Re: [HACKERS] Replication identifiers, take 4
On 25 April 2015 at 00:32, Andres Freund and...@anarazel.de wrote: Attached is a patch that does this, and some more, renaming. That was more work than I'd imagined. I've also made the internal naming in origin.c more consistent/simpler and did a bunch of other cleanup. Hi, It looks like bowerbird is not too happy with this, neither is my build environment. http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbirddt=2015-04-29%2019%3A31%3A06 The attached seems to fix it. I put the include in the origin.h rather than in the origin.c as the DoNotReplicateId macro uses UINT16_MAX and is also used in xact.c. Regards David Rowley UINT16_MAX_fix.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 29/04/15 22:12, David Rowley wrote: On 25 April 2015 at 00:32, Andres Freund and...@anarazel.de mailto:and...@anarazel.de wrote: Attached is a patch that does this, and some more, renaming. That was more work than I'd imagined. I've also made the internal naming in origin.c more consistent/simpler and did a bunch of other cleanup. Hi, It looks like bowerbird is not too happy with this, neither is my build environment. http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbirddt=2015-04-29%2019%3A31%3A06 The attached seems to fix it. I put the include in the origin.h rather than in the origin.c as the DoNotReplicateId macro uses UINT16_MAX and is also used in xact.c. I think correct fix is using PG_UINT16_MAX. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On Tue, Apr 28, 2015 at 10:00 PM, Peter Eisentraut pete...@gmx.net wrote: On 4/24/15 4:29 PM, Andres Freund wrote: Shouldn't this be backed up by pg_dump(all?)? Given it deals with LSNs and is, quite fundamentally due to concurrency, non transactional, I doubt it's worth it. The other side's slots also aren't going to be backed up as pg dump obviously can't know about then. So the represented data won't make much sense. I agree it might not be the best match. But we should consider that we used to say, a backup by pg_dumpall plus configuration files is a complete backup. Now we have replication slots and possibly replication identifiers and maybe similar features in the future that are not covered by this backup method. That's true. But if you did backup the replication slots with pg_dump, and then you restored them as part of restoring the dump, your replication setup would be just as broken as if you had never backed up those replication slots at all. I think the problem here is that replication slots are part of *cluster* configuration, not individual node configuration. If you back up a set of nodes that make up a cluster, and then restore them, you might hope that you will end up with working slots established between the same pairs of machines that had working slots between them before. But I don't see a way to make that happen when you look at it from the point of view of backing up and restoring just one node. As we get better clustering facilities into core, we may develop more instances of this problem; the best way of solving it is not clear to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 24/04/15 14:32, Andres Freund wrote: On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote: On 04/17/2015 11:54 AM, Andres Freund wrote: I've attached a rebased patch, that adds decision about origin logging to the relevant XLogInsert() callsites for external 2 byte identifiers and removes the pad-reusing version in the interest of moving forward. Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming: I just realized that it talks about replication identifier as the new fundamental concept. The system table is called pg_replication_identifier. But that's like talking about index identifiers, instead of just indexes, and calling the system table pg_index_oid. The important concept this patch actually adds is the *origin* of each transaction. That term is already used in some parts of the patch. I think we should roughly do a search-replace of replication identifier - replication origin to the patch. Or even transaction origin. Attached is a patch that does this, and some more, renaming. That was more work than I'd imagined. I've also made the internal naming in origin.c more consistent/simpler and did a bunch of other cleanup. There are few oversights in the renaming: doc/src/sgml/func.sgml: +Return the replay position for the passed in replication +identifier. The parameter parameterflush/parameter src/include/replication/origin.h: + * replication_identifier.h +extern PGDLLIMPORT RepOriginId replident_sesssion_origin; +extern PGDLLIMPORT XLogRecPtr replident_sesssion_origin_lsn; +extern PGDLLIMPORT TimestampTz replident_sesssion_origin_timestamp; (these are used then in multiple places in code afterwards and also mentioned in comment above replorigin_advance) src/backend/replication/logical/origin.c: + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), +errmsg(replication identiefer + default: + elog(PANIC, replident_redo: unknown op code + * This function may only be called if a origin was setup with + * replident_session_setup(). I also think the replident_checkpoint file should be renamed to replorigin_checkpoint. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote: On 04/17/2015 11:54 AM, Andres Freund wrote: I've attached a rebased patch, that adds decision about origin logging to the relevant XLogInsert() callsites for external 2 byte identifiers and removes the pad-reusing version in the interest of moving forward. Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming: I just realized that it talks about replication identifier as the new fundamental concept. The system table is called pg_replication_identifier. But that's like talking about index identifiers, instead of just indexes, and calling the system table pg_index_oid. The important concept this patch actually adds is the *origin* of each transaction. That term is already used in some parts of the patch. I think we should roughly do a search-replace of replication identifier - replication origin to the patch. Or even transaction origin. Attached is a patch that does this, and some more, renaming. That was more work than I'd imagined. I've also made the internal naming in origin.c more consistent/simpler and did a bunch of other cleanup. I'm pretty happy with this state. Greetings, Andres Freund From fc406a87f2e2ac08a7ac112ef6b75be1e8256a16 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Fri, 24 Apr 2015 14:25:56 +0200 Subject: [PATCH] Introduce replication progress tracking infrastructure. (v2.0) When implementing a replication solution ontop of logical decoding two related problems exist: * How to safely keep track of replication progress * How to change replication behavior, based on the origin of a row; e.g. to avoid loops in bi-directional replication setups The solution to these problems, as implemented in this commit, consist out of three parts: 1) 'replication origins', which identify nodes in a replication setup. 2) 'replication progress tracking', which remembers, for each replication origin, how far replay has progressed in a efficient and crash safe manner. 3) The ability to filter out changes performed on the behest of a replication origin during logical decoding; this allows complex replication topologies. Most of this could also be implemented in userspace, e.g. by inserting additional rows contain origin information, but that ends up being much less efficient and more complicated. We don't want to require various replication solutions to reimplement logic for this independently. The infrastructure is intended to be generic enough to be reusable. This infrastructure also replaces the 'nodeid' infrastructure of commit timestamps. It is intended to provide all former capabilities, except that there's only 2^16 different origins; but now they integrate with logical decoding. Additionally more functionality is accessible via SQL. Since the commit timestamp infrastructure has also been introduced in 9.5 that's not a problem. For now the number of origins for which the replication progress can be tracked is determined by the max_replication_slots GUC. That GUC is not a perfect match to configure this, but there doesn't seem to be sufficient reason to introduce a separate new one. Bumps both catversion and wal page magic. Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer Reviewed-By: Heikki Linnakangas, Robert Haas, Steve Singer Discussion: 20150216002155.gi15...@awork2.anarazel.de, 20140923182422.ga15...@alap3.anarazel.de, 20131114172632.ge7...@alap2.anarazel.de --- contrib/test_decoding/Makefile |3 +- contrib/test_decoding/expected/replorigin.out | 141 +++ contrib/test_decoding/sql/replorigin.sql| 64 + contrib/test_decoding/test_decoding.c | 28 + doc/src/sgml/catalogs.sgml | 123 ++ doc/src/sgml/filelist.sgml |1 + doc/src/sgml/func.sgml | 201 ++- doc/src/sgml/logicaldecoding.sgml | 35 +- doc/src/sgml/postgres.sgml |1 + doc/src/sgml/replication-origins.sgml | 93 ++ src/backend/access/heap/heapam.c| 19 + src/backend/access/rmgrdesc/Makefile|4 +- src/backend/access/rmgrdesc/replorigindesc.c| 61 + src/backend/access/rmgrdesc/xactdesc.c | 24 +- src/backend/access/transam/commit_ts.c | 53 +- src/backend/access/transam/rmgr.c |1 + src/backend/access/transam/xact.c | 72 +- src/backend/access/transam/xlog.c |8 + src/backend/access/transam/xloginsert.c | 27 +- src/backend/access/transam/xlogreader.c |6 + src/backend/catalog/Makefile|2 +- src/backend/catalog/catalog.c |8 +- src/backend/catalog/system_views.sql|7 + src/backend/replication/logical/Makefile|3 +- src/backend/replication/logical/decode.c| 49 +-
Re: [HACKERS] Replication identifiers, take 4
On April 24, 2015 10:26:23 PM GMT+02:00, Peter Eisentraut pete...@gmx.net wrote: On 4/24/15 8:32 AM, Andres Freund wrote: On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote: On 04/17/2015 11:54 AM, Andres Freund wrote: I've attached a rebased patch, that adds decision about origin logging to the relevant XLogInsert() callsites for external 2 byte identifiers and removes the pad-reusing version in the interest of moving forward. Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming: I just realized that it talks about replication identifier as the new fundamental concept. The system table is called pg_replication_identifier. But that's like talking about index identifiers, instead of just indexes, and calling the system table pg_index_oid. The important concept this patch actually adds is the *origin* of each transaction. That term is already used in some parts of the patch. I think we should roughly do a search-replace of replication identifier - replication origin to the patch. Or even transaction origin. Attached is a patch that does this, and some more, renaming. That was more work than I'd imagined. I've also made the internal naming in origin.c more consistent/simpler and did a bunch of other cleanup. I'm pretty happy with this state. Shouldn't this be backed up by pg_dump(all?)? Given it deals with LSNs and is, quite fundamentally due to concurrency, non transactional, I doubt it's worth it. The other side's slots also aren't going to be backed up as pg dump obviously can't know about then. So the represented data won't make much sense. Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 4/24/15 8:32 AM, Andres Freund wrote: On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote: On 04/17/2015 11:54 AM, Andres Freund wrote: I've attached a rebased patch, that adds decision about origin logging to the relevant XLogInsert() callsites for external 2 byte identifiers and removes the pad-reusing version in the interest of moving forward. Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming: I just realized that it talks about replication identifier as the new fundamental concept. The system table is called pg_replication_identifier. But that's like talking about index identifiers, instead of just indexes, and calling the system table pg_index_oid. The important concept this patch actually adds is the *origin* of each transaction. That term is already used in some parts of the patch. I think we should roughly do a search-replace of replication identifier - replication origin to the patch. Or even transaction origin. Attached is a patch that does this, and some more, renaming. That was more work than I'd imagined. I've also made the internal naming in origin.c more consistent/simpler and did a bunch of other cleanup. I'm pretty happy with this state. Shouldn't this be backed up by pg_dump(all?)? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 2015-03-24 22:22:29 +0100, Petr Jelinek wrote: Perhaps we should have some Logical replication developer documentation section and put all those three as subsections of that? So I just played around with this and it didn't find it worthwhile. Primarily because there's lots of uses of logical decoding besides building a logical replication solution. I've reverted to putting it into a separate chapter 'besides' logical decoding. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 21/04/15 22:36, Andres Freund wrote: On 2015-04-21 16:26:08 -0400, Robert Haas wrote: On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund and...@anarazel.de wrote: I've now named the functions: * pg_replication_origin_create * pg_replication_origin_drop * pg_replication_origin_get (map from name to id) * pg_replication_progress_setup_origin : configure session to replicate from a specific origin * pg_replication_progress_reset_origin * pg_replication_progress_setup_tx_details : configure per transaction details (LSN and timestamp currently) * pg_replication_progress_is_replaying : Is a origin configured for the session * pg_replication_progress_advance : manually set the replication progress to a value. Primarily useful for copying values from other systems and such. * pg_replication_progress_get : How far did replay progress for a certain origin * pg_get_replication_progress : SRF returning the replay progress for all origin. Any comments? Why are we using functions for this rather than DDL? Unless I miss something the only two we really could use ddl for is pg_replication_origin_create/pg_replication_origin_drop. We could use DDL for them if we really want, but I'm not really seeing the advantage. I think the only value of having DDL for this would be consistency (catalog objects are created via DDL) as it looks like something that will be called only by extensions and not users during normal operation. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 20 April 2015 at 09:28, Andres Freund and...@anarazel.de wrote: On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote: I just realized that it talks about replication identifier as the new fundamental concept. The system table is called pg_replication_identifier. But that's like talking about index identifiers, instead of just indexes, and calling the system table pg_index_oid. The important concept this patch actually adds is the *origin* of each transaction. That term is already used in some parts of the patch. I think we should roughly do a search-replace of replication identifier - replication origin to the patch. Or even transaction origin. Sounds good to me. +1 -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Replication identifiers, take 4
Andres Freund wrote: I'm working on changing this (I've implemented the missing WAL bits). I'd like to discuss the new terms for a sec, before I go and revise the docs. I'm now calling the feature 'replication progress tracking'. There's replication origins and there's progress tracking infrastructure that tracks how far data from a replication origin has replicated. Sounds good to me. Catalog wise there's an actual table 'pg_replication_origin' that maps between 'roident' and 'roname'. There's a pg_replication_progress view (used to be named pg_replication_identifier_progress). I'm not sure if the latter name isn't too generic? Maybe pg_logical_replication_progress? I think if we wanted pg_logical_replication_progress (and I don't really agree that we do) then we would add the logical bit to the names above as well. This seems unnecessary. pg_replication_progress seems okay to me. I've now named the functions: * pg_replication_origin_create * pg_replication_origin_drop * pg_replication_origin_get (map from name to id) * pg_replication_progress_setup_origin : configure session to replicate from a specific origin * pg_replication_progress_reset_origin * pg_replication_progress_is_replaying : Is a origin configured for the session * pg_replication_progress_advance : manually set the replication progress to a value. Primarily useful for copying values from other systems and such. These all look acceptable to me. * pg_replication_progress_get : How far did replay progress for a certain origin * pg_get_replication_progress : SRF returning the replay progress for all origin. This combination seems confusing. In some other thread not too long ago there was the argument that all functions 'get' something, so that verb should not appear in the function name. That would call for pg_replication_progress on the singleton. Maybe to distinguish the SRF, add all as a suffix? * pg_replication_progress_setup_tx_details : configure per transaction details (LSN and timestamp currently) Not sure about the tx here. We use xact as an abbreviation for transaction in most places. If nowadays we don't like that term, maybe just spell out transaction in full. I assume this function pairs up with pg_replication_progress_setup_origin, yes? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 2015-04-20 10:28:02 +0200, Andres Freund wrote: On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote: I just realized that it talks about replication identifier as the new fundamental concept. The system table is called pg_replication_identifier. But that's like talking about index identifiers, instead of just indexes, and calling the system table pg_index_oid. The important concept this patch actually adds is the *origin* of each transaction. That term is already used in some parts of the patch. I think we should roughly do a search-replace of replication identifier - replication origin to the patch. Or even transaction origin. Sounds good to me. I'm working on changing this (I've implemented the missing WAL bits). I'd like to discuss the new terms for a sec, before I go and revise the docs. I'm now calling the feature 'replication progress tracking'. There's replication origins and there's progress tracking infrastructure that tracks how far data from a replication origin has replicated. Catalog wise there's an actual table 'pg_replication_origin' that maps between 'roident' and 'roname'. There's a pg_replication_progress view (used to be named pg_replication_identifier_progress). I'm not sure if the latter name isn't too generic? Maybe pg_logical_replication_progress? I've now named the functions: * pg_replication_origin_create * pg_replication_origin_drop * pg_replication_origin_get (map from name to id) * pg_replication_progress_setup_origin : configure session to replicate from a specific origin * pg_replication_progress_reset_origin * pg_replication_progress_setup_tx_details : configure per transaction details (LSN and timestamp currently) * pg_replication_progress_is_replaying : Is a origin configured for the session * pg_replication_progress_advance : manually set the replication progress to a value. Primarily useful for copying values from other systems and such. * pg_replication_progress_get : How far did replay progress for a certain origin * pg_get_replication_progress : SRF returning the replay progress for all origin. Any comments? Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 2015-04-21 12:20:42 -0300, Alvaro Herrera wrote: Andres Freund wrote: Catalog wise there's an actual table 'pg_replication_origin' that maps between 'roident' and 'roname'. There's a pg_replication_progress view (used to be named pg_replication_identifier_progress). I'm not sure if the latter name isn't too generic? Maybe pg_logical_replication_progress? I think if we wanted pg_logical_replication_progress (and I don't really agree that we do) then we would add the logical bit to the names above as well. This seems unnecessary. pg_replication_progress seems okay to me. Cool. * pg_replication_progress_get : How far did replay progress for a certain origin * pg_get_replication_progress : SRF returning the replay progress for all origin. This combination seems confusing. In some other thread not too long ago there was the argument that all functions 'get' something, so that verb should not appear in the function name. That would call for pg_replication_progress on the singleton. Hm. I don't like that. That'd e.g. clash with the above view. I think it's good to distinguish between functions (that have a verb in the name) and views/tables (that don't). I agree that the above combination isn't optimal. Although pg_get (and pg_stat_get) is what's used for a lot of other SRF backed views. Maybe naming the SRF pg_get_all_replication_progress? * pg_replication_progress_setup_tx_details : configure per transaction details (LSN and timestamp currently) Not sure about the tx here. We use xact as an abbreviation for transaction in most places. Oh, yea. Xact is more consistent. If nowadays we don't like that term, maybe just spell out transaction in full. I assume this function pairs up with pg_replication_progress_setup_origin, yes? pg_replication_progress_setup_origin sets up the per session state, setup_xact_details the per replayed transaction state. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund and...@anarazel.de wrote: I've now named the functions: * pg_replication_origin_create * pg_replication_origin_drop * pg_replication_origin_get (map from name to id) * pg_replication_progress_setup_origin : configure session to replicate from a specific origin * pg_replication_progress_reset_origin * pg_replication_progress_setup_tx_details : configure per transaction details (LSN and timestamp currently) * pg_replication_progress_is_replaying : Is a origin configured for the session * pg_replication_progress_advance : manually set the replication progress to a value. Primarily useful for copying values from other systems and such. * pg_replication_progress_get : How far did replay progress for a certain origin * pg_get_replication_progress : SRF returning the replay progress for all origin. Any comments? Why are we using functions for this rather than DDL? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 2015-04-21 16:26:08 -0400, Robert Haas wrote: On Tue, Apr 21, 2015 at 8:08 AM, Andres Freund and...@anarazel.de wrote: I've now named the functions: * pg_replication_origin_create * pg_replication_origin_drop * pg_replication_origin_get (map from name to id) * pg_replication_progress_setup_origin : configure session to replicate from a specific origin * pg_replication_progress_reset_origin * pg_replication_progress_setup_tx_details : configure per transaction details (LSN and timestamp currently) * pg_replication_progress_is_replaying : Is a origin configured for the session * pg_replication_progress_advance : manually set the replication progress to a value. Primarily useful for copying values from other systems and such. * pg_replication_progress_get : How far did replay progress for a certain origin * pg_get_replication_progress : SRF returning the replay progress for all origin. Any comments? Why are we using functions for this rather than DDL? Unless I miss something the only two we really could use ddl for is pg_replication_origin_create/pg_replication_origin_drop. We could use DDL for them if we really want, but I'm not really seeing the advantage. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote: I just realized that it talks about replication identifier as the new fundamental concept. The system table is called pg_replication_identifier. But that's like talking about index identifiers, instead of just indexes, and calling the system table pg_index_oid. The important concept this patch actually adds is the *origin* of each transaction. That term is already used in some parts of the patch. I think we should roughly do a search-replace of replication identifier - replication origin to the patch. Or even transaction origin. Sounds good to me. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 04/17/2015 11:45 PM, Petr Jelinek wrote: The argument to move to 4 bytes is a poor one. If it was reasonable in terms of code or cosmetic value then all values used in the backend would be 4 bytes. We wouldn't have any 2 byte values anywhere. But we don't do that. The change does nothing useful, since I doubt anyone will ever need 32768 nodes in their cluster. And if they did there would be other much bigger problems than replication identifier being 16bit (it's actually 65534 as it's unsigned btw). Can you name some of the bigger problems you'd have? Obviously, if you have 10 high-volume OLTP nodes connected to a single server, feeding transactions as a continous stream, you're going to choke the system. But you might have 10 tiny satellite databases that sync up with the master every few hours, and each of them do only a few updates per day. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 2015-04-20 11:09:20 +0300, Heikki Linnakangas wrote: Can you name some of the bigger problems you'd have? Several parts of the system are O(#max_replication_slots). Having 100k outgoing logical replication slots is going to be expensive. Nothing unsolvable, but the 65k 16 bit limit surely isn't going to be the biggest problem. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 04/17/2015 11:36 PM, Simon Riggs wrote: On 17 April 2015 at 19:18, Heikki Linnakangas hlinn...@iki.fi wrote: To be honest, I'm not entirely sure what we're arguing over. When arguing over something you consider small, it is customary to allow the author precedence. We can't do things our own way all the time. Sure, I'm not going to throw a tantrum if Andres commits this as it is. I said that IMO the difference in WAL size is so small that we should just use 4-byte OIDs for the replication identifiers, instead of trying to make do with 2 bytes. Not because I find it too likely that you'll run out of IDs (although it could happen), but more to make replication IDs more like all other system objects we have. Andreas did some pgbench benchmarking to show that the difference in WAL size is about 10%. The WAL records generated by pgbench happen to have just the right sizes so that the 2-3 extra bytes bump them over to the next alignment boundary. That's why there is such a big difference - on average it'll be less. I think that's acceptable, Andreas seems to think otherwise. But if the WAL size really is so precious, we could remove the two padding bytes from XLogRecord, instead of dedicating them for the replication ids. That would be an even better use for them. The argument to move to 4 bytes is a poor one. If it was reasonable in terms of code or cosmetic value then all values used in the backend would be 4 bytes. We wouldn't have any 2 byte values anywhere. But we don't do that. That's a straw man argument. I'm not saying we should never use 2 byte values anywhere. OID is usually used as the primary key in system tables. There are exceptions, but that is nevertheless the norm. I'm saying that saving in WAL size is not worth making an exception here, and we should go with the simplest option of using OIDs. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 04/17/2015 11:54 AM, Andres Freund wrote: I've attached a rebased patch, that adds decision about origin logging to the relevant XLogInsert() callsites for external 2 byte identifiers and removes the pad-reusing version in the interest of moving forward. Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming: I just realized that it talks about replication identifier as the new fundamental concept. The system table is called pg_replication_identifier. But that's like talking about index identifiers, instead of just indexes, and calling the system table pg_index_oid. The important concept this patch actually adds is the *origin* of each transaction. That term is already used in some parts of the patch. I think we should roughly do a search-replace of replication identifier - replication origin to the patch. Or even transaction origin. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 17 April 2015 at 09:54, Andres Freund and...@anarazel.de wrote: Hrmpf. Says the person that used a lot of padding, without much discussion, for the WAL level infrastructure making pg_rewind more maintainable. Sounds bad. What padding are we talking about? -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Replication identifiers, take 4
On 17 April 2015 at 19:18, Heikki Linnakangas hlinn...@iki.fi wrote: To be honest, I'm not entirely sure what we're arguing over. When arguing over something you consider small, it is customary to allow the author precedence. We can't do things our own way all the time. I didn't much like pg_rewind, but it doesn't hurt and you like it, so I didn't object. We've all got better things to do. I said that IMO the difference in WAL size is so small that we should just use 4-byte OIDs for the replication identifiers, instead of trying to make do with 2 bytes. Not because I find it too likely that you'll run out of IDs (although it could happen), but more to make replication IDs more like all other system objects we have. Andreas did some pgbench benchmarking to show that the difference in WAL size is about 10%. The WAL records generated by pgbench happen to have just the right sizes so that the 2-3 extra bytes bump them over to the next alignment boundary. That's why there is such a big difference - on average it'll be less. I think that's acceptable, Andreas seems to think otherwise. But if the WAL size really is so precious, we could remove the two padding bytes from XLogRecord, instead of dedicating them for the replication ids. That would be an even better use for them. The argument to move to 4 bytes is a poor one. If it was reasonable in terms of code or cosmetic value then all values used in the backend would be 4 bytes. We wouldn't have any 2 byte values anywhere. But we don't do that. The change does nothing useful, since I doubt anyone will ever need 32768 nodes in their cluster. Increasing WAL size for any non-zero amount is needlessly wasteful for a change with only cosmetic value. But for a change that has significant value for database resilience, it is a sensible use of bytes. +1 to Andres' very reasonable suggestion. Lets commit this and go home. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Replication identifiers, take 4
On 17/04/15 22:36, Simon Riggs wrote: I said that IMO the difference in WAL size is so small that we should just use 4-byte OIDs for the replication identifiers, instead of trying to make do with 2 bytes. Not because I find it too likely that you'll run out of IDs (although it could happen), but more to make replication IDs more like all other system objects we have. Andreas did some pgbench benchmarking to show that the difference in WAL size is about 10%. The WAL records generated by pgbench happen to have just the right sizes so that the 2-3 extra bytes bump them over to the next alignment boundary. That's why there is such a big difference - on average it'll be less. I think that's acceptable, Andreas seems to think otherwise. But if the WAL size really is so precious, we could remove the two padding bytes from XLogRecord, instead of dedicating them for the replication ids. That would be an even better use for them. The argument to move to 4 bytes is a poor one. If it was reasonable in terms of code or cosmetic value then all values used in the backend would be 4 bytes. We wouldn't have any 2 byte values anywhere. But we don't do that. The change does nothing useful, since I doubt anyone will ever need 32768 nodes in their cluster. And if they did there would be other much bigger problems than replication identifier being 16bit (it's actually 65534 as it's unsigned btw). Considering the importance and widespread use of replication I think we should really make sure the related features have small overhead. Increasing WAL size for any non-zero amount is needlessly wasteful for a change with only cosmetic value. But for a change that has significant value for database resilience, it is a sensible use of bytes. +1 to Andres' very reasonable suggestion. Lets commit this and go home. +1 -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 04/12/2015 02:56 AM, Petr Jelinek wrote: On 10/04/15 18:03, Andres Freund wrote: On 2015-04-07 17:08:16 +0200, Andres Freund wrote: I'm starting benchmarks now. What I'm benchmarking here is the WAL overhead, since that's what we're debating. The test setup I used was a pgbench scale 10 instance. I've run with full_page_write=off to have more reproducible results. This of course over-emphasizes the overhead a bit. But for a long checkpoint interval and a memory resident working set it's not that unrealistic. I ran 50k transactions in a signle b baseline: - 20445024 - 20437128 - 20436864 - avg: 20439672 extern 2byte identifiers: - 23318368 - 23148648 - 23128016 - avg: 23198344 - avg overhead: 13.5% padding 2byte identifiers: - 21160408 - 21319720 - 21164280 - avg: 21214802 - avg overhead: 3.8% extern 4byte identifiers: - 23514216 - 23540128 - 23523080 - avg: 23525808 - avg overhead: 15.1% To me that shows pretty clearly that a) reusing the padding is worthwhile b) even without that using 2byte instead of 4 byte identifiers is beneficial. My opinion is that 10% of WAL size difference is quite high price to pay so that we can keep the padding for some other, yet unknown feature that hasn't come up in several years, which would need those 2 bytes. But if we are willing to pay it then we can really go all the way and just use Oids... This needs to be weighed against removing the padding bytes altogether. See attached. That would reduce the WAL size further when you don't need replication IDs. It's very straightforward, but need to do some performance/scalability testing to make sure that using memcpy instead of a straight 32-bit assignment doesn't hurt performance, since it happens in very performance critical paths. I'm surprised there's such a big difference between the extern and padding versions above. At a quick approximation, storing the ID as a separate fragment, along with XLogRecordDataHeaderShort and XLogRecordDataHeaderLong, should add one byte of overhead plus the ID itself. So that would be 3 extra bytes for 2-byte identifiers, or 5 bytes for 4-byte identifiers. Does that mean that the average record length is only about 30 bytes? That's what it seems like, if adding the extern 2 byte identifiers added about 10% of overhead compared to the padding 2 byte identifiers version. That doesn't sound right, 30 bytes is very little. Perhaps the size of the records created by pgbench happen to cross a 8-byte alignment boundary at that point, making a big difference. In another workload, there might be no difference at all, due to alignment. Also, you don't need to tag every record type with the replication ID. All indexam records can skip it, for starters, since logical decoding doesn't care about them. That should remove a fair amount of bloat. - Heikki diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 24cf520..09934f5 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -963,10 +963,10 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn) * Now that xl_prev has been filled in, calculate CRC of the record * header. */ - rdata_crc = rechdr-xl_crc; + memcpy(rdata_crc, rechdr-xl_crc, sizeof(pg_crc32)); COMP_CRC32C(rdata_crc, rechdr, offsetof(XLogRecord, xl_crc)); FIN_CRC32C(rdata_crc); - rechdr-xl_crc = rdata_crc; + memcpy(rechdr-xl_crc, rdata_crc, sizeof(pg_crc32)); /* * All the record data, including the header, is now ready to be @@ -4685,7 +4685,7 @@ BootStrapXLOG(void) COMP_CRC32C(crc, ((char *) record) + SizeOfXLogRecord, record-xl_tot_len - SizeOfXLogRecord); COMP_CRC32C(crc, (char *) record, offsetof(XLogRecord, xl_crc)); FIN_CRC32C(crc); - record-xl_crc = crc; + memcpy(record-xl_crc, crc, sizeof(pg_crc32)); /* Create first XLOG segment file */ use_existent = false; diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c index 88209c3..fbe97b1 100644 --- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -724,7 +724,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, rechdr-xl_info = info; rechdr-xl_rmid = rmid; rechdr-xl_prev = InvalidXLogRecPtr; - rechdr-xl_crc = rdata_crc; + memcpy(rechdr-xl_crc, rdata_crc, sizeof(pg_crc32)); return hdr_rdt; } diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index a4124d9..80a48ba 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -666,6 +666,9 @@ static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr) { pg_crc32 crc; + pg_crc32 rec_crc; + + memcpy(rec_crc, record-xl_crc, sizeof(pg_crc32)); /* Calculate the CRC */ INIT_CRC32C(crc); @@ -674,7 +677,7 @@ ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr) COMP_CRC32C(crc, (char *) record,
Re: [HACKERS] Replication identifiers, take 4
On 10/04/15 18:03, Andres Freund wrote: On 2015-04-07 17:08:16 +0200, Andres Freund wrote: I'm starting benchmarks now. What I'm benchmarking here is the WAL overhead, since that's what we're debating. The test setup I used was a pgbench scale 10 instance. I've run with full_page_write=off to have more reproducible results. This of course over-emphasizes the overhead a bit. But for a long checkpoint interval and a memory resident working set it's not that unrealistic. I ran 50k transactions in a signle b baseline: - 20445024 - 20437128 - 20436864 - avg: 20439672 extern 2byte identifiers: - 23318368 - 23148648 - 23128016 - avg: 23198344 - avg overhead: 13.5% padding 2byte identifiers: - 21160408 - 21319720 - 21164280 - avg: 21214802 - avg overhead: 3.8% extern 4byte identifiers: - 23514216 - 23540128 - 23523080 - avg: 23525808 - avg overhead: 15.1% To me that shows pretty clearly that a) reusing the padding is worthwhile b) even without that using 2byte instead of 4 byte identifiers is beneficial. My opinion is that 10% of WAL size difference is quite high price to pay so that we can keep the padding for some other, yet unknown feature that hasn't come up in several years, which would need those 2 bytes. But if we are willing to pay it then we can really go all the way and just use Oids... -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 2015-03-28 23:50:20 +0100, Petr Jelinek wrote: And finally I have issue with how the new identifiers are allocated. Currently, if you create identifier 'foo', remove identifier 'foo' and create identifier 'bar', the identifier 'bar' will have same id as the old 'foo' identifier. This can be problem because the identifier id is used as origin of the data and the replication solution using the replication identifiers can end up thinking that data came from node 'bar' even though they came from the node 'foo' which no longer exists. This can have bad effects for example on conflict detection or debugging problems with replication. Maybe another reason to use standard Oids? As the same reason exists for oids, just somewhat less likely, I don't see it as a reason for much. It's really not that hard to get oid conflicts once your server has lived for a while. As soon as the oid counter has wrapped around once, it's not unlikely to have conflicts. And with temp tables (or much more extremely WITH OID tables) and such it's not that hard to reach that point. The only material difference this makes is that it's much easier to notice the problem during development. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 2015-03-24 23:11:26 -0400, Robert Haas wrote: On Mon, Feb 16, 2015 at 4:46 AM, Andres Freund and...@2ndquadrant.com wrote: At a quick glance, this basic design seems workable. I would suggest expanding the replication IDs to regular 4 byte oids. Two extra bytes is a small price to pay, to make it work more like everything else in the system. I don't know. Growing from 3 to 5 byte overhead per relevant record (or even 0 to 5 in case the padding is reused) is rather noticeable. If we later find it to be a limit (I seriously doubt that), we can still increase it in a major release without anybody really noticing. You might notice that Heikki is making the same point here that I've attempted to make multiple times in the past: limiting to replication identifier to 2 bytes because that's how much padding space you happen to have available is optimizing for the wrong thing. What we should be optimizing for is consistency and uniformity of design. System catalogs have OIDs, so this one should, too. You're not going to be able to paper over the fact that the column has some funky data type that is unlike every other column in the system. To the best of my knowledge, the statement that there is a noticeable performance cost for those 2 extra bytes is also completely unsupported by any actual benchmarking. I'm starting benchmarks now. But I have to say: I find the idea that you'd need more than 2^16 identifiers anytime soon not very credible. The likelihood that replication identifiers are the limiting factor towards that seems incredibly small. Just consider how you'd apply changes from so many remotes; how to stream changes to them; how to even configure such a complex setup. We can easily change the size limits in the next major release without anybody being inconvenienced. We've gone through quite some lengths reducing the overhead of WAL. I don't understand why it's important that we do not make compromises here; but why that doesn't matter elsewhere. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 2015-04-07 16:30:25 +0200, Andres Freund wrote: And with temp tables (or much more extremely WITH OID tables) and such it's not that hard to reach that point. Oh, and obviously toast data. A couple tables with toasted columns is also a good way to rapidly consume oids. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 4/7/15 9:30 AM, Andres Freund wrote: On 2015-03-28 23:50:20 +0100, Petr Jelinek wrote: And finally I have issue with how the new identifiers are allocated. Currently, if you create identifier 'foo', remove identifier 'foo' and create identifier 'bar', the identifier 'bar' will have same id as the old 'foo' identifier. This can be problem because the identifier id is used as origin of the data and the replication solution using the replication identifiers can end up thinking that data came from node 'bar' even though they came from the node 'foo' which no longer exists. This can have bad effects for example on conflict detection or debugging problems with replication. Maybe another reason to use standard Oids? As the same reason exists for oids, just somewhat less likely, I don't see it as a reason for much. It's really not that hard to get oid conflicts once your server has lived for a while. As soon as the oid counter has wrapped around once, it's not unlikely to have conflicts. And with temp tables (or much more extremely WITH OID tables) and such it's not that hard to reach that point. The only material difference this makes is that it's much easier to notice the problem during development. Why not just create a sequence? I suspect it may not be as fast to assign as an OID, but it's not like you'd be doing this all the time. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
Why not just create a sequence? I suspect it may not be as fast to assign as an OID, but it's not like you'd be doing this all the time. What does that have to do with the thread? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 4/7/15 10:58 AM, Andres Freund wrote: Why not just create a sequence? I suspect it may not be as fast to assign as an OID, but it's not like you'd be doing this all the time. What does that have to do with the thread? The original bit was... And finally I have issue with how the new identifiers are allocated. Currently, if you create identifier 'foo', remove identifier 'foo' and create identifier 'bar', the identifier 'bar' will have same id as the old 'foo' identifier. This can be problem because the identifier id is used as origin of the data and the replication solution using the replication identifiers can end up thinking that data came from node 'bar' even though they came from the node 'foo' which no longer exists. This can have bad effects for example on conflict detection or debugging problems with replication. Maybe another reason to use standard Oids? Wasn't the reason for using OIDs so that we're not doing the equivalent of max(identifier) + 1? Perhaps I'm just confused... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On Tue, Apr 7, 2015 at 11:37 PM, Andres Freund and...@anarazel.de wrote: On 2015-04-07 16:30:25 +0200, Andres Freund wrote: And with temp tables (or much more extremely WITH OID tables) and such it's not that hard to reach that point. Oh, and obviously toast data. A couple tables with toasted columns is also a good way to rapidly consume oids. You are forgetting as well large objects on the stack, when client application does not assign an OID by itself. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
So I did some more in depth look, I do have couple of comments. I would really like to have something like Logical Replication Infrastructure doc section that would have both decoding and identifiers (and possibly even CommitTs) underneath. There is typo in docs: + para + The optional functionfilter_by_origin_cb/function callback + is called to determine wheter data that has been replayed wheter - whether And finally I have issue with how the new identifiers are allocated. Currently, if you create identifier 'foo', remove identifier 'foo' and create identifier 'bar', the identifier 'bar' will have same id as the old 'foo' identifier. This can be problem because the identifier id is used as origin of the data and the replication solution using the replication identifiers can end up thinking that data came from node 'bar' even though they came from the node 'foo' which no longer exists. This can have bad effects for example on conflict detection or debugging problems with replication. Maybe another reason to use standard Oids? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On Mon, Feb 16, 2015 at 4:46 AM, Andres Freund and...@2ndquadrant.com wrote: At a quick glance, this basic design seems workable. I would suggest expanding the replication IDs to regular 4 byte oids. Two extra bytes is a small price to pay, to make it work more like everything else in the system. I don't know. Growing from 3 to 5 byte overhead per relevant record (or even 0 to 5 in case the padding is reused) is rather noticeable. If we later find it to be a limit (I seriously doubt that), we can still increase it in a major release without anybody really noticing. You might notice that Heikki is making the same point here that I've attempted to make multiple times in the past: limiting to replication identifier to 2 bytes because that's how much padding space you happen to have available is optimizing for the wrong thing. What we should be optimizing for is consistency and uniformity of design. System catalogs have OIDs, so this one should, too. You're not going to be able to paper over the fact that the column has some funky data type that is unlike every other column in the system. To the best of my knowledge, the statement that there is a noticeable performance cost for those 2 extra bytes is also completely unsupported by any actual benchmarking. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
Hi, Here's the next version of this patch. I've tried to address the biggest issue (documentation) and some more. Now that both the more flexible commit WAL record format and the BKI_FORCE_NOT_NULL thing is in, it looks much cleaner. Changes: * Loads of documentation and comments * Revamped locking strategy. There's now a LWLock protecting all the replication progress array and spinlock for the individual sltos. * Simpler checkpoint format. * A new pg_replication_identifier_progress() function returning a individual identifier's replication progress; previously there was only the view showing all of them. * Lots of minor cleanup * Some more tests I'd greatly appreciate some feedback on the documentation. I'm not entirely sure into how much detail to go; and where exactly in the docs to place it. I do wonder if we shouldn't merge this with the logical decoding section and whether we could also document commit timestamps somewhere in there. I've verified that this correctly works on a stanby; replication progress is replicated correctly. I think there's two holes though: Changing the replication progress without replicating anything and dropping a replication identifier with some replication progress might not work correctly. That's fairly easily fixed and I intend to do so. Other than that I'm not aware of outstanding issues. Comments? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 32ff44f3f045b0805301af8d94d1f11336c6ce8f Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Sun, 15 Mar 2015 18:11:01 +0100 Subject: [PATCH] Introduce replication identifiers: v1.0 Replication identifiers are used to identify nodes in a replication setup, identify changes that are created due to replication and to keep track of replication progress. Primarily this is useful because solving these in other ways is possible, but ends up being much less efficient and more complicated. We don't want to require replication solutions to reimplement logic for this independently. The infrastructure is intended to be generic enough to be reusable. This infrastructure replaces the 'nodeid' infrastructure of commit timestamps. Except that there's only 2^16 identifiers, the infrastructure provided here integrates with logical replication and is available via SQL. Since the commit timestamp infrastructure has also been introduced in 9.5 that's not a problem. For now the number of nodes whose replication progress can be tracked is determined by the max_replication_slots GUC. It's not perfect to reuse that GUC, but there doesn't seem to be sufficient reason to introduce a separate new one. Bumps both catversion and wal page magic. Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer Reviewed-By: Robert Haas, Heikki Linnakangas, Steve Singer Discussion: 20150216002155.gi15...@awork2.anarazel.de, 20140923182422.ga15...@alap3.anarazel.de, 20131114172632.ge7...@alap2.anarazel.de --- contrib/test_decoding/Makefile |3 +- contrib/test_decoding/expected/replident.out | 127 ++ contrib/test_decoding/sql/replident.sql| 58 + contrib/test_decoding/test_decoding.c | 28 + doc/src/sgml/catalogs.sgml | 124 ++ doc/src/sgml/filelist.sgml |1 + doc/src/sgml/func.sgml | 162 ++- doc/src/sgml/logicaldecoding.sgml | 35 +- doc/src/sgml/postgres.sgml |1 + doc/src/sgml/replication-identifiers.sgml | 89 ++ src/backend/access/rmgrdesc/xactdesc.c | 24 +- src/backend/access/transam/commit_ts.c | 53 +- src/backend/access/transam/xact.c | 69 +- src/backend/access/transam/xlog.c |8 + src/backend/access/transam/xloginsert.c| 22 +- src/backend/access/transam/xlogreader.c| 10 + src/backend/catalog/Makefile |2 +- src/backend/catalog/catalog.c |8 +- src/backend/catalog/system_views.sql |7 + src/backend/replication/logical/Makefile |3 +- src/backend/replication/logical/decode.c | 48 +- src/backend/replication/logical/logical.c | 33 + src/backend/replication/logical/reorderbuffer.c|5 +- .../replication/logical/replication_identifier.c | 1300 src/backend/storage/ipc/ipci.c |3 + src/backend/utils/cache/syscache.c | 23 + src/bin/pg_resetxlog/pg_resetxlog.c|5 + src/include/access/commit_ts.h | 14 +- src/include/access/xact.h | 11 + src/include/access/xlog.h |1 + src/include/access/xlog_internal.h |2
Re: [HACKERS] Replication identifiers, take 4
On 24/03/15 16:33, Andres Freund wrote: Hi, Here's the next version of this patch. I've tried to address the biggest issue (documentation) and some more. Now that both the more flexible commit WAL record format and the BKI_FORCE_NOT_NULL thing is in, it looks much cleaner. Nice, I see you also did the more close integration with CommitTs. I only skimmed the patch but so far and it looks quite good, I'll take closer look around end of the week. I'd greatly appreciate some feedback on the documentation. I'm not entirely sure into how much detail to go; and where exactly in the docs to place it. I do wonder if we shouldn't merge this with the logical decoding section and whether we could also document commit timestamps somewhere in there. Perhaps we should have some Logical replication developer documentation section and put all those three as subsections of that? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 22/02/15 09:57, Andres Freund wrote: On 2015-02-19 00:49:50 +0100, Petr Jelinek wrote: On 16/02/15 10:46, Andres Freund wrote: On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote: At a quick glance, this basic design seems workable. I would suggest expanding the replication IDs to regular 4 byte oids. Two extra bytes is a small price to pay, to make it work more like everything else in the system. I don't know. Growing from 3 to 5 byte overhead per relevant record (or even 0 to 5 in case the padding is reused) is rather noticeable. If we later find it to be a limit (I seriously doubt that), we can still increase it in a major release without anybody really noticing. I agree that limiting the overhead is important. But I have related though about this - I wonder if it's worth to try to map this more directly to the CommitTsNodeId. Maybe. I'd rather go the other way round though; replication_identifier.c/h's stuff seems much more generic than CommitTsNodeId. Probably not more generic but definitely nicer as the nodes are named and yes it has richer API. I mean it is currently using that for the actual storage, but I am thinking of having the Replication Identifiers be the public API and the CommitTsNodeId stuff be just hidden implementation detail. This thought is based on the fact that CommitTsNodeId provides somewhat meaningless number while Replication Identifiers give us nice name for that number. And if we do that then the type should perhaps be same for both? I'm not sure. Given that this is included in a significant number of recordsd I'd really rather not increase the overhead as described above. Maybe we can just limit CommitTsNodeId to the same size for now? That would make sense. I also wonder about the default nodeid infrastructure the committs provides. I can't think of use-case which it can be used for and which isn't solved by replication identifiers - in the end the main reason I added that infra was to make it possible to write something like replication identifiers as part of an extension. In any case I don't think the default nodeid can be used in parallel with replication identifiers since one will overwrite the SLRU record for the other. Maybe it's enough if this is documented but I think it might be better if this patch removed that default committs nodeid infrastructure. It's just few lines of code which nobody should be using yet. Thinking about this some more and rereading the code I see that you are setting TransactionTreeSetCommitTsData during xlog replay, but not during transaction commit, that does not seem correct as the local records will not have any nodeid/origin. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 2/15/15 7:24 PM, Andres Freund wrote: On 2015-02-16 01:21:55 +0100, Andres Freund wrote: Here's my next attept attempt at producing something we can agree upon. The major change that might achieve that is that I've now provided a separate method to store the origin_id of a node. I've made it conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both paths. That new method uses Heikki's xlog rework to dynamically add the origin to the record if a origin is set up. That works surprisingly simply. I'm trying to figure out what this feature is supposed to do, but the patch contains no documentation or a commit message. Where is one supposed to start? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 2015-02-22 10:03:59 -0500, Peter Eisentraut wrote: On 2/15/15 7:24 PM, Andres Freund wrote: On 2015-02-16 01:21:55 +0100, Andres Freund wrote: Here's my next attept attempt at producing something we can agree upon. The major change that might achieve that is that I've now provided a separate method to store the origin_id of a node. I've made it conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both paths. That new method uses Heikki's xlog rework to dynamically add the origin to the record if a origin is set up. That works surprisingly simply. I'm trying to figure out what this feature is supposed to do, but the patch contains no documentation or a commit message. Where is one supposed to start? For a relatively short summary: http://archives.postgresql.org/message-id/20150216002155.GI15326%40awork2.anarazel.de For a longer one: http://www.postgresql.org/message-id/20140923182422.ga15...@alap3.anarazel.de Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 2015-02-22 04:59:30 +0100, Petr Jelinek wrote: Now that the issue with padding seems to no longer exists since the patch works both with and without padding, I went through the code and here are some comments I have (in no particular order). In CheckPointReplicationIdentifier: + * FIXME: Add a CRC32 to the end. The function already does it (I guess you forgot to remove the comment). Yep. I locally have a WIP version that's much cleaned up and doesn't contain it anymore. Using max_replication_slots as limit for replication_identifier states does not really make sense to me as replication_identifiers track remote info while and slots are local and in case of master-slave replication you need replication identifiers but don't need slots. On the other hand, it's quite cheap if unused. Not sure if several variables are worth it. In bootstrap.c: #define MARKNOTNULL(att) \ ((att)-attlen 0 || \ (att)-atttypid == OIDVECTOROID || \ - (att)-atttypid == INT2VECTOROID) + (att)-atttypid == INT2VECTOROID || \ + strcmp(NameStr((att)-attname), riname) == 0 \ +) Huh? Can this be solved in a nicer way? Yes. I'd mentioned that this is just a temporary hack ;). I've since pushed a more proper solution; BKI_FORCE_NOT_NULL can now be specified on the column. Since we call XLogFlush with local_lsn as parameter, shouldn't we check that it's actually within valid range? Currently we'll get errors like this if set to invalid value: ERROR: xlog flush request 123/123 is not satisfied --- flushed only to 0/168FB18 I think we should rather remove local_lsn from all parameters that are user callable, adding them there was a mistake. It's really only relevant for the cases where it's called by commit. In AdvanceReplicationIndentifier: +/* + * XXX: should we restore into a hashtable and dump into shmem only after + * recovery finished? + */ Probably no given that the function is also callable via SQL interface. Well, it's still a good idea regardless... As I wrote in another email, I would like to integrate the RepNodeId and CommitTSNodeId into single thing. Will reply separately there. There are no docs for the new sql interfaces. Yea. The whole thing needs docs. Thanks, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 2015-02-19 00:49:50 +0100, Petr Jelinek wrote: On 16/02/15 10:46, Andres Freund wrote: On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote: At a quick glance, this basic design seems workable. I would suggest expanding the replication IDs to regular 4 byte oids. Two extra bytes is a small price to pay, to make it work more like everything else in the system. I don't know. Growing from 3 to 5 byte overhead per relevant record (or even 0 to 5 in case the padding is reused) is rather noticeable. If we later find it to be a limit (I seriously doubt that), we can still increase it in a major release without anybody really noticing. I agree that limiting the overhead is important. But I have related though about this - I wonder if it's worth to try to map this more directly to the CommitTsNodeId. Maybe. I'd rather go the other way round though; replication_identifier.c/h's stuff seems much more generic than CommitTsNodeId. I mean it is currently using that for the actual storage, but I am thinking of having the Replication Identifiers be the public API and the CommitTsNodeId stuff be just hidden implementation detail. This thought is based on the fact that CommitTsNodeId provides somewhat meaningless number while Replication Identifiers give us nice name for that number. And if we do that then the type should perhaps be same for both? I'm not sure. Given that this is included in a significant number of recordsd I'd really rather not increase the overhead as described above. Maybe we can just limit CommitTsNodeId to the same size for now? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
Now that the issue with padding seems to no longer exists since the patch works both with and without padding, I went through the code and here are some comments I have (in no particular order). In CheckPointReplicationIdentifier: + * FIXME: Add a CRC32 to the end. The function already does it (I guess you forgot to remove the comment). Using max_replication_slots as limit for replication_identifier states does not really make sense to me as replication_identifiers track remote info while and slots are local and in case of master-slave replication you need replication identifiers but don't need slots. In bootstrap.c: #define MARKNOTNULL(att) \ ((att)-attlen 0 || \ (att)-atttypid == OIDVECTOROID || \ -(att)-atttypid == INT2VECTOROID) +(att)-atttypid == INT2VECTOROID || \ +strcmp(NameStr((att)-attname), riname) == 0 \ + ) Huh? Can this be solved in a nicer way? Since we call XLogFlush with local_lsn as parameter, shouldn't we check that it's actually within valid range? Currently we'll get errors like this if set to invalid value: ERROR: xlog flush request 123/123 is not satisfied --- flushed only to 0/168FB18 In AdvanceReplicationIndentifier: + /* +* XXX: should we restore into a hashtable and dump into shmem only after +* recovery finished? +*/ Probably no given that the function is also callable via SQL interface. As I wrote in another email, I would like to integrate the RepNodeId and CommitTSNodeId into single thing. There are no docs for the new sql interfaces. The replication_identifier.c might deserve some intro/notes text. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 16/02/15 10:46, Andres Freund wrote: On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote: At a quick glance, this basic design seems workable. I would suggest expanding the replication IDs to regular 4 byte oids. Two extra bytes is a small price to pay, to make it work more like everything else in the system. I don't know. Growing from 3 to 5 byte overhead per relevant record (or even 0 to 5 in case the padding is reused) is rather noticeable. If we later find it to be a limit (I seriously doubt that), we can still increase it in a major release without anybody really noticing. I agree that limiting the overhead is important. But I have related though about this - I wonder if it's worth to try to map this more directly to the CommitTsNodeId. I mean it is currently using that for the actual storage, but I am thinking of having the Replication Identifiers be the public API and the CommitTsNodeId stuff be just hidden implementation detail. This thought is based on the fact that CommitTsNodeId provides somewhat meaningless number while Replication Identifiers give us nice name for that number. And if we do that then the type should perhaps be same for both? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 2015-02-16 11:07:09 +0200, Heikki Linnakangas wrote: On 02/16/2015 02:21 AM, Andres Freund wrote: Furthermore the fact that the origin of records is recorded allows to avoid decoding them in logical decoding. That has both efficiency advantages (we can do so before they are stored in memory/disk) and functionality advantages. Imagine using a logical replication solution to replicate inserts to a single table between two databases where inserts are allowed on both - unless you prevent the replicated inserts from being replicated again you obviously have a loop. This infrastructure lets you avoid that. That makes sense. How does this work if you have multiple replication systems in use in the same cluster? You might use Slony to replication one table to one system, and BDR to replication another table with another system. Or the same replication software, but different hosts. It should just work. Replication identifiers are identified by a free form text, each replication solution can add the information/distinguising data they need in there. Bdr uses something like #define BDR_NODE_ID_FORMAT bdr_UINT64_FORMAT_%u_%u_%u_%s with remote_sysid, remote_tlid, remote_dboid, MyDatabaseId, configurable_name as parameters as a replication identifier name. I've been wondering whether the bdr_ part in the above should be split of into a separate field, similar to how the security label stuff does it. But I don't think it'd really buy us much, especially as we did not do that for logical slot names. Each of the used replication solution would probably ask their output plugin to only stream locally generated (i.e. origin_id = InvalidRepNodeId) changes, and possibly from a defined list of other known hosts in the cascading case. Does that answer your question? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 02/16/2015 02:21 AM, Andres Freund wrote: Furthermore the fact that the origin of records is recorded allows to avoid decoding them in logical decoding. That has both efficiency advantages (we can do so before they are stored in memory/disk) and functionality advantages. Imagine using a logical replication solution to replicate inserts to a single table between two databases where inserts are allowed on both - unless you prevent the replicated inserts from being replicated again you obviously have a loop. This infrastructure lets you avoid that. That makes sense. How does this work if you have multiple replication systems in use in the same cluster? You might use Slony to replication one table to one system, and BDR to replication another table with another system. Or the same replication software, but different hosts. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 02/16/2015 11:18 AM, Andres Freund wrote: On 2015-02-16 11:07:09 +0200, Heikki Linnakangas wrote: How does this work if you have multiple replication systems in use in the same cluster? You might use Slony to replication one table to one system, and BDR to replication another table with another system. Or the same replication software, but different hosts. It should just work. Replication identifiers are identified by a free form text, each replication solution can add the information/distinguising data they need in there. Bdr uses something like #define BDR_NODE_ID_FORMAT bdr_UINT64_FORMAT_%u_%u_%u_%s with remote_sysid, remote_tlid, remote_dboid, MyDatabaseId, configurable_name as parameters as a replication identifier name. I've been wondering whether the bdr_ part in the above should be split of into a separate field, similar to how the security label stuff does it. But I don't think it'd really buy us much, especially as we did not do that for logical slot names. Each of the used replication solution would probably ask their output plugin to only stream locally generated (i.e. origin_id = InvalidRepNodeId) changes, and possibly from a defined list of other known hosts in the cascading case. Does that answer your question? Yes, thanks. Note to self and everyone else looking at this: It's important to keep in mind is that the replication IDs are completely internal to the local cluster. They are *not* sent over the wire. That's not completely true if you also use physical replication, though. The replication IDs will be included in the WAL stream. Can you have logical decoding running in a hot standby server? And how does the progress report stuff that's checkpointed in pg_logical/checkpoints work in a hot standby? (Sorry if I'm not making sense, I haven't really thought hard about this myself) At a quick glance, this basic design seems workable. I would suggest expanding the replication IDs to regular 4 byte oids. Two extra bytes is a small price to pay, to make it work more like everything else in the system. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 2015-02-16 11:34:10 +0200, Heikki Linnakangas wrote: Yes, thanks. Note to self and everyone else looking at this: It's important to keep in mind is that the replication IDs are completely internal to the local cluster. They are *not* sent over the wire. Well, if you *want* to, you could send the external (free form text) replication identifiers over the wire in the output plugin. There are scenarios where that might make sense. That's not completely true if you also use physical replication, though. The replication IDs will be included in the WAL stream. Right. But then a physical rep server isn't realy a different server. Can you have logical decoding running in a hot standby server? Not at the moment, there's some minor stuff missing (following timelines, enforcing tuple visibility on the primary). And how does the progress report stuff that's checkpointed in pg_logical/checkpoints work in a hot standby? (Sorry if I'm not making sense, I haven't really thought hard about this myself) It doesn't work that greatly atm, they'd need to be WAL logged for that - which would not be hard. It'd be better to include the information in the checkpoint, but that unfortunately doesn't really work, because we store the checkpoint in the control file. And that has to be = 512 bytes. What, I guess, we could do is log it in the checkpoint, after determining the redo pointer, and store the LSN for it in the checkpoint record proper. That'd mean we'd read one more record at startup, but that isn't particularly bad. At a quick glance, this basic design seems workable. I would suggest expanding the replication IDs to regular 4 byte oids. Two extra bytes is a small price to pay, to make it work more like everything else in the system. I don't know. Growing from 3 to 5 byte overhead per relevant record (or even 0 to 5 in case the padding is reused) is rather noticeable. If we later find it to be a limit (I seriously doubt that), we can still increase it in a major release without anybody really noticing. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
Hi, Here's my next attept attempt at producing something we can agree upon. The major change that might achieve that is that I've now provided a separate method to store the origin_id of a node. I've made it conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both paths. That new method uses Heikki's xlog rework to dynamically add the origin to the record if a origin is set up. That works surprisingly simply. Other changes: * Locking preventing several backends to replay changes at the same time. This is actually overly restrictive in some cases, but I think good enough for now. * Logical decoding grew a filter_by_origin callback that allows to ignore changes that were replayed on a remote system. Such filters are executed before much is done with records, potentially saving a fair bit of costs. * Rebased. That took a bit due the xlog and other changes. * A couple more SQL interface functions (like dropping a replication identifier). I also want to quickly recap replication identifiers, given that in-person conversations with several people proved that the concept was slightly misunderstood: Think about a logical replication solution trying to replay changes. The postmaster in which the data is replayed into crashes every now and then. Replication identifiers allow you to do something like: do_replication() { source = ConnectToSourceSystem('mysource'); target = ConnectToSourceSystem('target'); # mark we're replayin target.exec($$SELECT pg_replication_identifier_setup_replaying_from('myrep_mysource')$$); # get how far we've replayed last time round remote_lsn = target.exec($$SELECT remote_lsn FROM pg_get_replication_identifier_progress WHERE external_id = 'myrep_mysource'); # and now replay changes copystream = source.exec('START_LOGICAL_REPLICATION SLOT ... START %x', remote_lsn); while (record = copystream.get_record()) { if (record.type = 'begin') { target.exec('BEGIN'); # setup the position of this individual xact target.exec('SELECT pg_replication_identifier_setup_tx_origin($1, $2);', record.origin_lsn, record.origin_commit_timestamp); } else if (record.type = 'change') target.exec(record.change_sql) else if (record.type = 'commit') target.exec('COMMIT'); } } A non pseudocode version of the above would be safe against crashes of both the source and the target system. If the target system crashes the replication identifier logic will recover how far we replayed during crash recovery. If the source system crashes/disconnects we'll have the current value in memory. Note that this works perfectly well if the target system (and obviously the source system, but that's obvious) use synchronous_commit = off - we'll not miss any changes. Furthermore the fact that the origin of records is recorded allows to avoid decoding them in logical decoding. That has both efficiency advantages (we can do so before they are stored in memory/disk) and functionality advantages. Imagine using a logical replication solution to replicate inserts to a single table between two databases where inserts are allowed on both - unless you prevent the replicated inserts from being replicated again you obviously have a loop. This infrastructure lets you avoid that. The SQL interface consists out of: # manage existance of identifiers internal_id pg_replication_identifier_create(external_id); void pg_replication_identifier_drop(external_id); # replay management void pg_replication_identifier_setup_replaying_from(external_id); void pg_replication_identifier_reset_replaying_from(); bool pg_replication_identifier_is_replaying(); void pg_replication_identifier_setup_tx_origin(remote_lsn, remote_commit_time); # replication progress status view SELECT * FROM pgreplication_identifier_progress; # replicatation identifiers SELECT * FROM pg_replication_identifier; Petr has developed (for UDR, i.e. logical replication ontop of 9.4) a SQL reimplementation of replication identifiers and that has proven that for busier workloads doing a table update to store the replication progress indeed has a noticeable overhead. Especially if there's some longer running activity on the standby. The bigger questions I have are: 1) Where to store the origin. I personally still think that using the padding is fine. Now that I have proven that it's pretty simple to store additional information the argument that it might be needed for something else doesn't really hold anymore. But I can live with the other solution as well - 3 bytes additional overhead ain't so bad. 2) If we go with the !REPLICATION_IDENTIFIER_REUSE_PADDING solution, do we want to store the origin only on relevant records? That'd be XLOG_HEAP_INSERT/XLOG_HEAPMULTI_INSERT/XLOG_HEAP_UPDATE // XLOG_XACT_COMMIT/XLOG_XACT_COMMIT_PREPARED. I'm thinking of something
Re: [HACKERS] Replication identifiers, take 4
On 2015-02-16 01:21:55 +0100, Andres Freund wrote: Here's my next attept attempt at producing something we can agree upon. The major change that might achieve that is that I've now provided a separate method to store the origin_id of a node. I've made it conditional on !REPLICATION_IDENTIFIER_REUSE_PADDING, to show both paths. That new method uses Heikki's xlog rework to dynamically add the origin to the record if a origin is set up. That works surprisingly simply. Other changes: * Locking preventing several backends to replay changes at the same time. This is actually overly restrictive in some cases, but I think good enough for now. * Logical decoding grew a filter_by_origin callback that allows to ignore changes that were replayed on a remote system. Such filters are executed before much is done with records, potentially saving a fair bit of costs. * Rebased. That took a bit due the xlog and other changes. * A couple more SQL interface functions (like dropping a replication identifier). And here an actual patch. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 268d52cac6bf7fe1c019fd68248853c7c7ae18b1 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Mon, 16 Feb 2015 01:22:08 +0100 Subject: [PATCH] Introduce replication identifiers to keep track of replication progress: v0.6 --- contrib/test_decoding/Makefile |3 +- contrib/test_decoding/expected/replident.out | 84 ++ contrib/test_decoding/sql/replident.sql| 40 + contrib/test_decoding/test_decoding.c | 28 + src/backend/access/rmgrdesc/xactdesc.c | 17 +- src/backend/access/transam/xact.c | 64 +- src/backend/access/transam/xlog.c | 34 +- src/backend/access/transam/xloginsert.c| 22 +- src/backend/access/transam/xlogreader.c| 10 + src/backend/bootstrap/bootstrap.c |5 +- src/backend/catalog/Makefile |2 +- src/backend/catalog/catalog.c |8 +- src/backend/catalog/system_views.sql |7 + src/backend/replication/logical/Makefile |3 +- src/backend/replication/logical/decode.c | 63 +- src/backend/replication/logical/logical.c |5 + src/backend/replication/logical/reorderbuffer.c|5 +- .../replication/logical/replication_identifier.c | 1190 src/backend/storage/ipc/ipci.c |3 + src/backend/utils/cache/syscache.c | 23 + src/backend/utils/misc/guc.c |1 + src/bin/initdb/initdb.c|1 + src/bin/pg_resetxlog/pg_resetxlog.c|3 + src/include/access/xact.h | 10 +- src/include/access/xlog.h |1 + src/include/access/xlogdefs.h |6 + src/include/access/xlogreader.h|9 + src/include/access/xlogrecord.h|5 +- src/include/catalog/indexing.h |6 + src/include/catalog/pg_proc.h | 28 + src/include/catalog/pg_replication_identifier.h| 75 ++ src/include/pg_config_manual.h |6 + src/include/replication/output_plugin.h|8 + src/include/replication/reorderbuffer.h|8 +- src/include/replication/replication_identifier.h | 58 + src/include/utils/syscache.h |2 + src/test/regress/expected/rules.out|5 + src/test/regress/expected/sanity_check.out |1 + 38 files changed, 1816 insertions(+), 33 deletions(-) create mode 100644 contrib/test_decoding/expected/replident.out create mode 100644 contrib/test_decoding/sql/replident.sql create mode 100644 src/backend/replication/logical/replication_identifier.c create mode 100644 src/include/catalog/pg_replication_identifier.h create mode 100644 src/include/replication/replication_identifier.h diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile index 438be44..f8334cc 100644 --- a/contrib/test_decoding/Makefile +++ b/contrib/test_decoding/Makefile @@ -37,7 +37,8 @@ submake-isolation: submake-test_decoding: $(MAKE) -C $(top_builddir)/contrib/test_decoding -REGRESSCHECKS=ddl rewrite toast permissions decoding_in_xact decoding_into_rel binary prepared +REGRESSCHECKS=ddl rewrite toast permissions decoding_in_xact decoding_into_rel \ + binary prepared replident regresscheck: all | submake-regress submake-test_decoding $(MKDIR_P) regression_output diff --git a/contrib/test_decoding/expected/replident.out b/contrib/test_decoding/expected/replident.out new file mode 100644 index 000..1c508a5 --- /dev/null +++