Re: [HACKERS] Slow synchronous logical replication
On 12 October 2017 at 16:09, Konstantin Knizhnik wrote: > > Is the CREATE TABLE and INSERT done in the same transaction? > > No. Table was create in separate transaction. > Moreover the same effect will take place if table is create before start of > replication. > The problem in this case seems to be caused by spilling decoded transaction > to the file by ReorderBufferSerializeTXN. Yeah. That's known to perform sub-optimally, and it also uses way more memory than it should. Your design compounds that by spilling transactions it will then discard, and doing so multiple times. To make your design viable you likely need some kind of cache of serialized reorder buffer transactions, where you don't rebuild one if it's already been generated. And likely a fair bit of optimisation on the serialisation. Or you might want a table- and even a row-filter that can be run during decoding, before appending to the ReorderBuffer, to let you skip changes early. Right now this can only be done at the transaction level, based on replication origin. Of course, if you do this you can't do the caching thing. > Unfortunately it is not quite clear how to make wal-sender smarter and let > him skip transaction not affecting its publication. You'd need more hooks to be implemented by the output plugin. > I really not sure that it is possible to skip over WAL. But the particular > problem with invalidation records etc can be solved by always processing > this records by WAl sender. > I.e. if backend is inserting invalidation record or some other record which > always should be processed by WAL sender, it can always promote LSN of this > record to WAL sender. > So WAl sender will skip only those WAl records which is safe to skip > (insert/update/delete records not affecting this publication). That sounds like a giant layering violation too. I suggest focusing on reducing the amount of work done when reading WAL, not trying to jump over whole ranges of WAL. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slow synchronous logical replication
On 12.10.2017 04:23, Craig Ringer wrote: On 12 October 2017 at 00:57, Konstantin Knizhnik wrote: The reason of such behavior is obvious: wal sender has to decode huge transaction generate by insert although it has no relation to this publication. It does. Though I wouldn't expect anywhere near the kind of drop you report, and haven't observed it here. Is the CREATE TABLE and INSERT done in the same transaction? No. Table was create in separate transaction. Moreover the same effect will take place if table is create before start of replication. The problem in this case seems to be caused by spilling decoded transaction to the file by ReorderBufferSerializeTXN. Please look at two profiles: http://garret.ru/lr1.svg corresponds to normal work if pgbench with synchronous replication to one replica, http://garret.ru/lr2.svg - the with concurrent execution of huge insert statement. And here is output of pgbench (at fifth second insert is started): progress: 1.0 s, 10020.9 tps, lat 0.791 ms stddev 0.232 progress: 2.0 s, 10184.1 tps, lat 0.786 ms stddev 0.192 progress: 3.0 s, 10058.8 tps, lat 0.795 ms stddev 0.301 progress: 4.0 s, 10230.3 tps, lat 0.782 ms stddev 0.194 progress: 5.0 s, 10335.0 tps, lat 0.774 ms stddev 0.192 progress: 6.0 s, 4535.7 tps, lat 1.591 ms stddev 9.370 progress: 7.0 s, 419.6 tps, lat 20.897 ms stddev 55.338 progress: 8.0 s, 105.1 tps, lat 56.140 ms stddev 76.309 progress: 9.0 s, 9.0 tps, lat 504.104 ms stddev 52.964 progress: 10.0 s, 14.0 tps, lat 797.535 ms stddev 156.082 progress: 11.0 s, 14.0 tps, lat 601.865 ms stddev 93.598 progress: 12.0 s, 11.0 tps, lat 658.276 ms stddev 138.503 progress: 13.0 s, 9.0 tps, lat 784.120 ms stddev 127.206 progress: 14.0 s, 7.0 tps, lat 870.944 ms stddev 156.377 progress: 15.0 s, 8.0 tps, lat .578 ms stddev 140.987 progress: 16.0 s, 7.0 tps, lat 1258.750 ms stddev 75.677 progress: 17.0 s, 6.0 tps, lat 991.023 ms stddev 229.058 progress: 18.0 s, 5.0 tps, lat 1063.986 ms stddev 269.361 It seems to be effect of large transactions. Presence of several channels of synchronous logical replication reduce performance, but not so much. Below are results at another machine and pgbench with scale 10. Configuraion standalone 1 async logical replica 1 sync logical replca 3 async logical replicas 3 syn logical replicas TPS 15k 13k 10k 13k 8k Only partly true. The output plugin can register a transaction origin filter and use that to say it's entirely uninterested in a transaction. But this only works based on filtering by origins. Not tables. Yes I know about origin filtering mechanism (and we are using it in multimaster). But I am speaking about standard pgoutput.c output plugin. it's pgoutput_origin_filter always returns false. I imagine we could call another hook in output plugins, "do you care about this table", and use it to skip some more work for tuples that particular decoding session isn't interested in. Skip adding them to the reorder buffer, etc. No such hook currently exists, but it'd be an interesting patch for Pg11 if you feel like working on it. Unfortunately it is not quite clear how to make wal-sender smarter and let him skip transaction not affecting its publication. As noted, it already can do so by origin. Mostly. We cannot totally skip over WAL, since we need to process various invalidations etc. See ReorderBufferSkip. The problem is that before end of transaction we do not know whether it touch this publication or not. So filtering by origin will not work in this case. I really not sure that it is possible to skip over WAL. But the particular problem with invalidation records etc can be solved by always processing this records by WAl sender. I.e. if backend is inserting invalidation record or some other record which always should be processed by WAL sender, it can always promote LSN of this record to WAL sender. So WAl sender will skip only those WAl records which is safe to skip (insert/update/delete records not affecting this publication). I wonder if there can be some other problems with skipping part of transaction by WAL sender. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Slow synchronous logical replication
On 12 October 2017 at 00:57, Konstantin Knizhnik wrote: > The reason of such behavior is obvious: wal sender has to decode huge > transaction generate by insert although it has no relation to this > publication. It does. Though I wouldn't expect anywhere near the kind of drop you report, and haven't observed it here. Is the CREATE TABLE and INSERT done in the same transaction? Because that's a known pathological case for logical replication, it has to do a LOT of extra work when it's in a transaction that has done DDL. I'm sure there's room for optimisation there, but the general recommendation for now is "don't do that". > Filtering of insert records of this transaction is done only inside output > plug-in. Only partly true. The output plugin can register a transaction origin filter and use that to say it's entirely uninterested in a transaction. But this only works based on filtering by origins. Not tables. I imagine we could call another hook in output plugins, "do you care about this table", and use it to skip some more work for tuples that particular decoding session isn't interested in. Skip adding them to the reorder buffer, etc. No such hook currently exists, but it'd be an interesting patch for Pg11 if you feel like working on it. > Unfortunately it is not quite clear how to make wal-sender smarter and let > him skip transaction not affecting its publication. As noted, it already can do so by origin. Mostly. We cannot totally skip over WAL, since we need to process various invalidations etc. See ReorderBufferSkip. It's not so simple by table since we don't know early enough whether the xact affects tables of interest or not. But you could definitely do some selective skipping. Making it efficient could be the challenge. > Once of the possible solutions is to let backend inform wal-sender about > smallest LSN it should wait for (backend knows which table is affected by > current operation, > so which publications are interested in this operation and so can point wal > -sender to the proper LSN without decoding huge part of WAL. > But it seems to be not so easy to implement. Sounds like confusing layering violations to me. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slow synchronous logical replication
On 11.10.2017 10:07, Craig Ringer wrote: On 9 October 2017 at 15:37, Konstantin Knizhnik wrote: Thank you for explanations. On 08.10.2017 16:00, Craig Ringer wrote: I think it'd be helpful if you provided reproduction instructions, test programs, etc, making it very clear when things are / aren't related to your changes. It will be not so easy to provide some reproducing scenario, because actually it involves many components (postgres_fdw, pg_pasthman, pg_shardman, LR,...) So simplify it to a test case that doesn't. The simplest reproducing scenario is the following: 1. Start two Posgtgres instances: synchronous_commit=on, fsync=off 2. Initialize pgbench database at both instances: pgbench -i 3. Create publication for pgbench_accounts table at one node 4. Create correspondent subscription at another node with copy_data=false parameter 5. Add subscription to synchronous_standby_names at first node. 6. Start pgbench -c 8 -N -T 100 -P 1 at first node. At my systems results are the following: standalone postgres: 8600 TPS asynchronous replication: 6600 TPS synchronous replication: 5600 TPS Quite good results. 7. Create some dummy table and perform bulk insert in it: create table dummy(x integer primary key); insert into dummy values (generate_series(1,1000)); pgbench almost stuck: until end of insert performance drops almost to zero. The reason of such behavior is obvious: wal sender has to decode huge transaction generate by insert although it has no relation to this publication. Filtering of insert records of this transaction is done only inside output plug-in. Unfortunately it is not quite clear how to make wal-sender smarter and let him skip transaction not affecting its publication. Once of the possible solutions is to let backend inform wal-sender about smallest LSN it should wait for (backend knows which table is affected by current operation, so which publications are interested in this operation and so can point wal -sender to the proper LSN without decoding huge part of WAL. But it seems to be not so easy to implement. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slow synchronous logical replication
Hi, On 2017-10-09 10:37:01 +0300, Konstantin Knizhnik wrote: > So we have implement sharding - splitting data between several remote tables > using pg_pathman and postgres_fdw. > It means that insert or update of parent table cause insert or update of > some derived partitions which is forwarded by postgres_fdw to the > correspondent node. > Number of shards is significantly larger than number of nodes, i.e. for 5 > nodes we have 50 shards. Which means that at each onde we have 10 shards. > To provide fault tolerance each shard is replicated using logical > replication to one or more nodes. Right now we considered only redundancy > level 1 - each shard has only one replica. > So from each node we establish 10 logical replication channels. Isn't that part of the pretty fundamental problem? There shouldn't be 10 different replication channels per node. There should be one. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slow synchronous logical replication
On 9 October 2017 at 15:37, Konstantin Knizhnik wrote: > Thank you for explanations. > > On 08.10.2017 16:00, Craig Ringer wrote: >> >> I think it'd be helpful if you provided reproduction instructions, >> test programs, etc, making it very clear when things are / aren't >> related to your changes. > > > It will be not so easy to provide some reproducing scenario, because > actually it involves many components (postgres_fdw, pg_pasthman, > pg_shardman, LR,...) So simplify it to a test case that doesn't. > I have checked syncrepl.c file, particularly SyncRepGetSyncRecPtr function. > Each wal sender independently calculates minimal LSN among all synchronous > replicas and wakeup backends waiting for this LSN. It means that transaction > performing update of data in one shard will actually wait confirmation from > replication channels for all shards. That's expected for the current sync rep design, yes. Because it's based on lsn, and was designed for physical rep where there's no question about whether we're sending some data to some peers and not others. So all backends will wait for the slowest-responding peer, including peers that don't need to actually do anything for this xact. You could possibly hack around that by having the output plugin advance the slot position when it sees that it just processed an empty xact. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slow synchronous logical replication
On Mon, Oct 9, 2017 at 4:37 PM, Konstantin Knizhnik wrote: > Thank you for explanations. > > On 08.10.2017 16:00, Craig Ringer wrote: >> >> I think it'd be helpful if you provided reproduction instructions, >> test programs, etc, making it very clear when things are / aren't >> related to your changes. > > > It will be not so easy to provide some reproducing scenario, because > actually it involves many components (postgres_fdw, pg_pasthman, > pg_shardman, LR,...) > and requires multinode installation. > But let me try to explain what going on: > So we have implement sharding - splitting data between several remote tables > using pg_pathman and postgres_fdw. > It means that insert or update of parent table cause insert or update of > some derived partitions which is forwarded by postgres_fdw to the > correspondent node. > Number of shards is significantly larger than number of nodes, i.e. for 5 > nodes we have 50 shards. Which means that at each onde we have 10 shards. > To provide fault tolerance each shard is replicated using logical > replication to one or more nodes. Right now we considered only redundancy > level 1 - each shard has only one replica. > So from each node we establish 10 logical replication channels. > > We want commit to wait until data is actually stored at all replicas, so we > are using synchronous replication: > So we set synchronous_commit option to "on" and include all ten 10 > subscriptions in synchronous_standby_names list. > > In this setup commit latency is very large (about 100msec and most of the > time is actually spent in commit) and performance is very bad - pgbench > shows about 300 TPS for optimal number of clients (about 10, for larger > number performance is almost the same). Without logical replication at the > same setup we get about 6000 TPS. > > I have checked syncrepl.c file, particularly SyncRepGetSyncRecPtr function. > Each wal sender independently calculates minimal LSN among all synchronous > replicas and wakeup backends waiting for this LSN. It means that transaction > performing update of data in one shard will actually wait confirmation from > replication channels for all shards. > If some shard is updated rarely than other or is not updated at all (for > example because communication channels between this node is broken), then > all backens will stuck. > Also all backends are competing for the single SyncRepLock, which also can > be a contention point. > IIUC, I guess you meant to say that in current synchronous logical replication a transaction has to wait for updated table data to be replicated even on servers that don't subscribe for the table. If we change it so that a transaction needs to wait for only the server that are subscribing for the table it would be more efficiency, for at least your use case. We send at least the begin and commit data to all subscriptions and then wait for the reply from them but can we skip to wait them, for example, when the walsender actually didn't send any data modified by the transaction? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slow synchronous logical replication
Thank you for explanations. On 08.10.2017 16:00, Craig Ringer wrote: I think it'd be helpful if you provided reproduction instructions, test programs, etc, making it very clear when things are / aren't related to your changes. It will be not so easy to provide some reproducing scenario, because actually it involves many components (postgres_fdw, pg_pasthman, pg_shardman, LR,...) and requires multinode installation. But let me try to explain what going on: So we have implement sharding - splitting data between several remote tables using pg_pathman and postgres_fdw. It means that insert or update of parent table cause insert or update of some derived partitions which is forwarded by postgres_fdw to the correspondent node. Number of shards is significantly larger than number of nodes, i.e. for 5 nodes we have 50 shards. Which means that at each onde we have 10 shards. To provide fault tolerance each shard is replicated using logical replication to one or more nodes. Right now we considered only redundancy level 1 - each shard has only one replica. So from each node we establish 10 logical replication channels. We want commit to wait until data is actually stored at all replicas, so we are using synchronous replication: So we set synchronous_commit option to "on" and include all ten 10 subscriptions in synchronous_standby_names list. In this setup commit latency is very large (about 100msec and most of the time is actually spent in commit) and performance is very bad - pgbench shows about 300 TPS for optimal number of clients (about 10, for larger number performance is almost the same). Without logical replication at the same setup we get about 6000 TPS. I have checked syncrepl.c file, particularly SyncRepGetSyncRecPtr function. Each wal sender independently calculates minimal LSN among all synchronous replicas and wakeup backends waiting for this LSN. It means that transaction performing update of data in one shard will actually wait confirmation from replication channels for all shards. If some shard is updated rarely than other or is not updated at all (for example because communication channels between this node is broken), then all backens will stuck. Also all backends are competing for the single SyncRepLock, which also can be a contention point. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slow synchronous logical replication
On 8 October 2017 at 03:58, Konstantin Knizhnik wrote: > The question was about logical replication mechanism in mainstream version > of Postgres. I think it'd be helpful if you provided reproduction instructions, test programs, etc, making it very clear when things are / aren't related to your changes. > I think that most of people are using asynchronous logical replication and > synchronous LR is something exotic and not well tested and investigated. > It will be great if I am wrong:) I doubt it's widely used. That said, a lot of people use synchronous replication with BDR and pglogical, which are ancestors of the core logical rep code and design. I think you actually need to collect some proper timings and diagnostics here, rather than hand-waving about it being "slow". A good starting point might be setting some custom 'perf' tracepoints, or adding some 'elog()'ing for timestamps. Then scrape the results and build a latency graph. That said, if I had to guess why it's slow, I'd say that you're facing a number of factors: * By default, logical replication in PostgreSQL does not do an immediate flush to disk after downstream commit. In the interests of faster apply performance it instead delays sending flush confirmations until the next time WAL is flushed out. See the docs for CREATE SUBSCRIPTION, notably the synchronous_commit option. This will obviously greatly increase latencies on sync commit. * Logical decoding doesn't *start* streaming a transaction until the origin node finishes the xact and writes a COMMIT, then the xlogreader picks it up. * As a consequence of the above, a big xact holds up commit confirmations of smaller ones by a LOT more than is the case for streaming physical replication. Hopefully that gives you something to look into, anyway. Maybe you'll be inspired to work on parallelized logical decoding :) -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slow synchronous logical replication
On 10/07/2017 10:42 PM, Andres Freund wrote: Hi, On 2017-10-07 22:39:09 +0300, konstantin knizhnik wrote: In our sharded cluster project we are trying to use logical relication for providing HA (maintaining redundant shard copies). Using asynchronous logical replication has not so much sense in context of HA. This is why we try to use synchronous logical replication. Unfortunately it shows very bad performance. With 50 shards and level of redundancy=1 (just one copy) cluster is 20 times slower then without logical replication. With asynchronous replication it is "only" two times slower. As far as I understand, the reason of such bad performance is that synchronous replication mechanism was originally developed for streaming replication, when all replicas have the same content and LSNs. When it is used for logical replication, it behaves very inefficiently. Commit has to wait confirmations from all receivers mentioned in "synchronous_standby_names" list. So we are waiting not only for our own single logical replication standby, but all other standbys as well. Number of synchronous standbyes is equal to number of shards divided by number of nodes. To provide uniform distribution number of shards should >> than number of nodes, for example for 10 nodes we usually create 100 shards. As a result we get awful performance and blocking of any replication channel blocks all backends. So my question is whether my understanding is correct and synchronous logical replication can not be efficiently used in such manner. If so, the next question is how difficult it will be to make synchronous replication mechanism for logical replication more efficient and are there some plans to work in this direction? This seems to be a question that is a) about a commercial project we don't know much about b) hasn't received a lot of investigation. Sorry, If I was not clear. The question was about logical replication mechanism in mainstream version of Postgres. I think that most of people are using asynchronous logical replication and synchronous LR is something exotic and not well tested and investigated. It will be great if I am wrong:) Concerning our sharded cluster (pg_shardman) - it is not a commercial product yet, it is in development phase. We are going to open its sources when it will be more or less stable. But unlike multimaster, this sharded cluster is mostly built from existed components: pg_pathman + postgres_fdw + logical replication. So we are just trying to combine them all into some integrated system. But currently the most obscure point is logical replication. And the main goal of my e-mail was to know the opinion of authors and users of LR whether it is good idea to use LR to provide fault tolerance in sharded cluster. Or some other approaches, for example sharding with redundancy or using streaming replication are preferable? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Slow synchronous logical replication
Hi, On 2017-10-07 22:39:09 +0300, konstantin knizhnik wrote: > In our sharded cluster project we are trying to use logical relication for > providing HA (maintaining redundant shard copies). > Using asynchronous logical replication has not so much sense in context of > HA. This is why we try to use synchronous logical replication. > Unfortunately it shows very bad performance. With 50 shards and level of > redundancy=1 (just one copy) cluster is 20 times slower then without logical > replication. > With asynchronous replication it is "only" two times slower. > > As far as I understand, the reason of such bad performance is that > synchronous replication mechanism was originally developed for streaming > replication, when all replicas have the same content and LSNs. When it is > used for logical replication, it behaves very inefficiently. Commit has to > wait confirmations from all receivers mentioned in > "synchronous_standby_names" list. So we are waiting not only for our own > single logical replication standby, but all other standbys as well. Number of > synchronous standbyes is equal to number of shards divided by number of > nodes. To provide uniform distribution number of shards should >> than number > of nodes, for example for 10 nodes we usually create 100 shards. As a result > we get awful performance and blocking of any replication channel blocks all > backends. > > So my question is whether my understanding is correct and synchronous logical > replication can not be efficiently used in such manner. > If so, the next question is how difficult it will be to make synchronous > replication mechanism for logical replication more efficient and are there > some plans to work in this direction? This seems to be a question that is a) about a commercial project we don't know much about b) hasn't received a lot of investigation. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers