Re: [HACKERS] logical replication - possible remaining problem
Hi, On 07/06/17 22:49, Erik Rijkers wrote: > I am not sure whether what I found here amounts to a bug, I might be > doing something dumb. > > During the last few months I did tests by running pgbench over logical > replication. Earlier emails have details. > > The basic form of that now works well (and the fix has been comitted) > but as I looked over my testing program I noticed one change I made to > it, already many weeks ago: > > In the cleanup during startup (pre-flight check you might say) and also > before the end, instead of > > echo "delete from pg_subscription;" | psql -qXp $port2 -- (1) > > I changed that (as I say, many weeks ago) to: > > echo "delete from pg_subscription; > delete from pg_subscription_rel; > delete from pg_replication_origin; " | psql -qXp $port2 -- (2) > > This occurs (2x) inside the bash function clean_pubsub(), in main test > script pgbench_detail2.sh > > This change was an effort to ensure to arrive at a 'clean' start (and > end-) state which would always be the same. > > All my more recent testing (and that of Mark, I have to assume) was thus > done with (2). > > Now, looking at the script again I am thinking that it would be > reasonable to expect that after issuing >delete from pg_subscription; > > the other 2 tables are /also/ cleaned, automatically, as a consequence. > (Is this reasonable? this is really the main question of this email). > Hmm, they are not cleaned automatically, deleting from system catalogs manually like this never propagates to related tables, we don't use FKs there. > So I removed the latter two delete statements again, and ran the tests > again with the form in (1) > > I have established that (after a number of successful cycles) the test > stops succeeding with in the replica log repetitions of: > > 2017-06-07 22:10:29.057 CEST [2421] LOG: logical replication apply > worker for subscription "sub1" has started > 2017-06-07 22:10:29.057 CEST [2421] ERROR: could not find free > replication state slot for replication origin with OID 11 > 2017-06-07 22:10:29.057 CEST [2421] HINT: Increase > max_replication_slots and try again. > 2017-06-07 22:10:29.058 CEST [2061] LOG: worker process: logical > replication worker for subscription 29235 (PID 2421) exited with exit > code 1 > > when I manually 'clean up' by doing: >delete from pg_replication_origin; > Yeah because you consumed all the origins (I am still not huge fan of how that limit works, but that's separate discussion). > then, and only then, does the session finish and succeed ('replica ok'). > > So to me it looks as if there is an omission of > pg_replication_origin-cleanup when pg_description is deleted. > There is no omission, origin is not supposed to be deleted automatically unless you use DROP SUBSCRIPTION. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - possible remaining problem
On 2017-06-07 23:18, Alvaro Herrera wrote: Erik Rijkers wrote: Now, looking at the script again I am thinking that it would be reasonable to expect that after issuing delete from pg_subscription; the other 2 tables are /also/ cleaned, automatically, as a consequence. (Is this reasonable? this is really the main question of this email). I don't think it's reasonable to expect that the system recovers automatically from what amounts to catalog corruption. You should be using the DDL that removes subscriptions instead. You're right, that makes sense. Thanks. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - possible remaining problem
Erik Rijkers wrote: > Now, looking at the script again I am thinking that it would be reasonable > to expect that after issuing >delete from pg_subscription; > > the other 2 tables are /also/ cleaned, automatically, as a consequence. (Is > this reasonable? this is really the main question of this email). I don't think it's reasonable to expect that the system recovers automatically from what amounts to catalog corruption. You should be using the DDL that removes subscriptions instead. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] logical replication - possible remaining problem
I am not sure whether what I found here amounts to a bug, I might be doing something dumb. During the last few months I did tests by running pgbench over logical replication. Earlier emails have details. The basic form of that now works well (and the fix has been comitted) but as I looked over my testing program I noticed one change I made to it, already many weeks ago: In the cleanup during startup (pre-flight check you might say) and also before the end, instead of echo "delete from pg_subscription;" | psql -qXp $port2 -- (1) I changed that (as I say, many weeks ago) to: echo "delete from pg_subscription; delete from pg_subscription_rel; delete from pg_replication_origin; " | psql -qXp $port2 -- (2) This occurs (2x) inside the bash function clean_pubsub(), in main test script pgbench_detail2.sh This change was an effort to ensure to arrive at a 'clean' start (and end-) state which would always be the same. All my more recent testing (and that of Mark, I have to assume) was thus done with (2). Now, looking at the script again I am thinking that it would be reasonable to expect that after issuing delete from pg_subscription; the other 2 tables are /also/ cleaned, automatically, as a consequence. (Is this reasonable? this is really the main question of this email). So I removed the latter two delete statements again, and ran the tests again with the form in (1) I have established that (after a number of successful cycles) the test stops succeeding with in the replica log repetitions of: 2017-06-07 22:10:29.057 CEST [2421] LOG: logical replication apply worker for subscription "sub1" has started 2017-06-07 22:10:29.057 CEST [2421] ERROR: could not find free replication state slot for replication origin with OID 11 2017-06-07 22:10:29.057 CEST [2421] HINT: Increase max_replication_slots and try again. 2017-06-07 22:10:29.058 CEST [2061] LOG: worker process: logical replication worker for subscription 29235 (PID 2421) exited with exit code 1 when I manually 'clean up' by doing: delete from pg_replication_origin; then, and only then, does the session finish and succeed ('replica ok'). So to me it looks as if there is an omission of pg_replication_origin-cleanup when pg_description is deleted. Does that make sense? All this is probably vague and I am only posting in the hope that Petr (or someone else) perhaps immediately understands what goes wrong, with even his limited amount of info. In the meantime I will try to dig up more detailed info... thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers