On Tue, Feb 21, 2017 at 7:11 AM, nuko yokohama <nuko.yokoh...@gmail.com> wrote: > Hi. > While verifying the logical replication of PostgreSQL 10-devel, I found the > following problem. > > * When you run the SUBSCRIPTION command against a table in the same > PostgreSQL server, hang up. > * Canceling the hung SUBSCRIPTION command with CTRL + C causes a server > process occurs Segfault, and the PostgreSQL server to restart. > > -------- > [Steps to Reproduce] > > $ createdb db1 -U postgres > $ createdb db2 -U postgres > $ psql -U postgres db1 -c "CREATE TABLE test (id int primary key, data > text)" > $ psql -U postgres db2 -c "CREATE TABLE test (id int primary key, data > text)" > $ psql -U postgres db1 -c "CREATE PUBLICATION pub_db1_test FOR TABLE test" > $ psql -U postgres db2 -c "CREATE SUBSCRIPTION sub_db2_test CONNECTION > 'dbname=db1 port=5432 user=postgres' PUBLICATION pub_db1_test" > > The SUBSCRIPTION command does not end, it hangs up. > At this time, the following logs are output in the server log. > > 2017-02-21 06:58:05.082 JST [22060] LOG: logical decoding found initial > starting point at 0/1C06660 > 2017-02-21 06:58:05.082 JST [22060] DETAIL: 1 transaction needs to finish. > > Suspending psql (running SUBSCRIPTION) with CTRL + C causes a Segfault in > the server process. > At this time, the following message is output to the server log. > > 2017-02-21 07:01:00.246 JST [22059] ERROR: canceling statement due to user > request > 2017-02-21 07:01:00.246 JST [22059] STATEMENT: CREATE SUBSCRIPTION > sub_db2_test CONNECTION 'dbname=db1 port=5432 user=postgres' PUBLICATION > pub_db1_test > 2017-02-21 07:01:01.006 JST [21445] LOG: server process (PID 22060) was > terminated by signal 11: Segmentation fault > 2017-02-21 07:01:01.007 JST [21445] LOG: terminating any other active > server processes > > -------- > [Environment] > > postgres=# SELECT version(); > version > ------------------------------------------------------------------------------------------------------------ > PostgreSQL 10devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.2 > 20140120 (Red Hat 4.8.2-16), 64-bit > (1 row) > > postgres=# SHOW wal_level; > wal_level > ----------- > logical > (1 row) > > postgres=# SHOW max_wal_senders;; > max_wal_senders > ----------------- > 10 > (1 row) > > postgres=# SHOW max_replication_slots; > max_replication_slots > ----------------------- > 10 > (1 row) > > postgres=# TABLE pg_hba_file_rules ; > line_number | type | database | user_name | address | > netmask | auth_method | options | error > -------------+-------+---------------+------------+-----------+-----------------------------------------+-------------+---------+------- > 2 | local | {all} | {all} | | > | trust | | > 4 | host | {all} | {all} | 127.0.0.1 | > 255.255.255.255 | trust | | > 6 | host | {all} | {all} | ::1 | > ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff | trust | | > 9 | local | {replication} | {postgres} | | > | trust | | > (4 rows) > >
As log message says, the reason why CREATE SUBSCRIPTION hangs up is that it's waiting for another transaction finish. Regarding SEGV, it occurs by wal sender process creating replication slot. When SEV occurs the backtrace is as follows. (gdb) bt #0 0x00000000006e3d47 in resetStringInfo (str=0xea1c60) at stringinfo.c:96 #1 0x00000000007e4738 in ProcessRepliesIfAny () at walsender.c:1347 #2 0x00000000007e42c5 in WalSndWaitForWal (loc=23400264) at walsender.c:1142 #3 0x00000000007e38f6 in logical_read_xlog_page (state=0x2732008, targetPagePtr=23396352, reqLen=3912, targetRecPtr=23400240, cur_page=0x2733d58 "\225\320\005", pageTLI=0x27328b4) at walsender.c:717 #4 0x00000000005458c8 in ReadPageInternal (state=0x2732008, pageptr=23396352, reqLen=3912) at xlogreader.c:556 #5 0x0000000000545068 in XLogReadRecord (state=0x2732008, RecPtr=23400240, errormsg=0x7fff7cf34cf8) at xlogreader.c:255 #6 0x00000000007d09db in DecodingContextFindStartpoint (ctx=0x2731d48) at logical.c:432 #7 0x00000000007e3a87 in CreateReplicationSlot (cmd=0x26edfa0) at walsender.c:796 #8 0x00000000007e45ae in exec_replication_command (cmd_string=0x268b5f8 "CREATE_REPLICATION_SLOT \"sub_db2_test\" LOGICAL pgoutput") at walsender.c:1272 #9 0x0000000000846783 in PostgresMain (argc=1, argv=0x2697f38, dbname=0x2697e68 "db1", username=0x2697e40 "masahiko") at postgres.c:4064 #10 0x00000000007b34e0 in BackendRun (port=0x268f4b0) at postmaster.c:4310 #11 0x00000000007b2c60 in BackendStartup (port=0x268f4b0) at postmaster.c:3982 #12 0x00000000007af2e0 in ServerLoop () at postmaster.c:1722 #13 0x00000000007ae9a4 in PostmasterMain (argc=5, argv=0x266c020) at postmaster.c:1330 #14 0x00000000006f487a in main (argc=5, argv=0x266c020) at main.c:228 I guess that the cause of SEGV is that reply_message is reset without initialize by initStringInfo. The reply_message and other buffers are initialized at beggning of WalSndLoop but resetStringInfo can be called by WalSndWaitForWal->ProcessRepliesIfAny without calling WalSndLoop when creating replication slot. To fix it, we can initialize reply_message in exec_replication_command like follows. diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index ba506e2..01b54f6 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -1258,6 +1258,7 @@ exec_replication_command(const char *cmd_string) cmd_node = replication_parse_result; + initStringInfo(&reply_message); switch (cmd_node->type) { case T_IdentifySystemCmd: And maybe we should put assertion in resetStringInfo. -- Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers