Re: [HACKERS] CREATE REPLICATION SLOT fails on a timeout
On 05/28/2014 06:41 PM, Andres Freund wrote: Hi, Pushed a fix for it. I am pretty sure it will, but could you still test that it fixes your problem? Thanks! The fix seems to work (I am no longer getting the timeout on slot creation) Thanks Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE REPLICATION SLOT fails on a timeout
Hi, On 2014-05-17 01:34:25 +0200, Andres Freund wrote: On 2014-05-16 17:02:33 -0400, Steve Singer wrote: I don't think that's going to cut it though. The creation can take longer than whatever wal_sender_timeout is set to (when there's lots of longrunning transactions). I think checking whether last_reply_timestamp = 0 during timeout checking is more robust. That makes sense. A patch that does that is attached. I've attached a heavily revised version of that patch. Didn't touch all the necessary places... Survives creation of logical replication slots under 'intensive pressure', with a wal_sender_timeout=10ms. Thanks for the bugreport! Pushed a fix for it. I am pretty sure it will, but could you still test that it fixes your problem? Thanks! Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] CREATE REPLICATION SLOT fails on a timeout
I am finding that my logical walsender connections are being terminated due to a timeout on the CREATE REPLICATION SLOT command. with terminating walsender process due to replication timeout Below is the stack trace when this happens #3 0x0067df28 in WalSndCheckTimeOut (now=now@entry=453585463823871) at walsender.c:1748 #4 0x0067eedc in WalSndWaitForWal (loc=691727888) at walsender.c:1216 #5 logical_read_xlog_page (state=optimized out, targetPagePtr=691724288, reqLen=optimized out, targetRecPtr=optimized out, cur_page=0x18476e0 }\320\005, pageTLI=optimized out) at walsender.c:741 #6 0x004f41bf in ReadPageInternal (state=state@entry=0x17aa140, pageptr=pageptr@entry=691724288, reqLen=reqLen@entry=3600) at xlogreader.c:525 #7 0x004f454a in XLogReadRecord (state=0x17aa140, RecPtr=691727856, RecPtr@entry=0, errormsg=errormsg@entry=0x7fff7f668c28) at xlogreader.c:228 #8 0x00675e5c in DecodingContextFindStartpoint (ctx=ctx@entry=0x17a0358) at logical.c:460 #9 0x00680f16 in CreateReplicationSlot (cmd=0x1798b50) at walsender.c:800 #10 exec_replication_command ( cmd_string=cmd_string@entry=0x17f1478 CREATE_REPLICATION_SLOT \slon_4_1\ LOGICAL \slony1_funcs.2.2.0\) at walsender.c:1291 #11 0x006bf4a1 in PostgresMain (argc=optimized out, argv=argv@entry=0x177db50, dbname=0x177db30 test1, (gdb) p last_reply_timestamp $1 = 0 I propose the attached patch sets last_reply_timestamp to now() it starts processing a command. Since receiving a command is hearing something from the client. Steve diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c new file mode 100644 index 5c11d68..56a2f10 *** a/src/backend/replication/walsender.c --- b/src/backend/replication/walsender.c *** exec_replication_command(const char *cmd *** 1276,1281 --- 1276,1282 parse_rc; cmd_node = replication_parse_result; + last_reply_timestamp = GetCurrentTimestamp(); switch (cmd_node-type) { -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE REPLICATION SLOT fails on a timeout
Hi, On 2014-05-16 16:37:16 -0400, Steve Singer wrote: I am finding that my logical walsender connections are being terminated due to a timeout on the CREATE REPLICATION SLOT command. with terminating walsender process due to replication timeout Below is the stack trace when this happens #3 0x0067df28 in WalSndCheckTimeOut (now=now@entry=453585463823871) at walsender.c:1748 #4 0x0067eedc in WalSndWaitForWal (loc=691727888) at walsender.c:1216 ... #9 0x00680f16 in CreateReplicationSlot (cmd=0x1798b50) at walsender.c:800 #10 exec_replication_command () at walsender.c:1291 #11 0x006bf4a1 in PostgresMain (argc=optimized out, argv=argv@entry=0x177db50, dbname=0x177db30 test1, (gdb) p last_reply_timestamp $1 = 0 I propose the attached patch sets last_reply_timestamp to now() it starts processing a command. Since receiving a command is hearing something from the client. Hm. Yes, that's a problem. diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c new file mode 100644 index 5c11d68..56a2f10 *** a/src/backend/replication/walsender.c --- b/src/backend/replication/walsender.c *** exec_replication_command(const char *cmd *** 1276,1281 --- 1276,1282 parse_rc; cmd_node = replication_parse_result; + last_reply_timestamp = GetCurrentTimestamp(); switch (cmd_node-type) { I don't think that's going to cut it though. The creation can take longer than whatever wal_sender_timeout is set to (when there's lots of longrunning transactions). I think checking whether last_reply_timestamp = 0 during timeout checking is more robust. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE REPLICATION SLOT fails on a timeout
On 05/16/2014 04:43 PM, Andres Freund wrote: Hi, I don't think that's going to cut it though. The creation can take longer than whatever wal_sender_timeout is set to (when there's lots of longrunning transactions). I think checking whether last_reply_timestamp = 0 during timeout checking is more robust. Greetings, Andres Freund That makes sense. A patch that does that is attached. Steve diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c new file mode 100644 index 5c11d68..d23f06b *** a/src/backend/replication/walsender.c --- b/src/backend/replication/walsender.c *** WalSndCheckTimeOut(TimestampTz now) *** 1738,1744 timeout = TimestampTzPlusMilliseconds(last_reply_timestamp, wal_sender_timeout); ! if (wal_sender_timeout 0 now = timeout) { /* * Since typically expiration of replication timeout means --- 1738,1744 timeout = TimestampTzPlusMilliseconds(last_reply_timestamp, wal_sender_timeout); ! if (last_reply_timestamp 0 wal_sender_timeout 0 now = timeout) { /* * Since typically expiration of replication timeout means -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CREATE REPLICATION SLOT fails on a timeout
On 2014-05-16 17:02:33 -0400, Steve Singer wrote: I don't think that's going to cut it though. The creation can take longer than whatever wal_sender_timeout is set to (when there's lots of longrunning transactions). I think checking whether last_reply_timestamp = 0 during timeout checking is more robust. That makes sense. A patch that does that is attached. I've attached a heavily revised version of that patch. Didn't touch all the necessary places... Survives creation of logical replication slots under 'intensive pressure', with a wal_sender_timeout=10ms. Thanks for the bugreport! Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 324d74e16dd8ee2a0fa977d92a269fd43a746196 Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Sat, 17 May 2014 01:22:01 +0200 Subject: [PATCH] Don't pay heed to wal_sender_timeout while creating a decoding slot. Sometimes CREATE_REPLICATION_SLOT ... LOGICAL ... needs to wait for futher WAL using WalSndWaitForWal(). That used to always respect wal_sender_timeout and kill the session when waiting long enough because no feedback/ping messages can be sent while the slot is still being created. Add the notion that last_reply_timestamp = 0 means that the walsender currently doesn't need timeout processing and add checks for it in a couple of places. Bugreport and initial patch by Steve Singer, revised by Andres Freund. --- src/backend/replication/walsender.c | 25 + 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 5c11d68..90394ce 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -148,9 +148,10 @@ static StringInfoData reply_message; static StringInfoData tmpbuf; /* - * Timestamp of the last receipt of the reply from the standby. + * Timestamp of the last receipt of the reply from the standby. Set to 0 if a + * the process currently shouldn't be killed by wal_sender_timeout. */ -static TimestampTz last_reply_timestamp; +static TimestampTz last_reply_timestamp = 0; /* Have we sent a heartbeat message asking for reply, since last reply? */ static bool waiting_for_ping_response = false; @@ -796,6 +797,14 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd) logical_read_xlog_page, WalSndPrepareWrite, WalSndWriteData); + /* + * Signal that we don't need the timeout mechanism. We're just + * creating the replication slot and don't yet accept feedback + * messages or send keepalives. As we possibly need to wait for + * further WAL the walsender would possibly be killed too soon. + */ + last_reply_timestamp = 0; + /* build initial snapshot, might take a while */ DecodingContextFindStartpoint(ctx); @@ -1693,7 +1702,7 @@ WalSndComputeSleeptime(TimestampTz now) { long sleeptime = 1; /* 10 s */ - if (wal_sender_timeout 0) + if (wal_sender_timeout 0 last_reply_timestamp 0) { TimestampTz wakeup_time; long sec_to_timeout; @@ -1735,6 +1744,10 @@ WalSndCheckTimeOut(TimestampTz now) { TimestampTz timeout; + /* don't bail out if we're doing something that doesn't require timeouts */ + if (last_reply_timestamp = 0) + return; + timeout = TimestampTzPlusMilliseconds(last_reply_timestamp, wal_sender_timeout); @@ -2879,7 +2892,11 @@ WalSndKeepaliveIfNecessary(TimestampTz now) { TimestampTz ping_time; - if (wal_sender_timeout = 0) + /* + * Don't send keepalive messages if timeouts are globally disabled or + * we're doing something not partaking in timeouts. + */ + if (wal_sender_timeout = 0 || last_reply_timestamp = 0) return; if (waiting_for_ping_response) -- 2.0.0.rc2.4.g1dc51c6.dirty -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers