Hello, I received an report that pg_basebackup with "-X stream"
always exits with the following error.

> pg_basebackup: could not receive data from WAL stream: server closed the 
> connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.

The walsender had been terminated by replication timeout a bit
before the above message.

> LOG:  terminating walsender process due to replication timeout

====

I digged into this and found that a accumulation of delay in
receiving xlog stream can cause such an error. This hardly occurs
on an ordinary environment but this would be caused by temporary
(or permanent) heavy load on the receiver-side machine. A virtual
machine environment could raises the chance, I suppose.

In HandleCopyStream(), the feedbacks are sent only after breaks
of the xlog stream, so continuous flow of xlog stream prevents it
from being sent with expected intervals. walsender sends
keepalive message for such a case, but it arrives with a long
delay being caught by the congestion of the stream, but this is
not a problem because the keepalive is intended to be sent while
idle.

I think that the status feedback should be sent whenever
standby_message_timeout has elapsed just after (or before) an
incoming message is processed. The seemingly most straightforward
way to fix this is breaking the innner-loop (while(r != 0)) if
the time to feedback comes as the attached patch #1
(0001-Make-sure-to-send-...).

What do you thing about this?

regards,


=====
- How to reproduce the situation.

As mentioned before, this hardly occurs on ordinary
environment. But simulating the heavy load by inserting a delay
in HandleCopyStream() effectively let the error occur.

With the attached patch #2(insert_interval.diff), the following
operation let us see the situation. The walsender reports timeout
although the stream processing is delayed but running steadily.

postgresql.conf:
> wal_level = hot_standby
> max_wal_senders = 2
> wal_sender_timeout = 20s  # * 2 of default of standby_message_timeout

Terminal1$ psql postgres
postgres=# CREATE TABLE t1 (a text);
postgres=# ALTER TABLE t1 ALTER COLUMN a SET STORAGE EXTERNAL;
postgres=# INSERT INTO t1 (SELECT repeat('x', 10000) FROM generate_series(0, 
99999)); -- about 1GB
postgres=# ^D

Terminal1$ pgbench -i postgres
Terminal1$ pgbench -T 600 -h localhost postgres

Terminal2$ pg_basebackupo -r 32k -X stream -D data -h localhost

... in about a couple of minutes on my environment..

Terminal1:
> LOG:  terminating walsender process due to replication timeout

... after another couple of minutes.

Terminal2:
> pg_basebackup: could not receive data from WAL stream: server closed...

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 0df5eb2229d0e417e8aa52d83ea6b30681bdeef9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyot...@lab.ntt.co.jp>
Date: Mon, 2 Feb 2015 12:49:45 +0900
Subject: [PATCH] Make sure to send feedback at desired timing.

Continuous stream due to heavy-load on client side can prevent
feedbacks to be sent with expected interval, and it results in a
replication timeout on walsender. Exiting from the fast-path loop when
the time comes to make sure the feedback to be sent with expected
intervals.
---
 src/bin/pg_basebackup/receivelog.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 8caedff..fb6738d 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -925,9 +925,15 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 			}
 
 			/*
-			 * Process the received data, and any subsequent data we
-			 * can read without blocking.
+			 * Process the received data, and any subsequent data we can read
+			 * without blocking except when the time to feedback comes.
 			 */
+			now = feGetCurrentTimestamp();
+			if (standby_message_timeout > 0 &&
+				feTimestampDifferenceExceeds(last_status, now,
+											 standby_message_timeout))
+				break;
+
 			r = CopyStreamReceive(conn, 0, &copybuf);
 		}
 	}
-- 
2.1.0.GIT

diff --git a/src/bin/pg_basebackup/receivelog.c b/src/bin/pg_basebackup/receivelog.c
index 8caedff..69be40a 100644
--- a/src/bin/pg_basebackup/receivelog.c
+++ b/src/bin/pg_basebackup/receivelog.c
@@ -928,6 +928,7 @@ HandleCopyStream(PGconn *conn, XLogRecPtr startpos, uint32 timeline,
 			 * Process the received data, and any subsequent data we
 			 * can read without blocking.
 			 */
+			usleep(100 * 1000);
 			r = CopyStreamReceive(conn, 0, &copybuf);
 		}
 	}
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to