Hi all

I've been investigating some intermittent timeout issues I've seen
with BDR users in the wild and I think I've identified an issue with
how logical decoding sends data and handles feedback.

TL;DR: Logical decoding clients need to generate their own keepalives
and not rely on the server requesting them to prevent timeouts. Or
admins should raise the wal_sender_timeout by a LOT when using logical
decoding on DBs with any big rows.


An output plugin accumulates data in a StringInfo which is then sent
by WalSndWriteData(), which does nonblocking sends, checks for
interrupts, processes client replies, and requests keepalives if
approaching walsender timeout.

A problem arises when the client's buffered data is big enough that it
takes more than wal_sender_timeout to transmit the whole chunk and get
a reply from the client on the link. A slow link, a big row, or some
badly timed packet loss can easily result in timeout.

Unlike in physical replication, a timeout on a logical decoding
session can be expensive:

* We have to restart logical decoding at restart_lsn, which can be a
LONG time ago if there are any long running tx's, and decode up to the
last confirmed xact;

* We then have to resend the whole xact we were working on when we
timed out, which might be quite big.

so we should do our best to avoid timeouts in logical decoding
sessions. Physical replication chunks data into small pieces and can
restart cheaply, but logical decoding will send a 100MB+ row as a
single message and have to re-read hundreds of GB of WAL and resend
gigabytes of data on the wire to get back where it started after a
timeout.

When sending a big message, WalSndWriteData() notices that it's
approaching timeout and tries to send a keepalive request, but the
request just gets buffered behind the remaining output plugin data and
isn't seen by the client until the client has received the rest of the
pending data. The keepalive requests are useful so long as no one
message takes more than wal_sender_timeout to send since it'll defer
timeout and give us enough time to process the next message. But they
won't prevent timeout within a single big message.

How to fix this?

We can't inject keepalive requests in the middle of a CopyData message.

After the cancel-message discussion I know better than to suggest
using out-of-band tcp data and SIGURG.

So: We could ask output plugins to deal with this for us, by chunking
up their data in small pieces and calling OutputPluginPrepareWrite()
and OutputPluginWrite() more than once per output plugin callback if
they expect to send a big message. But this pushes the complexity of
splitting up and handling big rows, and big Datums, onto each plugin.
It's awkward to do well and hard to avoid splitting things up
unnecessarily.

We could extend the logical decoding protocol slightly, so that the
CopyData messages that carry output plugin payloads are limited to
some fixed, reasonable size (1MB? 10MB?) where the CopyData message
overhead is insignificant and most rows won't need to be reassembled
from multiple CopyData messages by the downstream. This gives us gaps
in which we can inject keepalives / status updates during big messages
so we get the client replies and process them. This won't fix existing
releases though, and makes clients reassemble the chunks. It also
means clients can't know the full size of a message based on the
CopyData length anymore so it'll force clients that rely on the
CopyData message length to add their own length fields in the payload.
CopyData doesn't have anything we can use as a "continuation message
follows" flag.

We could permit the walsender to reset the keepalive timer whenever it
confirms that it's sent a reasonable chunk of data such that it's
unlikely to just be lurking in the kernel's socket send buffer or
various routers in transit. We don't care that the client received up
to any particular point, only that it's alive and making progress. We
know the client must be receiving and acking it since we're using TCP.
The problem is knowing what "enough' data sent to mean the client
must've received some is. The socket send buffer is easy to check and
bounded, but TCP window scaling means there can be a huge amount of
un-acked data in transit. AFAIK we have no visibility into the network
stack's view of what's acked, at least portably. Linux has TCP_INFO
but it's hard to use and nonportable. So we can't assume that even
successfully sending 1GB of data means the client got data 1GB-ago
unless we clamp the congestion window or do other
possibly-performance-affecting things.

Or we could expect logical decoding downstream clients to send their
own proactive feedback when receiving and processing large messages.
The walsender is already capable of noticing and processing such
replies, just not requesting them when it should. Again this pushes
complexity onto each client, but less so than requiring output plugins
to chunk up their messages. Also, since each client-initiated feedback
message resets the walsender's waiting_for_ping_response flag, it'll
queue multiple keepalives after big messages and we'll process them in
a burst. Annoying but minor.

I think the least bad is to have clients generate their own feedback
messages, unprompted by the upstream, when enough time has elapsed.
Unfortunately wal_sender_timeout isn't sent in the initial params by
the server and the walsender protocol doesn't seem to have any way to
ask the server for parameter values so you can't check the upstream's
wal_sender_timeout and have to just pick a sensible default.

(A separate issue is that we can also time out when doing logical
_replication_ if the downstream side blocks on a lock, since it's not
safe to send on a socket from a signal handler and you can't interrupt
the lock wait to do some work then resume the lock wait. But that's
for later...)

Search kw:

* Logical decoding
* terminating walsender process due to replication timeout
* wal_sender_timeout

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to