On 11.01.2018 22:41, Peter Eisentraut wrote:
On 12/22/17 23:57, Tomas Vondra wrote:
PART 1: adding logical_work_mem memory limit (0001)
---------------------------------------------------

Currently, limiting the amount of memory consumed by logical decoding is
tricky (or you might say impossible) for several reasons:
I would like to see some more discussion on this, but I think not a lot
of people understand the details, so I'll try to write up an explanation
here.  This code is also somewhat new to me, so please correct me if
there are inaccuracies, while keeping in mind that I'm trying to simplify.

The data in the WAL is written as it happens, so the changes belonging
to different transactions are all mixed together.  One of the jobs of
logical decoding is to reassemble the changes belonging to each
transaction.  The top-level data structure for that is the infamous
ReorderBuffer.  So as it reads the WAL and sees something about a
transaction, it keeps a copy of that change in memory, indexed by
transaction ID (ReorderBufferChange).  When the transaction commits, the
accumulated changes are passed to the output plugin and then freed.  If
the transaction aborts, then changes are just thrown away.

So when logical decoding is active, a copy of the changes for each
active transaction is kept in memory (once per walsender).

More precisely, the above happens for each subtransaction.  When the
top-level transaction commits, it finds all its subtransactions in the
ReorderBuffer, reassembles everything in the right order, then invokes
the output plugin.

All this could end up using an unbounded amount of memory, so there is a
mechanism to spill changes to disk.  The way this currently works is
hardcoded, and this patch proposes to change that.

Currently, when a transaction or subtransaction has accumulated 4096
changes, it is spilled to disk.  When the top-level transaction commits,
things are read back from disk to do the final processing mentioned above.

This all works mostly fine, but you can construct some more extreme
cases where this can blow up.

Here is a mundane example.  Let's say a change entry takes 100 bytes (it
might contain a new row, or an update key and some new column values,
for example).  If you have 100 concurrent active sessions and no
subtransactions, then logical decoding memory is bounded by 4096 * 100 *
100 = 40 MB (per walsender) before things spill to disk.

Now let's say you are using a lot of subtransactions, because you are
using PL functions, exception handling, triggers, doing batch updates.
If you have 200 subtransactions on average per concurrent session, the
memory usage bound in that case would be 4096 * 100 * 100 * 200 = 8 GB
(per walsender).  And so on.  If you have more concurrent sessions or
larger changes or more subtransactions, you'll use much more than those
8 GB.  And if you don't have those 8 GB, then you're stuck at this point.

That is the consideration when we record changes, but we also need
memory when we do the final processing at commit time.  That is slightly
less problematic because we only process one top-level transaction at a
time, so the formula is only 4096 * avg_size_of_changes * nr_subxacts
(without the concurrent sessions factor).

So, this patch proposes to improve this as follows:

- We compute the actual size of each ReorderBufferChange and keep a
running tally for each transaction, instead of just counting the number
of changes.

- We have a configuration setting that allows us to change the limit
instead of the hardcoded 4096.  The configuration setting is also in
terms of memory, not in number of changes.

- The configuration setting is for the total memory usage per decoding
session, not per subtransaction.  (So we also keep a running tally for
the entire ReorderBuffer.)

There are two open issues with this patch:

One, this mechanism only applies when recording changes.  The processing
at commit time still uses the previous hardcoded mechanism.  The reason
for this is, AFAIU, that as things currently work, you have to have all
subtransactions in memory to do the final processing.  There are some
proposals to change this as well, but they are more involved.  Arguably,
per my explanation above, memory use at commit time is less likely to be
a problem.

Two, what to do when the memory limit is reached.  With the old
accounting, this was easy, because we'd decide for each subtransaction
independently whether to spill it to disk, when it has reached its 4096
limit.  Now, we are looking at a global limit, so we have to find a
transaction to spill in some other way.  The proposed patch searches
through the entire list of transactions to find the largest one.  But as
the patch says:

"XXX With many subtransactions this might be quite slow, because we'll
have to walk through all of them. There are some options how we could
improve that: (a) maintain some secondary structure with transactions
sorted by amount of changes, (b) not looking for the entirely largest
transaction, but e.g. for transaction using at least some fraction of
the memory limit, and (c) evicting multiple transactions at once, e.g.
to free a given portion of the memory limit (e.g. 50%)."

(a) would create more overhead for the case where everything fits into
memory, so it seems unattractive.  Some combination of (b) and (c) seems
useful, but we'd have to come up with something concrete.

Thoughts?


I am very sorry that I have not noticed this thread before.
Spilling to the file in reorder buffer is the main factor limiting speed of importing data in multimaster and shardman (sharding based on FDW with redundancy provided by LR).
This is why we think a lot about possible ways of addressing this issue.
Right now data of huge transaction is written to the disk three times before it is applied at replica. And obviously read also three times. First it is saved in WAL, then spilled to the disk by reorder buffer and once again spilled to the disk at replica before assignment to the particular apply worker (last one is specific of multimaster, which can apply received transactions concurrently).

We considered three different approaches:
1. Streaming. It is similar with the proposed patch, the main difference is that we do not want to spill transaction in temporary file at replica, but apply it immediately in separate backend and abort transaction if it is aborted at master. Certainly it will work only with 2PC.
2. Elimination of spilling by rescanning WAL.
3. Bypass WAL: add hooks to heapam to buffer and propagate changes immediately to replica and apply them in dedicated backend. I have implemented prototype of such replication. With one replica it shows about 1.5x slowdown comparing with standalone/async LR and about 2-3 improvement comparing with sync LR. For two replicas result is 2x slower than async LR and 2-8 times faster than sync LR (depending on number of concurrent connections).

Approach 3) seems to be specific to multimaster/shardman, so most likely it can not be considered for general LR.
So I want to compare 1 and 2. Did you ever though about something like 2?

Right now in the proposed patch you just move spilling to the file from master to replica. It still can make sense to avoid memory overflow and reduce disk IO at master. But if we have just one huge transaction (COPY) importing gigabytes of data to the database,
then performance will be almost the same with your patch or without it.
The only difference is where we serialize transaction: at master or at replica side. In this sense this patch doesn't solve the problem with slow load of large bulks of data though LR.

Alternatively (approach 2) we can have small in-memory buffer for decoding transaction and remember LSN and snapshot of this transaction start. In case of buffer overflow we just continue WAL traversal until we reach end of the transaction. After it we restart scanning WAL from the beginning of this transaction at this second pass send changes directly to the output plugin. So we have to scan WAL several times but do not need to spill anything to the disk neither at publisher, neither at subscriber side. Certainly this approach will be inefficient if we have several long interleaving transactions. But in most customer's use cases we have observed until now there is just one huge transaction performing bulk load. May be I missed something, but this approach seems to be easier for implementation than transaction streaming. And it doesn't require any changes in output plugin API. I realize that it is a little bit late to ask this question once your patch is almost ready, but what do you think about it? Are there some pitfals with this approach?

There is one more aspect and performance problem with LR we have faced with shardman: if there are several publications for different subsets of table at one instance, then WAL senders have to do a lot of useless work. Them are decoding transactions which have no relation to this publication. But WAL sender doesn't know it until it reaches the end of this transaction. What is worser: if transaction is huge, then all WAL senders will spill it to the disk even through only one of them actually needs it. So data of huge transaction is written not three times, but N times, where N is number of publications. The only solution of the problem we can imagine is to let backend somehow inform WAL sender (through shared message queue?) about LSN-s it should considered. In this case WAL sender can skip large portions of WAL without decoding. We also want to know opinion of 2ndQuandarnt about this idea.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Reply via email to