Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra Thu, 29 Aug 2019 11:49:11 -0700

On Thu, Aug 29, 2019 at 05:37:45PM +0300, Alexey Kondratov wrote:

On 28.08.2019 22:06, Tomas Vondra wrote:
Interesting. Any idea where does the extra overhead inthis particularcase come from? It's hard to deduce that from the singleflame graph,when I don't have anything to compare it with (i.e. theflame graph for
the "normal" case).
I guess that bottleneck is in disk operations. You can check
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
writes (~26%) take around 35% of CPU time in summary. To compare,
please, see attached flame graph for the following transaction:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);

Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)

where disk IO is only ~7-8% in total. So we get very roughly the same
~x4-5 performance drop here. JFYI, I am using a machine withSSD for tests.
Therefore, probably you may write changes on receiver inbigger chunks,
not each change separately.
Possibly, I/O is certainly a possible culprit, although we should be
using buffered I/O and there certainly are not any fsyncs here. So I'm
not sure why would it be cheaper to do the writes in batches.

BTW does this mean you see the overhead on the apply side? Or are you
running this on a single machine, and it's difficult to decide?
I run this on a single machine, but walsender and worker areutilizing almost 100% of CPU per each process all the time, andat apply side I/O syscalls take about 1/3 of CPU time. Though Iam still not sure, but for me this result somehow linksperformance drop with problems at receiver side.
Writing in batches was just a hypothesis and to validate it Ihave performed test with large txn, but consisting of a smallernumber of wide rows. This test does not exhibit any significantperformance drop, while it was streamed too. So it seems to bevalid. Anyway, I do not have other reasonable ideas beside thatright now.
It seems that overhead added by synchronous replica is lower by2-3 times compared with Postgres master and streaming withspilling. Therefore, the original patch eliminated delay beforelarge transaction processing start by sender, while thisadditional patch speeds up the applier side.
Although the overall speed up is surely measurable, there is aroom for improvements yet:
1) Currently bgworkers are only spawned on demand without someinitial pool and never stopped. Maybe we should create a smallpool on replication start and offload some of idle bgworkers ifthey exceed some limit?
2) Probably we can track somehow that incoming change hasconflicts with some of being processed xacts, so we can wait forspecific bgworkers only in that case?
3) Since the communication between main logical apply worker andeach bgworker from the pool is a 'single producer --- singleconsumer' problem, then probably it is possible to wait andset/check flags without locks, but using just atomics.
What do you think about this concept in general? Any concerns andcriticism are welcome!
Hi Tomas,

Thank you for a quick response.
I don't think it matters very much whether the workers are started at the
beginning or allocated ad hoc, that's IMO a minor implementation detail.
OK, I had the same vision about this point. Any minor differences herewill be neglectable for a sufficiently large transaction.
There's one huge challenge that I however don't see mentioned in your
message or in the patch (after cursory reading) - ensuring the samecommit
order, and introducing deadlocks that would not exist in single-process
apply.
Probably I haven't explained well this part, sorry for that. In mypatch I don't use workers pool for a concurrent transaction apply, butrather for a fast context switch between long-lived streamedtransactions. In other words we apply all changes arrived from thesender in a completely serial manner. Being written step-by-step itlooks like:
1) Read STREAM START message and figure out the target worker by xid.
2) Put all changes, which belongs to this xact to the selected workerone by one via shm_mq_send.
3) Read STREAM STOP message and wait until our worker will apply allchanges in the queue.
4) Process all other chunks of streamed xacts in the same manner.

5) Process all non-streamed xacts immediately in the main apply worker loop.
6) If we read STREAMED COMMIT/ABORT we again wait until selectedworker either commits or aborts.
Thus, it automatically guaranties the same commit order on replica ason master. Yes, we loose some performance here, since we don't applytransactions concurrently, but it would bring all those problems youhave described.


OK, so it's apply in multiple processes, but at any moment only a single

apply process is active.

However, you helped me to figure out another point I have forgotten.Although we ensure commit order automatically, the beginning ofstreamed xacts may reorder. It happens if some small xacts have beencommited on master since the streamed one started, because we do notstart streaming immediately, but only after logical_work_mem hit. Ihave performed some tests with conflicting xacts and it seems thatit's not a problem, since locking mechanism in Postgres guaranteesthat if there would some deadlocks, they will happen earlier onmaster. So if some records hit the WAL, it is safe to apply thesequentially. Am I wrong?


I think you're right the way you interleave the changes ensures you
can't introduce new deadlocks between transactions in this stream. I don't
think reordering the blocks of streamed trasactions does matter, as long
as the comit order is ensured in this case.

Anyway, I'm going to double check the safety of this part later.


OK.

FWIW my understanding is that the speedup comes mostly from elimination of
the serialization to a file. That however requires savepoints to handle
aborts of subtransactions - I'm pretty sure I'd be trivial to create a
workload where this will be much slower (with many aborts of large
subtransactions).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Reply via email to