On Thu, Aug 29, 2019 at 05:37:45PM +0300, Alexey Kondratov wrote:
On 28.08.2019 22:06, Tomas Vondra wrote:
Interesting. Any idea where does the extra overhead in
this particular
case come from? It's hard to deduce that from the single
flame graph,
when I don't have anything to compare it with (i.e. the
flame graph for
the "normal" case).
I guess that bottleneck is in disk operations. You can check
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
writes (~26%) take around 35% of CPU time in summary. To compare,
please, see attached flame graph for the following transaction:
INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);
Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)
where disk IO is only ~7-8% in total. So we get very roughly the same
~x4-5 performance drop here. JFYI, I am using a machine with
SSD for tests.
Therefore, probably you may write changes on receiver in
bigger chunks,
not each change separately.
Possibly, I/O is certainly a possible culprit, although we should be
using buffered I/O and there certainly are not any fsyncs here. So I'm
not sure why would it be cheaper to do the writes in batches.
BTW does this mean you see the overhead on the apply side? Or are you
running this on a single machine, and it's difficult to decide?
I run this on a single machine, but walsender and worker are
utilizing almost 100% of CPU per each process all the time, and
at apply side I/O syscalls take about 1/3 of CPU time. Though I
am still not sure, but for me this result somehow links
performance drop with problems at receiver side.
Writing in batches was just a hypothesis and to validate it I
have performed test with large txn, but consisting of a smaller
number of wide rows. This test does not exhibit any significant
performance drop, while it was streamed too. So it seems to be
valid. Anyway, I do not have other reasonable ideas beside that
right now.
It seems that overhead added by synchronous replica is lower by
2-3 times compared with Postgres master and streaming with
spilling. Therefore, the original patch eliminated delay before
large transaction processing start by sender, while this
additional patch speeds up the applier side.
Although the overall speed up is surely measurable, there is a
room for improvements yet:
1) Currently bgworkers are only spawned on demand without some
initial pool and never stopped. Maybe we should create a small
pool on replication start and offload some of idle bgworkers if
they exceed some limit?
2) Probably we can track somehow that incoming change has
conflicts with some of being processed xacts, so we can wait for
specific bgworkers only in that case?
3) Since the communication between main logical apply worker and
each bgworker from the pool is a 'single producer --- single
consumer' problem, then probably it is possible to wait and
set/check flags without locks, but using just atomics.
What do you think about this concept in general? Any concerns and
criticism are welcome!
Hi Tomas,
Thank you for a quick response.
I don't think it matters very much whether the workers are started at the
beginning or allocated ad hoc, that's IMO a minor implementation detail.
OK, I had the same vision about this point. Any minor differences here
will be neglectable for a sufficiently large transaction.
There's one huge challenge that I however don't see mentioned in your
message or in the patch (after cursory reading) - ensuring the same
commit
order, and introducing deadlocks that would not exist in single-process
apply.
Probably I haven't explained well this part, sorry for that. In my
patch I don't use workers pool for a concurrent transaction apply, but
rather for a fast context switch between long-lived streamed
transactions. In other words we apply all changes arrived from the
sender in a completely serial manner. Being written step-by-step it
looks like:
1) Read STREAM START message and figure out the target worker by xid.
2) Put all changes, which belongs to this xact to the selected worker
one by one via shm_mq_send.
3) Read STREAM STOP message and wait until our worker will apply all
changes in the queue.
4) Process all other chunks of streamed xacts in the same manner.
5) Process all non-streamed xacts immediately in the main apply worker loop.
6) If we read STREAMED COMMIT/ABORT we again wait until selected
worker either commits or aborts.
Thus, it automatically guaranties the same commit order on replica as
on master. Yes, we loose some performance here, since we don't apply
transactions concurrently, but it would bring all those problems you
have described.
OK, so it's apply in multiple processes, but at any moment only a single
apply process is active.
However, you helped me to figure out another point I have forgotten.
Although we ensure commit order automatically, the beginning of
streamed xacts may reorder. It happens if some small xacts have been
commited on master since the streamed one started, because we do not
start streaming immediately, but only after logical_work_mem hit. I
have performed some tests with conflicting xacts and it seems that
it's not a problem, since locking mechanism in Postgres guarantees
that if there would some deadlocks, they will happen earlier on
master. So if some records hit the WAL, it is safe to apply the
sequentially. Am I wrong?
I think you're right the way you interleave the changes ensures you
can't introduce new deadlocks between transactions in this stream. I don't
think reordering the blocks of streamed trasactions does matter, as long
as the comit order is ensured in this case.
Anyway, I'm going to double check the safety of this part later.
OK.
FWIW my understanding is that the speedup comes mostly from elimination of
the serialization to a file. That however requires savepoints to handle
aborts of subtransactions - I'm pretty sure I'd be trivial to create a
workload where this will be much slower (with many aborts of large
subtransactions).
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services