Hi,
Could you please share repro steps for running these benchmarks? I am doing
performance testing in this area and want to use the same benchmarks.
Thanks,
Muhammad
From: Andres Freund
Sent: Friday, October 28, 2022 7:54 PM
To: pgsql-hack...@postgresql.org ; Thomas Munro
; Melanie Plageman
Cc: Yura Sokolov ; Robert Haas
Subject: refactoring relation extension and BufferAlloc(), faster COPY
Hi,
I'm working to extract independently useful bits from my AIO work, to reduce
the size of that patchset. This is one of those pieces.
In workloads that extend relations a lot, we end up being extremely contended
on the relation extension lock. We've attempted to address that to some degree
by using batching, which helps, but only so much.
The fundamental issue, in my opinion, is that we do *way* too much while
holding the relation extension lock. We acquire a victim buffer, if that
buffer is dirty, we potentially flush the WAL, then write out that
buffer. Then we zero out the buffer contents. Call smgrextend().
Most of that work does not actually need to happen while holding the relation
extension lock. As far as I can tell, the minimum that needs to be covered by
the extension lock is the following:
1) call smgrnblocks()
2) insert buffer[s] into the buffer mapping table at the location returned by
smgrnblocks
3) mark buffer[s] as IO_IN_PROGRESS
1) obviously has to happen with the relation extension lock held because
otherwise we might miss another relation extension. 2+3) need to happen with
the lock held, because otherwise another backend not doing an extension could
read the block before we're done extending, dirty it, write it out, and then
have it overwritten by the extending backend.
The reason we currently do so much work while holding the relation extension
lock is that bufmgr.c does not know about the relation lock and that relation
extension happens entirely within ReadBuffer* - there's no way to use a
narrower scope for the lock.
My fix for that is to add a dedicated function for extending relations, that
can acquire the extension lock if necessary (callers can tell it to skip that,
e.g., when initially creating an init fork). This routine is called by
ReadBuffer_common() when P_NEW is passed in, to provide backward
compatibility.
To be able to acquire victim buffers outside of the extension lock, victim
buffers are now acquired separately from inserting the new buffer mapping
entry. Victim buffer are pinned, cleaned, removed from the buffer mapping
table and marked invalid. Because they are pinned, clock sweeps in other
backends won't return them. This is done in a new function,
[Local]BufferAlloc().
This is similar to Yuri's patch at [0], but not that similar to earlier or
later approaches in that thread. I don't really understand why that thread
went on to ever more complicated approaches, when the basic approach shows
plenty gains, with no issues around the number of buffer mapping entries that
can exist etc.
Other interesting bits I found:
a) For workloads that [mostly] fit into s_b, the smgwrite() that BufferAlloc()
does, nearly doubles the amount of writes. First the kernel ends up writing
out all the zeroed out buffers after a while, then when we write out the
actual buffer contents.
The best fix for that seems to be to optionally use posix_fallocate() to
reserve space, without dirtying pages in the kernel page cache. However, it
looks like that's only beneficial when extending by multiple pages at once,
because it ends up causing one filesystem-journal entry for each extension
on at least some filesystems.
I added 'smgrzeroextend()' that can extend by multiple blocks, without the
caller providing a buffer to write out. When extending by 8 or more blocks,
posix_fallocate() is used if available, otherwise pg_pwritev_with_retry() is
used to extend the file.
b) I found that is quite beneficial to bulk-extend the relation with
smgrextend() even without concurrency. The reason for that is the primarily
the aforementioned dirty buffers that our current extension method causes.
One bit that stumped me for quite a while is to know how much to extend the
relation by. RelationGetBufferForTuple() drives the decision whether / how
much to bulk extend purely on the contention on the extension lock, which
obviously does not work for non-concurrent workloads.
After quite a while I figured out that we actually have good information on
how much to extend by, at least for COPY /
heap_multi_insert(). heap_multi_insert() can compute how much space is
needed to store all tuples, and pass that on to
RelationGetBufferForTuple().
For that to be accurate we need to recompute that number whenever we use an
already partially filled page. That's not great, but doesn't appear to be a
measurable overhead.
c) Contention on the FSM and the pages returned by it is a serious bottlene