Re: refactoring relation extension and BufferAlloc(), faster COPY

Muhammad Malik Wed, 03 May 2023 11:26:13 -0700

Hi,

Could you please share repro steps for running these benchmarks? I am doing 
performance testing in this area and want to use the same benchmarks.

Thanks,
Muhammad
________________________________
From: Andres Freund <and...@anarazel.de>
Sent: Friday, October 28, 2022 7:54 PM
To: pgsql-hack...@postgresql.org <pgsql-hack...@postgresql.org>; Thomas Munro 
<thomas.mu...@gmail.com>; Melanie Plageman <melanieplage...@gmail.com>
Cc: Yura Sokolov <y.soko...@postgrespro.ru>; Robert Haas <robertmh...@gmail.com>
Subject: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

I'm working to extract independently useful bits from my AIO work, to reduce
the size of that patchset. This is one of those pieces.

In workloads that extend relations a lot, we end up being extremely contended
on the relation extension lock. We've attempted to address that to some degree
by using batching, which helps, but only so much.

The fundamental issue, in my opinion, is that we do *way* too much while
holding the relation extension lock.  We acquire a victim buffer, if that
buffer is dirty, we potentially flush the WAL, then write out that
buffer. Then we zero out the buffer contents. Call smgrextend().

Most of that work does not actually need to happen while holding the relation
extension lock. As far as I can tell, the minimum that needs to be covered by
the extension lock is the following:

1) call smgrnblocks()
2) insert buffer[s] into the buffer mapping table at the location returned by
   smgrnblocks
3) mark buffer[s] as IO_IN_PROGRESS

1) obviously has to happen with the relation extension lock held because
otherwise we might miss another relation extension. 2+3) need to happen with
the lock held, because otherwise another backend not doing an extension could
read the block before we're done extending, dirty it, write it out, and then
have it overwritten by the extending backend.

The reason we currently do so much work while holding the relation extension
lock is that bufmgr.c does not know about the relation lock and that relation
extension happens entirely within ReadBuffer* - there's no way to use a
narrower scope for the lock.

My fix for that is to add a dedicated function for extending relations, that
can acquire the extension lock if necessary (callers can tell it to skip that,
e.g., when initially creating an init fork). This routine is called by
ReadBuffer_common() when P_NEW is passed in, to provide backward
compatibility.

To be able to acquire victim buffers outside of the extension lock, victim
buffers are now acquired separately from inserting the new buffer mapping
entry. Victim buffer are pinned, cleaned, removed from the buffer mapping
table and marked invalid. Because they are pinned, clock sweeps in other
backends won't return them. This is done in a new function,
[Local]BufferAlloc().

This is similar to Yuri's patch at [0], but not that similar to earlier or
later approaches in that thread. I don't really understand why that thread
went on to ever more complicated approaches, when the basic approach shows
plenty gains, with no issues around the number of buffer mapping entries that
can exist etc.

Other interesting bits I found:

a) For workloads that [mostly] fit into s_b, the smgwrite() that BufferAlloc()
   does, nearly doubles the amount of writes. First the kernel ends up writing
   out all the zeroed out buffers after a while, then when we write out the
   actual buffer contents.

   The best fix for that seems to be to optionally use posix_fallocate() to
   reserve space, without dirtying pages in the kernel page cache. However, it
   looks like that's only beneficial when extending by multiple pages at once,
   because it ends up causing one filesystem-journal entry for each extension
   on at least some filesystems.

   I added 'smgrzeroextend()' that can extend by multiple blocks, without the
   caller providing a buffer to write out. When extending by 8 or more blocks,
   posix_fallocate() is used if available, otherwise pg_pwritev_with_retry() is
   used to extend the file.

b) I found that is quite beneficial to bulk-extend the relation with
   smgrextend() even without concurrency. The reason for that is the primarily
   the aforementioned dirty buffers that our current extension method causes.

   One bit that stumped me for quite a while is to know how much to extend the
   relation by. RelationGetBufferForTuple() drives the decision whether / how
   much to bulk extend purely on the contention on the extension lock, which
   obviously does not work for non-concurrent workloads.

   After quite a while I figured out that we actually have good information on
   how much to extend by, at least for COPY /
   heap_multi_insert(). heap_multi_insert() can compute how much space is
   needed to store all tuples, and pass that on to
   RelationGetBufferForTuple().

   For that to be accurate we need to recompute that number whenever we use an
   already partially filled page. That's not great, but doesn't appear to be a
   measurable overhead.

c) Contention on the FSM and the pages returned by it is a serious bottleneck
   after a) and b).

   The biggest issue is that the current bulk insertion logic in hio.c enters
   all but one of the new pages into the freespacemap. That will immediately
   cause all the other backends to contend on the first few pages returned the
   FSM, and cause contention on the FSM pages itself.

   I've, partially, addressed that by using the information about the required
   number of pages from b). Whether we bulk insert or not, the number of pages
   we know we're going to need for one heap_multi_insert() don't need to be
   added to the FSM - we're going to use them anyway.

   I've stashed the number of free blocks in the BulkInsertState for now, but
   I'm not convinced that that's the right place.

   If I revert just this part, the "concurrent COPY into unlogged table"
   benchmark goes from ~240 tps to ~190 tps.

   Even after that change the FSM is a major bottleneck. Below I included
   benchmarks showing this by just removing the use of the FSM, but I haven't
   done anything further about it. The contention seems to be both from
   updating the FSM, as well as thundering-herd like symptoms from accessing
   the FSM.

   The update part could likely be addressed to some degree with a batch
   update operation updating the state for multiple pages.

   The access part could perhaps be addressed by adding an operation that gets
   a page and immediately marks it as fully used, so other backends won't also
   try to access it.

d) doing
                /* new buffers are zero-filled */
                MemSet((char *) bufBlock, 0, BLCKSZ);

  under the extension lock is surprisingly expensive on my two socket
  workstation (but much less noticable on my laptop).

  If I move the MemSet back under the extension lock, the "concurrent COPY
  into unlogged table" benchmark goes from ~240 tps to ~200 tps.

e) When running a few benchmarks for this email, I noticed that there was a
   sharp performance dropoff for the patched code for a pgbench -S -s100 on a
   database with 1GB s_b, start between 512 and 1024 clients. This started with
   the patch only acquiring one buffer partition lock at a time. Lots of
   debugging ensued, resulting in [3].

   The problem isn't actually related to the change, it just makes it more
   visible, because the "lock chains" between two partitions reduce the
   average length of the wait queues substantially, by distribution them
   between more partitions.  [3] has a reproducer that's entirely independent
   of this patchset.

Bulk extension acquires a number of victim buffers, acquires the extension
lock, inserts the buffers into the buffer mapping table and marks them as
io-in-progress, calls smgrextend and releases the extension lock. After that
buffer[s] are locked (depending on mode and an argument indicating the number
of blocks to be locked), and TerminateBufferIO() is called.

This requires two new pieces of infrastructure:

First, pinning multiple buffers opens up the obvious danger that we might run
of non-pinned buffers. I added LimitAdditional[Local]Pins() that allows each
backend to pin a proportional share of buffers (although always allowing one,
as we do today).

Second, having multiple IOs in progress at the same time isn't possible with
the InProgressBuf mechanism. I added a ResourceOwnerRememberBufferIO() etc to
deal with that instead. I like that this ends up removing a lot of
AbortBufferIO() calls from the loops of various aux processes (now released
inside ReleaseAuxProcessResources()).

In very extreme workloads (single backend doing a pgbench -S -s 100 against a
s_b=64MB database) the memory allocations triggered by StartBufferIO() are
*just about* visible, not sure if that's worth worrying about - we do such
allocations for the much more common pinning of buffers as well.

The new [Bulk]ExtendSharedRelationBuffered() currently have both a Relation
and a SMgrRelation argument, requiring at least one of them to be set. The
reason for that is on the one hand that LockRelationForExtension() requires a
relation and on the other hand, redo routines typically don't have a Relation
around (recovery doesn't require an extension lock).  That's not pretty, but
seems a tad better than the ReadBufferExtended() vs
ReadBufferWithoutRelcache() mess.

I've done a fair bit of benchmarking of this patchset. For COPY it comes out
ahead everywhere. It's possible that there's a very small regression for
extremly IO miss heavy workloads, more below.

server "base" configuration:

max_wal_size=150GB
shared_buffers=24GB
huge_pages=on
autovacuum=0
backend_flush_after=2MB
max_connections=5000
wal_buffers=128MB
wal_segment_size=1GB

benchmark: pgbench running COPY into a single table. pgbench -t is set
according to the client count, so that the same amount of data is inserted.
This is done oth using small files ([1], ringbuffer not effective, no dirty
data to write out within the benchmark window) and a bit larger files ([2],
lots of data to write out due to ringbuffer).

To make it a fair comparison HEAD includes the lwlock-waitqueue fix as well.

s_b=24GB

test: unlogged_small_files, format: text, files: 1024, 9015MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       58.63   207     50.22   242     54.35   224
2       32.67   372     25.82   472     27.30   446
4       22.53   540     13.30   916     14.33   851
8       15.14   804     7.43    1640    7.48    1632
16      14.69   829     4.79    2544    4.50    2718
32      15.28   797     4.41    2763    3.32    3710
64      15.34   794     5.22    2334    3.06    4061
128     15.49   786     4.97    2452    3.13    3926
256     15.85   768     5.02    2427    3.26    3769
512     16.02   760     5.29    2303    3.54    3471

test: logged_small_files, format: text, files: 1024, 9018MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       68.18   178     59.41   205     63.43   192
2       39.71   306     33.10   368     34.99   348
4       27.26   446     19.75   617     20.09   607
8       18.84   646     12.86   947     12.68   962
16      15.96   763     9.62    1266    8.51    1436
32      15.43   789     8.20    1486    7.77    1579
64      16.11   756     8.91    1367    8.90    1383
128     16.41   742     10.00   1218    9.74    1269
256     17.33   702     11.91   1023    10.89   1136
512     18.46   659     14.07   866     11.82   1049

test: unlogged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       63.27s  192     56.14   217     59.25   205
2       40.17s  303     29.88   407     31.50   386
4       27.57s  442     16.16   754     17.18   709
8       21.26s  573     11.89   1025    11.09   1099
16      21.25s  573     10.68   1141    10.22   1192
32      21.00s  580     10.72   1136    10.35   1178
64      20.64s  590     10.15   1200    9.76    1249
128     skipped
256     skipped
512     skipped

test: logged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       71.89s  169     65.57   217     69.09   69.09
2       47.36s  257     36.22   407     38.71   38.71
4       33.10s  368     21.76   754     22.78   22.78
8       26.62s  457     15.89   1025    15.30   15.30
16      24.89s  489     17.08   1141    15.20   15.20
32      25.15s  484     17.41   1136    16.14   16.14
64      26.11s  466     17.89   1200    16.76   16.76
128     skipped
256     skipped
512     skipped

Just to see how far it can be pushed, with binary format we can now get to
nearly 6GB/s into a table when disabling the FSM - note the 2x difference
between patch and patch+no-fsm at 32 clients.

test: unlogged_small_files, format: binary, files: 1024, 9508MB total
        seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch   no_fsm  no_fsm
1       34.14   357     28.04   434     29.46   413
2       22.67   537     14.42   845     14.75   826
4       16.63   732     7.62    1599    7.69    1587
8       13.48   904     4.36    2795    4.13    2959
16      14.37   848     3.78    3224    2.74    4493
32      14.79   823     4.20    2902    2.07    5974
64      14.76   825     5.03    2423    2.21    5561
128     14.95   815     4.36    2796    2.30    5343
256     15.18   802     4.31    2828    2.49    4935
512     15.41   790     4.59    2656    2.84    4327

s_b=4GB

test: unlogged_small_files, format: text, files: 1024, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1       62.55   194     54.22   224
2       37.11   328     28.94   421
4       25.97   469     16.42   742
8       20.01   609     11.92   1022
16      19.55   623     11.05   1102
32      19.34   630     11.27   1081
64      19.07   639     12.04   1012
128     19.22   634     12.27   993
256     19.34   630     12.28   992
512     19.60   621     11.74   1038

test: logged_small_files, format: text, files: 1024, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1       71.71   169     63.63   191
2       46.93   259     36.31   335
4       30.37   401     22.41   543
8       22.86   533     16.90   721
16      20.18   604     14.07   866
32      19.64   620     13.06   933
64      19.71   618     15.08   808
128     19.95   610     15.47   787
256     20.48   595     16.53   737
512     21.56   565     16.86   722

test: unlogged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1       62.65   194     55.74   218
2       40.25   302     29.45   413
4       27.37   445     16.26   749
8       22.07   552     11.75   1037
16      21.29   572     10.64   1145
32      20.98   580     10.70   1139
64      20.65   590     10.21   1193
128     skipped
256     skipped
512     skipped

test: logged_medium_files, format: text, files: 64, 9018MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1       71.72   169     65.12   187
2       46.46   262     35.74   341
4       32.61   373     21.60   564
8       26.69   456     16.30   747
16      25.31   481     17.00   716
32      24.96   488     17.47   697
64      26.05   467     17.90   680
128     skipped
256     skipped
512     skipped

test: unlogged_small_files, format: binary, files: 1024, 9505MB total
        seconds tbl-MBs seconds tbl-MBs
clients HEAD    HEAD    patch   patch
1       37.62   323     32.77   371
2       28.35   429     18.89   645
4       20.87   583     12.18   1000
8       19.37   629     10.38   1173
16      19.41   627     10.36   1176
32      18.62   654     11.04   1103
64      18.33   664     11.89   1024
128     18.41   661     11.91   1023
256     18.52   658     12.10   1007
512     18.78   648     11.49   1060

benchmark: Run a pgbench -S workload with scale 100, so it doesn't fit into
s_b, thereby exercising BufferAlloc()'s buffer replacement path heavily.

The run-to-run variance on my workstation is high for this workload (both
before/after my changes). I also found that the ramp-up time at higher client
counts is very significant:
progress: 2.1 s, 5816.8 tps, lat 1.835 ms stddev 4.450, 0 failed
progress: 3.0 s, 666729.4 tps, lat 5.755 ms stddev 16.753, 0 failed
progress: 4.0 s, 899260.1 tps, lat 3.619 ms stddev 41.108, 0 failed
...

One would need to run pgbench for impractically long to make that effect
vanish.

My not great solution for these was to run with -T21 -P5 and use the best 5s
as the tps.

s_b=1GB
        tps             tps
clients master          patched
1          49541           48805
2          85342           90010
4         167340          168918
8         308194          303222
16        524294          523678
32        649516          649100
64        932547          937702
128       908249          906281
256       856496          903979
512       764254          934702
1024      653886          925113
2048      569695          917262
4096      526782          903258

s_b=128MB:
        tps             tps
clients master          patched
1          40407           39854
2          73180           72252
4         143334          140860
8         240982          245331
16        429265          420810
32        544593          540127
64        706408          726678
128       713142          718087
256       611030          695582
512       552751          686290
1024      508248          666370
2048      474108          656735
4096      448582          633040

As there might be a small regression at the smallest end, I ran a more extreme
version of the above. Using a pipelined pgbench -S, with a single client, for
longer. With s_b=8MB.

To further reduce noise I pinned the server to one cpu, the client to another
and disabled turbo mode on the CPU.

master "total" tps: 61.52
master "best 5s" tps: 61.8
patch "total" tps: 61.20
patch "best 5s" tps: 61.4

Hardly conclusive, but it does look like there's a small effect. It could be
code layout or such.

My guess however is that it's the resource owner for in-progress IO that I
added - that adds an additional allocation inside the resowner machinery. I
commented those out (that's obviously incorrect!) just to see whether that
changes anything:

no-resowner "total" tps: 62.03
no-resowner "best 5s" tps: 62.2

So it looks like indeed, it's the resowner. I am a bit surprised, because
obviously we already use that mechanism for pins, which obviously is more
frequent.

I'm not sure it's worth worrying about - this is a pretty absurd workload. But
if we decide it is, I can think of a few ways to address this. E.g.:

- We could preallocate an initial element inside the ResourceArray struct, so
  that a newly created resowner won't need to allocate immediately
- We could only use resowners if there's more than one IO in progress at the
  same time - but I don't like that idea much
- We could try to store the "in-progress"-ness of a buffer inside the 
'bufferpin'
  resowner entry - on 64bit system there's plenty space for that. But on 32bit 
systems...

The patches here aren't fully polished (as will be evident). But they should
be more than good enough to discuss whether this is a sane direction.

Greetings,

Andres Freund

[0] 
https://postgr.es/m/3b108afd19fa52ed20c464a69f64d545e4a14772.camel%40postgrespro.ru
[1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO 
'/tmp/copytest_data_text.copy' WITH (FORMAT test);
[2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) 
TO '/tmp/copytest_data_text.copy' WITH (FORMAT text);
[3] https://postgr.es/m/20221027165914.2hofzp4cvutj6...@awork3.anarazel.de

Re: refactoring relation extension and BufferAlloc(), faster COPY

Reply via email to