Hi, Could you please share repro steps for running these benchmarks? I am doing performance testing in this area and want to use the same benchmarks.
Thanks, Muhammad ________________________________ From: Andres Freund <and...@anarazel.de> Sent: Friday, October 28, 2022 7:54 PM To: pgsql-hack...@postgresql.org <pgsql-hack...@postgresql.org>; Thomas Munro <thomas.mu...@gmail.com>; Melanie Plageman <melanieplage...@gmail.com> Cc: Yura Sokolov <y.soko...@postgrespro.ru>; Robert Haas <robertmh...@gmail.com> Subject: refactoring relation extension and BufferAlloc(), faster COPY Hi, I'm working to extract independently useful bits from my AIO work, to reduce the size of that patchset. This is one of those pieces. In workloads that extend relations a lot, we end up being extremely contended on the relation extension lock. We've attempted to address that to some degree by using batching, which helps, but only so much. The fundamental issue, in my opinion, is that we do *way* too much while holding the relation extension lock. We acquire a victim buffer, if that buffer is dirty, we potentially flush the WAL, then write out that buffer. Then we zero out the buffer contents. Call smgrextend(). Most of that work does not actually need to happen while holding the relation extension lock. As far as I can tell, the minimum that needs to be covered by the extension lock is the following: 1) call smgrnblocks() 2) insert buffer[s] into the buffer mapping table at the location returned by smgrnblocks 3) mark buffer[s] as IO_IN_PROGRESS 1) obviously has to happen with the relation extension lock held because otherwise we might miss another relation extension. 2+3) need to happen with the lock held, because otherwise another backend not doing an extension could read the block before we're done extending, dirty it, write it out, and then have it overwritten by the extending backend. The reason we currently do so much work while holding the relation extension lock is that bufmgr.c does not know about the relation lock and that relation extension happens entirely within ReadBuffer* - there's no way to use a narrower scope for the lock. My fix for that is to add a dedicated function for extending relations, that can acquire the extension lock if necessary (callers can tell it to skip that, e.g., when initially creating an init fork). This routine is called by ReadBuffer_common() when P_NEW is passed in, to provide backward compatibility. To be able to acquire victim buffers outside of the extension lock, victim buffers are now acquired separately from inserting the new buffer mapping entry. Victim buffer are pinned, cleaned, removed from the buffer mapping table and marked invalid. Because they are pinned, clock sweeps in other backends won't return them. This is done in a new function, [Local]BufferAlloc(). This is similar to Yuri's patch at [0], but not that similar to earlier or later approaches in that thread. I don't really understand why that thread went on to ever more complicated approaches, when the basic approach shows plenty gains, with no issues around the number of buffer mapping entries that can exist etc. Other interesting bits I found: a) For workloads that [mostly] fit into s_b, the smgwrite() that BufferAlloc() does, nearly doubles the amount of writes. First the kernel ends up writing out all the zeroed out buffers after a while, then when we write out the actual buffer contents. The best fix for that seems to be to optionally use posix_fallocate() to reserve space, without dirtying pages in the kernel page cache. However, it looks like that's only beneficial when extending by multiple pages at once, because it ends up causing one filesystem-journal entry for each extension on at least some filesystems. I added 'smgrzeroextend()' that can extend by multiple blocks, without the caller providing a buffer to write out. When extending by 8 or more blocks, posix_fallocate() is used if available, otherwise pg_pwritev_with_retry() is used to extend the file. b) I found that is quite beneficial to bulk-extend the relation with smgrextend() even without concurrency. The reason for that is the primarily the aforementioned dirty buffers that our current extension method causes. One bit that stumped me for quite a while is to know how much to extend the relation by. RelationGetBufferForTuple() drives the decision whether / how much to bulk extend purely on the contention on the extension lock, which obviously does not work for non-concurrent workloads. After quite a while I figured out that we actually have good information on how much to extend by, at least for COPY / heap_multi_insert(). heap_multi_insert() can compute how much space is needed to store all tuples, and pass that on to RelationGetBufferForTuple(). For that to be accurate we need to recompute that number whenever we use an already partially filled page. That's not great, but doesn't appear to be a measurable overhead. c) Contention on the FSM and the pages returned by it is a serious bottleneck after a) and b). The biggest issue is that the current bulk insertion logic in hio.c enters all but one of the new pages into the freespacemap. That will immediately cause all the other backends to contend on the first few pages returned the FSM, and cause contention on the FSM pages itself. I've, partially, addressed that by using the information about the required number of pages from b). Whether we bulk insert or not, the number of pages we know we're going to need for one heap_multi_insert() don't need to be added to the FSM - we're going to use them anyway. I've stashed the number of free blocks in the BulkInsertState for now, but I'm not convinced that that's the right place. If I revert just this part, the "concurrent COPY into unlogged table" benchmark goes from ~240 tps to ~190 tps. Even after that change the FSM is a major bottleneck. Below I included benchmarks showing this by just removing the use of the FSM, but I haven't done anything further about it. The contention seems to be both from updating the FSM, as well as thundering-herd like symptoms from accessing the FSM. The update part could likely be addressed to some degree with a batch update operation updating the state for multiple pages. The access part could perhaps be addressed by adding an operation that gets a page and immediately marks it as fully used, so other backends won't also try to access it. d) doing /* new buffers are zero-filled */ MemSet((char *) bufBlock, 0, BLCKSZ); under the extension lock is surprisingly expensive on my two socket workstation (but much less noticable on my laptop). If I move the MemSet back under the extension lock, the "concurrent COPY into unlogged table" benchmark goes from ~240 tps to ~200 tps. e) When running a few benchmarks for this email, I noticed that there was a sharp performance dropoff for the patched code for a pgbench -S -s100 on a database with 1GB s_b, start between 512 and 1024 clients. This started with the patch only acquiring one buffer partition lock at a time. Lots of debugging ensued, resulting in [3]. The problem isn't actually related to the change, it just makes it more visible, because the "lock chains" between two partitions reduce the average length of the wait queues substantially, by distribution them between more partitions. [3] has a reproducer that's entirely independent of this patchset. Bulk extension acquires a number of victim buffers, acquires the extension lock, inserts the buffers into the buffer mapping table and marks them as io-in-progress, calls smgrextend and releases the extension lock. After that buffer[s] are locked (depending on mode and an argument indicating the number of blocks to be locked), and TerminateBufferIO() is called. This requires two new pieces of infrastructure: First, pinning multiple buffers opens up the obvious danger that we might run of non-pinned buffers. I added LimitAdditional[Local]Pins() that allows each backend to pin a proportional share of buffers (although always allowing one, as we do today). Second, having multiple IOs in progress at the same time isn't possible with the InProgressBuf mechanism. I added a ResourceOwnerRememberBufferIO() etc to deal with that instead. I like that this ends up removing a lot of AbortBufferIO() calls from the loops of various aux processes (now released inside ReleaseAuxProcessResources()). In very extreme workloads (single backend doing a pgbench -S -s 100 against a s_b=64MB database) the memory allocations triggered by StartBufferIO() are *just about* visible, not sure if that's worth worrying about - we do such allocations for the much more common pinning of buffers as well. The new [Bulk]ExtendSharedRelationBuffered() currently have both a Relation and a SMgrRelation argument, requiring at least one of them to be set. The reason for that is on the one hand that LockRelationForExtension() requires a relation and on the other hand, redo routines typically don't have a Relation around (recovery doesn't require an extension lock). That's not pretty, but seems a tad better than the ReadBufferExtended() vs ReadBufferWithoutRelcache() mess. I've done a fair bit of benchmarking of this patchset. For COPY it comes out ahead everywhere. It's possible that there's a very small regression for extremly IO miss heavy workloads, more below. server "base" configuration: max_wal_size=150GB shared_buffers=24GB huge_pages=on autovacuum=0 backend_flush_after=2MB max_connections=5000 wal_buffers=128MB wal_segment_size=1GB benchmark: pgbench running COPY into a single table. pgbench -t is set according to the client count, so that the same amount of data is inserted. This is done oth using small files ([1], ringbuffer not effective, no dirty data to write out within the benchmark window) and a bit larger files ([2], lots of data to write out due to ringbuffer). To make it a fair comparison HEAD includes the lwlock-waitqueue fix as well. s_b=24GB test: unlogged_small_files, format: text, files: 1024, 9015MB total seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch no_fsm no_fsm 1 58.63 207 50.22 242 54.35 224 2 32.67 372 25.82 472 27.30 446 4 22.53 540 13.30 916 14.33 851 8 15.14 804 7.43 1640 7.48 1632 16 14.69 829 4.79 2544 4.50 2718 32 15.28 797 4.41 2763 3.32 3710 64 15.34 794 5.22 2334 3.06 4061 128 15.49 786 4.97 2452 3.13 3926 256 15.85 768 5.02 2427 3.26 3769 512 16.02 760 5.29 2303 3.54 3471 test: logged_small_files, format: text, files: 1024, 9018MB total seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch no_fsm no_fsm 1 68.18 178 59.41 205 63.43 192 2 39.71 306 33.10 368 34.99 348 4 27.26 446 19.75 617 20.09 607 8 18.84 646 12.86 947 12.68 962 16 15.96 763 9.62 1266 8.51 1436 32 15.43 789 8.20 1486 7.77 1579 64 16.11 756 8.91 1367 8.90 1383 128 16.41 742 10.00 1218 9.74 1269 256 17.33 702 11.91 1023 10.89 1136 512 18.46 659 14.07 866 11.82 1049 test: unlogged_medium_files, format: text, files: 64, 9018MB total seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch no_fsm no_fsm 1 63.27s 192 56.14 217 59.25 205 2 40.17s 303 29.88 407 31.50 386 4 27.57s 442 16.16 754 17.18 709 8 21.26s 573 11.89 1025 11.09 1099 16 21.25s 573 10.68 1141 10.22 1192 32 21.00s 580 10.72 1136 10.35 1178 64 20.64s 590 10.15 1200 9.76 1249 128 skipped 256 skipped 512 skipped test: logged_medium_files, format: text, files: 64, 9018MB total seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch no_fsm no_fsm 1 71.89s 169 65.57 217 69.09 69.09 2 47.36s 257 36.22 407 38.71 38.71 4 33.10s 368 21.76 754 22.78 22.78 8 26.62s 457 15.89 1025 15.30 15.30 16 24.89s 489 17.08 1141 15.20 15.20 32 25.15s 484 17.41 1136 16.14 16.14 64 26.11s 466 17.89 1200 16.76 16.76 128 skipped 256 skipped 512 skipped Just to see how far it can be pushed, with binary format we can now get to nearly 6GB/s into a table when disabling the FSM - note the 2x difference between patch and patch+no-fsm at 32 clients. test: unlogged_small_files, format: binary, files: 1024, 9508MB total seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch no_fsm no_fsm 1 34.14 357 28.04 434 29.46 413 2 22.67 537 14.42 845 14.75 826 4 16.63 732 7.62 1599 7.69 1587 8 13.48 904 4.36 2795 4.13 2959 16 14.37 848 3.78 3224 2.74 4493 32 14.79 823 4.20 2902 2.07 5974 64 14.76 825 5.03 2423 2.21 5561 128 14.95 815 4.36 2796 2.30 5343 256 15.18 802 4.31 2828 2.49 4935 512 15.41 790 4.59 2656 2.84 4327 s_b=4GB test: unlogged_small_files, format: text, files: 1024, 9018MB total seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch 1 62.55 194 54.22 224 2 37.11 328 28.94 421 4 25.97 469 16.42 742 8 20.01 609 11.92 1022 16 19.55 623 11.05 1102 32 19.34 630 11.27 1081 64 19.07 639 12.04 1012 128 19.22 634 12.27 993 256 19.34 630 12.28 992 512 19.60 621 11.74 1038 test: logged_small_files, format: text, files: 1024, 9018MB total seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch 1 71.71 169 63.63 191 2 46.93 259 36.31 335 4 30.37 401 22.41 543 8 22.86 533 16.90 721 16 20.18 604 14.07 866 32 19.64 620 13.06 933 64 19.71 618 15.08 808 128 19.95 610 15.47 787 256 20.48 595 16.53 737 512 21.56 565 16.86 722 test: unlogged_medium_files, format: text, files: 64, 9018MB total seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch 1 62.65 194 55.74 218 2 40.25 302 29.45 413 4 27.37 445 16.26 749 8 22.07 552 11.75 1037 16 21.29 572 10.64 1145 32 20.98 580 10.70 1139 64 20.65 590 10.21 1193 128 skipped 256 skipped 512 skipped test: logged_medium_files, format: text, files: 64, 9018MB total seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch 1 71.72 169 65.12 187 2 46.46 262 35.74 341 4 32.61 373 21.60 564 8 26.69 456 16.30 747 16 25.31 481 17.00 716 32 24.96 488 17.47 697 64 26.05 467 17.90 680 128 skipped 256 skipped 512 skipped test: unlogged_small_files, format: binary, files: 1024, 9505MB total seconds tbl-MBs seconds tbl-MBs clients HEAD HEAD patch patch 1 37.62 323 32.77 371 2 28.35 429 18.89 645 4 20.87 583 12.18 1000 8 19.37 629 10.38 1173 16 19.41 627 10.36 1176 32 18.62 654 11.04 1103 64 18.33 664 11.89 1024 128 18.41 661 11.91 1023 256 18.52 658 12.10 1007 512 18.78 648 11.49 1060 benchmark: Run a pgbench -S workload with scale 100, so it doesn't fit into s_b, thereby exercising BufferAlloc()'s buffer replacement path heavily. The run-to-run variance on my workstation is high for this workload (both before/after my changes). I also found that the ramp-up time at higher client counts is very significant: progress: 2.1 s, 5816.8 tps, lat 1.835 ms stddev 4.450, 0 failed progress: 3.0 s, 666729.4 tps, lat 5.755 ms stddev 16.753, 0 failed progress: 4.0 s, 899260.1 tps, lat 3.619 ms stddev 41.108, 0 failed ... One would need to run pgbench for impractically long to make that effect vanish. My not great solution for these was to run with -T21 -P5 and use the best 5s as the tps. s_b=1GB tps tps clients master patched 1 49541 48805 2 85342 90010 4 167340 168918 8 308194 303222 16 524294 523678 32 649516 649100 64 932547 937702 128 908249 906281 256 856496 903979 512 764254 934702 1024 653886 925113 2048 569695 917262 4096 526782 903258 s_b=128MB: tps tps clients master patched 1 40407 39854 2 73180 72252 4 143334 140860 8 240982 245331 16 429265 420810 32 544593 540127 64 706408 726678 128 713142 718087 256 611030 695582 512 552751 686290 1024 508248 666370 2048 474108 656735 4096 448582 633040 As there might be a small regression at the smallest end, I ran a more extreme version of the above. Using a pipelined pgbench -S, with a single client, for longer. With s_b=8MB. To further reduce noise I pinned the server to one cpu, the client to another and disabled turbo mode on the CPU. master "total" tps: 61.52 master "best 5s" tps: 61.8 patch "total" tps: 61.20 patch "best 5s" tps: 61.4 Hardly conclusive, but it does look like there's a small effect. It could be code layout or such. My guess however is that it's the resource owner for in-progress IO that I added - that adds an additional allocation inside the resowner machinery. I commented those out (that's obviously incorrect!) just to see whether that changes anything: no-resowner "total" tps: 62.03 no-resowner "best 5s" tps: 62.2 So it looks like indeed, it's the resowner. I am a bit surprised, because obviously we already use that mechanism for pins, which obviously is more frequent. I'm not sure it's worth worrying about - this is a pretty absurd workload. But if we decide it is, I can think of a few ways to address this. E.g.: - We could preallocate an initial element inside the ResourceArray struct, so that a newly created resowner won't need to allocate immediately - We could only use resowners if there's more than one IO in progress at the same time - but I don't like that idea much - We could try to store the "in-progress"-ness of a buffer inside the 'bufferpin' resowner entry - on 64bit system there's plenty space for that. But on 32bit systems... The patches here aren't fully polished (as will be evident). But they should be more than good enough to discuss whether this is a sane direction. Greetings, Andres Freund [0] https://postgr.es/m/3b108afd19fa52ed20c464a69f64d545e4a14772.camel%40postgrespro.ru [1] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT test); [2] COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT text); [3] https://postgr.es/m/20221027165914.2hofzp4cvutj6...@awork3.anarazel.de