Re: SLRUs in the main buffer pool, redux
Hi, > > Unfortunately the patchset rotted quite a bit since February and needs a > > rebase. > > A consensus was reached to mark this patch as RwF for now. There > are many patches to be reviewed and this one doesn't seem to be in the > best shape, so we have to prioritise. Please feel free re-submitting > the patch for the next commitfest. See also [1] """ [...] Also, please consider joining the efforts and having one thread with a single patchset rather than different threads with different competing patches. This will simplify the work of the reviewers a lot. Personally I would suggest taking one step back and agree on a particular RFC first and then continue working on a single patchset according to this RFC. We did it in the past in similar cases and this approach proved to be productive. [...] """ [1]: https://postgr.es/m/CAJ7c6TME5Z8k4undYUmKavD_dQFL0ujA%2BzFCK1eTH0_pzM%3DXrA%40mail.gmail.com -- Best regards, Aleksander Alekseev
Re: SLRUs in the main buffer pool, redux
Hi, > Unfortunately the patchset rotted quite a bit since February and needs a > rebase. A consensus was reached [1] to mark this patch as RwF for now. There are many patches to be reviewed and this one doesn't seem to be in the best shape, so we have to prioritise. Please feel free re-submitting the patch for the next commitfest. [1]: https://postgr.es/m/0737f444-59bb-ac1d-2753-873c40da0840%40eisentraut.org -- Best regards, Aleksander Alekseev
Re: SLRUs in the main buffer pool, redux
Hi, > > Here's a rebased set of patches. > > > > The second patch is failing the pg_upgrade tests. Before I dig into > > that, I'd love to get some feedback on this general approach. > > Forgot to include the new "slrulist.h" file in the previous patch, fixed > here. Unfortunately the patchset rotted quite a bit since February and needs a rebase. -- Best regards, Aleksander Alekseev
Re: SLRUs in the main buffer pool, redux
On Mon, Jul 25, 2022 at 11:54:36AM +0300, Heikki Linnakangas wrote: > Oh I just saw that you had a comment about that in the patch and had hacked > around it. Anyway, calling ResourceOwnerEnlargeBuffers() might be a > solution. Or switch to a separate "CriticalResourceOwner" that's guaranteed > to have enough pre-allocated space, before entering the critical section. Wanted to bump up this thread. Rishu in my team posted a patch in the other SLRU thread [1] with the latest updates and fixes and looks like performance numbers do not show any regression. This change is currently in the January commitfest [2] as well. Any feedback would be appreciated! [1] https://www.postgresql.org/message-id/A09EAE0D-0D3F-4A34-ADE9-8AC1DCBE7D57%40amazon.com [2] https://commitfest.postgresql.org/41/3514/ Shawn Amazon Web Services (AWS)
Re: SLRUs in the main buffer pool, redux
On Sat, Sep 17, 2022 at 12:41 PM Bagga, Rishu wrote: > While I was working on adding the page headers to SLRU pages on your patch, I > came across this code where it seems like "MultiXactIdToMemberPage" is > mistakenly being used instead of MultiXactIdToOffsetPage in the TrimMultiXact > function. Thanks Rishu. Right. Will fix soon in the next version, along with my long overdue replies to Heikki and Konstantin.
Re: SLRUs in the main buffer pool, redux
Hi Thomas, While I was working on adding the page headers to SLRU pages on your patch, I came across this code where it seems like "MultiXactIdToMemberPage" is mistakenly being used instead of MultiXactIdToOffsetPage in the TrimMultiXact function. Below is the area of concern in the patch: @@ -2045,14 +1977,7 @@ TrimMultiXact(void) oldestMXactDB = MultiXactState->oldestMultiXactDB; LWLockRelease(MultiXactGenLock); - /* Clean up offsets state */ - LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE); - - /* -* (Re-)Initialize our idea of the latest page number for offsets. -*/ - pageno = MultiXactIdToOffsetPage(nextMXact); - MultiXactOffsetCtl->shared->latest_page_number = pag0eno; + pageno = MXOffsetToMemberPage(offset); Let us know if I am missing something here or if it is an error. Sincerely, Rishu Bagga (Amazon Web Services) On 9/16/22, 5:37 PM, "Thomas Munro" wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Rebased, debugged and fleshed out a tiny bit more, but still with plenty of TODO notes and questions. I will talk about this idea at PGCon, so I figured it'd help to have a patch that actually applies, even if it doesn't work quite right yet. It's quite a large patch but that's partly because it removes a lot of lines...
Re: SLRUs in the main buffer pool, redux
On 25/07/2022 09:54, Heikki Linnakangas wrote: In RecordTransactionCommit(), we enter a critical section, and then call TransactionIdCommitTree() to update the CLOG pages. That now involves a call to ReadBuffer_common(), which in turn calls ResourceOwnerEnlargeBuffers(). That can fail, because it might require allocating memory, which is forbidden in a critical section. I ran into an assertion about that with "make check" when I was playing around with a heavily modified version of this patch. Haven't seen it with your original one, but I believe that's just luck. Calling ResourceOwnerEnlargeBuffers() before entering the critical section would probably fix that, although I'm a bit worried about having the Enlarge call so far away from the point where it's needed. Oh I just saw that you had a comment about that in the patch and had hacked around it. Anyway, calling ResourceOwnerEnlargeBuffers() might be a solution. Or switch to a separate "CriticalResourceOwner" that's guaranteed to have enough pre-allocated space, before entering the critical section. - Heikki
Re: SLRUs in the main buffer pool, redux
Good day, Thomas В Пт, 27/05/2022 в 23:24 +1200, Thomas Munro пишет: > Rebased, debugged and fleshed out a tiny bit more, but still with > plenty of TODO notes and questions. I will talk about this idea at > PGCon, so I figured it'd help to have a patch that actually applies, > even if it doesn't work quite right yet. It's quite a large patch but > that's partly because it removes a lot of lines... Looks like it have to be rebased again.
Re: SLRUs in the main buffer pool, redux
On Thu, Jun 16, 2022 at 1:13 PM Konstantin Knizhnik wrote: > I wonder which workload can cause CLOG to become a bottleneck? > Usually Postgres uses hint bits to avoid clog access. So standard pgbench > doesn't demonstrate any degrade of performance even in case of presence of > long living transactions, > which keeps XMIN horizon. I haven't done research on this in a number of years and a bunch of other improvements have been made since then, but I remember discovering that CLOG could become a bottleneck on standard pgbench tests especially when using unlogged tables. The higher you raised the scale factor, the more of a bottleneck CLOG became. That makes sense: no matter what the scale factor is, you're constantly updating rows that have not previously been hinted, but as the scale factor gets bigger, those rows are likely to be older (in terms of XID age) on average, and so you need to cache more CLOG buffers to maintain performance. But the SLRU system provides no such flexibility: it doesn't scale to large numbers of buffers the way the code is written, and it certainly can't vary the number of buffers devoted to this purpose at runtime. So I think that the approach we're talking about here has potential in that sense. However, another problem with the SLRU code is that it's old and crufty and hard to work with. It's hard to imagine anyone being able to improve things very much as long as that's the basis. I don't know that whatever code Thomas has written or will write is better, but if it is, that would be good, because I don't see a lot of improvement in this area being possible otherwise. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SLRUs in the main buffer pool, redux
On 28.05.2022 04:13, Thomas Munro wrote: On Fri, May 27, 2022 at 11:24 PM Thomas Munro wrote: Rebased, debugged and fleshed out a tiny bit more, but still with plenty of TODO notes and questions. I will talk about this idea at PGCon, so I figured it'd help to have a patch that actually applies, even if it doesn't work quite right yet. It's quite a large patch but that's partly because it removes a lot of lines... FWIW, here are my PGCon slides about this: https://speakerdeck.com/macdice/improving-the-slru-subsystem There was a little bit of discussion on #pgcon-stream2 which I could summarise as: can we figure out a way to keep parts of the CLOG pinned so that backends don't have to do that for each lookup? Then CLOG checks become simple reads. There may be some relation to the idea of 'nailing' btree root pages that I've heard of from a couple of people now (with ProcSignalBarrier or something more fine grained along those lines if you need to unnail anything). Something to think about. I'm also wondering if it would be possible to do "optimistic" pinning instead for reads that normally need only a pin, using some kind of counter scheme with read barriers to tell you if the page might have been evicted after you read the data... I wonder if there are some tests which can illustrate advantages of storing SLRU pages in shared buffers? In PgPro we had a customer which run PL-PgSql code with recursively called function containing exception handling code. Each exception block creates subtransaction and subxids SLRU becomes bottleneck. I have simulated this workload with large number subxids using the following function: create or replace function do_update(id integer, level integer) returns void as $$ begin begin if level > 0 then perform do_update(id, level-1); else update pgbench_accounts SET abalance = abalance + 1 WHERE aid = id; end if; exception WHEN OTHERS THEN raise notice '% %', SQLERRM, SQLSTATE; end; end; $$ language plpgsql; With the following test script: \set aid random(1, 1000) select do_update(:aid,100) I got the following results: knizhnik@xps:~/db$ pgbench postgres -f update.sql -c 10 -T 100 -P 1 -M prepared pgbench (15beta1) starting vacuum...end. progress: 1.0 s, 3030.8 tps, lat 3.238 ms stddev 1.110, 0 failed progress: 2.0 s, 3018.0 tps, lat 3.303 ms stddev 1.088, 0 failed progress: 3.0 s, 3000.4 tps, lat 3.329 ms stddev 1.063, 0 failed progress: 4.0 s, 2855.6 tps, lat 3.494 ms stddev 1.152, 0 failed progress: 5.0 s, 2747.0 tps, lat 3.631 ms stddev 1.306, 0 failed progress: 6.0 s, 2664.0 tps, lat 3.743 ms stddev 1.410, 0 failed progress: 7.0 s, 2498.0 tps, lat 3.992 ms stddev 1.659, 0 failed ... progress: 93.0 s, 670.0 tps, lat 14.964 ms stddev 10.555, 0 failed progress: 94.0 s, 615.0 tps, lat 16.222 ms stddev 11.419, 0 failed progress: 95.0 s, 580.0 tps, lat 17.251 ms stddev 11.622, 0 failed progress: 96.0 s, 568.0 tps, lat 17.582 ms stddev 11.679, 0 failed progress: 97.0 s, 573.0 tps, lat 17.389 ms stddev 11.771, 0 failed progress: 98.0 s, 611.0 tps, lat 16.428 ms stddev 11.768, 0 failed progress: 99.0 s, 568.0 tps, lat 17.622 ms stddev 11.912, 0 failed progress: 100.0 s, 568.0 tps, lat 17.631 ms stddev 11.672, 0 failed tps = 1035.566054 (without initial connection time) With Thomas patch results are the following: progress: 1.0 s, 2949.8 tps, lat 3.332 ms stddev 1.285, 0 failed progress: 2.0 s, 3009.1 tps, lat 3.317 ms stddev 1.077, 0 failed progress: 3.0 s, 2993.6 tps, lat 3.338 ms stddev 1.099, 0 failed progress: 4.0 s, 3034.4 tps, lat 3.291 ms stddev 1.056, 0 failed ... progress: 97.0 s, 1113.0 tps, lat 8.972 ms stddev 3.885, 0 failed progress: 98.0 s, 1138.0 tps, lat 8.803 ms stddev 3.496, 0 failed progress: 99.0 s, 1174.8 tps, lat 8.471 ms stddev 3.875, 0 failed progress: 100.0 s, 1094.1 tps, lat 9.123 ms stddev 3.842, 0 failed tps = 2133.240094 (without initial connection time) So there is still degrade of performance but smaller than in case of vanilla and total TPS are almost two times higher. And this is another example demonstrating degrade of performance from presentation by Alexander Korotkov: pgbench script: \setaid random(1, 10 * :scale) \setbid random(1, 1 * :scale) \settid random(1, 10 * :scale) \setdelta random(-5000, 5000) BEGIN; INSERT INTOpgbench_history (tid, bid, aid, delta, mtime) VALUES(:tid, :bid, :aid, :delta,CURRENT_TIMESTAMP); SAVEPOINT s1; INSERT INTOpgbench_history (tid, bid, aid, delta, mtime) VALUES(:tid, :bid, :aid, :delta,CURRENT_TIMESTAMP); SAVEPOINT sN; INSERT INTOpgbench_history (tid, bid, aid, delta, mtime) VALUES(:tid, :bid, :aid, :delta,CURRENT_TIMESTAMP); SELECTpg_sleep(1.0); END; I wonder which workload can cause CLOG to become a bottleneck? Usually Postgres uses hint bits to avoid clog access. So standard pgbench doesn't demonstrate any degrade of performance even in case of presence of long living
Re: SLRUs in the main buffer pool, redux
Hi, On 2022-05-28 13:13:20 +1200, Thomas Munro wrote: > There was a little bit of discussion on #pgcon-stream2 which I could > summarise as: can we figure out a way to keep parts of the CLOG pinned > so that backends don't have to do that for each lookup? Then CLOG > checks become simple reads. Included in that is not needing to re-check that the identity of the buffer changed since the last use and to not need a PrivateRefCountEntry. Neither is cheap... I'd structure it so that there's a small list of slru buffers that's pinned in a "shared" mode. Entering the buffer into that increases the BufferDesc's refcount, but is *not* memorialized in the backend's refcount structures, because it's "owned by the SLRU". > There may be some relation to the idea of > 'nailing' btree root pages that I've heard of from a couple of people > now (with ProcSignalBarrier or something more fine grained along those > lines if you need to unnail anything). Something to think about. I'm very doubtful it's a good idea to combine those things - I think it's quite different to come up with a design for SLRUs, of which there's a constant number and shared memory ownership datastructures, and btree root pages etc, of which there are arbitrary many. For the nbtree (and similar) cases, I think it'd make sense to give backends a size-limited number of pages they can keep pinned, but in a backend local way. With, as you suggest, a procsignal barrier or such to force release. > I'm also wondering if it would be possible to do "optimistic" pinning > instead for reads that normally need only a pin, using some kind of > counter scheme with read barriers to tell you if the page might have > been evicted after you read the data... -many That seems fragile and complicated, without, at least to me, a clear need. Greetings, Andres Freund
Re: SLRUs in the main buffer pool, redux
On Fri, May 27, 2022 at 11:24 PM Thomas Munro wrote: > Rebased, debugged and fleshed out a tiny bit more, but still with > plenty of TODO notes and questions. I will talk about this idea at > PGCon, so I figured it'd help to have a patch that actually applies, > even if it doesn't work quite right yet. It's quite a large patch but > that's partly because it removes a lot of lines... FWIW, here are my PGCon slides about this: https://speakerdeck.com/macdice/improving-the-slru-subsystem There was a little bit of discussion on #pgcon-stream2 which I could summarise as: can we figure out a way to keep parts of the CLOG pinned so that backends don't have to do that for each lookup? Then CLOG checks become simple reads. There may be some relation to the idea of 'nailing' btree root pages that I've heard of from a couple of people now (with ProcSignalBarrier or something more fine grained along those lines if you need to unnail anything). Something to think about. I'm also wondering if it would be possible to do "optimistic" pinning instead for reads that normally need only a pin, using some kind of counter scheme with read barriers to tell you if the page might have been evicted after you read the data...
Re: SLRUs in the main buffer pool, redux
On Mon, Jan 17, 2022 at 11:23 PM Heikki Linnakangas wrote: > IIRC one issue with this has been performance. When an SLRU is working > well, a cache hit in the SLRU is very cheap. Linearly scanning the SLRU > array is cheap, compared to computing the hash and looking up a buffer > in the buffer cache. Would be good to do some benchmarking of that. One trick I want to experiment with is trying to avoid the mapping table lookup by using ReadRecentBuffer(). In the prototype patch I do that for one buffer per SLRU cache (so that'll often point to the head CLOG page), but it could be extended to a small array of recently accessed buffers to scan linearly, much like the current SLRU happy case except that it's not shared and doesn't need a lock so it's even happier. I'm half joking here, but that would let us keep calling this subsystem SLRU :-) > I wanted to do this with the CSN (Commit Sequence Number) work a long > time ago. That would benefit from something like an SLRU with very fast > access, to look up CSNs. Even without CSNs, if the CLOG was very fast to > access we might not need hint bits anymore. I guess trying to make SLRUs > faster than they already are is moving the goalposts, but at a minimum > let's make sure they don't get any slower. One idea I've toyed with is putting a bitmap into shared memory where each bit corresponds to a range of (say) 256 xids, accessed purely with atomics. If a bit is set we know they're all committed, and otherwise you have to do more pushups. I've attached a quick and dirty experimental patch along those lines, but I don't think it's quite right, especially with people waving 64 bit xid patches around. Perhaps it'd need to be a smaller sliding window, somehow, and perhaps people will balk at the wild and unsubstantiated assumption that transactions generally commit. From ee0db89b7c054ca0297d42ef94058b37bb8b9436 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 14 Jan 2022 11:15:48 +1300 Subject: [PATCH] WIP: Try to cache CLOG results. To make CLOG lookups faster, maintain a bitmap of xid groups that are known to be all committed, using atomic ops. XXX Just an experiment... --- src/backend/access/transam/clog.c | 144 +- 1 file changed, 143 insertions(+), 1 deletion(-) diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c index de787c3d37..0678fa828c 100644 --- a/src/backend/access/transam/clog.c +++ b/src/backend/access/transam/clog.c @@ -41,6 +41,7 @@ #include "miscadmin.h" #include "pg_trace.h" #include "pgstat.h" +#include "port/atomics.h" #include "storage/proc.h" #include "storage/sync.h" @@ -81,6 +82,127 @@ */ #define THRESHOLD_SUBTRANS_CLOG_OPT5 +/* + * The number of transactions to summarize into a single bit of shared memory + * that says 'everything in this group has committed'. Must be a power of two + * and <= CLOG_XACTS_PER_PAGE. Currently set so that it corresponds to one + * cache line of CLOG page, so that it's cheap to scan a group on every + * commit. That results in a 1MB summary array. XXX Many other policies + * including lazy computation and course granularity are possible. + */ +#define CLOG_XACTS_PER_SUMMARY_GROUP (64 * CLOG_XACTS_PER_BYTE) + +/* Fixed size of the array of all-committed summary bits, in bytes. */ +#define CLOG_SUMMARY_SIZE (0x8000 / CLOG_XACTS_PER_SUMMARY_GROUP / 8) + +static pg_atomic_uint32 *CLOGSummary; + +/* + * Write the all-committed bit for the summary group containing 'xid'. + */ +static void +TransactionXidSetSummaryBit(TransactionId xid, bool value) +{ + size_t bit_index = (xid & 0x7fff) / CLOG_XACTS_PER_SUMMARY_GROUP; + size_t bits_per_word = sizeof(uint32) * 8; + size_t word_index = bit_index / bits_per_word; + uint32 mask = 1 << (bit_index % bits_per_word); + + if (value) + pg_atomic_fetch_or_u32([word_index], mask); + else + pg_atomic_fetch_and_u32([word_index], ~mask); +} + +/* + * Read the all-committed bit for the summary group containing 'xid'. + */ +static bool +TransactionXidGetSummaryBit(TransactionId xid) +{ + size_t bit_index = (xid & 0x7fff) / CLOG_XACTS_PER_SUMMARY_GROUP; + size_t bits_per_word = sizeof(uint32) * 8; + size_t word_index = bit_index / bits_per_word; + uint32 mask = 1 << (bit_index % bits_per_word); + + return (pg_atomic_read_u32([word_index]) & mask) != 0; +} + +/* + * Summarize a cache line's worth of CLOG statuses into an all-committed bit, + * if possible. Called for each CLOG commit, so had better be fast. + */ +static void +TransactionXidSummarizeGroup(TransactionId xid, char *page) +{ + int first_chunk; + int end_chunk; + uint64 all_committed_chunk; + uint64 *page_chunks = (uint64 *) page; + + /* Already set? */ + if (unlikely(TransactionXidGetSummaryBit(xid))) + return; + + /* +* XXX Async commit LSNs should prevent
Re: SLRUs in the main buffer pool, redux
On 13/01/2022 15:59, Thomas Munro wrote: Hi, I was re-reviewing the proposed batch of GUCs for controlling the SLRU cache sizes[1], and I couldn't resist sketching out $SUBJECT as an obvious alternative. This patch is highly experimental and full of unresolved bits and pieces (see below for some), but it passes basic tests and is enough to start trying the idea out and figuring out where the real problems lie. The hypothesis here is that CLOG, multixact, etc data should compete for space with relation data in one unified buffer pool so you don't have to tune them, and they can benefit from the better common implementation (mapping, locking, replacement, bgwriter, checksums, etc and eventually new things like AIO, TDE, ...). +1 I know that many people have talked about doing this and maybe they already have patches along these lines too; I'd love to know what others imagined differently/better. IIRC one issue with this has been performance. When an SLRU is working well, a cache hit in the SLRU is very cheap. Linearly scanning the SLRU array is cheap, compared to computing the hash and looking up a buffer in the buffer cache. Would be good to do some benchmarking of that. I wanted to do this with the CSN (Commit Sequence Number) work a long time ago. That would benefit from something like an SLRU with very fast access, to look up CSNs. Even without CSNs, if the CLOG was very fast to access we might not need hint bits anymore. I guess trying to make SLRUs faster than they already are is moving the goalposts, but at a minimum let's make sure they don't get any slower. - Heikki
Re: SLRUs in the main buffer pool, redux
On Thu, Jan 13, 2022 at 9:00 AM Thomas Munro wrote: > I was re-reviewing the proposed batch of GUCs for controlling the SLRU > cache sizes[1], and I couldn't resist sketching out $SUBJECT as an > obvious alternative. This patch is highly experimental and full of > unresolved bits and pieces (see below for some), but it passes basic > tests and is enough to start trying the idea out and figuring out > where the real problems lie. The hypothesis here is that CLOG, > multixact, etc data should compete for space with relation data in one > unified buffer pool so you don't have to tune them, and they can > benefit from the better common implementation (mapping, locking, > replacement, bgwriter, checksums, etc and eventually new things like > AIO, TDE, ...). I endorse this hypothesis. The performance cliff when the XID range you're regularly querying exceeds the hardcoded constant is quite steep, and yet we can't just keep pushing that constant up. Linear search does not scale well to infinitely large arrays.[citation needed] > [ long list of dumpster-fire level problems with the patch ] Honestly, none of this sounds that bad. I mean, it sounds bad in the sense that you're going to have to fix all of this somehow and I'm going to unhelpfully give you no advice whatsoever about how to do that, but my guess is that a moderate amount of persistence will be sufficient for you to get the job done. None of it sounds hopeless. Before fixing all of that, one thing you might want to consider is whether it uh, works. And by "work" I don't mean "get the right answer" even though I agree with my esteemed fellow hacker that this is an important thing to do.[1] What I mean is that it would be good to see some evidence that the number of buffers that end up being used to cache any particular SLRU is somewhat sensible, and varies by workload. For example, consider a pgbench workload. As you increase the scale factor, the age of the oldest XIDs that you regularly encounter will also increase, because on the average, the row you're now updating will not have been updated for a larger number of transactions. So it would be interesting to know whether all of the CLOG buffers that are regularly being accessed do in fact remain in cache - and maybe even whether buffers that stop being regularly accessed get evicted in the face of cache pressure. Also, the existing clog code contains a guard that absolutely prevents the latest CLOG buffer from being evicted. Because - at least in a pgbench test like the one postulated above, and probably in general - the frequency of access to older CLOG buffers decays exponentially, evicting the newest or even the second or third newest CLOG buffer is really bad. At present, there's a hard-coded guard to prevent the newest buffer from being evicted, which is a band-aid, but an effective one. Even with that band-aid, evicting any of the most recent few can produce a system-wide stall, where every backend ends up waiting for the evicted buffer to be retrieved. It would be interesting to know whether the same problem can be recreated with your patch, because the buffer eviction algorithm for shared buffers is only a little bit less dumb than the one for SLRUs, and can pretty commonly devolve into little more than evict-at-random. Evict-at-random is very bad here, because evicting a hot CLOG page is probably even worse than evicting, say, a btree root page. Another interesting test might be one that puts pressure on some other SLRU, like pg_multixact or pg_subtrans. In general SLRU pages that are actually being used are hot enough that we should keep them in cache almost no matter what else is competing for cache space ... but the number of such pages can be different from one SLRU to another, and can change over time. -- Robert Haas EDB: http://www.enterprisedb.com [1] http://postgr.es/m/3151122.1642086...@sss.pgh.pa.us
SLRUs in the main buffer pool, redux
Hi, I was re-reviewing the proposed batch of GUCs for controlling the SLRU cache sizes[1], and I couldn't resist sketching out $SUBJECT as an obvious alternative. This patch is highly experimental and full of unresolved bits and pieces (see below for some), but it passes basic tests and is enough to start trying the idea out and figuring out where the real problems lie. The hypothesis here is that CLOG, multixact, etc data should compete for space with relation data in one unified buffer pool so you don't have to tune them, and they can benefit from the better common implementation (mapping, locking, replacement, bgwriter, checksums, etc and eventually new things like AIO, TDE, ...). I know that many people have talked about doing this and maybe they already have patches along these lines too; I'd love to know what others imagined differently/better. In the attached sketch, the SLRU caches are psuedo-relations in pseudo-database 9. Yeah. That's a straw-man idea stolen from the Zheap/undo project[2] (I also stole DiscardBuffer() from there); better ideas for identifying these buffers without making BufferTag bigger are very welcome. You can list SLRU buffers with: WITH slru(relfilenode, path) AS (VALUES (0, 'pg_xact'), (1, 'pg_multixact/offsets'), (2, 'pg_multixact/members'), (3, 'pg_subtrans'), (4, 'pg_serial'), (5, 'pg_commit_ts'), (6, 'pg_notify')) SELECT bufferid, path, relblocknumber, isdirty, usagecount, pinning_backends FROM pg_buffercache NATURAL JOIN slru WHERE reldatabase = 9 ORDER BY path, relblocknumber; Here are some per-cache starter hypotheses about locking that might be completely wrong and obviously need real analysis and testing. pg_xact: I couldn't easily get rid of XactSLRULock, because it didn't just protect buffers, it's also used to negotiate "group CLOG updates". (I think it'd be nice to replace that system with an atomic page update scheme so that concurrent committers stay on CPU, something like [3], but that's another topic.) I decided to try a model where readers only have to pin the page (the reads are sub-byte values that we can read atomically, and you'll see a value as least as fresh as the time you took the pin, right?), but writers have to take an exclusive content lock because otherwise they'd clobber each other at byte level, and because they need to maintain the page LSN consistently. Writing back is done with a share lock as usual and log flushing can be done consistently. I also wanted to try avoiding the extra cost of locking and accessing the buffer mapping table in common cases, so I use ReadRecentBuffer() for repeat access to the same page (this applies to the other SLRUs too). pg_subtrans: I got rid of SubtransSLRULock because it only protected page contents. Can be read with only a pin. Exclusive page content lock to write. pg_multixact: I got rid of the MultiXact{Offset,Members}SLRULock locks. Can be read with only a pin. Writers take exclusive page content lock. The multixact.c module still has its own MultiXactGenLock. pg_commit_ts: I got rid of CommitTsSLRULock since it only protected buffers, but here I had to take shared content locks to read pages, since the values can't be read atomically. Exclusive content lock to write. pg_serial: I could not easily get rid of SerialSLRULock, because it protects the SLRU + also some variables in serialControl. Shared and exclusive page content locks. pg_notify: I got rid of NotifySLRULock. Shared and exclusive page content locks are used for reading and writing. The module still has a separate lock NotifyQueueLock to coordinate queue positions. Some problems tackled incompletely: * I needed to disable checksums and in-page LSNs, since SLRU pages hold raw data with no header. We'd probably eventually want regular (standard? formatted?) pages (the real work here may be implementing FPI for SLRUs so that checksums don't break your database on torn writes). In the meantime, suppressing those things is done by the kludge of recognising database 9 as raw data, but there should be something better than this. A separate array of size NBuffer holds "external" page LSNs, to drive WAL flushing. * The CLOG SLRU also tracks groups of async commit LSNs in a fixed sized array. The obvious translation would be very wasteful (an array big enough for NBuffers * groups per page), but I hope that there is a better way to do this... in the sketch patch I changed it to use the single per-page LSN for simplicity (basically group size is 32k instead of 32...), which is certainly not good enough. Some stupid problems not tackled yet: * It holds onto the virtual file descriptor for the last segment accessed, but there is no invalidation for