Re: index prefetching
On 1/16/24 2:10 PM, Konstantin Knizhnik wrote: Amazon RDS is just vanilla Postgres with file system mounted on EBS (Amazon distributed file system). EBS provides good throughput but larger latencies comparing with local SSDs. I am not sure if read-ahead works for EBS. Actually, EBS only provides a block device - it's definitely not a filesystem itself (*EFS* is a filesystem - but it's also significantly different than EBS). So as long as readahead is happening somewheer above the block device I would expect it to JustWork on EBS. Of course, Aurora Postgres (like Neon) is completely different. If you look at page 53 of [1] you'll note that there's two different terms used: prefetch and batch. I'm not sure how much practical difference there is, but batched IO (one IO request to Aurora Storage for many blocks) predates index prefetch; VACUUM in APG has used batched IO for a very long time (it also *only* reads blocks that aren't marked all visble/frozen; none of the "only skip if skipping at least 32 blocks" logic is used). 1: https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Deep_dive_on_Amazon_Aurora_with_PostgreSQL_compatibility_DAT328-R1.pdf -- Jim Nasby, Data Architect, Austin TX
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On 1/12/24 12:45 PM, Robert Haas wrote: P.P.S. to everyone: Yikes, this logic is really confusing. Having studied all this code several years ago when it was even simpler - it was *still* very hard to grok even back then. I *greatly appreciate* the effort that y'all are putting into increasing the clarity here. BTW, back in the day the whole "no indexes" optimization was a really tiny amount of code... I think it amounted to 2 or 3 if statements. I haven't yet attempted to grok this patchset, but I'm definitely wondering how much it's worth continuing to optimize that case. Clearly it'd be very expensive to memoize dead tuples just to trawl that list a single time to clean the heap, but outside of that I'm not sure other optimazations are worth it given the amount of code complexity/duplication they seem to require - especially for code where correctness is so crucial. -- Jim Nasby, Data Architect, Austin TX
Re: Make NUM_XLOGINSERT_LOCKS configurable
On 1/12/24 12:32 AM, Bharath Rupireddy wrote: Test case: ./pgbench --initialize --scale=100 --username=ubuntu postgres ./pgbench --progress=10 --client=64 --time=300 --builtin=tpcb-like --username=ubuntu postgres Setup: ./configure --prefix=$PWD/inst/ CFLAGS="-ggdb3 -O3" > install.log && make -j 8 install > install.log 2>&1 & shared_buffers = '8GB' max_wal_size = '32GB' track_wal_io_timing = on Stats measured: I've used the attached patch to measure WAL Insert Lock Acquire Time (wal_insert_lock_acquire_time) and WAL Wait for In-progress Inserts to Finish Time (wal_wait_for_insert_to_finish_time). Unfortunately this leaves the question of how frequently is WaitXLogInsertionsToFinish() being called and by whom. One possibility here is that wal_buffers is too small so backends are constantly having to write WAL data to free up buffers. -- Jim Nasby, Data Architect, Austin TX
Re: Confine vacuum skip logic to lazy_scan_skip
On 1/11/24 5:50 PM, Melanie Plageman wrote: On Fri, Jan 5, 2024 at 5:51 AM Nazir Bilal Yavuz wrote: On Fri, 5 Jan 2024 at 02:25, Jim Nasby wrote: On 1/4/24 2:23 PM, Andres Freund wrote: On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote: Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var nskippable_blocks I think these may lead to worse code - the compiler has to reload vacrel->rel_pages/next_unskippable_block for every loop iteration, because it can't guarantee that they're not changed within one of the external functions called in the loop body. Admittedly I'm not up to speed on recent vacuum changes, but I have to wonder if the concept of skipping should go away in the context of vector IO? Instead of thinking about "we can skip this range of blocks", why not maintain a list of "here's the next X number of blocks that we need to vacuum"? Sorry if I misunderstood. AFAIU, with the help of the vectored IO; "the next X number of blocks that need to be vacuumed" will be prefetched by calculating the unskippable blocks ( using the lazy_scan_skip() function ) and the X will be determined by Postgres itself. Do you have something different in your mind? I think you are both right. As we gain more control of readahead from within Postgres, we will likely want to revisit this heuristic as it may not serve us anymore. But the streaming read interface/vectored I/O is also not a drop-in replacement for it. To change anything and ensure there is no regression, we will probably have to do cross-platform benchmarking, though. That being said, I would absolutely love to get rid of the skippable ranges because I find them very error-prone and confusing. Hopefully now that the skipping logic is isolated to a single function, it will be easier not to trip over it when working on lazy_scan_heap(). Yeah, arguably it's just a matter of semantics, but IMO it's a lot clearer to simply think in terms of "here's the next blocks we know we want to vacuum" instead of "we vacuum everything, but sometimes we skip some blocks". -- Jim Nasby, Data Architect, Austin TX
Re: Add BF member koel-like indentation checks to SanityCheck CI
On 1/9/24 3:20 PM, Tom Lane wrote: In short, I don't think that putting this into CI is the answer. Putting it into committers' standard workflow is a better idea, if we can get all the committers on board with that. FWIW, that's the approach that go takes - not only for committing to go itself, but it is *strongly* recommended[1] that anyone writing any code in go makes running `go fmt` a standard part of their workflow. In my experience, it makes collaborating noticably easier because you never need to worry about formatting differences. FYI, vim makes this easy via vim-autoformat[2] (which also supports line-by-line formatting if the format tool allows it); presumably any modern editor has similar support. 1: Literally 3rd item at https://go.dev/doc/effective_go 2: https://github.com/vim-autoformat/vim-autoformat -- Jim Nasby, Data Architect, Austin TX
Re: Emit fewer vacuum records by reaping removable tuples during pruning
On 1/8/24 2:10 PM, Robert Haas wrote: On Fri, Jan 5, 2024 at 3:57 PM Andres Freund wrote: I will be astonished if you can make this work well enough to avoid huge regressions in plausible cases. There are plenty of cases where we do a very thorough job opportunistically removing index tuples. These days the AM is often involved with that, via table_index_delete_tuples()/heap_index_delete_tuples(). That IIRC has to happen before physically removing the already-marked-killed index entries. We can't rely on being able to actually prune the heap page at that point, there might be other backends pinning it, but often we will be able to. If we were to prune below heap_index_delete_tuples(), we wouldn't need to recheck that index again during "individual tuple pruning", if the to-be-marked-unused heap tuple is one of the tuples passed to heap_index_delete_tuples(). Which presumably will be very commonly the case. At least for nbtree, we are much more aggressive about marking index entries as killed, than about actually removing the index entries. "individual tuple pruning" would have to look for killed-but-still-present index entries, not just for "live" entries. I don't want to derail this thread, but I don't really see what you have in mind here. The first paragraph sounds like you're imagining that while pruning the index entries we might jump over to the heap and clean things up there, too, but that seems like it wouldn't work if the table has more than one index. I thought you were talking about starting with a heap tuple and bouncing around to every index to see if we can find index pointers to kill in every one of them. That *could* work out, but you only need one index to have been opportunistically cleaned up in order for it to fail to work out. There might well be some workloads where that's often the case, but the regressions in the workloads where it isn't the case seem like they would be rather substantial, because doing an extra lookup in every index for each heap tuple visited sounds pricey. The idea of probing indexes for tuples that are now dead has come up in the past, and the concern has always been whether it's actually safe to do so. An obvious example is an index on a function and now the function has changed so you can't reliably determine if a particular tuple is present in the index. That's bad enough during an index scan, but potentially worse while doing heeap cleanup. Given that operators are functions, this risk exists to some degree in even simple indexes. Depending on the gains this might still be worth doing, at least for some cases. It's hard to conceive of this breaking for indexes on integers, for example. But we'd still need to be cautious. -- Jim Nasby, Data Architect, Austin TX
Re: Random pg_upgrade test failure on drongo
On 1/4/24 10:19 PM, Amit Kapila wrote: On Thu, Jan 4, 2024 at 5:30 PM Alexander Lakhin wrote: 03.01.2024 14:42, Amit Kapila wrote: And the internal process is ... background writer (BgBufferSync()). So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and got 20 x 10 tests passing. Thus, it we want just to get rid of the test failure, maybe it's enough to add this to the test's config... What about checkpoints? Can't it do the same while writing the buffers? As we deal here with pg_upgrade/pg_restore, it must not be very easy to get the desired effect, but I think it's not impossible in principle. More details below. What happens during the pg_upgrade execution is essentially: 1) CREATE DATABASE "postgres" WITH TEMPLATE = template0 OID = 5 ...; -- this command flushes file buffers as well 2) ALTER DATABASE postgres OWNER TO ... 3) COMMENT ON DATABASE "postgres" IS ... 4) -- For binary upgrade, preserve pg_largeobject and index relfilenodes SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid); SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid); TRUNCATE pg_catalog.pg_largeobject; -- ^^^ here we can get the error "could not create file "base/5/2683": File exists" ... We get the effect discussed when the background writer process decides to flush a file buffer for pg_largeobject during stage 1. (Thus, if a checkpoint somehow happened to occur during CREATE DATABASE, the result must be the same.) And another important factor is shared_buffers = 1MB (set during the test). With the default setting of 128MB I couldn't see the failure. It can be reproduced easily (on old Windows versions) just by running pg_upgrade in a loop (I've got failures on iterations 22, 37, 17 (with the default cluster)). If an old cluster contains dozen of databases, this increases the failure probability significantly (with 10 additional databases I've got failures on iterations 4, 1, 6). I don't have an old Windows environment to test but I agree with your analysis and theory. The question is what should we do for these new random BF failures? I think we should set bgwriter_lru_maxpages to 0 and checkpoint_timeout to 1hr for these new tests. Doing some invasive fix as part of this doesn't sound reasonable because this is an existing problem and there seems to be another patch by Thomas that probably deals with the root cause of the existing problem [1] as pointed out by you. [1] - https://commitfest.postgresql.org/40/3951/ Isn't this just sweeping the problem (non-POSIX behavior on SMB and ReFS) under the carpet? I realize that synthetic test workloads like pg_upgrade in a loop aren't themselves real-world scenarios, but what about other cases? Even if we're certain it's not possible for these issues to wedge a server, it's still not a good experience for users to get random, unexplained IO-related errors... -- Jim Nasby, Data Architect, Austin TX
Re: Confine vacuum skip logic to lazy_scan_skip
On 1/4/24 2:23 PM, Andres Freund wrote: On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote: Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var nskippable_blocks I think these may lead to worse code - the compiler has to reload vacrel->rel_pages/next_unskippable_block for every loop iteration, because it can't guarantee that they're not changed within one of the external functions called in the loop body. Admittedly I'm not up to speed on recent vacuum changes, but I have to wonder if the concept of skipping should go away in the context of vector IO? Instead of thinking about "we can skip this range of blocks", why not maintain a list of "here's the next X number of blocks that we need to vacuum"? -- Jim Nasby, Data Architect, Austin TX
Re: the s_lock_stuck on perform_spin_delay
On 1/4/24 10:33 AM, Tom Lane wrote: Robert Haas writes: On Thu, Jan 4, 2024 at 10:22 AM Tom Lane wrote: We should be making an effort to ban coding patterns like "return with spinlock still held", because they're just too prone to errors similar to this one. I agree that we don't want to add overhead, and also about how spinlocks should be used, but I dispute "easily detectable statically." I mean, if you or I look at some code that uses a spinlock, we'll know whether the pattern that you mention is being followed or not, modulo differences of opinion in debatable cases. But you and I cannot be there to look at all the code all the time. If we had a static checking tool that was run as part of every build, or in the buildfarm, or by cfbot, or somewhere else that raised the alarm if this rule was violated, then we could claim to be effectively enforcing this rule. I was indeed suggesting that maybe we could find a way to detect such things automatically. While I've not been paying close attention, I recall there's been some discussions of using LLVM/clang infrastructure for customized static analysis, so maybe it'd be possible without an undue amount of effort. FWIW, the lackey[1] tool in Valgrind is able to do some kinds of instruction counting, so it might be possible to measure how many instructions are actualyl being executed while holding a spinlock. Might be easier than code analysis. Another possibility might be using the CPUs timestamp counter. 1: https://valgrind.org/docs/manual/lk-manual.html -- Jim Nasby, Data Architect, Austin TX
Re: doing also VM cache snapshot and restore with pg_prewarm, having more information of the VM inside PostgreSQL
On 1/3/24 5:57 PM, Cedric Villemain wrote: for 15 years pgfincore has been sitting quietly and being used in large setups to help in HA and resources management. It can perfectly stay as is, to be honest I was expecting to one day include a windows support and propose that to PostgreSQL, it appears getting support on linux and BSD is more than enough today. So I wonder if there are interest for having virtual memory snapshot and restore operations with, for example, pg_prewarm/autowarm ? IMHO, to improve the user experience here we'd need something that combined the abilities of all these extensions into a cohesive interface that allowed users to simply say "please get this data into cache". Simply moving pgfincore into core Postgres wouldn't satisfy that need. So I think the real question is whether the community feels spport for better cache (buffercache + filesystem) management is a worthwhile feature to add to Postgres. Micromanaging cache contents for periodic jobs seems almost like a mis-feature. While it's a useful tool to have in the toolbox, it's also a non-trivial layer of complexity. IMHO not something we'd want to add. Though, there might be smaller items that would make creating tools to do that easier, such as some ability to see what blocks a backend is accessing (perhaps via a hook). On the surface, improving RTO via cache warming sounds interesting ... but I'm not sure how useful it would be in reality. Users that care about RTO would almost always have some form of hot-standby, and generally those will already have a lot of data in cache. While they won't have read-only data in cache, I have to wonder if the answer to that problem is allowing writers to tell a replica what blocks are being read, so the replica can keep them in cache. Also, most (all?) operations that require a restart could be handled via a failover, so I'm not sure how much cache management moves the needle there. Some usecases covered: snapshot/restore cache around cronjobs, around dumps, switchover, failover, on stop/start of postgres (think kernel upgrade with a cold restart), ... pgfincore also provides some nice information with mincore (on FreeBSD mincore is more interesting) or cachestat, again it can remain as an out of tree extension but I will be happy to add to commitfest if there are interest from the community. An example of cachestat output: postgres=# select *from vm_relation_cachestat('foo',range:=1024*32); block_start | block_count | nr_pages | nr_cache | nr_dirty | nr_writeback | nr_evicted | nr_recently_evicted -+-+--+--+--+--++- 0 | 32768 | 65536 | 62294 | 0 | 0 | 3242 | 3242 32768 | 32768 | 65536 | 39279 | 0 | 0 | 26257 | 26257 65536 | 32768 | 65536 | 22516 | 0 | 0 | 43020 | 43020 98304 | 32768 | 65536 | 24944 | 0 | 0 | 40592 | 40592 131072 | 1672 | 3344 | 487 | 0 | 0 | 2857 | 2857 Comments? --- Cédric Villemain +33 (0)6 20 30 22 52 https://Data-Bene.io PostgreSQL Expertise, Support, Training, R&D -- Jim Nasby, Data Architect, Austin TX
Re: SET ROLE x NO RESET
On 1/3/24 5:47 PM, Nico Williams wrote: Though maybe `NO RESET` isn't really needed to build these, since after all one could use an unprivileged role and a SECURITY DEFINER function that does the `SET ROLE` following some user-defined authentication method, and so what if the client can RESET the role, since that brings it back to the otherwise unprivileged role. An unprivileged role that has the ability to assume any other role ;p Also, last I checked you can't do SET ROLE in a security definer function. Who needs to RESET roles anyways? Answer: connection pools, but not every connection is used via a pool. This brings up something: attempts to reset a NO RESET session need to fail in such a way that a connection pool can detect this and disconnect, or else it needs to fail by terminating the connection altogether. RESET ROLE is also useful for setting up a "superuser role" that DBAs have access to via a NO INHERIT role. It allows them to do things like... SET ROLE su; -- Do some superuserly thing RESET ROLE; Admittedly, the last step could be just to set their role back to themselves, but RESET ROLE removes the risk of typos. -- Jim Nasby, Data Architect, Austin TX
Re: add function argument names to regex* functions.
On 1/3/24 5:05 PM, Dian Fay wrote: Another possibility is `index`, which is relatively short and not a reserved keyword ^1. `position` is not as precise but would avoid the conceptual overloading of ordinary indices. I'm not a fan of "index" since that leaves the question of whether it's 0 or 1 based. "Position" is a bit better, but I think Jian's suggestion of "occurance" is best. We do have precedent for one-based `index` in Postgres: array types are 1-indexed by default! "Occurrence" removes that ambiguity but it's long and easy to misspell (I looked it up after typing it just now and it _still_ feels off). How's "instance"? Presumably someone referencing arguments by name would have just looked up the names via \df or whatever, so presumably misspelling wouldn't be a big issue. But I think "instance" is OK as well. -- Jim Nasby, Data Architect, Austin TX
Re: Extension Enhancement: Buffer Invalidation in pg_buffercache
On 1/3/24 10:25 AM, Cédric Villemain wrote: Hi Palak, I did a quick review of the patch: +CREATE FUNCTION pg_buffercache_invalidate(IN int, IN bool default true) +RETURNS bool +AS 'MODULE_PATHNAME', 'pg_buffercache_invalidate' +LANGUAGE C PARALLEL SAFE; --> Not enforced anywhere, but you can also add a comment to the function, for end users... The arguments should also have names... + force = PG_GETARG_BOOL(1); I think you also need to test PG_ARGISNULL with force parameter. Actually, that's true for the first argument as well. Or, just mark the function as STRICT. -- Jim Nasby, Data Architect, Austin TX
Re: Password leakage avoidance
On 1/3/24 7:53 AM, Robert Haas wrote: Also, +1 for the general idea. I don't think this is a whole answer to the problem of passwords appearing in log files because (1) you have to be using libpq in order to make use of this and (2) you have to actually use it instead of just doing something else and complaining about the problem. But neither of those things is a reason not to have it. There's no reason why a sophisticated user who goes through libpq shouldn't have an easy way to do this instead of being asked to reimplement it if they want the functionality. ISTM the only way to really move the needle (short of removing all SQL support for changing passwords) would be a GUC that allows disabling the use of SQL for setting passwords. While that doesn't prevent leakage, it does at least force users to use a secure method of setting passwords. -- Jim Nasby, Data Architect, Austin TX
Re: add function argument names to regex* functions.
On 1/1/24 12:05 PM, Dian Fay wrote: I agree that the parameter name `n` is not ideal. For example, in `regexp_replace` it's easy to misinterpret it as "make up to n replacements". This has not been a problem when `n` only lives in the documentation which explains exactly what it does, but that context is not readily available in code expressing `n => 3`. Agreed; IMO it's worth diverging from what Oracle has done here. Another possibility is `index`, which is relatively short and not a reserved keyword ^1. `position` is not as precise but would avoid the conceptual overloading of ordinary indices. I'm not a fan of "index" since that leaves the question of whether it's 0 or 1 based. "Position" is a bit better, but I think Jian's suggestion of "occurance" is best. -- Jim Nasby, Data Architect, Austin TX
Re: Things I don't like about \du's "Attributes" column
On 1/2/24 1:38 PM, Robert Haas wrote: But to try to apply that concept here means that we suppose the user knows whether the default is INHERIT or NOINHERIT, whether the default is BYPASSRLS or NOBYPASSRLS, etc. And I'm just a little bit skeptical of that assumption. Perhaps it's just that I've spent less time doing user management than table administration and so I'm the only one who finds this fuzzier than some other kinds of SQL objects, but I'm not sure it's just that. Roles are pretty weird. In my consulting experience, it's extremely rare for users to do anything remotely sophisticated with roles (I was always happy just to see apps weren't connecting as a superuser...). Like you, I view \du and friends as more of a "helping hand" to seeing the state of things, without the expectation that every tiny nuance will always be visible, because I don't think it's practical to do that in psql. While that behavior might surprise some users, the good news is once they start exploring non-default options the behavior becomes self-evident. Some attributes are arguably important enough to warrant their own column. The most obvious is NOLOGIN, since those roles are generally used for a very different purpose than LOGIN roles. SUPERUSER might be another candidate (though, I much prefer a dedicated "sudo role" than explicit SU on roles). I'm on the fence when it comes to SQL syntax vs what we have now. What we currenly have is more readable, but off-hand I think the other places we list attributes we do it in SQL syntax. It might be worth changing just for consistency sake. -- Jim Nasby, Data Architect, Austin TX
Re: Next step towards 64bit XIDs: Switch to FullTransactionId for PGPROC->xid and XLogRecord->xl_xid
On 1/2/24 1:58 PM, Robert Haas wrote: Maybe this analysis I've just given isn't quite right, but my point is that we should try to think hard about where in the system 32-bit XIDs suck and for what reason, and use that as a guide to what to change first. Very much this. The biggest reason 2B XIDs are such an issue is because it's incredibly expensive to move the window forward, which is governed by on-disk stuff.
Re: SET ROLE x NO RESET
On 12/31/23 1:19 PM, Joe Conway wrote: On 12/30/23 17:19, Michał Kłeczek wrote: On 30 Dec 2023, at 17:16, Eric Hanson wrote: What do you think of adding a NO RESET option to the SET ROLE command? What I proposed some time ago is SET ROLE … GUARDED BY ‘password’, so that you could later: RESET ROLE WITH ‘password' I like that too, but see it as a separate feature. FWIW that is also supported by the set_user extension referenced elsewhere on this thread. That means you'd need to manage that password. ISTM a better mechanism to do this in SQL would be a method of changing roles that returns a nonce that you'd then use to reset your role. I believe that'd also make it practical to do this server-side in the various PLs.
Re: Should vacuum process config file reload more often
On 3/2/23 1:36 AM, Masahiko Sawada wrote: For example, I guess we will need to take care of changes of maintenance_work_mem. Currently we initialize the dead tuple space at the beginning of lazy vacuum, but perhaps we would need to enlarge/shrink it based on the new value? Doesn't the dead tuple space grow as needed? Last I looked we don't allocate up to 1GB right off the bat. I don't think we need to do anything about that initially, just because the config can be changed in a more granular way, doesn't mean we have to react to every change for the current operation. Perhaps we can mention in the docs that a change to maintenance_work_mem will not take effect in the middle of vacuuming a table. But, Ithink it probably isn't needed. Agreed. I disagree that there's no need for this. Sure, if maintenance_work_memory is 10MB then it's no big deal to just abandon your current vacuum and start a new one, but the index vacuuming phase with maintenance_work_mem set to say 500MB can take quite a while. Forcing a user to either suck it up or throw everything in the phase away isn't terribly good. Of course, if the patch that eliminates the 1GB vacuum limit gets committed the situation will be even worse. While it'd be nice to also honor maintenance_work_mem getting set lower, I don't see any need to go through heroics to accomplish that. Simply recording the change and honoring it for future attempts to grow the memory and on future passes through the heap would be plenty. All that said, don't let these suggestions get in the way of committing this. Just having the ability to tweak cost parameters would be a win.
Re: refactoring relation extension and BufferAlloc(), faster COPY
On 2/21/23 3:12 PM, Andres Freund wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Hi, On 2023-02-21 15:00:15 -0600, Jim Nasby wrote: Some food for thought: I think it's also completely fine to extend any relation over a certain size by multiple blocks, regardless of concurrency. E.g. 10 extra blocks on an 80MB relation is 0.1%. I don't have a good feel for what algorithm would make sense here; maybe something along the lines of extend = max(relpages / 2048, 128); if extend < 8 extend = 1; (presumably extending by just a couple extra pages doesn't help much without concurrency). I previously implemented just that. It's not easy to get right. You can easily end up with several backends each extending the relation by quite a bit, at the same time (or you re-introduce contention). Which can end up with a relation being larger by a bunch if data loading stops at some point. We might want that as well at some point, but the approach implemented in the patchset is precise and thus always a win, and thus should be the baseline. Yeah, what I was suggesting would only make sense when there *wasn't* contention.
Re: refactoring relation extension and BufferAlloc(), faster COPY
On 10/28/22 9:54 PM, Andres Freund wrote: b) I found that is quite beneficial to bulk-extend the relation with smgrextend() even without concurrency. The reason for that is the primarily the aforementioned dirty buffers that our current extension method causes. One bit that stumped me for quite a while is to know how much to extend the relation by. RelationGetBufferForTuple() drives the decision whether / how much to bulk extend purely on the contention on the extension lock, which obviously does not work for non-concurrent workloads. After quite a while I figured out that we actually have good information on how much to extend by, at least for COPY / heap_multi_insert(). heap_multi_insert() can compute how much space is needed to store all tuples, and pass that on to RelationGetBufferForTuple(). For that to be accurate we need to recompute that number whenever we use an already partially filled page. That's not great, but doesn't appear to be a measurable overhead. Some food for thought: I think it's also completely fine to extend any relation over a certain size by multiple blocks, regardless of concurrency. E.g. 10 extra blocks on an 80MB relation is 0.1%. I don't have a good feel for what algorithm would make sense here; maybe something along the lines of extend = max(relpages / 2048, 128); if extend < 8 extend = 1; (presumably extending by just a couple extra pages doesn't help much without concurrency).
Re: Direct I/O
On 11/1/22 2:36 AM, Thomas Munro wrote: Hi, Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the This is exciting to see! There's two other items to add to the TODO list before this would be ready for production: 1) work_mem. This is a significant impediment to scaling shared buffers the way you'd want to. 2) Clock sweep. Specifically, currently the only thing that drives usage_count is individual backends running the clock hand. On large systems with 75% of memory going to shared_buffers, that becomes a very significant problem, especially when the backend running the clock sweep is doing so in order to perform an operation like a b-tree page split. I suspect it shouldn't be too hard to deal with this issue by just having bgwriter or another bgworker proactively ensuring some reasonable number of buffers with usage_count=0 exist. One other thing to be aware of: overflowing as SLRU becomes a massive problem if there isn't a filesystem backing the SLRU. Obviously only an issue if you try and apply DIO to SLRU files.
Re: autovacuum: change priority of the vacuumed tables
On 3/3/18 2:53 PM, Tomas Vondra wrote: That largely depends on what knobs would be exposed. I'm against adding some low-level knobs that perhaps 1% of the users will know how to tune, and the rest will set it incorrectly. Some high-level options that would specify the workload type might work, but I have no idea about details. Not knowing about details is why we've been stuck here for years: it's not terribly obvious how to create a scheduler that is going to work in all situations. Current autovac is great for 80% of situations, but it simply doesn't handle the remaining 20% by itself. Once you're pushing your IO limits you *have* to start scheduling manual vacuums for any critical tables. At least if we exposed some low level ability to control autovac workers then others could create tools to improve the situation. Currently that's not possible because manual vacuum lacks features that autovac has. One fairly simple option would be to simply replace the logic that currently builds a worker's table list with running a query via SPI. That would allow for prioritizing important tables. It could also reduce the problem of workers getting "stuck" on a ton of large tables by taking into consideration the total number of pages/tuples a list contains. I don't see why SPI would be needed to do that, i.e. why couldn't we implement such prioritization with the current approach. Another thing Sure, it's just a SMOC. But most of the issue here is actually a query problem. I suspect that the current code would actually shrink if converted to SPI. In any case, I'm not wed to that idea. is I really doubt prioritizing "important tables" is an good solution, as it does not really guarantee anything. If by "important" you mean small tables with high update rates, prioritizing those actually would help as long as you have free workers. By itself it doesn't gain all that much though. A more fine-grained approach would be to have workers make a new selection after every vacuum they complete. That would provide the ultimate in control, since you'd be able to see exactly what all the other workers are doing. That was proposed earlier in this thread, and the issue is it may starve all the other tables when the "important" tables need cleanup all the time. There's plenty of other ways to shoot yourself in the foot in that regard already. We can always have safeguards in place if we get too close to wrap-around, just like we currently do. -- Jim Nasby, Chief Data Architect, Austin TX OpenSCG http://OpenSCG.com
Re: autovacuum: change priority of the vacuumed tables
On 2/19/18 10:00 AM, Tomas Vondra wrote: So I don't think this is a very promising approach, unfortunately. What I think might work is having a separate pool of autovac workers, dedicated to these high-priority tables. That still would not guarantee the high-priority tables are vacuumed immediately, but at least that they are not stuck in the worker queue behind low-priority ones. I wonder if we could detect tables with high update/delete activity and promote them to high-priority automatically. The reasoning is that by delaying the cleanup for those tables would result in significantly more bloat than for those with low update/delete activity. I've looked at this stuff in the past, and I think that the first step in trying to improve autovacuum needs to be allowing for a much more granular means of controlling worker table selection, and exposing that ability. There are simply too many different scenarios to try and account for to try and make a single policy that will satisfy everyone. Just as a simple example, OLTP databases (especially with small queue tables) have very different vacuum needs than data warehouses. One fairly simple option would be to simply replace the logic that currently builds a worker's table list with running a query via SPI. That would allow for prioritizing important tables. It could also reduce the problem of workers getting "stuck" on a ton of large tables by taking into consideration the total number of pages/tuples a list contains. A more fine-grained approach would be to have workers make a new selection after every vacuum they complete. That would provide the ultimate in control, since you'd be able to see exactly what all the other workers are doing. -- Jim Nasby, Chief Data Architect, Austin TX OpenSCG http://OpenSCG.com