Hi Ashutosh, I tried applying the v20260128 patches but encountered conflicts. Could you let me know the base commit they were developed against?
Thanks On Fri, Feb 6, 2026 at 5:21 PM Ashutosh Bapat <[email protected]> wrote: > On Fri, Nov 14, 2025 at 5:23 PM Ashutosh Bapat > <[email protected]> wrote: > > > > Hi, > > PFA new patchset with some TODOs from previous email addressed: > > > > On Mon, Oct 13, 2025 at 9:28 PM Ashutosh Bapat > > <[email protected]> wrote: > > > 1. New backends join while the synchronization is going on. > > > > Done. Explained the solution below in detail. > > > > > An existing backend exiting. > > > > Not tested specifically, but should work. > > > > > 2. Failure or crash in the backend which is executing > pg_resize_buffer_pool() > > > > still a TODO > > > > > 3. Fix crashes in the tests. > > > > core regression passes, pg_buffercache regression tests pass and the > > tests for buffer resizing pass most of the time. So far I have seen > > two issues > > 1. An assertion from AIO worker - which happened only once and I > > couldn't reproduce again. Need to study interaction of AIO worker with > > buffer resizing. > > 2. checkpointer crashes - which is one of the TODOs listed below. > > 3. Also there's an shared memory id related failure, which I don't > > understand but happen more frequently than the first one. Need to look > > into that. > > > > > go through Tomas's detailed comments and address those > > > which still apply. > > > > Still a TODO. But since many of those patches are revised heavily, I > > think many of the comments may have been addressed, some may not apply > > anymore. > > Thanks a lot Tomas for these review comments and summary of the > discussion as of your response. Sorry, it took so long to fully > respond to this email, but I wanted the patches to be in a reasonable > shape before responding. It's been a while since I have posted the > last patchset. Hence attaching patchset along with responses to your > comments as well as the earlier comments that you mentioned in your > response. Dmitry has responded to some parts of these emails already > but I am responding to all the comments in the context of the latest > implementation which has been revised heavily since his responses. The > latest design and UI is described in details in [1]. Please note that > that email also responds to some earlier comments from Andres and > others. I have not repeated those responses here. > > The earlier patches did not build in EXEC_BACKEND mode. The attached > patches build in EXEC_BACKEND mode and also the regression tests pass > in that build. I haven't tried running whole test suite though. > > > > > I agree it'd be useful to be able to resize shared buffers, without > > having to restart the instance (which is obviously very disruptive). So > > if we can make this work reliably, with reasonable trade offs (both on > > the backends, and also the risks/complexity introduced by the feature). > > Agreed. That's the goal - to make it work reliably with reasonable > trade-offs. Adding some details to this goal as follows: > > 1. Performance > -------------- > a. The impact of resizing on the performance of the concurrent > database operations should be reasonable and acceptable. > b. The code changes should not cause significant performance > degradation during normal operations > > Palak Chaturvedi has already done some benchmarking. I am listing her > findings here > 1. The code changes do not have any noticeable performance degradation > during normal operations. > 2. Resizing itself is also reasonably fast and with very minimal, > almost unnoticeable, performance impact on the concurrent > transactions. > > We will share the benchmarks once the code is cleaner and more complete > stable. > > 2. Reliability > -------------- > We are adding tests to make sure that resizing does not introduce any > crashes, hazards or data corruption. Some tests are already part of > the patch set. > > > > > I'm far from an expert on mmap() and similar low-level stuff, but the > > current appproach (reserving a big chunk of shared memory and slicing > > it by mmap() into smaller segments) seems reasonable. > > The latest patches, we reserve address space by mmap and manage memory > within that address space using file backed memory. > > > > > But I'm getting a bit lost in how exactly this interacts with things > > like overcommit, system memory accounting / OOM killer and this sort of > > stuff. I went through the thread and it seems to me the reserve+map > > approach works OK in this regard (and the messages on linux-mm seem to > > confirm this). But this information is scattered over many messages and > > it's hard to say for sure, because some of this might be relevant for > > an earlier approach, or a subtly different variant of it. > > Looking at the documentation and the messages on linux-mm, it seems > that the mmap with MAP_NORESERVE approach works well with overcommit, > system memory accounting and OOM killer. I plan to add tests verify > these aspects, so that any changes in the future do not cause any > regressions. I am looking for ways to write a small test program for > the same, which we can add to our test battery. But no success so far. > Any idea? > > > > > A similar question is portability. The comments and commit messages > > seem to suggest most of this is linux-specific, and other platforms just > > don't have these capabilities. But there's a bunch of messages (mostly > > by Thomas Munro) that hint FreeBSD might be capable of this too, even if > > to some limited extent. And possibly even Windows/EXEC_BACKEND, although > > that seems much trickier. > > > > FWIW I think it's perfectly fine to only support resizing on selected > > platforms, especially considering Linux is the most widely used system > > for running Postgres. We still need to be able to build/run on other > > systems, of course. And maybe it'd be good to be able to disable this > > even on Linux, if that eliminates some overhead and/or risks for people > > who don't need the feature. Just a thought. > > +1. Plan is to make it work on Linux first and disable it on the other > platforms. I think it's a good idea to be able to disable it at initdb > time say. But I am not yet sure if that's going to avoid any overhead > or risks. I haven't yet wrapped my head around the platform dependency > completely. > > > > > Anyway, my main point is that this information is important, but very > > scattered over the thread. It's a bit foolish to expect everyone who > > wants to do a review to read the whole thread (which will inevitably > > grow longer over time), and assemble all these pieces again an again, > > following all the changes in the design etc. Few people will get over > > that hurdle, IMHO. > > > > So I think it'd be very helpful to write a README, explaining the > > currnent design/approach, and summarizing all these aspects in a single > > place. Including things like portability, interaction with the OS > > accounting, OOM killer, this kind of stuff. Some of this stuff may be > > already mentioned in code comments, but you it's hard to find those. > > > > Especially worth documenting are the states the processes need to go > > through (using the barriers), and the transacitons between them (i.e. > > what is allowed in each phase, what blocks can be visible, etc.). > > > > That's a great idea. There is already a README in > src/backend/storage/buffer/, I have added a section of buffer pool > resizing. The section has pointers to relevant code comments. I will > revise this to be as self sufficient as possible as the patches > mature. The code managing the shared memory is scattered across > backend/portability, storage/ipc etc. It's slightly troublesome to > find one good place to add a README which explains everything. > > > > > I'll go over some higher-level items first, and then over some comments > > for individual patches. > > > > > > 1) no user docs > > > > There are no user .sgml docs, and maybe it's time to write some, > > explaining how to use this thing - how to configure it, how to trigger > > the resizing, etc. It took me a while to realize I need to do ALTER > > SYSTEM + pg_reload_conf() to kick this off. > > Latest patches have updated config.sgml and func-admin.sgml to > document the functionality. Please review them. > > > > > It should also document the user-visible limitations, e.g. what activity > > is blocked during the resizing, etc. > > > > We are still figuring this out. But we will document the limitations > in func-admin.sgml. Right now there are no limitations and my > intention is to keep the limitations as limited as possible. > > > > > 2) pending GUC changes > > > > I'm somewhat skeptical about the GUC approach. I don't think it was > > designed with this kind of use case in mind, and so I think it's quite > > likely it won't be able to handle it well. > > > > For example, there's almost no validation of the values, so how do you > > ensure the new value makes sense? Because if it doesn't, it can easily > > crash the system (I've seen such crashes repeatedly, I'll get to that). > > Sure, you may do ALTER SYSTEM to set shared_buffers to nonsense and it > > won't start after restart/reboot, but crashing an instance is maybe a > > little bit more annoying. > > > > Let's say we did the ALTER SYSTEM + pg_reload_conf(), and it gets stuck > > waiting on something (can't evict a buffer or something). How do you > > cancel it, when the change is already written to the .auto.conf file? > > Can you simply do ALTER SYSTEM + pg_reload_conf() again? > > > > It also seems a bit strange that the "switch" gets to be be driven by a > > randomly selected backend (unless I'm misunderstanding this bit). It > > seems to be true for the buffer eviction during shrinking, at least. > > > > Perhaps this should be a separate utility command, or maybe even just > > a new ALTER SYSTEM variant? Or even just a function, similar to what > > the "online checksums" patch did, possibly combined with a bgworder > > (but probably not needed, there are no db-specific tasks to do). > > > > As documented in func-admin.sgml, and config.sgml, resizing the buffer > pool requires ALTER SYSTEM + pg_reload_conf() followed by > pg_resize_shared_buffers(). There is no pending flag, no arbitrary > backend driving the process. The validation, failures are handled by > pg_resize_shared_buffers(). The backend where > pg_resize_shared_buffers() is called is responsible for driving the > resizing process. I think, the new implementation takes care of all > the concerns mentioned above. > > > > > 3) max_available_memory > > > > Speaking of GUCs, I dislike how max_available_memory works. It seems a > > bit backwards to me. I mean, we're specifying shared_buffers (and some > > other parameters), and the system calculates the amount of shared memory > > needed. But the limit determines the total limit? > > > > I think the GUC should specify the maximum shared_buffers we want to > > allow, and then we'd work out the total to pre-allocate? Considering > > we're only allowing to resize shared_buffers, that should be pretty > > trivial. Yes, it might happen that the "total limit" happens to exceed > > the available memory or something, but we already have the problem > > with shared_buffers. Seems fine if we explain this in the docs, and > > perhaps print the calculated memory limit on start. > > > > In any case, we should not allow setting a value that ends up > > overflowing the internal reserved space. It's true we don't have a good > > way to do checks for GUcs, but it's a bit silly to crash because of > > hitting some non-obvious internal limit that we necessarily know about. > > > > Maybe this is a reason why GUC hooks are not a good way to set this. > > Latest patches introduce max_shared_buffers which specifies the > maximum shared buffers allowed. max_available_memory, that some > previous patchsets had, has been removed. This GUC is mentioned in the > documentation changes. > > One relatively small thing to think about is what do we call > shared_buffers GUC - it's not SIGHUP per say since it requires a > function to be called after the reload but it's not POSTMASTER either > since restart is not required. I think we need a new PGC_ for it but > haven't yet thought of a good name. Do you have any suggestions? > > > > > > > 4) SHMEM_RESIZE_RATIO > > > > The SHMEM_RESIZE_RATIO thing seems a bit strange too. There's no way > > these ratios can make sense. For example, BLCKSZ is 8192 but the buffer > > descriptor is 64B. That's 128x difference, but the ratios says 0.6 and > > 0.1, so 6x. Sure, we'll actually allocate only the memory we need, and > > the rest is only "reserved". > > > > However, that just makes the max_available_memory a bit misleading, > > because you can't ever use it. You can use the 60% for shared buffers > > (which is not mentioned anywhere, and good luck not overflowing that, > > as it's never checked), but those smaller regions are guaranteed to be > > mostly unused. Unfortunate. > > > > And it's not just a matter of fixing those ratios, because then someone > > rebuilds with 32kB blocks and you're in the same situation. > > > > Moreover, all of the above is for mappings sized based on NBuffers. But > > if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the > > moment someone increases of max_connection, max_locks_per_transaction > > and possibly some other stuff? > > > > +1. Fixed. max_shared_buffers only deals with the shared buffers. > > > > > 5) no tests > > > > I mentioned no "user docs", but the patch has 0 tests too. Which seems > > a bit strange for a patch of this age. > > > > A really serious part of the patch series seems to be the coordination > > of processes when going through the phases, enforced by the barriers. > > This seems like a perfect match for testing using injection points, and > > I know we did something like this in the online checksums patch, which > > needs to coordinate processes in a similar way. > > > > But even just a simple TAP test that does a bunch of (random?) resizes > > while running a pgbench seem better than no tests. (That's what I did > > manually, and it crashed right away.) > > > > There's a lot more stuff to test here, I think. Idle sessions with > > buffers pinned by open cursors, multiple backends doing ALTER SYSTEM > > + pg_reload_conf concurrently, other kinds of failures. > > > > +1. The latest patches already have a few tests but we are adding more > tests to cover different scenarios. > > > > > 6) SIGBUS failures > > > > As mentioned, I did some simple tests with shrink/resize with a pgbench > > in the background, and it almost immediately crashed for me :-( With a > > SIGBUS, which I think is fairly rare on x86 (definitely much less common > > than e.g. SIGSEGV). > > > > An example backtrace attached. > > > > There is a test which runs pgbench concurrently with resizing and it > fails much less frequently (1 in 20 or even lower). Quite likely the > crash you have seen is also fixed. Please let me know if the test > fails for you and if there is something that is in your test but not > that test. > > > > > 7) EXEC_BACKEND, FreeBSD > > > > We clearly need to keep this working on systems without the necessary > > bits (so likely EXEC_BACKEND, FreeBSD etc.). But the builds currently > > fail in both cases, it seems. > > > > I think it's fine to not support resizing on every platform, then we'd > > never get it, but it still needs to build. It would be good to not have > > two very different code versions, one for resizing and one without it, > > though. I wonder if we can just have the "no-resize" use the same struct > > (with the segments/mapping, ...) and all that, but skipping the space > > reservation. > > > > I have not thought fully through support on platforms other than > Linux. However, my current plan is to not support GUC > max_shared_buffers and the resizing function > pg_resize_shared_buffers() on platforms which do not support the > necessary bits. The code will still build and run on those platforms, > but resizing shared buffers without restart won't be possible. > > > > > 8) monitoring > > > > So, let's say I start a resize of shared buffers. How will I know what > > it's currently doing, how much longer it might take, what it's waiting > > for, etc.? I think it'd be good to have progress monitoring, through > > the regular system view (e.g. pg_stat_shmem_resize_progress?). > > > > When pg_resize_shared_buffers() finishes, user knows that the resizing > is finished. It usually takes a few seconds for the resizing to > finish. I am not sure whether we will need progress reporting. But > there might be cases where the resizing has to wait for something or > it may take longer for a given phase e.g. eviction. So we may require > progress reporting. Have added TODO in the code for the same. > > > > > 10) what to do about stuck resize? > > > > AFAICS the resize can get stuck for various reasons, e.g. because it > > can't evict pinned buffers, possibly indefinitely. Not great, it's not > > clear to me if there's a way out (canceling the resize) after a timeout, > > or something like that? Not great to start an "online resize" only to > > get stuck with all activity blocked for indefinite amount of time, and > > get to restart anyway. > > > > Seems related to Thomas' message [2], but AFAICS the patch does not do > > anything about this yet, right? What's the plan here? > > > > pg_resize_shared_buffers() is affected by timeouts like > statement_timeouts as well as query cancellation. However, current > implementation lacks graceful handling of these events. Another TODO. > > > > > 11) preparatory actions? > > > > Even if it doesn't get stuck, some of the actions can take a while, like > > evicting dirty buffers before shrinking, etc. This is similar to what > > happens on restart, when the shutdown checkpoint can take a while, while > > the system is (partly) unavailable. > > > > The common mitigation is to do an explicit checkpoint right before the > > restart, to make the shutdown checkpoint cheap. Could we do something > > similar for the shrinking, e.g. flush buffers from the part to be > > removed before actually starting the resize? > > > > Eviction is carried out before changes to the memory or shared buffer > metadata. If eviction fails, the resizing is rolled back. The system > continues to work with old buffer pool size. A test particularly > testing this aspect remains to be added. > > > > > 12) does this affect e.g. fork() costs? > > > > I wonder if this affects the cost of fork() in some undesirable way? > > Could it make fork() more measurably more expensive? > > > > A new backend needs to attach to extra shared memory segments, AFAIK > and detach those when exiting. But I don't think there's any other > extra work that a backend needs to do when starting or exiting. > Windows might be different. I haven't seen any noticeable difference > in pgbench performance which uses new connection for every > transaction. However, we haven't tried a benchmark with empty > transactions so that fork() and exit() are exercised at a higher > frequency. Another TODO. > > > > > 13) resize "state" is all over the place > > > > For me, a big hurdle when reasoning about the resizing correctness is > > that there's quite a lot of distinct pieces tracking what the current > > "state" is. I mean, there's: > > > > - ShmemCtrl->NSharedBuffers > > - NBuffers > > - NBuffersOld > > - NBuffersPending > > - ... (I'm sure I missed something) > > > > There's no cohesive description how this fits together, it seems a bit > > "ad hoc". Could be correct, but I find it hard to reason about. > > > > Next set of patches will consolidate the state only in two places > NBuffersPending and StrategyControl, which needs a new name. But even > in the attached patches it's consolidated in NBuffersPending, > ShmemCtrl and StrategyControl. There are instances of NBuffers which > will be replaced by variables from ShmemCtrl or StrategyControl. > > > > > 14) interesting messages from the thread > > > > While reading through the thread, I noticed a couple messages that I > > think are still relevant: > > > > - I see Peter E posted some review in 2024/11 [3], but it seems his > > comments were mostly ignored. I agree with most of them. > > Find detailed reply to that email at the end. > > > > > - Robert mentioned a couple interesting failure scenarios in [4], not > > sure if all of this was handled. He howerver assumes pointers would > > not be stable (and that's something we should not allow, and the > > current approach works OK in this regard, I think). He also outlines > > how it'd happen in phases - this would be useful for the design README > > I think. It also reminds me the "phases" in the checksums patch. > > > > stable pointers in the latest patch. The phases are explained in the > prologue of pg_resize_shared_buffers(). > > > - Robert asked [5] if Linux might abruptly break this, but I find that > > unlikely. We'd point out we rely on this, and they'd likely rethink. > > This would be made safer if this was specified by POSIX - taking that > > away once implemented seems way harder than for custom extensions. > > It's likely they'd not take away the feature without an alternative > > way to achieve the same effect, I think (yes, harder to maintain). > > Tom suggests [7] this is not in POSIX. > > > > - Matthias mentioned [6] similar flags on other operating systems. Could > > some of those be used to implement the same resizing? > > Part of portability TODO. > > > > > - Andres had an interesting comment about how overcommit interacts with > > MAP_NORESERVE. AFAIK it means we need the flag to not break overcommit > > accounting. There's also some comments about from linux-mm people [9]. > > > > New implementation uses MAP_NORESERVE. See my earlier response about > overcommit. > > > - There seem to be some issues with releasing memory backing a mapping > > with hugetlb [10]. With the fd (and truncating the file), this seems > > to release the memory, but it's linux-specific? But most of this stuff > > is specific to linux, it seems. So is this a problem? With this it > > should be working even for hugetlb ... > > > > Right. mmap() + ftruncate(), instead of mmap() + mremap() allows us to > avoid going through postmaster, which makes the implementation much > simpler. And also support hugetlb. But it will be linux only. > > > - It seems FreeBSD has MFD_HUGETLB [11], so maybe we could use this and > > make the hugetlb stuff work just like on Linux? Unclear. Also, I > > thought the mfd stuff is linux-specific ... or am I confused? > > > > Portability TODO. > > > - Andres objected to any approach without pointer stability, and I agree > > with that. If we can figure out such solution, of course. > > Since we call mmap only once, the address of the mappping does not > change; it is always stable. So pointer stability is guaranteed. > > > > > - Thomas asked [13] why we need to stop all the backends, instead of > > just waiting for them to acknowledge the new (smaller) NBuffers value > > and then let them continue. I also don't quite see why this should > > not work, and it'd limit the disruption when we have to wait for > > eviction of buffers pinned by paused cursors, etc. > > > > Approach in the latest patches does not stop all backends. Backends > continue to work while resizing is in progress. They are synchronized > at certain points using barriers. The details of synchronization for > each subsystem and worker backend need to be worked out. We are > working on that. > > > > > > > Now, some comments about the individual patches (some of this may be a > > bit redundant with the earlier points): > > Since these patches have been heavily rewritten because we have > redesigned and reimplemented the resizing process, some of the > comments may not be relevant any more. I will try to answer the > underlying concerns wherever applicable. > > > > > > > v5-0001-Process-config-reload-in-AIO-workers.patch > > > > 1) Hmmm, so which other workers may need such explicit handling? Do all > > other processes participate in procsignal stuff, or does anything > > need an explicit handling? > > > > As Andres mentioned in [2], this is not needed. It's not part of the > latest patches. > > > > > v5-0003-Introduce-pss_barrierReceivedGeneration.patch > > > > 1) Do we actually need this? Isn't it enough to just have two barriers? > > Or a barrier + condition variable, or something like that. > > > > 2) The comment talks about "coordinated way" when processing messages, > > but it's not very clear to me. It should explain what is needed and > > not possible with the current barrier code. > > > > 3) This very much reminds me what the online checksums patch needed to > > do, and we managed to do it using plain barriers. So why does this > > need this new thing? (No opinion on whether it's correct.) > > > > > This isn't required in the latest implementation. Removed from the > latest patchset. > > > > > v5-0004-Allow-to-use-multiple-shared-memory-mappings.patch > > > > 1) "int shmem_segment" - wouldn't it be better to have a separate enum > > for this? I mean, we'll have a predefined list of segments, right? > > > > +1. Andres suggested [2] to keep only two segments. So enum may be > superfluous, but may be good if we need more segments in future. TODO > for now. > > > 2) typedef struct AnonymousMapping would deserve some comment > > This structure no more exists. Instead we use MemoryMappingSizes to > hold the required sizes of shared memory segments. > > > > > 3) ANON_MAPPINGS - Probably should be MAX_ANON_MAPPINGS? But we'll know > > how many we have, so why not to allocate exactly the right number? > > Or even just an array of structs, like in similar cases? > > > > +1. Renamed as NUM_MEMORY_MAPPINGS and is used to declare and traverse > corresponding arrays. > > > 4) static int next_free_segment = 0; > > > > We exactly know what segments we'll create and in which order, no? So > > why do we even bother with this next_free_segment thing? Can't we > > simply declare an array of AnonymousMapping elements, with all the > > elements, and then just walk it and calculate the sizes/pointers? > > next_free_segment is removed from the latest patches as it's not needed. > > > > > 5) I'm a bit confused about the segment/mapping difference. The patch > > seems to randomly mix those, or maybe I'm just confused. I mean, > > we are creating just shmem segment, and the pieces are mappings, > > right? So why do we index them by "shmem_segment"? > > > > Also, consider > > > > CreateAnonymousSegment(AnonymousMapping *mapping) > > > > so is that creating a segment or mapping? Or what's the difference? > > > > Or are we creating multiple segments, and I missed that? Or are there > > different "segment" concepts, or what? > > > > 6) There should probably be some sort of API wrapping the mappings, so > > that the various places don't need to mess with next_free_segments > > directly, etc. Perhaps PGSharedMemoryCreate() shouldn't do this, and > > should just pass size to CreateAnonymousSegment(), and that finding > > empty slot in Mappings, etc.? Not sure that'll work, but it's a bit > > error-prone if a struct is modified from multiple places like this. > > Fixed this confusion in the latest patches. There are multiple > segments, each mapped to a different address space. Since we are using > mmap() + ftruncate(), the memory allocated for each segment is > controlled by a separate fds. If we use a single address space > reservation and carve multiple segments out of it, we can not resize > each segment separately. Just to clarify, we do not reserve a large > part of address space and then carve it into smaller segments. > > > > > 7) We should remember which segments got to use huge pages and which > > did not. And we should make it optional for each segment. Although, > > maybe I'm just confused about the "segment" definition - if we only > > have one, that's where huge pages are applied. > > > > If we could have multiple segments for different segments (whatever > > that means), not sure what we'll report for cases when some segments > > get to use huge pages and others don't. Either because we don't want > > to use that for some segments, or because we happen to run out of > > the available huge pages. > > This is an interesting idea. In the current implementation either we > use huge pages for all the segments or none of them. I think, per > segment huge page usage will be an add-on feature in the next version. > What do you think? Using separate segments for reserving separate > address spaces will make it easy to use huge pages for some and not > for others. > > > > > 8) It seems PGSharedMemoryDetach got some significant changes, but the > > comment was not modified at all. I'd guess that means the comment is > > perhaps stale, or maybe there's something we should mention. > > > > Done. > > > 9) I doubt the Assert on GetConfigOption needs to be repeated for all > > segments (in CreateSharedMemoryAndSemaphores). > > > > Done. > > > > 10) Why do we have the Mapping and Segments indexed in different ways? > > I mean, Mappings seem to be filled in FIFO (just grab the next free > > slot), while Segments are indexed by segment ID. > > > > Fixed this confusion in the latest patches. Both are indexed by segment ID. > > > 11) Actually, what's the difference between the contents of Mappings > > and Segments? Isn't that the same thing, indexed in the same way? > > Or could it be unified? Or are they conceptually different thing? > > > > See explanation above. > > > 12) I believe we'll have a predefined list of segments, with fixed IDs, > > so why not just have a MAX of those IDs as the capacity? > > > > yes. Fixed in the latest patches. > > > 13) Would it be good to have some checks on shmem_segment values? That > > it's valid with respect to defined segments, etc. An assert, maybe? > > What about some asserts on the Mapping/Segment elements? To check > > that the element is sensible, and that the arrays "match" (if we > > need both). > > > > That's a good idea. Added Asserts to that effect. > > > 14) Some of the lines got pretty long, e.g. in pg_get_shmem_allocations. > > I suggest we define some macros to make this shorter, or something > > like that. > > > > Done. > > > 15) I'd maybe rename ShmemSegment to PGShmemSegment, for consistency > > with PGShmemHeader? > > Actually we track information about the shared memory segments at > multiple places. shmem.c has APIs similar to MemoryContext for shared > memory. The information there is consolidated into ShmemSegment > structure. pg_shmem.h has platform independent APIs to manage shared > memory segments and then each implementation has implementation > specific information. Some of that information is passed from parent > process (postmaster) to child process (postgresql backends). I have > created PGInhShmemSegment for the information that is passed from > postmaster to backends and then AnonymousShmemSegment structure for > tracking information about anonymous shared memory in sysv_shmem.c. I > don't like PGInhShmemSegment name, but I haven't figured out a better > name. I may rearrange these structures a bit more in the next patches. > > > > > 16) Is MAIN_SHMEM_SEGMENT something we want to expose in a public header > > file? Seems very much like an internal thing, people should access > > it only through APIs ... > > > > If we convert it to an enum, we can't avoid exposing it in a public > header file. In future, we may allow users to allocate memory in > specific segments. So I think it's fine to expose it. > > > > > v5-0005-Address-space-reservation-for-shared-memory.patch > > > > 1) Shouldn't reserved_offset and huge_pages_on really be in the segment > > info? Or maybe even in mapping info? (again, maybe I'm confused > > about what these structs store) > > I can't find reserved_offset in the latest patches. huge_pages_on is > now a global flag, either all segments use huge pages or none of them > do. As mentioned earlier, per segment huge page usage may be an add-on > separate feature. > > > > > 2) CreateSharedMemoryAndSemaphores comment is rather light on what it > > does, considering it now reserves space and then carves is into > > segments. > > > > The detailed comment is in PGSharedMemoryCreate(), which seems to be a > better place for explaining memory reservation etc. I just change the > comment here to mention plural shared memory segments and > differentiate between shared memory segments and the shared memory > structures. Please let me know if this looks good. > > > 3) So ReserveAnonymousMemory is what makes decisions about huge pages, > > for the whole reserved space / all segments in it. That's a bit > > unfortunate with respect to the desirability of some segments > > benefiting from huge pages and others not. Maybe we should have two > > "reserved" areas, one with huge pages, one without? > > > > See my responses above about mapping vs segment and per segment huge page > usage. > > > I guess we don't want too many segments, because that might make > > fork() more expensive, etc. Just guessing, though. Also, how would > > this work with threading? > > See my earlier response about fork() performance and also limiting > number of segments to just 2. If we do see fork() is getting > expensive, we can limit the number of segments to 1. > > > > > 4) Any particular reason to define max_available_memory as > > GUC_UNIT_BLOCKS and not GUC_UNIT_MB? Of course, if we change this > > to have "max shared buffers limit" then it'd make sense to use > > blocks, but "total limit" is not in blocks. > > > > Right. See response about max_shared_buffers above. > > > 5) The general approach seems sound to me, but I'm not expert on this. > > I wonder how portable this behavior is. I mean, will it work on other > > Unix systems / Windows? Is it POSIX or Linux extension? > > See my earlier response about portability. > > > > > 6) It might be a good idea to have Assert procedures to chech mappings > > and segments (that it doesn't overflow reserved space, etc.). It > > took me ages to realize I can change shared_buffers to >60% of the > > limit, it'll happily oblige and then just crash with OOM when > > calling mprotect(). > > > > This shouldn't happen with the latest patches. There are tests for the > same. Please let me know if you still face the issue again in your > tests. > > > > > v5-0006-Introduce-multiple-shmem-segments-for-shared-buff.patch > > > > 1) I suspect the SHMEM_RESIZE_RATIO is the wrong direction, because it > > entirely ignores relationships between the parts. See the earlier > > comment about this. > > > > 2) In fact, what happens if the user tries to resize to a value that is > > too large for one of the segments? How would the system know before > > starting the resize (and failing)? > > > > +1. No SHMEM_RESIZE_RATIO in the latest patches. Reservation is > entirely based on max_shared_buffers. > > > 3) It seems wrong to modify the BufferManagerShmemSize like this. It's > > probably better to have a "...SegmentSize" function for individual > > segments, and let BufferManagerShmemSize() to still return a sum of > > all segments. > > > > I have reimplemented CalculateShmemSize() in the latest patches. Please > review. > > > 4) I think MaxAvailableMemory is the wrong abstraction, because that's > > not what people specify. See earlier comment. > > > > +1. No MaxAvailableMemory in the latest patches. > > > 5) Let's say we change the shared memory size (ALTER SYSTEM), trigger > > the config reload (pg_reload_conf). But then we find that we can't > > actually shrink the buffers, for some unpredictable reason (e.g. > > there's pinned buffers). How do we "undo" the change? We can't > > really undo the ALTER SYSTEM, that's already written in the .conf > > and we don't know the old value, IIRC. Is it reasonable to start > > killing backends from the assign_hook or something? Seems weird. > > > > The buffer pool resizing is done using the function > pg_resize_shared_buffers(), which does not need rolling back the > config value. As mentioned earlier, appropriate error handling is > still a TBD. > > > > > v5-0007-Allow-to-resize-shared-memory-without-restart.patch > > > > 1) Why would AdjustShmemSize be needed? Isn't that a sign of a bug > > somewhere in the resizing? > > > > It has been replaced by BufferManagerShmemResize() in the latest > patches. BufferManagerShmemResize() only resizes the buffer manager > related shared memory segments and data structures. > > > 2) Isn't the pg_memory_barrier() in CoordinateShmemResize a bit weird? > > Why is it needed, exactly? If it's to flush stuff for processes > > consuming EmitProcSignalBarrier, it's that too late? What if a > > process consumes the barrier between the emit and memory barrier? > > > > 3) WaitOnShmemBarrier seem a bit under-documented. > > > > Both of these functions are not required in the new implementation. > Removed from the latest patches. > > > 4) Is this actually adding buffers to the freelist? I see buf_init only > > links the new buffers by seeting freeNext, but where are the new > > buffers added to the existing freelist? > > > > Freelist does not exist anymore. > > > 5) The issue with a new backend seeing an old NBuffers value reminds me > > of the "support enabling checksums online" thread, where we ran into > > similar race conditions. See message [1], the part about race #2 > > (the other race might be relevant too, not sure). It's been a while, > > but I think our conclusion ini that thread was that the "best" fix > > would be to change the order of steps in InitPostgres(), i.e. setup > > the ProcSignal stuff first, and only then "copy" the NBuffers value. > > And handle the possibility that we receive a "duplicate" barriers. > > > > I plan to remove NBuffers entirely and instead use shared memory > variables so that we don't have to worry about backends inheriting > stale values from Postmaster or even involving Postmaster in the > resizing process. This is being worked upon and will be available in a > future patchset. > > > 6) In fact, the online checksums thread seems like a possible source of > > inspiration for some of the issues, because it needs to do similar > > stuff (e.g. make sure all backends follow steps in a synchronized > > way, etc.). And it didn't need new types of Barrier to do that. > > > > Right. New implementation does not require any changes to the barrier > implementation. > > > 7) Also, this seems like a perfect match for testing using injection > > points. In fact, there's not a single test in the whole patch series. > > Or a single line of .sgml docs, for that matter. It took me a while > > to realize I'm supposed to change the size by ALTER SYSTEM + reload > > the config. > > > > There are some tests in the latest patches. More tests are being added. > > > > > v5-0008-Support-shrinking-shared-buffers.patch > > > > 1) Why is ShmemCtrl->evictor_pid reset in AnonymousShmemResize? Isn't > > there a place starting it and waiting for it to complete? Why > > shouldn't it do EvictExtraBuffers itself? > > > > evictor_pid is not used in the latest patches. Eviction is done in > pg_resize_shared_buffers() itself by the backend which executes that > function. > > > 2) Isn't the change to BufferManagerShmemInit wrong? How do we know the > > last buffer is still at the end of the freelist? Seems unlikely. > > > > No freelist anymore. > > > 3) Seems a bit strange to do it from a random backend. Shouldn't it > > be the responsibility of a process like checkpointer/bgwriter, or > > maybe a dedicated dynamic bgworker? Can we even rely on a backend > > to be available? > > > > The backend which executes pg_resize_shared_buffers() coordinates the > resizing itself. I have not seen a need for it to use another > background worker yet. > > > 4) Unsolved issues with buffers pinned for a long time. Could be an > > issue if the buffer is pinned indefinitely (e.g. cursor in idle > > connection), and the resizing blocks some activity (new connections > > or stuff like that). > > > > Resizing stops immediately after it encounters a pinned buffer. More > sophisticated handling of such situations is a TBD and mostly v2. > > > 5) Funny that "AI suggests" something, but doesn't the block fail to > > reset nextVictimBuffer of the clocksweep? It may point to a buffer > > we're removing, and it'll be invalid, no? > > > > TODO: > > > 6) It's not clear to me in what situations this triggers (in the call > > to BufferManagerShmemInit) > > > > if (FirstBufferToInit < NBuffers) ... > > > > That code does not exist in the latest patches. > > > > > v5-0009-Reinitialize-StrategyControl-after-resizing-buffe.patch > > > > 1) IMHO this should be included in the earlier resize/shrink patches, > > I don't see a reason to keep it separate (assuming this is the > > correct way, and the "init" is not). > > > > Right. Merged into the resizing patch in the latest patches. > > > 2) Doesn't StrategyPurgeFreeList already do some of this for the case > > of shrinking memory? > > No freelist anymore. > > > > > 3) Not great adding a bunch of static variables to bufmgr.c. Why do we > > need to make "everything" static global? Isn't it enough to make > > only the "valid" flag global? The rest can stay local, no? > > > > If everything needs to be global for some reason, could we at least > > make it a struct, to group the fields, not just separate random > > variables? And maybe at the top, not half-way throught the file? > > > > 4) Isn't the name BgBufferSyncAdjust misleading? It's not adjusting > > anything, it's just invalidating the info about past runs. > > Being worked upon. > > > > > 5) I don't quite understand why BufferSync needs to do the dance with > > delay_shmem_resize. I mean, we certainly should not run BufferSync > > from the code that resizes buffers, right? Certainly not after the > > eviction, from the part that actually rebuilds shmem structs etc. > > So perhaps something could trigger resize while we're running the > > BufferSync()? Isn't that a bit strange? If this flag is needed, it > > seems more like a band-aid for some issue in the architecture. > > > > 6) Also, why should it be fine to get into situation that some of the > > buffers might not be valid, during shrinking? I mean, why should > > this check (pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != > NBuffers). > > It seems better to ensure we never get into "sync" in a way that > > might lead some of the buffers invalid. Seems way too lowlevel to > > care about whether resize is happening. > > > > 7) I don't understand the new condition for "Execute the LRU scan". > > Won't this stop LRU scan even in cases when we want it to happen? > > Don't we want to scan the buffers in the remaining part (after > > shrinking), for example? Also, we already checked this shmem flag at > > the beginning of the function - sure, it could change (if some other > > process modifies it), but does that make sense? Wouldn't it cause > > problems if it can change at an arbitrary point while running the > > BufferSync? IMHO just another sign it may not make sense to allow > > this, i.e. buffer sync should not run during the "actual" resize. > > > > Resizing runs for a few seconds. With default BufferSync frequency of > 200ms, there will be a few cycles of BufferSync during resizing. If we > don't let BufferSync run during resizing, we might have more dirty > buffers to tackle post-resizing. I think it will be wise to let it run > during resizing in the portion of the buffer pool which is not being > removed. Working on that. > > > > > v5-0010-Additional-validation-for-buffer-in-the-ring.patch > > > > 1) So the problem is we might create a ring before shrinking shared > > buffers, and then GetBufferFromRing will see bogus buffers? OK, but > > we should be more careful with these checks, otherwise we'll miss > > real issues when we incorrectly get an invalid buffer. Can't the > > backends do this only when they for sure know we did shrink the > > shared buffers? Or maybe even handle that during the barrier? > > > > AFAIK, the buffer rings are inside the Scan nodes, which are not > accessible globally or to the barrier handling functions. So we can't > purge the rings only once during or after resizing. If we already have > a mechanism to access rings globally and hence in the barrier code, or > we can develop such a mechanism, we can purge the rings only once. > Please let me know if you have any ideas on this. > > > 2) IMHO a sign there's the "transitions" between different NBuffers > > values may not be clear enough, and we're allowing stuff to happen > > in the "blurry" area. I think that's likely to cause bugs (it did > > cause issues for the online checksums patch, I think). > > > > NBuffers is going to be replaced by shared memory variables. The > current code relies on buffer pool size being static for the entire > duration of the server. We are now changing that assumption. So, yes > there will be some "blurry" areas and bugs to tackle. We are > developing tests to uncover bugs and fix them. > > Here's reply to email by Peter E [3]. > > On Tue, Nov 19, 2024 at 6:27 PM Peter Eisentraut <[email protected]> > wrote: > > > > On 18.10.24 21:21, Dmitry Dolgov wrote: > > > v1-0001-Allow-to-use-multiple-shared-memory-mappings.patch > > > > > > Preparation, introduces the possibility to work with many shmem > mappings. To > > > make it less invasive, I've duplicated the shmem API to extend it with > the > > > shmem_slot argument, while redirecting the original API to it. There > are > > > probably better ways of doing that, I'm open for suggestions. > > > > After studying this a bit, I tend to think you should just change the > > existing APIs in place. So for example, > > > > void *ShmemAlloc(Size size); > > > > becomes > > > > void *ShmemAlloc(int shmem_slot, Size size); > > > > There aren't that many callers, and all these duplicated interfaces > > almost add more new code than they save. > > > > It might be worth making exceptions for interfaces that are likely to be > > used by extensions. For example, I see pg_stat_statements using > > ShmemInitStruct() and ShmemInitHash(). But that seems to be it. Are > > there any other examples out there? Maybe there are many more that I > > don't see right now. But at least for the initialization functions, it > > doesn't seem worth it to preserve the existing interfaces exactly. > > > > In any case, I think the slot number should be the first argument. This > > matches how MemoryContextAlloc() or also talloc() work. > > > > Fixed. Please take a look at shmem.c changes. > > > (Now here is an idea: Could these just be memory contexts? Instead of > > making six shared memory slots, could you make six memory contexts with > > a special shared memory type. And ShmemAlloc becomes the allocation > > function, etc.?) > > I don't see a need to do that right now. But will revisit this idea. > > > > > I noticed the existing code made inconsistent use of PGShmemHeader * vs. > > void *, which also bled into your patch. I made the attached little > > patch to clean that up a bit. > > > > The latest patches are rebased on top of your commit. > > > I suggest splitting the struct ShmemSegment into one struct for the > > three memory addresses and a separate array just for the slock_t's. The > > former struct can then stay private in storage/ipc/shmem.c, only the > > locks need to be exported. > > > Also, maybe some of this should be declared in storage/shmem.h rather > > than in storage/pg_shmem.h. We have the existing ShmemLock in there, so > > it would be a bit confusing to have the per-segment locks elsewhere. > > > > I have described the new structures above. Please review. > > > > > Maybe rename ANON_MAPPINGS to something like NUM_ANON_MAPPINGS. > > Done. > > > > > > v1-0003-Introduce-multiple-shmem-slots-for-shared-buffers.patch > > > > > > Splits shared_buffers into multiple slots, moving out structures that > depend on > > > NBuffers into separate mappings. There are two large gaps here: > > > > > > * Shmem size calculation for those mappings is not correct yet, it > includes too > > > many other things (no particular issues here, just haven't had > time). > > > * It makes hardcoded assumptions about what is the upper limit for > resizing, > > > which is currently low purely for experiments. Ideally there should > be a new > > > configuration option to specify the total available memory, which > would be a > > > base for subsequent calculations. > > > > Yes, I imagine a shared_buffers_hard_limit setting. We could maybe > > default that to the total available memory, but it would also be good to > > be able to specify it directly, for testing. > > > > Latest patches introduce max_shared_buffers instead. > > > > > > v1-0005-Use-anonymous-files-to-back-shared-memory-segment.patch > > > > > > Allows an anonyous file to back a shared mapping. This makes certain > things > > > easier, e.g. mappings visual representation, and gives an fd for > possible > > > future customizations. > > > > I think this could be a useful patch just by itself, without the rest of > > the series, because of > > > > > * By default, Linux will not add file-backed shared mappings into a > > > core dump, making it more convenient to work with them in PostgreSQL: > > > no more huge dumps to process. > > > > This could be significant operational benefit. > > > > When you say "by default", is this adjustable? Does someone actually > > want the whole shared memory in their core file? (If it's adjustable, > > is it also adjustable for anonymous mappings?) > > 'man core' mentions that /proc/[PID]/coredump_filter file controls the > memory segments written to the core file. The value in the file is a > bitmask of memory mapping types. The bits correspond to the flags > passed to mmap(). This file can be written to by the program through > /proc/self/coredump_filter or externally by something like echo > > /proc/PID/coredump_filter. A boot time setting becomes default for all > the programs. A child process inherits the setting from parent through > fork() and execve() does not change it. So it looks like DBAs should > be able to control what gets dumped in postgresql core dumps by using > any of those options. If we use file backed shared memory as the > patches do, DBAs may have to change their default setting, which may > not have considered filed backed shared memory. We will need to > highlight this in the release notes. I have added a TODO for the same > in the code for now. We will add this note in the commit message or > update relevant document as the patches get committable. > > The man page says that bits 0, 1, 4 are set by default (anonymous > private and shared mappings and ELF headers). But on my machine I see > that the default is different. It's possible that different > installations use different configurations. > > > > > I'm wondering about this change: > > > > -#define PG_MMAP_FLAGS > > (MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE) > > +#define PG_MMAP_FLAGS (MAP_SHARED|MAP_HASSEMAPHORE) > > > > It looks like this would affect all mmap() calls, not only the one > > you're changing. But that's the only one that uses this macro! I don't > > understand why we need this; I don't see anything in the commit log > > about this ever being used for any portability. I think we should just > > get rid of it and have mmap() use the right flags directly. > > This is committed as c100340729b66dc46d4f9d68a794957bf2c468d8. > > > > > I see that FreeBSD has a memfd_create() function. Might be worth a try. > > Obviously, this whole thing needs a configure test for memfd_create() > > anyway. > > We are using memfd_create in the latest patches. I will look into > creating a configure test for memfd_create(). I have added a TODO > comment for the same. > > > > > I see that memfd_create() has a MFD_HUGETLB flag. It's not very clear > > how that interacts with the MAP_HUGETLB flag for mmap(). Do you need to > > specify both of them if you want huge pages? > > We need both. I used toy program attached to [0] to verify that. Maybe > we could use similar program for configure test. > > There are three important areas of puzzle that I will work on next: > 1. Synchronization: The coordinator sends Proc signal barriers to all > the concurrent backends during resizing. The code which deals with the > buffer pool needs to react to these barriers and may need to adjust > its course of action. Checkpointer, background writer, new buffer > allocation, code scanning the buffer pool sequentially are a few > examples that need to react to the barrier code. > > 2. Graceful handling of errors and failures during resizing. > > 3. Portability: A config test to check whether a platform has the APIs > need to support this feature and enable the feature in the > corresponding build. On other platforms build succeeds, regression > runs but this feature is disabled. > > All the TODOs that mentioned above and those not covered by the above > three points are added to the code. One of them being to reduce the > number of segments to just two (main and shared buffer blocks). I plan > to address them as the feature matures. > > [0] > https://www.postgresql.org/message-id/CAExHW5sVxEwQsuzkgjjJQP9-XVe0H2njEVw1HxeYFdT7u7J%2BeQ%40mail.gmail.com > [1] > https://www.postgresql.org/message-id/CAExHW5sOu8%2B9h6t7jsA5jVcQ--N-LCtjkPnCw%2BrpoN0ovT6PHg%40mail.gmail.com > [2] > https://www.postgresql.org/message-id/qltuzcdxapofdtb5mrd4em3bzu2qiwhp3cdwdsosmn7rhrtn4u%40yaogvphfwc4h > [3] > https://www.postgresql.org/message-id/12add41a-7625-4639-a394-a5563e349322%40eisentraut.org > [4] > https://www.postgresql.org/message-id/CAExHW5vTWABxuM5fbQcFkGuTLwaxuZDEE2vtx2WuMUWk6JnF4g%40mail.gmail.com > [5] > https://www.postgresql.org/message-id/CAEze2WiMkmXUWg10y%2B_oGhJzXirZbYHB5bw0%3DVWte%2BYHwSBa%3DA%40mail.gmail.com > > -- > Best Wishes, > Ashutosh Bapat >
