Re: Changing shared_buffers without restart

Bowen Shi Fri, 06 Feb 2026 01:26:34 -0800

Hi Ashutosh,

I tried applying the v20260128 patches but encountered conflicts.
Could you let me know the base commit they were developed against?


Thanks

On Fri, Feb 6, 2026 at 5:21 PM Ashutosh Bapat <[email protected]>
wrote:

> On Fri, Nov 14, 2025 at 5:23 PM Ashutosh Bapat
> <[email protected]> wrote:
> >
> > Hi,
> > PFA new patchset with some TODOs from previous email addressed:
> >
> > On Mon, Oct 13, 2025 at 9:28 PM Ashutosh Bapat
> > <[email protected]> wrote:
> > > 1. New backends join while the synchronization is going on.
> >
> > Done. Explained the solution below in detail.
> >
> > > An existing backend exiting.
> >
> > Not tested specifically, but should work.
> >
> > > 2. Failure or crash in the backend which is executing
> pg_resize_buffer_pool()
> >
> > still a TODO
> >
> > > 3. Fix crashes in the tests.
> >
> > core regression passes, pg_buffercache regression tests pass and the
> > tests for buffer resizing pass most of the time. So far I have seen
> > two issues
> > 1. An assertion from AIO worker - which happened only once and I
> > couldn't reproduce again. Need to study interaction of AIO worker with
> > buffer resizing.
> > 2. checkpointer crashes - which is one of the TODOs listed below.
> > 3. Also there's an shared memory id related failure, which I don't
> > understand but happen more frequently than the first one. Need to look
> > into that.
> >
> > > go through Tomas's detailed comments and address those
> > > which still apply.
> >
> > Still a TODO. But since many of those patches are revised heavily, I
> > think many of the comments may have been addressed, some may not apply
> > anymore.
>
> Thanks a lot Tomas for these review comments and summary of the
> discussion as of your response. Sorry, it took so long to fully
> respond to this email, but I wanted the patches to be in a reasonable
> shape before responding. It's been a while since I have posted the
> last patchset. Hence attaching patchset along with responses to your
> comments as well as the earlier comments that you mentioned in your
> response. Dmitry has responded to some parts of these emails already
> but I am responding to all the comments in the context of the latest
> implementation which has been revised heavily since his responses. The
> latest design and UI is described in details in [1]. Please note that
> that email also responds to some earlier comments from Andres and
> others. I have not repeated those responses here.
>
> The earlier patches did not build in EXEC_BACKEND mode. The attached
> patches build in EXEC_BACKEND mode and also the regression tests pass
> in that build. I haven't tried running whole test suite though.
>
> >
> > I agree it'd be useful to be able to resize shared buffers, without
> > having to restart the instance (which is obviously very disruptive). So
> > if we can make this work reliably, with reasonable trade offs (both on
> > the backends, and also the risks/complexity introduced by the feature).
>
> Agreed. That's the goal - to make it work reliably with reasonable
> trade-offs. Adding some details to this goal as follows:
>
> 1. Performance
> --------------
> a. The impact of resizing on the performance of the concurrent
> database operations should be reasonable and acceptable.
> b. The code changes should not cause significant performance
> degradation during normal operations
>
> Palak Chaturvedi has already done some benchmarking. I am listing her
> findings here
> 1. The code changes do not have any noticeable performance degradation
> during normal operations.
> 2. Resizing itself is also reasonably fast and with very minimal,
> almost unnoticeable, performance impact on the concurrent
> transactions.
>
> We will share the benchmarks once the code is cleaner and more complete
> stable.
>
> 2. Reliability
> --------------
> We are adding tests to make sure that resizing does not introduce any
> crashes, hazards or data corruption. Some tests are already part of
> the patch set.
>
> >
> > I'm far from an expert on mmap() and similar low-level stuff, but the
> > current appproach (reserving a big chunk of shared memory and slicing
> > it by mmap() into smaller segments) seems reasonable.
>
> The latest patches, we reserve address space by mmap and manage memory
> within that address space using file backed memory.
>
> >
> > But I'm getting a bit lost in how exactly this interacts with things
> > like overcommit, system memory accounting / OOM killer and this sort of
> > stuff. I went through the thread and it seems to me the reserve+map
> > approach works OK in this regard (and the messages on linux-mm seem to
> > confirm this). But this information is scattered over many messages and
> > it's hard to say for sure, because some of this might be relevant for
> > an earlier approach, or a subtly different variant of it.
>
> Looking at the documentation and the messages on linux-mm, it seems
> that the mmap with MAP_NORESERVE approach works well with overcommit,
> system memory accounting and OOM killer. I plan to add tests verify
> these aspects, so that any changes in the future do not cause any
> regressions. I am looking for ways to write a small test program for
> the same, which we can add to our test battery. But no success so far.
> Any idea?
>
> >
> > A similar question is portability. The comments and commit messages
> > seem to suggest most of this is linux-specific, and other platforms just
> > don't have these capabilities. But there's a bunch of messages (mostly
> > by Thomas Munro) that hint FreeBSD might be capable of this too, even if
> > to some limited extent. And possibly even Windows/EXEC_BACKEND, although
> > that seems much trickier.
> >
> > FWIW I think it's perfectly fine to only support resizing on selected
> > platforms, especially considering Linux is the most widely used system
> > for running Postgres. We still need to be able to build/run on other
> > systems, of course. And maybe it'd be good to be able to disable this
> > even on Linux, if that eliminates some overhead and/or risks for people
> > who don't need the feature. Just a thought.
>
> +1. Plan is to make it work on Linux first and disable it on the other
> platforms. I think it's a good idea to be able to disable it at initdb
> time say. But I am not yet sure if that's going to avoid any overhead
> or risks. I haven't yet wrapped my head around the platform dependency
> completely.
>
> >
> > Anyway, my main point is that this information is important, but very
> > scattered over the thread. It's a bit foolish to expect everyone who
> > wants to do a review to read the whole thread (which will inevitably
> > grow longer over time), and assemble all these pieces again an again,
> > following all the changes in the design etc. Few people will get over
> > that hurdle, IMHO.
> >
> > So I think it'd be very helpful to write a README, explaining the
> > currnent design/approach, and summarizing all these aspects in a single
> > place. Including things like portability, interaction with the OS
> > accounting, OOM killer, this kind of stuff. Some of this stuff may be
> > already mentioned in code comments, but you it's hard to find those.
> >
> > Especially worth documenting are the states the processes need to go
> > through (using the barriers), and the transacitons between them (i.e.
> > what is allowed in each phase, what blocks can be visible, etc.).
> >
>
> That's a great idea. There is already a README in
> src/backend/storage/buffer/, I have added a section of buffer pool
> resizing. The section has pointers to relevant code comments. I will
> revise this to be as self sufficient as possible as the patches
> mature. The code managing the shared memory is scattered across
> backend/portability, storage/ipc etc. It's slightly troublesome to
> find one good place to add a README which explains everything.
>
> >
> > I'll go over some higher-level items first, and then over some comments
> > for individual patches.
> >
> >
> > 1) no user docs
> >
> > There are no user .sgml docs, and maybe it's time to write some,
> > explaining how to use this thing - how to configure it, how to trigger
> > the resizing, etc. It took me a while to realize I need to do ALTER
> > SYSTEM + pg_reload_conf() to kick this off.
>
> Latest patches have updated config.sgml and func-admin.sgml to
> document the functionality. Please review them.
>
> >
> > It should also document the user-visible limitations, e.g. what activity
> > is blocked during the resizing, etc.
> >
>
> We are still figuring this out. But we will document the limitations
> in func-admin.sgml. Right now there are no limitations and my
> intention is to keep the limitations as limited as possible.
>
> >
> > 2) pending GUC changes
> >
> > I'm somewhat skeptical about the GUC approach. I don't think it was
> > designed with this kind of use case in mind, and so I think it's quite
> > likely it won't be able to handle it well.
> >
> > For example, there's almost no validation of the values, so how do you
> > ensure the new value makes sense? Because if it doesn't, it can easily
> > crash the system (I've seen such crashes repeatedly, I'll get to that).
> > Sure, you may do ALTER SYSTEM to set shared_buffers to nonsense and it
> > won't start after restart/reboot, but crashing an instance is maybe a
> > little bit more annoying.
> >
> > Let's say we did the ALTER SYSTEM + pg_reload_conf(), and it gets stuck
> > waiting on something (can't evict a buffer or something). How do you
> > cancel it, when the change is already written to the .auto.conf file?
> > Can you simply do ALTER SYSTEM + pg_reload_conf() again?
> >
> > It also seems a bit strange that the "switch" gets to be be driven by a
> > randomly selected backend (unless I'm misunderstanding this bit). It
> > seems to be true for the buffer eviction during shrinking, at least.
> >
> > Perhaps this should be a separate utility command, or maybe even just
> > a new ALTER SYSTEM variant? Or even just a function, similar to what
> > the "online checksums" patch did, possibly combined with a bgworder
> > (but probably not needed, there are no db-specific tasks to do).
> >
>
> As documented in func-admin.sgml, and config.sgml, resizing the buffer
> pool requires ALTER SYSTEM + pg_reload_conf() followed by
> pg_resize_shared_buffers(). There is no pending flag, no arbitrary
> backend driving the process. The validation, failures are handled by
> pg_resize_shared_buffers(). The backend where
> pg_resize_shared_buffers() is called is responsible for driving the
> resizing process. I think, the new implementation takes care of all
> the concerns mentioned above.
>
> >
> > 3) max_available_memory
> >
> > Speaking of GUCs, I dislike how max_available_memory works. It seems a
> > bit backwards to me. I mean, we're specifying shared_buffers (and some
> > other parameters), and the system calculates the amount of shared memory
> > needed. But the limit determines the total limit?
> >
> > I think the GUC should specify the maximum shared_buffers we want to
> > allow, and then we'd work out the total to pre-allocate? Considering
> > we're only allowing to resize shared_buffers, that should be pretty
> > trivial. Yes, it might happen that the "total limit" happens to exceed
> > the available memory or something, but we already have the problem
> > with shared_buffers. Seems fine if we explain this in the docs, and
> > perhaps print the calculated memory limit on start.
> >
> > In any case, we should not allow setting a value that ends up
> > overflowing the internal reserved space. It's true we don't have a good
> > way to do checks for GUcs, but it's a bit silly to crash because of
> > hitting some non-obvious internal limit that we necessarily know about.
> >
> > Maybe this is a reason why GUC hooks are not a good way to set this.
>
> Latest patches introduce max_shared_buffers which specifies the
> maximum shared buffers allowed. max_available_memory, that some
> previous patchsets had, has been removed. This GUC is mentioned in the
> documentation changes.
>
> One relatively small thing to think about is what do we call
> shared_buffers GUC - it's not SIGHUP per say since it requires a
> function to be called after the reload but it's not POSTMASTER either
> since restart is not required. I think we need a new PGC_ for it but
> haven't yet thought of a good name. Do you have any suggestions?
>
> >
> >
> > 4) SHMEM_RESIZE_RATIO
> >
> > The SHMEM_RESIZE_RATIO thing seems a bit strange too. There's no way
> > these ratios can make sense. For example, BLCKSZ is 8192 but the buffer
> > descriptor is 64B. That's 128x difference, but the ratios says 0.6 and
> > 0.1, so 6x. Sure, we'll actually allocate only the memory we need, and
> > the rest is only "reserved".
> >
> > However, that just makes the max_available_memory a bit misleading,
> > because you can't ever use it. You can use the 60% for shared buffers
> > (which is not mentioned anywhere, and good luck not overflowing that,
> > as it's never checked), but those smaller regions are guaranteed to be
> > mostly unused. Unfortunate.
> >
> > And it's not just a matter of fixing those ratios, because then someone
> > rebuilds with 32kB blocks and you're in the same situation.
> >
> > Moreover, all of the above is for mappings sized based on NBuffers. But
> > if we allocate 10% for MAIN_SHMEM_SEGMENT, won't that be a problem the
> > moment someone increases of max_connection, max_locks_per_transaction
> > and possibly some other stuff?
> >
>
> +1. Fixed. max_shared_buffers only deals with the shared buffers.
>
> >
> > 5) no tests
> >
> > I mentioned no "user docs", but the patch has 0 tests too. Which seems
> > a bit strange for a patch of this age.
> >
> > A really serious part of the patch series seems to be the coordination
> > of processes when going through the phases, enforced by the barriers.
> > This seems like a perfect match for testing using injection points, and
> > I know we did something like this in the online checksums patch, which
> > needs to coordinate processes in a similar way.
> >
> > But even just a simple TAP test that does a bunch of (random?) resizes
> > while running a pgbench seem better than no tests. (That's what I did
> > manually, and it crashed right away.)
> >
> > There's a lot more stuff to test here, I think. Idle sessions with
> > buffers pinned by open cursors, multiple backends doing ALTER SYSTEM
> > + pg_reload_conf concurrently, other kinds of failures.
> >
>
> +1. The latest patches already have a few tests but we are adding more
> tests to cover different scenarios.
>
> >
> > 6) SIGBUS failures
> >
> > As mentioned, I did some simple tests with shrink/resize with a pgbench
> > in the background, and it almost immediately crashed for me :-( With a
> > SIGBUS, which I think is fairly rare on x86 (definitely much less common
> > than e.g. SIGSEGV).
> >
> > An example backtrace attached.
> >
>
> There is a test which runs pgbench concurrently with resizing and it
> fails much less frequently (1 in 20 or even lower). Quite likely the
> crash you have seen is also fixed. Please let me know if the test
> fails for you and if there is something that is in your test but not
> that test.
>
> >
> > 7) EXEC_BACKEND, FreeBSD
> >
> > We clearly need to keep this working on systems without the necessary
> > bits (so likely EXEC_BACKEND, FreeBSD etc.). But the builds currently
> > fail in both cases, it seems.
> >
> > I think it's fine to not support resizing on every platform, then we'd
> > never get it, but it still needs to build. It would be good to not have
> > two very different code versions, one for resizing and one without it,
> > though. I wonder if we can just have the "no-resize" use the same struct
> > (with the segments/mapping, ...) and all that, but skipping the space
> > reservation.
> >
>
> I have not thought fully through support on platforms other than
> Linux. However, my current plan is to not support GUC
> max_shared_buffers and the resizing function
> pg_resize_shared_buffers() on platforms which do not support the
> necessary bits. The code will still build and run on those platforms,
> but resizing shared buffers without restart won't be possible.
>
> >
> > 8) monitoring
> >
> > So, let's say I start a resize of shared buffers. How will I know what
> > it's currently doing, how much longer it might take, what it's waiting
> > for, etc.? I think it'd be good to have progress monitoring, through
> > the regular system view (e.g. pg_stat_shmem_resize_progress?).
> >
>
> When pg_resize_shared_buffers() finishes, user knows that the resizing
> is finished. It usually takes a few seconds for the resizing to
> finish. I am not sure whether we will need progress reporting. But
> there might be cases where the resizing has to wait for something or
> it may take longer for a given phase e.g. eviction. So we may require
> progress reporting. Have added TODO in the code for the same.
>
> >
> > 10) what to do about stuck resize?
> >
> > AFAICS the resize can get stuck for various reasons, e.g. because it
> > can't evict pinned buffers, possibly indefinitely. Not great, it's not
> > clear to me if there's a way out (canceling the resize) after a timeout,
> > or something like that? Not great to start an "online resize" only to
> > get stuck with all activity blocked for indefinite amount of time, and
> > get to restart anyway.
> >
> > Seems related to Thomas' message [2], but AFAICS the patch does not do
> > anything about this yet, right? What's the plan here?
> >
>
> pg_resize_shared_buffers() is affected by timeouts like
> statement_timeouts as well as query cancellation. However, current
> implementation lacks graceful handling of these events. Another TODO.
>
> >
> > 11) preparatory actions?
> >
> > Even if it doesn't get stuck, some of the actions can take a while, like
> > evicting dirty buffers before shrinking, etc. This is similar to what
> > happens on restart, when the shutdown checkpoint can take a while, while
> > the system is (partly) unavailable.
> >
> > The common mitigation is to do an explicit checkpoint right before the
> > restart, to make the shutdown checkpoint cheap. Could we do something
> > similar for the shrinking, e.g. flush buffers from the part to be
> > removed before actually starting the resize?
> >
>
> Eviction is carried out before changes to the memory or shared buffer
> metadata. If eviction fails, the resizing is rolled back. The system
> continues to work with old buffer pool size. A test particularly
> testing this aspect remains to be added.
>
> >
> > 12) does this affect e.g. fork() costs?
> >
> > I wonder if this affects the cost of fork() in some undesirable way?
> > Could it make fork() more measurably more expensive?
> >
>
> A new backend needs to attach to extra shared memory segments, AFAIK
> and detach those when exiting. But I don't think there's any other
> extra work that a backend needs to do when starting or exiting.
> Windows might be different. I haven't seen any noticeable difference
> in pgbench performance which uses new connection for every
> transaction. However, we haven't tried a benchmark with empty
> transactions so that fork() and exit() are exercised at a higher
> frequency. Another TODO.
>
> >
> > 13) resize "state" is all over the place
> >
> > For me, a big hurdle when reasoning about the resizing correctness is
> > that there's quite a lot of distinct pieces tracking what the current
> > "state" is. I mean, there's:
> >
> >  - ShmemCtrl->NSharedBuffers
> >  - NBuffers
> >  - NBuffersOld
> >  - NBuffersPending
> >  - ... (I'm sure I missed something)
> >
> > There's no cohesive description how this fits together, it seems a bit
> > "ad hoc". Could be correct, but I find it hard to reason about.
> >
>
> Next set of patches will consolidate the state only in two places
> NBuffersPending and StrategyControl, which needs a new name. But even
> in the attached patches it's consolidated in NBuffersPending,
> ShmemCtrl and StrategyControl. There are instances of NBuffers which
> will be replaced by variables from ShmemCtrl or StrategyControl.
>
> >
> > 14) interesting messages from the thread
> >
> > While reading through the thread, I noticed a couple messages that I
> > think are still relevant:
> >
> > - I see Peter E posted some review in 2024/11 [3], but it seems his
> >   comments were mostly ignored. I agree with most of them.
>
> Find detailed reply to that email at the end.
>
> >
> > - Robert mentioned a couple interesting failure scenarios in [4], not
> >   sure if all of this was handled. He howerver assumes pointers would
> >   not be stable (and that's something we should not allow, and the
> >   current approach works OK in this regard, I think). He also outlines
> >   how it'd happen in phases - this would be useful for the design README
> >   I think. It also reminds me the "phases" in the checksums patch.
> >
>
> stable pointers in the latest patch. The phases are explained in the
> prologue of pg_resize_shared_buffers().
>
> > - Robert asked [5] if Linux might abruptly break this, but I find that
> >   unlikely. We'd point out we rely on this, and they'd likely rethink.
> >   This would be made safer if this was specified by POSIX - taking that
> >   away once implemented seems way harder than for custom extensions.
> >   It's likely they'd not take away the feature without an alternative
> >   way to achieve the same effect, I think (yes, harder to maintain).
> >   Tom suggests [7] this is not in POSIX.
> >
> > - Matthias mentioned [6] similar flags on other operating systems. Could
> >   some of those be used to implement the same resizing?
>
> Part of portability TODO.
>
> >
> > - Andres had an interesting comment about how overcommit interacts with
> >   MAP_NORESERVE. AFAIK it means we need the flag to not break overcommit
> >   accounting. There's also some comments about from linux-mm people [9].
> >
>
> New implementation uses MAP_NORESERVE. See my earlier response about
> overcommit.
>
> > - There seem to be some issues with releasing memory backing a mapping
> >   with hugetlb [10]. With the fd (and truncating the file), this seems
> >   to release the memory, but it's linux-specific? But most of this stuff
> >   is specific to linux, it seems. So is this a problem? With this it
> >   should be working even for hugetlb ...
> >
>
> Right. mmap() + ftruncate(), instead of mmap() + mremap() allows us to
> avoid going through postmaster, which makes the implementation much
> simpler. And also support hugetlb. But it will be linux only.
>
> > - It seems FreeBSD has MFD_HUGETLB [11], so maybe we could use this and
> >   make the hugetlb stuff work just like on Linux? Unclear. Also, I
> >   thought the mfd stuff is linux-specific ... or am I confused?
> >
>
> Portability TODO.
>
> > - Andres objected to any approach without pointer stability, and I agree
> >   with that. If we can figure out such solution, of course.
>
> Since we call mmap only once, the address of the mappping does not
> change; it is always stable. So pointer stability is guaranteed.
>
> >
> > - Thomas asked [13] why we need to stop all the backends, instead of
> >   just waiting for them to acknowledge the new (smaller) NBuffers value
> >   and then let them continue. I also don't quite see why this should
> >   not work, and it'd limit the disruption when we have to wait for
> >   eviction of buffers pinned by paused cursors, etc.
> >
>
> Approach in the latest patches does not stop all backends. Backends
> continue to work while resizing is in progress. They are synchronized
> at certain points using barriers. The details of synchronization for
> each subsystem and worker backend need to be worked out. We are
> working on that.
>
> >
> >
> > Now, some comments about the individual patches (some of this may be a
> > bit redundant with the earlier points):
>
> Since these patches have been heavily rewritten because we have
> redesigned and reimplemented the resizing process, some of the
> comments may not be relevant any more. I will try to answer the
> underlying concerns wherever applicable.
>
> >
> >
> > v5-0001-Process-config-reload-in-AIO-workers.patch
> >
> > 1) Hmmm, so which other workers may need such explicit handling? Do all
> >    other processes participate in procsignal stuff, or does anything
> >    need an explicit handling?
> >
>
> As Andres mentioned in [2], this is not needed. It's not part of the
> latest patches.
>
> >
> > v5-0003-Introduce-pss_barrierReceivedGeneration.patch
> >
> > 1) Do we actually need this? Isn't it enough to just have two barriers?
> >    Or a barrier + condition variable, or something like that.
> >
> > 2) The comment talks about "coordinated way" when processing messages,
> >    but it's not very clear to me. It should explain what is needed and
> >    not possible with the current barrier code.
> >
> > 3) This very much reminds me what the online checksums patch needed to
> >    do, and we managed to do it using plain barriers. So why does this
> >    need this new thing? (No opinion on whether it's correct.)
> >
>
>
> This isn't required in the latest implementation. Removed from the
> latest patchset.
>
> >
> > v5-0004-Allow-to-use-multiple-shared-memory-mappings.patch
> >
> > 1) "int shmem_segment" - wouldn't it be better to have a separate enum
> >    for this? I mean, we'll have a predefined list of segments, right?
> >
>
> +1. Andres suggested [2] to keep only two segments. So enum may be
> superfluous, but may be good if we need more segments in future. TODO
> for now.
>
> > 2) typedef struct AnonymousMapping would deserve some comment
>
> This structure no more exists. Instead we use MemoryMappingSizes to
> hold the required sizes of shared memory segments.
>
> >
> > 3) ANON_MAPPINGS - Probably should be MAX_ANON_MAPPINGS? But we'll know
> >    how many we have, so why not to allocate exactly the right number?
> >    Or even just an array of structs, like in similar cases?
> >
>
> +1. Renamed as NUM_MEMORY_MAPPINGS and is used to declare and traverse
> corresponding arrays.
>
> > 4) static int next_free_segment = 0;
> >
> >    We exactly know what segments we'll create and in which order, no? So
> >    why do we even bother with this next_free_segment thing? Can't we
> >    simply declare an array of AnonymousMapping elements, with all the
> >    elements, and then just walk it and calculate the sizes/pointers?
>
> next_free_segment is removed from the latest patches as it's not needed.
>
> >
> > 5) I'm a bit confused about the segment/mapping difference. The patch
> >    seems to randomly mix those, or maybe I'm just confused. I mean,
> >    we are creating just shmem segment, and the pieces are mappings,
> >    right? So why do we index them by "shmem_segment"?
> >
> >    Also, consider
> >
> >       CreateAnonymousSegment(AnonymousMapping *mapping)
> >
> >    so is that creating a segment or mapping? Or what's the difference?
> >
> >    Or are we creating multiple segments, and I missed that? Or are there
> >    different "segment" concepts, or what?
> >
> > 6) There should probably be some sort of API wrapping the mappings, so
> >    that the various places don't need to mess with next_free_segments
> >    directly, etc. Perhaps PGSharedMemoryCreate() shouldn't do this, and
> >    should just pass size to CreateAnonymousSegment(), and that finding
> >    empty slot in Mappings, etc.? Not sure that'll work, but it's a bit
> >    error-prone if a struct is modified from multiple places like this.
>
> Fixed this confusion in the latest patches. There are multiple
> segments, each mapped to a different address space. Since we are using
> mmap() + ftruncate(), the memory allocated for each segment is
> controlled by a separate fds. If we use a single address space
> reservation and carve multiple segments out of it, we can not resize
> each segment separately. Just to clarify, we do not reserve a large
> part of address space and then carve it into smaller segments.
>
> >
> > 7) We should remember which segments got to use huge pages and which
> >    did not. And we should make it optional for each segment. Although,
> >    maybe I'm just confused about the "segment" definition - if we only
> >    have one, that's where huge pages are applied.
> >
> >    If we could have multiple segments for different segments (whatever
> >    that means), not sure what we'll report for cases when some segments
> >    get to use huge pages and others don't. Either because we don't want
> >    to use that for some segments, or because we happen to run out of
> >    the available huge pages.
>
> This is an interesting idea. In the current implementation either we
> use huge pages for all the segments or none of them. I think, per
> segment huge page usage will be an add-on feature in the next version.
> What do you think? Using separate segments for reserving separate
> address spaces will make it easy to use huge pages for some and not
> for others.
>
> >
> > 8) It seems PGSharedMemoryDetach got some significant changes, but the
> >    comment was not modified at all. I'd guess that means the comment is
> >    perhaps stale, or maybe there's something we should mention.
> >
>
> Done.
>
> > 9) I doubt the Assert on GetConfigOption needs to be repeated for all
> >    segments (in CreateSharedMemoryAndSemaphores).
> >
>
> Done.
>
>
> > 10) Why do we have the Mapping and Segments indexed in different ways?
> >     I mean, Mappings seem to be filled in FIFO (just grab the next free
> >     slot), while Segments are indexed by segment ID.
> >
>
> Fixed this confusion in the latest patches. Both are indexed by segment ID.
>
> > 11) Actually, what's the difference between the contents of Mappings
> >     and Segments? Isn't that the same thing, indexed in the same way?
> >     Or could it be unified? Or are they conceptually different thing?
> >
>
> See explanation above.
>
> > 12) I believe we'll have a predefined list of segments, with fixed IDs,
> >     so why not just have a MAX of those IDs as the capacity?
> >
>
> yes. Fixed in the latest patches.
>
> > 13) Would it be good to have some checks on shmem_segment values? That
> >     it's valid with respect to defined segments, etc. An assert, maybe?
> >     What about some asserts on the Mapping/Segment elements? To check
> >     that the element is sensible, and that the arrays "match" (if we
> >     need both).
> >
>
> That's a good idea. Added Asserts to that effect.
>
> > 14) Some of the lines got pretty long, e.g. in pg_get_shmem_allocations.
> >     I suggest we define some macros to make this shorter, or something
> >     like that.
> >
>
> Done.
>
> > 15) I'd maybe rename ShmemSegment to PGShmemSegment, for consistency
> >     with PGShmemHeader?
>
> Actually we track information about the shared memory segments at
> multiple places. shmem.c has APIs similar to MemoryContext for shared
> memory. The information there is consolidated into ShmemSegment
> structure. pg_shmem.h has platform independent APIs to manage shared
> memory segments and then each implementation has implementation
> specific information. Some of that information is passed from parent
> process (postmaster) to child process (postgresql backends). I have
> created PGInhShmemSegment for the information that is passed from
> postmaster to backends and then AnonymousShmemSegment structure for
> tracking information about anonymous shared memory in sysv_shmem.c. I
> don't like PGInhShmemSegment name, but I haven't figured out a better
> name. I may rearrange these structures a bit more in the next patches.
>
> >
> > 16) Is MAIN_SHMEM_SEGMENT something we want to expose in a public header
> >     file? Seems very much like an internal thing, people should access
> >     it only through APIs ...
> >
>
> If we convert it to an enum, we can't avoid exposing it in a public
> header file. In future, we may allow users to allocate memory in
> specific segments. So I think it's fine to expose it.
>
> >
> > v5-0005-Address-space-reservation-for-shared-memory.patch
> >
> > 1) Shouldn't reserved_offset and huge_pages_on really be in the segment
> >    info? Or maybe even in mapping info? (again, maybe I'm confused
> >    about what these structs store)
>
> I can't find reserved_offset in the latest patches. huge_pages_on is
> now a global flag, either all segments use huge pages or none of them
> do. As mentioned earlier, per segment huge page usage may be an add-on
> separate feature.
>
> >
> > 2) CreateSharedMemoryAndSemaphores comment is rather light on what it
> >    does, considering it now reserves space and then carves is into
> >    segments.
> >
>
> The detailed comment is in PGSharedMemoryCreate(), which seems to be a
> better place for explaining memory reservation etc. I just change the
> comment here to mention plural shared memory segments and
> differentiate between shared memory segments and the shared memory
> structures. Please let me know if this looks good.
>
> > 3) So ReserveAnonymousMemory is what makes decisions about huge pages,
> >    for the whole reserved space / all segments in it. That's a bit
> >    unfortunate with respect to the desirability of some segments
> >    benefiting from huge pages and others not. Maybe we should have two
> >    "reserved" areas, one with huge pages, one without?
> >
>
> See my responses above about mapping vs segment and per segment huge page
> usage.
>
> >    I guess we don't want too many segments, because that might make
> >    fork() more expensive, etc. Just guessing, though. Also, how would
> >    this work with threading?
>
> See my earlier response about fork() performance and also limiting
> number of segments to just 2. If we do see fork() is getting
> expensive, we can limit the number of segments to 1.
>
> >
> > 4) Any particular reason to define max_available_memory as
> >    GUC_UNIT_BLOCKS and not GUC_UNIT_MB? Of course, if we change this
> >    to have "max shared buffers limit" then it'd make sense to use
> >    blocks, but "total limit" is not in blocks.
> >
>
> Right. See response about max_shared_buffers above.
>
> > 5) The general approach seems sound to me, but I'm not expert on this.
> >    I wonder how portable this behavior is. I mean, will it work on other
> >    Unix systems / Windows? Is it POSIX or Linux extension?
>
> See my earlier response about portability.
>
> >
> > 6) It might be a good idea to have Assert procedures to chech mappings
> >    and segments (that it doesn't overflow reserved space, etc.). It
> >    took me ages to realize I can change shared_buffers to >60% of the
> >    limit, it'll happily oblige and then just crash with OOM when
> >    calling mprotect().
> >
>
> This shouldn't happen with the latest patches. There are tests for the
> same. Please let me know if you still face the issue again in your
> tests.
>
> >
> > v5-0006-Introduce-multiple-shmem-segments-for-shared-buff.patch
> >
> > 1) I suspect the SHMEM_RESIZE_RATIO is the wrong direction, because it
> >    entirely ignores relationships between the parts. See the earlier
> >    comment about this.
> >
> > 2) In fact, what happens if the user tries to resize to a value that is
> >    too large for one of the segments? How would the system know before
> >    starting the resize (and failing)?
> >
>
> +1. No SHMEM_RESIZE_RATIO in the latest patches. Reservation is
> entirely based on max_shared_buffers.
>
> > 3) It seems wrong to modify the BufferManagerShmemSize like this. It's
> >    probably better to have a "...SegmentSize" function for individual
> >    segments, and let BufferManagerShmemSize() to still return a sum of
> >    all segments.
> >
>
> I have reimplemented CalculateShmemSize() in the latest patches. Please
> review.
>
> > 4) I think MaxAvailableMemory is the wrong abstraction, because that's
> >    not what people specify. See earlier comment.
> >
>
> +1. No MaxAvailableMemory in the latest patches.
>
> > 5) Let's say we change the shared memory size (ALTER SYSTEM), trigger
> >    the config reload (pg_reload_conf). But then we find that we can't
> >    actually shrink the buffers, for some unpredictable reason (e.g.
> >    there's pinned buffers). How do we "undo" the change? We can't
> >    really undo the ALTER SYSTEM, that's already written in the .conf
> >    and we don't know the old value, IIRC. Is it reasonable to start
> >    killing backends from the assign_hook or something? Seems weird.
> >
>
> The buffer pool resizing is done using the function
> pg_resize_shared_buffers(), which does not need rolling back the
> config value. As mentioned earlier, appropriate error handling is
> still a TBD.
>
> >
> > v5-0007-Allow-to-resize-shared-memory-without-restart.patch
> >
> > 1) Why would AdjustShmemSize be needed? Isn't that a sign of a bug
> >    somewhere in the resizing?
> >
>
> It has been replaced by BufferManagerShmemResize() in the latest
> patches. BufferManagerShmemResize() only resizes the buffer manager
> related shared memory segments and data structures.
>
> > 2) Isn't the pg_memory_barrier() in CoordinateShmemResize a bit weird?
> >    Why is it needed, exactly? If it's to flush stuff for processes
> >    consuming EmitProcSignalBarrier, it's that too late? What if a
> >    process consumes the barrier between the emit and memory barrier?
> >
> > 3) WaitOnShmemBarrier seem a bit under-documented.
> >
>
> Both of these functions are not required in the new implementation.
> Removed from the latest patches.
>
> > 4) Is this actually adding buffers to the freelist? I see buf_init only
> >    links the new buffers by seeting freeNext, but where are the new
> >    buffers added to the existing freelist?
> >
>
> Freelist does not exist anymore.
>
> > 5) The issue with a new backend seeing an old NBuffers value reminds me
> >    of the "support enabling checksums online" thread, where we ran into
> >    similar race conditions. See message [1], the part about race #2
> >    (the other race might be relevant too, not sure). It's been a while,
> >    but I think our conclusion ini that thread was that the "best" fix
> >    would be to change the order of steps in InitPostgres(), i.e. setup
> >    the ProcSignal stuff first, and only then "copy" the NBuffers value.
> >    And handle the possibility that we receive a "duplicate" barriers.
> >
>
> I plan to remove NBuffers entirely and instead use shared memory
> variables so that we don't have to worry about backends inheriting
> stale values from Postmaster or even involving Postmaster in the
> resizing process. This is being worked upon and will be available in a
> future patchset.
>
> > 6) In fact, the online checksums thread seems like a possible source of
> >    inspiration for some of the issues, because it needs to do similar
> >    stuff (e.g. make sure all backends follow steps in a synchronized
> >    way, etc.). And it didn't need new types of Barrier to do that.
> >
>
> Right. New implementation does not require any changes to the barrier
> implementation.
>
> > 7) Also, this seems like a perfect match for testing using injection
> >    points. In fact, there's not a single test in the whole patch series.
> >    Or a single line of .sgml docs, for that matter. It took me a while
> >    to realize I'm supposed to change the size by ALTER SYSTEM + reload
> >    the config.
> >
>
> There are some tests in the latest patches. More tests are being added.
>
> >
> > v5-0008-Support-shrinking-shared-buffers.patch
> >
> > 1) Why is ShmemCtrl->evictor_pid reset in AnonymousShmemResize? Isn't
> >    there a place starting it and waiting for it to complete? Why
> >    shouldn't it do EvictExtraBuffers itself?
> >
>
> evictor_pid is not used in the latest patches. Eviction is done in
> pg_resize_shared_buffers() itself by the backend which executes that
> function.
>
> > 2) Isn't the change to BufferManagerShmemInit wrong? How do we know the
> >    last buffer is still at the end of the freelist? Seems unlikely.
> >
>
> No freelist anymore.
>
> > 3) Seems a bit strange to do it from a random backend. Shouldn't it
> >    be the responsibility of a process like checkpointer/bgwriter, or
> >    maybe a dedicated dynamic bgworker? Can we even rely on a backend
> >    to be available?
> >
>
> The backend which executes pg_resize_shared_buffers() coordinates the
> resizing itself. I have not seen a need for it to use another
> background worker yet.
>
> > 4) Unsolved issues with buffers pinned for a long time. Could be an
> >    issue if the buffer is pinned indefinitely (e.g. cursor in idle
> >    connection), and the resizing blocks some activity (new connections
> >    or stuff like that).
> >
>
> Resizing stops immediately after it encounters a pinned buffer. More
> sophisticated handling of such situations is a TBD and mostly v2.
>
> > 5) Funny that "AI suggests" something, but doesn't the block fail to
> >    reset nextVictimBuffer of the clocksweep? It may point to a buffer
> >    we're removing, and it'll be invalid, no?
> >
>
> TODO:
>
> > 6) It's not clear to me in what situations this triggers (in the call
> >    to BufferManagerShmemInit)
> >
> >    if (FirstBufferToInit < NBuffers) ...
> >
>
> That code does not exist in the latest patches.
>
> >
> > v5-0009-Reinitialize-StrategyControl-after-resizing-buffe.patch
> >
> > 1) IMHO this should be included in the earlier resize/shrink patches,
> >    I don't see a reason to keep it separate (assuming this is the
> >    correct way, and the "init" is not).
> >
>
> Right. Merged into the resizing patch in the latest patches.
>
> > 2) Doesn't StrategyPurgeFreeList already do some of this for the case
> >    of shrinking memory?
>
> No freelist anymore.
>
> >
> > 3) Not great adding a bunch of static variables to bufmgr.c. Why do we
> >    need to make "everything" static global? Isn't it enough to make
> >    only the "valid" flag global? The rest can stay local, no?
> >
> >    If everything needs to be global for some reason, could we at least
> >    make it a struct, to group the fields, not just separate random
> >    variables? And maybe at the top, not half-way throught the file?
> >
> > 4) Isn't the name BgBufferSyncAdjust misleading? It's not adjusting
> >    anything, it's just invalidating the info about past runs.
>
> Being worked upon.
>
> >
> > 5) I don't quite understand why BufferSync needs to do the dance with
> >    delay_shmem_resize.  I mean, we certainly should not run BufferSync
> >    from the code that resizes buffers, right? Certainly not after the
> >    eviction, from the part that actually rebuilds shmem structs etc.
> >    So perhaps something could trigger resize while we're running the
> >    BufferSync()? Isn't that a bit strange? If this flag is needed, it
> >    seems more like a band-aid for some issue in the architecture.
> >
> > 6) Also, why should it be fine to get into situation that some of the
> >    buffers might not be valid, during shrinking? I mean, why should
> >    this check (pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) !=
> NBuffers).
> >    It seems better to ensure we never get into "sync" in a way that
> >    might lead some of the buffers invalid. Seems way too lowlevel to
> >    care about whether resize is happening.
> >
> > 7) I don't understand the new condition for "Execute the LRU scan".
> >    Won't this stop LRU scan even in cases when we want it to happen?
> >    Don't we want to scan the buffers in the remaining part (after
> >    shrinking), for example? Also, we already checked this shmem flag at
> >    the beginning of the function - sure, it could change (if some other
> >    process modifies it), but does that make sense? Wouldn't it cause
> >    problems if it can change at an arbitrary point while running the
> >    BufferSync? IMHO just another sign it may not make sense to allow
> >    this, i.e. buffer sync should not run during the "actual" resize.
> >
>
> Resizing runs for a few seconds. With default BufferSync frequency of
> 200ms, there will be a few cycles of BufferSync during resizing. If we
> don't let BufferSync run during resizing, we might have more dirty
> buffers to tackle post-resizing. I think it will be wise to let it run
> during resizing in the portion of the buffer pool which is not being
> removed. Working on that.
>
> >
> > v5-0010-Additional-validation-for-buffer-in-the-ring.patch
> >
> > 1) So the problem is we might create a ring before shrinking shared
> >    buffers, and then GetBufferFromRing will see bogus buffers? OK, but
> >    we should be more careful with these checks, otherwise we'll miss
> >    real issues when we incorrectly get an invalid buffer. Can't the
> >    backends do this only when they for sure know we did shrink the
> >    shared buffers? Or maybe even handle that during the barrier?
> >
>
> AFAIK, the buffer rings are inside the Scan nodes, which are not
> accessible globally or to the barrier handling functions. So we can't
> purge the rings only once during or after resizing. If we already have
> a mechanism to access rings globally and hence in the barrier code, or
> we can develop such a mechanism, we can purge the rings only once.
> Please let me know if you have any ideas on this.
>
> > 2) IMHO a sign there's the "transitions" between different NBuffers
> >    values may not be clear enough, and we're allowing stuff to happen
> >    in the "blurry" area. I think that's likely to cause bugs (it did
> >    cause issues for the online checksums patch, I think).
> >
>
> NBuffers is going to be replaced by shared memory variables. The
> current code relies on buffer pool size being static for the entire
> duration of the server. We are now changing that assumption. So, yes
> there will be some "blurry" areas and bugs to tackle. We are
> developing tests to uncover bugs and fix them.
>
> Here's reply to email by Peter E [3].
>
> On Tue, Nov 19, 2024 at 6:27 PM Peter Eisentraut <[email protected]>
> wrote:
> >
> > On 18.10.24 21:21, Dmitry Dolgov wrote:
> > > v1-0001-Allow-to-use-multiple-shared-memory-mappings.patch
> > >
> > > Preparation, introduces the possibility to work with many shmem
> mappings. To
> > > make it less invasive, I've duplicated the shmem API to extend it with
> the
> > > shmem_slot argument, while redirecting the original API to it. There
> are
> > > probably better ways of doing that, I'm open for suggestions.
> >
> > After studying this a bit, I tend to think you should just change the
> > existing APIs in place.  So for example,
> >
> > void *ShmemAlloc(Size size);
> >
> > becomes
> >
> > void *ShmemAlloc(int shmem_slot, Size size);
> >
> > There aren't that many callers, and all these duplicated interfaces
> > almost add more new code than they save.
> >
> > It might be worth making exceptions for interfaces that are likely to be
> > used by extensions.  For example, I see pg_stat_statements using
> > ShmemInitStruct() and ShmemInitHash().  But that seems to be it.  Are
> > there any other examples out there?  Maybe there are many more that I
> > don't see right now.  But at least for the initialization functions, it
> > doesn't seem worth it to preserve the existing interfaces exactly.
> >
> > In any case, I think the slot number should be the first argument.  This
> > matches how MemoryContextAlloc() or also talloc() work.
> >
>
> Fixed. Please take a look at shmem.c changes.
>
> > (Now here is an idea:  Could these just be memory contexts?  Instead of
> > making six shared memory slots, could you make six memory contexts with
> > a special shared memory type.  And ShmemAlloc becomes the allocation
> > function, etc.?)
>
> I don't see a need to do that right now. But will revisit this idea.
>
> >
> > I noticed the existing code made inconsistent use of PGShmemHeader * vs.
> > void *, which also bled into your patch.  I made the attached little
> > patch to clean that up a bit.
> >
>
> The latest patches are rebased on top of your commit.
>
> > I suggest splitting the struct ShmemSegment into one struct for the
> > three memory addresses and a separate array just for the slock_t's.  The
> > former struct can then stay private in storage/ipc/shmem.c, only the
> > locks need to be exported.
>
> > Also, maybe some of this should be declared in storage/shmem.h rather
> > than in storage/pg_shmem.h.  We have the existing ShmemLock in there, so
> > it would be a bit confusing to have the per-segment locks elsewhere.
> >
>
> I have described the new structures above. Please review.
>
> >
> > Maybe rename ANON_MAPPINGS to something like NUM_ANON_MAPPINGS.
>
> Done.
>
> >
> > > v1-0003-Introduce-multiple-shmem-slots-for-shared-buffers.patch
> > >
> > > Splits shared_buffers into multiple slots, moving out structures that
> depend on
> > > NBuffers into separate mappings. There are two large gaps here:
> > >
> > > * Shmem size calculation for those mappings is not correct yet, it
> includes too
> > >    many other things (no particular issues here, just haven't had
> time).
> > > * It makes hardcoded assumptions about what is the upper limit for
> resizing,
> > >    which is currently low purely for experiments. Ideally there should
> be a new
> > >    configuration option to specify the total available memory, which
> would be a
> > >    base for subsequent calculations.
> >
> > Yes, I imagine a shared_buffers_hard_limit setting.  We could maybe
> > default that to the total available memory, but it would also be good to
> > be able to specify it directly, for testing.
> >
>
> Latest patches introduce max_shared_buffers instead.
>
> >
> > > v1-0005-Use-anonymous-files-to-back-shared-memory-segment.patch
> > >
> > > Allows an anonyous file to back a shared mapping. This makes certain
> things
> > > easier, e.g. mappings visual representation, and gives an fd for
> possible
> > > future customizations.
> >
> > I think this could be a useful patch just by itself, without the rest of
> > the series, because of
> >
> >  > * By default, Linux will not add file-backed shared mappings into a
> >  > core dump, making it more convenient to work with them in PostgreSQL:
> >  > no more huge dumps to process.
> >
> > This could be significant operational benefit.
> >
> > When you say "by default", is this adjustable?  Does someone actually
> > want the whole shared memory in their core file?  (If it's adjustable,
> > is it also adjustable for anonymous mappings?)
>
> 'man core' mentions that /proc/[PID]/coredump_filter file controls the
> memory segments written to the core file. The  value in the file is a
> bitmask of memory mapping types. The bits correspond to the flags
> passed to mmap(). This file can be written to by the program through
> /proc/self/coredump_filter or externally by something like echo >
> /proc/PID/coredump_filter. A boot time setting becomes default for all
> the programs. A child process inherits the setting from parent through
> fork() and execve() does not change it. So it looks like DBAs should
> be able to control what gets dumped in postgresql core dumps by using
> any of those options. If we use file backed shared memory as the
> patches do, DBAs may have to change their default setting, which may
> not have considered filed backed shared memory. We will need to
> highlight this in the release notes. I have added a TODO for the same
> in the code for now. We will add this note in the commit message or
> update relevant document as the patches get committable.
>
> The man page says that bits 0, 1, 4 are set by default (anonymous
> private and shared mappings and ELF headers). But on my machine I see
> that the default is different. It's possible that different
> installations use different configurations.
>
> >
> > I'm wondering about this change:
> >
> > -#define PG_MMAP_FLAGS
> > (MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
> > +#define PG_MMAP_FLAGS                  (MAP_SHARED|MAP_HASSEMAPHORE)
> >
> > It looks like this would affect all mmap() calls, not only the one
> > you're changing.  But that's the only one that uses this macro!  I don't
> > understand why we need this; I don't see anything in the commit log
> > about this ever being used for any portability.  I think we should just
> > get rid of it and have mmap() use the right flags directly.
>
> This is committed as c100340729b66dc46d4f9d68a794957bf2c468d8.
>
> >
> > I see that FreeBSD has a memfd_create() function.  Might be worth a try.
> >   Obviously, this whole thing needs a configure test for memfd_create()
> > anyway.
>
> We are using memfd_create in the latest patches. I will look into
> creating a configure test for memfd_create(). I have added a TODO
> comment for the same.
>
> >
> > I see that memfd_create() has a MFD_HUGETLB flag.  It's not very clear
> > how that interacts with the MAP_HUGETLB flag for mmap().  Do you need to
> > specify both of them if you want huge pages?
>
> We need both. I used toy program attached to [0] to verify that. Maybe
> we could use similar program for configure test.
>
> There are three important areas of puzzle that I will work on next:
> 1. Synchronization: The coordinator sends Proc signal barriers to all
> the concurrent backends during resizing. The code which deals with the
> buffer pool needs to react to these barriers and may need to adjust
> its course of action. Checkpointer, background writer, new buffer
> allocation, code scanning the buffer pool sequentially are a few
> examples that need to react to the barrier code.
>
> 2. Graceful handling of errors and failures during resizing.
>
> 3. Portability: A config test to check whether a platform has the APIs
> need to support this feature and enable the feature in the
> corresponding build. On other platforms build succeeds, regression
> runs but this feature is disabled.
>
> All the TODOs that mentioned above and those not covered by the above
> three points are added to the code. One of them being to reduce the
> number of segments to just two (main and shared buffer blocks). I plan
> to address them as the feature matures.
>
> [0]
> https://www.postgresql.org/message-id/CAExHW5sVxEwQsuzkgjjJQP9-XVe0H2njEVw1HxeYFdT7u7J%2BeQ%40mail.gmail.com
> [1]
> https://www.postgresql.org/message-id/CAExHW5sOu8%2B9h6t7jsA5jVcQ--N-LCtjkPnCw%2BrpoN0ovT6PHg%40mail.gmail.com
> [2]
> https://www.postgresql.org/message-id/qltuzcdxapofdtb5mrd4em3bzu2qiwhp3cdwdsosmn7rhrtn4u%40yaogvphfwc4h
> [3]
> https://www.postgresql.org/message-id/12add41a-7625-4639-a394-a5563e349322%40eisentraut.org
> [4]
> https://www.postgresql.org/message-id/CAExHW5vTWABxuM5fbQcFkGuTLwaxuZDEE2vtx2WuMUWk6JnF4g%40mail.gmail.com
> [5]
> https://www.postgresql.org/message-id/CAEze2WiMkmXUWg10y%2B_oGhJzXirZbYHB5bw0%3DVWte%2BYHwSBa%3DA%40mail.gmail.com
>
> --
> Best Wishes,
> Ashutosh Bapat
>

Re: Changing shared_buffers without restart

Reply via email to