On Wed, Jan 28, 2026 at 2:19 PM Ashutosh Bapat
<[email protected]> wrote:
>v 20260128*.patch
Short intro: I've started trying out these patches for slightly another reason
than the online buffers resize. There's was recent post [1] that was brought to
attention by Alvaro. That article is complaining about postmaster being
unscalable and more or less saturating @ 2-3k new connections / second and
postmaster becoming a CPU hog (one could argue that's too much and not sensible
setup).
I've thought that the potential main reason of the hit would be slow fork(),
so I had an idea why we fork() with majority of memory being shared_buffers
(BufferBlocks) that is not really used inside postmaster itself
(I mean it does not use it, only backends do use it). I've thought it could
be cool if we could just init the memory, leave just the fd from memfd_create
for s_b around (that is unmap() BufferBlocks from the postmaster thus lowering
its RSS/smaps footprint) and then on fork() the fork() would NOT have to copy
that big kernel VMA for shared_buffers. Instead (in theory - only the fd that
is the reference - thereby we could increase the scalability of the postmaster
(kernel would need to perform less work during fork()). Later on, the classic
backends on their side would mmap() the region back from the fd created earlier
(in postmaster) using memfd_create(2), but that would happen as part of many
backends (so workload would be spread across many CPUs). The critical
assumption here is that although on Linux there seems to be huge PMD sharing for
MAP_SHARED | MAP_HUGETLB, I was still wondering if we couldn't accelerate it
further by simply not having at all this memory before calling fork().
Initially I've created simple PoC bench on 64GB even with hugepages showed some
potential:
Scenario 1 (mmap inherited): 20001 total forks, 0.302ms per fork
Scenario 2 (MADV_DONTFORK): 20001 total forks, 0.292ms per fork
Scenario 3 (memfd_create): 20002 total forks, 0.145ms per fork
Quite unexpectedly that's how I discovered Your's and Dimitry's patch
as it already
had separation of memory segments (rather than one big mmap() blob) and
memfd_create(2) used too, so I just gave it a try. So I've tried to benchmark
Your's patchset when it comes to establishing new connections:
1s4c 32GB RAM, 6.14.x kernel, 16GB shared_buffers
benchmark: /usr/pgsql19/bin/pgbench -n --connect -j 4 -c 100
-f <(echo "SELECT 1;") postgres -P 1 -T 30
# master
latency average = 358.681 ms
latency stddev = 225.813 ms
average connection time = 2.989 ms
tps = 1329.733460 (including reconnection times)
# memfd/thispatchset
latency average = 363.584 ms
latency stddev = 230.529 ms
average connection time = 3.022 ms
tps = 1315.810761 (including reconnection times)
# memfd+mytrick, showed some promise in low stddev, but not in TPS
latency average = 34.229 ms
latency stddev = 22.059 ms
average connection time = 2.908 ms
tps = 1369.785773 (including reconnection times)
Another box, 4s32c64, 128GB RAM, 6.14.x kernel,
64GB shared_buffers (4 NUMA nodes)
benchmark: /usr/pgsql19/bin/pgbench -n --connect -j 128 -c 1000
-f <(echo "SELECT 1;") postgres -P 1 -T 30
#master
latency average = 240.179 ms
latency stddev = 119.379 ms
average connection time = 62.049 ms
tps = 2058.434343 (including reconnection times)
#memfd
latency average = 268.384 ms
latency stddev = 133.501 ms
average connection time = 69.081 ms
tps = 1847.422995 (including reconnection times)
#memfd+mytrick
latency average = 261.726 ms
latency stddev = 130.161 ms
average connection time = 67.579 ms
tps = 1889.988400 (including reconnection times)
So:
a) yes, my idea fizzled - still no crystal clear idea why - but at least
I've tried Your's patch :) We are still in the ballpark of ~1800..3000
new connections per second.
and here proper review against patchset follows:
b) the patch changes the behavior on startup and it appears that now
the patch tries to touch all the memory during startup which takes
much more time (I'm thinking of HA failover/promote scenarios where
long startup on could mean trouble e.g. after pg_rewind). E.g. without
patch it takes 1-2s and with the patch it takes 49s, no HugePages with
64GB s_b on slow machine). It happens due to that new fallocate() from
shmem_fallocate(). If it is supposed to stay like that IMHO log should
elog() what it is doing ("allocating memory...", otherwise users can
be left confused. It almost behaves like MAP_POPULATE would be
used.
c) as per above measurements, on NUMA it appears that there's seems be
like 1847/2058=~89% of baseline regression, when it comes to the
establishing new connections and you are operating on sysv_shmem.c
(so affecting all users). Possibly this would have to be re-tested
on some more modern hardware (I don't see it on single socket, but I
see on multiple sockets)
d) MADV_HUGEPAGES is Linux 4.14+ and although released nearly 10
years ago the buildfarm probably has some animals (Ubuntu 16?) that
still use such
old kernels (??))
e) so maybe because of b+c+d we should consider putting it under some new
shared_memory_type in the long run?
e) With huge_pages=on and no asserts it seemed to never work for me due to:
FATAL: segment[main]: could not truncate anonymous file to
size 313483264: Invalid argument
and please see this (this is with both(!)
max_shared_buffers=shared_buffers=1GB),
for some reason ftruncate() ended up calling ~ 2x more.
[pid 1252287] memfd_create("main", MFD_HUGETLB) = 4
[pid 1252287] mmap(NULL, 157286400, PROT_NONE, MAP_SHARED|MAP_NORESE..
[pid 1252287] mprotect(0x7f2a1a400000, 157286400, PROT_READ|PROT_WRI..
[pid 1252287] ftruncate(4, 313483264) = -1 EINVAL (Invalid argument)
it appears that I'm getting this due to bug in
round_off_mapping_sizes_for_hugepages() as before it I'm getting:
shmem_reserved=156196864, shmem_req_size=156196864
and after it it's called it returning:
shmem_reserved=157286400, shmem_req_size=313483264
Maybe TYPE ALIGN() would be a better fit for this there.
-J.
[1] - https://www.recall.ai/blog/postgres-postmaster-does-not-scale
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 0399265c4dd..b80fcd58931 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -758,6 +758,8 @@ round_off_mapping_sizes_for_hugepages(MemoryMappingSizes
*mapping, int hugepages
if (hugepagesize == 0)
return;
+ elog(WARNING, "shmem_reserved=%ld, shmem_req_size=%ld",
mapping->shmem_reserved, mapping->shmem_req_size);
+
if (mapping->shmem_req_size % hugepagesize != 0)
mapping->shmem_req_size += add_size(mapping->shmem_req_size,
hugepagesize - (mapping->shmem_req_size % hugepagesize));
@@ -839,7 +841,7 @@ CreateAnonymousSegment(int segment_id, MemoryMappingSizes
*mapping)
(errmsg("segment[%s]: could not create
anonymous shared memory file: %m",
segname)));
- elog(DEBUG1, "segment[%s]: mmap(%zu)", segname,
mapping->shmem_req_size);
+ elog(WARNING, "segment[%s]: mmap(%zu)", segname,
mapping->shmem_req_size);
/*
* Reserve maximum required address space for future expansion of this
@@ -894,7 +896,6 @@ CreateAnonymousSegment(int segment_id, MemoryMappingSizes
*mapping)
anonseg->addr = ptr;
anonseg->size = mapping->shmem_reserved;
}
-
/*
* PrepareHugePages
*
@@ -1418,6 +1419,29 @@ PGSharedMemoryDetach(void)
}
}
+void
+JWPGSharedMemoryBuffersDetachTrick(void)
+{
+ AnonShmemSegment *sbseg = &AnonShmemSegs[BUFFERS_SHMEM_SEGMENT];
+ elog(WARNING, "unmapping s_b, but not closing fd(%d), from postmaster
to accelerate fork()", sbseg->fd);
+ if(munmap(sbseg->addr, sbseg->size) != 0) {
+ ereport(FATAL, (errmsg("sb segment: could not unmap anonymous
shared memory: %m")));
+ }
+}
+
+void
+JWPGSharedMemoryBuffersReattachTrick(void)
+{
+ AnonShmemSegment *sbseg = &AnonShmemSegs[BUFFERS_SHMEM_SEGMENT];
+
+ if(sbseg->fd == -1)
+ elog(PANIC, "children got wrong memfd fd");
+
+ sbseg->addr = mmap(NULL, sbseg->size, PROT_READ | PROT_WRITE,
MAP_SHARED, sbseg->fd, 0);
+ if (sbseg->addr == MAP_FAILED)
+ elog(PANIC, "children failed mmap: %m");
+}
+
void
ShmemControlInit(void)
{
diff --git a/src/backend/postmaster/bgwriter.c
b/src/backend/postmaster/bgwriter.c
index 80e3088fc7e..b59c82f6c1f 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -93,6 +93,7 @@ BackgroundWriterMain(const void *startup_data, size_t
startup_data_len)
WritebackContext wb_context;
Assert(startup_data_len == 0);
+ elog(WARNING, "bwriter starting");
MyBackendType = B_BG_WRITER;
AuxiliaryProcessMainCommon();
diff --git a/src/backend/postmaster/launch_backend.c
b/src/backend/postmaster/launch_backend.c
index 85da8ac381a..43db2570a95 100644
--- a/src/backend/postmaster/launch_backend.c
+++ b/src/backend/postmaster/launch_backend.c
@@ -227,6 +227,9 @@ postmaster_child_launch(BackendType child_type, int
child_slot,
conn_timing.fork_end = GetCurrentTimestamp();
}
+ //elog(WARNING, "starting %d", pid);
+ JWPGSharedMemoryBuffersReattachTrick();
+
/* Close the postmaster's sockets */
ClosePostmasterPorts(child_type == B_LOGGER);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 1e92b0bcc5e..a18cb6227c5 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -257,7 +257,7 @@ CreateSharedMemoryAndSemaphores(void)
inhseg->UsedShmemSegID = i;
/* Compute the size of the shared-memory block */
- elog(DEBUG3, "invoking IpcMemoryCreate(segment %s, size=%zu,
reserved address space=%zu)",
+ elog(WARNING, "invoking IpcMemoryCreate(segment %s, size=%zu,
reserved address space=%zu)",
MappingName(i), mapping->shmem_req_size,
mapping->shmem_reserved);
/*
@@ -289,6 +289,10 @@ CreateSharedMemoryAndSemaphores(void)
*/
if (shmem_startup_hook)
shmem_startup_hook();
+
+ /* JW HACK */
+ JWPGSharedMemoryBuffersDetachTrick();
+
}
/*
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index ac679259787..2160af8de1a 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -219,6 +219,8 @@ extern PGShmemHeader *PGSharedMemoryCreate(int segment_id,
MemoryMappingSizes *m
PGShmemHeader **shim);
extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2);
extern void PGSharedMemoryDetach(void);
+extern void JWPGSharedMemoryBuffersDetachTrick(void);
+extern void JWPGSharedMemoryBuffersReattachTrick(void);
extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags,
int *memfd_flags);
extern bool PGSharedMemoryResize(int segment_id, MemoryMappingSizes
*mapping_sizes);