Hi, On 2026-02-10 19:14:44 +0200, Heikki Linnakangas wrote: > On 10/02/2026 18:41, Andres Freund wrote: > > On 2026-02-10 17:52:16 +0200, Heikki Linnakangas wrote: > > > If there's a performance reason to keep have it be aligned - and maybe > > > there > > > is - we should pad it explicitly. > > > > We should make it a power of two or such. There are some workloads where the > > indexing from GetPGProcByNumber() shows up, because it ends up having to be > > implemented as a 64 bit multiplication, which has a reasonably high latency > > (3-5 cycles). Whereas a shift has a latency of 1 and typically higher > > throughput too. > > Power of two means going to 1024 bytes. That's a lot of padding. Where have > you seen that show up?
LWLock contention heavy code, due to the GetPGProcByNumber() in LWLockWakeup() and other similar places. > Attached is a patch to align to cache line boundary. That's straightforward > if that's what we want to do. Yea, I think we should do that. Even if we don't see a difference today, just because it's a hell to find production issues around this, because it is so dependent on what processes use which PGPROC etc and because false sharing issues are generally expensive to debug. > > Re false sharing: We should really separate stuff that changes (like > > e.g. pendingRecoveryConflicts) and never changing stuff (backendType). You > > don't need overlapping structs to have false sharing issues if you mix > > different access patterns inside a struct that's accessed across > > processes... > > Makes sense, although I don't want to optimize too hard for performance, at > the expense of readability. The current order is pretty random anyway, > though. Yea, I don't think we need to be perfect here. Just a bit less bad. And, as you say, the current order doesn't make a lot of sense. Just grouping things like - pid, pgxactoff, backendType (i.e. barely if ever changing) - wait_event_info, waitStart (i.e. very frequently changing, but typically accessed within one proc) - sem, lwWaiting, waitLockMode (i.e. stuff that is updated frequently and accessed across processes) > It'd probably be good to move the subxids cache to the end of the struct. > That'd act as natural padding, as it's not very frequently used, especially > the tail end of the cache. Yea, that'd make sense. > Or come to think of it, it might be good to move the subxids cache out of > PGPROC altogether. It's mostly frequently accessed in GetSnapshotData(), and > for that it'd actually be better if it was in a separate "mirrored" array, > similar to the main xid and subxidStates. That would eliminate the > pgprocnos[pgxactoff] lookup from GetSnapshotdata() altogether. I doubt it's worth it - that way we'd need to move a lot larger array around during [dis]connect. The subxids stuff is a lot larger than the xid, statusFlags arrays... > I'm a little reluctant to mess with this without a concrete benchmark > though. Got one in mind? I'm travelling this week, but I'll try to recreate the benchmarks I've seen this on. Unfortunately you really need a large machine to really see differences, without that the memory latency between cores is just too low to realistically see issues. Greetings, Andres Freund
