Hi,

On 2026-02-10 19:14:44 +0200, Heikki Linnakangas wrote:
> On 10/02/2026 18:41, Andres Freund wrote:
> > On 2026-02-10 17:52:16 +0200, Heikki Linnakangas wrote:
> > > If there's a performance reason to keep have it be aligned - and maybe 
> > > there
> > > is - we should pad it explicitly.
> > 
> > We should make it a power of two or such. There are some workloads where the
> > indexing from GetPGProcByNumber() shows up, because it ends up having to be
> > implemented as a 64 bit multiplication, which has a reasonably high latency
> > (3-5 cycles). Whereas a shift has a latency of 1 and typically higher
> > throughput too.
> 
> Power of two means going to 1024 bytes. That's a lot of padding. Where have
> you seen that show up?

LWLock contention heavy code, due to the GetPGProcByNumber() in LWLockWakeup()
and other similar places.


> Attached is a patch to align to cache line boundary. That's straightforward
> if that's what we want to do.

Yea, I think we should do that. Even if we don't see a difference today, just
because it's a hell to find production issues around this, because it is so
dependent on what processes use which PGPROC etc and because false sharing
issues are generally expensive to debug.


> > Re false sharing: We should really separate stuff that changes (like
> > e.g. pendingRecoveryConflicts) and never changing stuff (backendType). You
> > don't need overlapping structs to have false sharing issues if you mix
> > different access patterns inside a struct that's accessed across 
> > processes...
> 
> Makes sense, although I don't want to optimize too hard for performance, at
> the expense of readability. The current order is pretty random anyway,
> though.

Yea, I don't think we need to be perfect here. Just a bit less bad. And, as
you say, the current order doesn't make a lot of sense.
Just grouping things like
- pid, pgxactoff, backendType (i.e. barely if ever changing)
- wait_event_info, waitStart (i.e. very frequently changing, but typically
  accessed within one proc)
- sem, lwWaiting, waitLockMode (i.e. stuff that is updated frequently and
  accessed across processes)


> It'd probably be good to move the subxids cache to the end of the struct.
> That'd act as natural padding, as it's not very frequently used, especially
> the tail end of the cache.

Yea, that'd make sense.


> Or come to think of it, it might be good to move the subxids cache out of
> PGPROC altogether. It's mostly frequently accessed in GetSnapshotData(), and
> for that it'd actually be better if it was in a separate "mirrored" array,
> similar to the main xid and subxidStates. That would eliminate the
> pgprocnos[pgxactoff] lookup from GetSnapshotdata() altogether.

I doubt it's worth it - that way we'd need to move a lot larger array around
during [dis]connect. The subxids stuff is a lot larger than the xid,
statusFlags arrays...


> I'm a little reluctant to mess with this without a concrete benchmark
> though. Got one in mind?

I'm travelling this week, but I'll try to recreate the benchmarks I've seen
this on.  Unfortunately you really need a large machine to really see
differences, without that the memory latency between cores is just too low to
realistically see issues.


Greetings,

Andres Freund


Reply via email to