Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

2023-08-07 Thread Matt Smiley
>
> Why would the access frequency be uniform? In particular, there's a huge
> variability in how long the locks need to exist
>

As a supporting data point, our example production workload shows a 3x
difference between the most versus least frequently contended lock_manager
lock:
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365630726

Since we deterministically distribute relations among those 16 lock_manager
lwlocks by hashing their lock tag, we can probably assume a roughly uniform
number of relations are being managed by each lock_manager lock, but the
demand (and contention) for them is non-uniform. This 3x spread
corroborates the intuition that some relations are locked more frequently
than others (that being both a schema- and workload-specific property).

Since we're contemplating a new hashing scheme, I wonder how we could
accommodate that kind of asymmetry, where some relations are locked more
frequently than others.


Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

2023-08-07 Thread Matt Smiley
Hi Andres, thanks for helping!  Great questions, replies are inline below.

On Sun, Aug 6, 2023 at 1:00 PM Andres Freund  wrote:

> Hm, I'm curious whether you have a way to trigger the issue outside of your
> prod environment. Mainly because I'm wondering if you're potentially
> hitting
> the issue fixed in a4adc31f690 - we ended up not backpatching that fix, so
> you'd not see the benefit unless you reproduced the load in 16+.
>

Thanks for sharing this!

I have not yet written a reproducer since we see this daily in production.
I have a sketch of a few ways that I think will reproduce the behavior
we're observing, but haven't had time to implement it.

I'm not sure if we're seeing this behavior in production, but it's
definitely an interesting find.  Currently we are running postgres 12.11,
with an upcoming upgrade to 15 planned.  Good to know there's a potential
improvement waiting in 16.  I noticed that in LWLockAcquire the call to
LWLockDequeueSelf occurs (
https://github.com/postgres/postgres/blob/REL_12_11/src/backend/storage/lmgr/lwlock.c#L1218)
directly between the unsuccessful attempt to immediately acquire the lock
and reporting the backend's wait event.  The distinctive indicators we have
been using for this pathology are that "lock_manager" wait_event and its
associated USDT probe (
https://github.com/postgres/postgres/blob/REL_12_11/src/backend/storage/lmgr/lwlock.c#L1236-L1237),
both of which occur after whatever overhead is incurred by
LWLockDequeueSelf.  As you mentioned in your commit message, that overhead
is hard to detect.  My first impression is that whatever overhead it incurs
is in addition to what we are investigating.


> I'm also wondering if it's possible that the reason for the throughput
> drops
> are possibly correlated with heavyweight contention or higher frequency
> access
> to the pg_locks view. Deadlock checking and the locks view acquire locks on
> all lock manager partitions... So if there's a bout of real lock contention
> (for longer than deadlock_timeout)...
>

Great questions, but we ruled that out.  The deadlock_timeout is 5 seconds,
so frequently hitting that would massively violate SLO and would alert the
on-call engineers.  The pg_locks view is scraped a couple times per minute
for metrics collection, but the lock_manager lwlock contention can be
observed thousands of times every second, typically with very short
durations.  The following example (captured just now) shows the number of
times per second over a 10-second window that any 1 of the 16
"lock_manager" lwlocks was contended:

msmiley@patroni-main-2004-103-db-gprd.c.gitlab-production.internal:~$ sudo
./bpftrace -e 'usdt:/usr/lib/postgresql/12/bin/postgres:lwlock__wait__start
/str(arg0) == "lock_manager"/ { @[arg1] = count(); } interval:s:1 {
print(@); clear(@); } interval:s:10 { exit(); }'
Attaching 5 probes...
@[0]: 12122
@[0]: 12888
@[0]: 13011
@[0]: 13348
@[0]: 11461
@[0]: 10637
@[0]: 10892
@[0]: 12334
@[0]: 11565
@[0]: 11596

Typically that contention only lasts a couple microseconds.  But the long
tail can sometimes be much slower.  Details here:
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365159507
.

Given that most of your lock manager traffic comes from query planning -
> have
> you evaluated using prepared statements more heavily?
>

Yes, there are unrelated obstacles to doing so -- that's a separate can of
worms, unfortunately.  But in this pathology, even if we used prepared
statements, the backend would still need to reacquire the same locks during
each executing transaction.  So in terms of lock acquisition rate, whether
it's via the planner or executor doing it, the same relations have to be
locked.


Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

2023-08-07 Thread Matt Smiley
Thank you Tomas!  I really appreciate your willingness to dig in here and
help us out!  The rest of my replies are inline below.

On Thu, Aug 3, 2023 at 1:39 PM Tomas Vondra 
wrote:

> The analysis in the linked gitlab issue is pretty amazing. I wasn't
> planning to argue against the findings anyway, but plenty of data
> supporting the conclusions is good.
>

Thank you!  I totally agree, having supporting data is so helpful.

I'm not an expert on locking, so some of the stuff I say may be
> trivially obvious - it's just me thinking about ...
>

Absolutely makes sense to check assumptions, etc.  Thanks for being open!
For what it's worth, I've also been working with Postgres for many years,
and I love that it keeps teaching me new things, this topic being just the
latest.

I wonder what's the rough configuration of those systems, though. Both
> the hardware and PostgreSQL side. How many cores / connections, etc.?
>

Each of the postgres hosts had 96 vCPUs and at peak handled roughly 80
concurrently active connections.

For purposes of reproducing the pathology, I think we can do so with a
single postgres instance.  We will need a high enough query rate to push
the bottleneck to lock_manager lwlock contention.  The simplest way to do
so is probably to give it a small dataset that fits easily in cache and run
several concurrent client connections doing cheap single-row queries, each
in its own transaction, against a target table that has either many indexes
or partitions or both.

For context, here's a brief summary of the production environment where we
first observed this pathology:
The writable primary postgres instance has several streaming replicas, used
for read-only portions of the workload.  All of them run on equivalent
hardware.  Most of the research focuses on the streaming replica postgres
instances, although the same pathology has been observed in the writable
primary node as well. The general topology is thousands of client
connections fanning down into several pgbouncer instances per Postgres
instance.  From each Postgres instance's perspective, its workload
generally has a daily peak of roughly 80 concurrently active backends
supporting a throughput of 75K transactions second, where most transactions
run a single query.

Yes, I agree with that. Partitioning makes this issue works, I guess.
> Schemas with indexes on every column are disturbingly common these days
> too, which hits the issue too ...
>

Agreed.


> Those are pretty great pieces of information. I wonder if some of the
> measurements may be affecting the observation (by consuming too much
> CPU, making the contention worse), but overall it seems convincing.
>

Yes, definitely overhead is a concern, glad you asked!

Here are my notes on the overhead for each bpftrace script:
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1357834678

Here is a summary of where that overhead comes from:
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301#note_1365310956

Here are more generic benchmark results for uprobe overhead:
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1383

Briefly, we generally  expect the instrumentation overhead to be roughly
1-2 microseconds per call to the instrumented instruction.  It partly
depends on what we're doing in the instrumentation, but most of that
overhead is just the interrupt-handling to transfer control flow to/from
the BPF code.

Would it be difficult to sample just a small fraction of the calls? Say,
> 1%, to get good histograms/estimated with acceptable CPU usage.
>

That would be great, but since the overhead comes mostly from the control
transfer, it wouldn't help to put sampling logic in the tracer itself.  The
main way to mitigate that overhead is to choose instrumentation points
where the call rate is tolerably low.  That's why the only instrumentation
I used for more than a few seconds at a time were the "low overhead"
scripts that instrument only the stalled call to LWLockAcquire.


> In any case, it's a great source of information to reproduce the issue
> and evaluate possible fixes.
>

Thanks, that's my hope!


> After looking at the code etc. I think the main trade-off here is going
> to be the cost of searching the fpRelId array. At the moment it's
> searched linearly, which is cheap for 16 locks. But at some point it'll
> become as expensive as updating the slowpath, and the question is when.
>
> I wonder if we could switch to a more elaborate strategy if the number
> of locks is high enough. Say, a hash table, or some hybrid approach.
>

Interesting idea!  I was hoping a linear search would stay cheap enough but
you're right, it's going to become too inefficient at some point.  It might
make sense to start with just blackbox timing or throughput measurements,
because directly measuring that search duration may not be cheap.  To
observe durations via BPF, we have to instrument 2 points (e.g. function
entry and return, or more generally the 

Re: Configurable FP_LOCK_SLOTS_PER_BACKEND

2023-08-02 Thread Matt Smiley
nswer may be workload- and
platform-specific.  Exposing this as a GUC gives the admin a way to make a
different choice if our default (currently 16) is bad for them.

I share your reluctance to add another low-level tunable, but like many
other GUCs, having a generally reasonable default that can be adjusted is
better than forcing folks to fork postgres to adjust a compile-time
constant.  And unfortunately I don't see a better way to solve this
problem.  Growing the lock_manager lwlock tranche isn't as effective,
because it doesn't help workloads where one or more relations are locked
frequently enough to hit this saturation point.

Handling a larger percentage of heavyweight lock acquisitions via fastpath
instead of slowpath seems likely to help many high-throughput workloads,
since it avoids having to exclusively acquire an lwlock.  It seemed like
the least intrusive general-purpose solution we've come up with so far.
That's why we wanted to solicit feedback or new ideas from the community.
Currently, the only options folks have to solve this class of saturation
are through some combination of schema changes, application changes,
vertical scaling, and spreading the query rate among more postgres
instances.  Those are not feasible and efficient options.  Lacking a better
solution, exposing a GUC that rarely needs tuning seems reasonable to me.

Anyway, hopefully the extra context is helpful!  Please do share your
thoughts.

-- 
*Matt Smiley* | Staff Site Reliability Engineer at GitLab