Postgres version: PostgreSQL 17.5 on x86_64-pc-linux-gnu, compiled by
Debian clang version 12.0.1, 64-bit

Postgres hosting: Google Cloud Platform, Cloud SQL

Postgres CPU: 40 vCPU

Postgres Memory: 200 GB

Postgres setup: 1 primary with 1 read-replica (hot_standby_feedback flag is
on)

Issue:

We experience a lot of LWLock:lock_manager events on our read-replica, i.e.
150+. They seem to come mostly in bursts and the occurrences seem to have
been steadily increasing while our load did only very slightly. The amount
of these lock_manager locks also seem concentrated to mostly one or two
specific queries while we have 100+ different queries, some with a higher
frequency and as many joins than the “problematic” one. Note that the total
amount of locks during that time is more than 10k.

These lock_manager locks end up taking dozens amount of seconds and because
of that we run into our statement timeouts of 60s.

Some statistics on our database:

   - Average amount of queries per minute is 12-15k
   - Maximum amount of concurrent requests is ~100, but when the lock rises
   goes up to ~150
   - The relevant tables have between 10 and 20 indexes, mostly foreign keys
   - The relevant tables are ~20-40 GB
   - Total database size is 981 GB
   - The average replication lag on the read replica is <100ms
   - Generating a query plan of the problematic query it seems to read
   between 100MB and 1GB of data reported in the Buffers
   - We don’t have any partitions

The amount of information on when you run into this lock is somewhat
limited, but we found
https://www.postgresql.org/message-id/E1ss4gX-000IvX-63%40gemulon.postgresql.org
that mentions that the number of fast-path locks should have been
configurable since that release using the max_locks_per_transaction
parameter. Although if we check the pg_locks table we see that the amount
of fast-path locks per pid and lock mode is still a maximum of 16, even
though max_locks_per_transaction is set to 64.

If we increase the memory of the read-replica to 300GB with the same amount
of CPU’s we see it occur way less, which makes us think that increasing the
fast-path slots wouldn’t really solve the issue, but something else is
going on. Maybe we didn’t give enough memory for the amount of vCPU’s.

When adding a second replica with the same vCPU’s and memory, we don’t see
any LWLock:lock_manager waits (or negligible at least). The traffic to both
replica’s is random, so they should hit the same data, so we would expect
to see the issue somewhat in this case as well, but the behaviour seems to
differ than having one larger read-replica.

Additionally we notice somewhat of a periodicity in these locking issues
about every 10 minutes, which we can’t correlate with any load increase.

So the questions we have, are:

   - When would the LWLock:lock_manager locks occur?
   - Why do they not occur consistently but in waves?
   - Why do they seem to correlate with the amount of memory given?
   - How we can solve this?



*Sam Persoon*
Team Lead Frontend
qargo.com <https://www.qargo.com>

Reply via email to