Postgres version: PostgreSQL 17.5 on x86_64-pc-linux-gnu, compiled by Debian clang version 12.0.1, 64-bit
Postgres hosting: Google Cloud Platform, Cloud SQL Postgres CPU: 40 vCPU Postgres Memory: 200 GB Postgres setup: 1 primary with 1 read-replica (hot_standby_feedback flag is on) Issue: We experience a lot of LWLock:lock_manager events on our read-replica, i.e. 150+. They seem to come mostly in bursts and the occurrences seem to have been steadily increasing while our load did only very slightly. The amount of these lock_manager locks also seem concentrated to mostly one or two specific queries while we have 100+ different queries, some with a higher frequency and as many joins than the “problematic” one. Note that the total amount of locks during that time is more than 10k. These lock_manager locks end up taking dozens amount of seconds and because of that we run into our statement timeouts of 60s. Some statistics on our database: - Average amount of queries per minute is 12-15k - Maximum amount of concurrent requests is ~100, but when the lock rises goes up to ~150 - The relevant tables have between 10 and 20 indexes, mostly foreign keys - The relevant tables are ~20-40 GB - Total database size is 981 GB - The average replication lag on the read replica is <100ms - Generating a query plan of the problematic query it seems to read between 100MB and 1GB of data reported in the Buffers - We don’t have any partitions The amount of information on when you run into this lock is somewhat limited, but we found https://www.postgresql.org/message-id/E1ss4gX-000IvX-63%40gemulon.postgresql.org that mentions that the number of fast-path locks should have been configurable since that release using the max_locks_per_transaction parameter. Although if we check the pg_locks table we see that the amount of fast-path locks per pid and lock mode is still a maximum of 16, even though max_locks_per_transaction is set to 64. If we increase the memory of the read-replica to 300GB with the same amount of CPU’s we see it occur way less, which makes us think that increasing the fast-path slots wouldn’t really solve the issue, but something else is going on. Maybe we didn’t give enough memory for the amount of vCPU’s. When adding a second replica with the same vCPU’s and memory, we don’t see any LWLock:lock_manager waits (or negligible at least). The traffic to both replica’s is random, so they should hit the same data, so we would expect to see the issue somewhat in this case as well, but the behaviour seems to differ than having one larger read-replica. Additionally we notice somewhat of a periodicity in these locking issues about every 10 minutes, which we can’t correlate with any load increase. So the questions we have, are: - When would the LWLock:lock_manager locks occur? - Why do they not occur consistently but in waves? - Why do they seem to correlate with the amount of memory given? - How we can solve this? *Sam Persoon* Team Lead Frontend qargo.com <https://www.qargo.com>