Hi hackers!

Recently I observed very peculiar incident.

=== Incident description ===

ETL database was operating fine for many months, regularly updated etc. 
Workload was not changing much, but as far as it was ETL database - most of 
queries were different all the time.
On the night of September 7th database was stuck, no monitoring query could be 
executed. DBAs started to deal with the incident, but there is not much to do 
with database service, when you cannot execute a single query. According to VM 
metrics, VM was writing a lot on disk.
On-call engineer was summoned. He observed a lot of backends stuck with similar 
backtrace
#5  LWLockAcquire()
#6  pgss_store()
#7  pgss_post_parse_analyze()
#8  parse_analyze()
#9  pg_analyze_and_rewrite()

After restart, problem reproduced within 50 minutes. But monitoring queries 
were operating, what showed that all backends were stuck in LWLock 
pg_stat_statements. It was impossible to disable pgss with SQL, so engineer 
altered auto.conf and restarted database. This resolved the incident.

Later I was working on analyzing the incident. Enabling pgss back showed traces 
of the problem:
            Fri 09 Sep 2022 08:52:31 AM MSK (every 2s)

     usename     |        state        |     wait_event     | cnt 
-----------------+---------------------+--------------------+-----
 content         | active              | DataFileRead       |   1
 content         | active              | pg_stat_statements |  42
 content         | idle in transaction | ClientRead         |   2
 pgwatch_monitor | active              | BufFileRead        |   1
 pgwatch_monitor | active              | [null]             |   5
 pgwatch_monitor | active              | pg_stat_statements |  85
 postgres        | active              | [null]             |   1
 repl            | active              | WalSenderMain      |   2
 [null]          | active              | [null]             |   2
 [null]          | active              | VacuumDelay        |   7
(10 rows)

pgwatch was quering 60 databases, every minute and each call to 
pg_stat_statements() took approximately 3-4 seconds.
Backend that was in charge of grand lock was looking like this in 
pg_stat_statements:

datid            | 127782
pid              | 4077900
usename          | pgwatch_monitor
application_name | pgwatch2 - 10.96.17.68
wait_event_type  | IO
wait_event       | BufFileWrite
state            | active
backend_xid      | [null]
backend_xmin     | 67048029

The contents of pg_stat_statements view overrun work_mem and were materialized 
in tuplestore on disk. This is what cause a lot of disk write on database that 
was not accepting any user query.

</Incident description>

TLDR: LWLock "pg_stat_statements" disabled all SQL queries.


=== How the problem develops ===
Prerequisite 1. pgwatch is quering pgss often.
Prerequisite 2. pgss becomes big so that tuplestore is written on disk, while 
holding shared lock.
Prerequisite 3. Someone is calling reset() or pgss_store() needing exclusive 
lock.

Consequence. Exclusive lock queues after long held shared lock and prevents all 
shared locks to be taken.
Result. Any query calling pgss hooks hangs.


=== Reproduction for development purposes ===
0. Setup a database with pg_stat_statements.track = all.
1. Modify pg_stat_statements_internal() to wait for a long time under 
LWLockAcquire(pgss->lock, LW_SHARED).
2. select * from pg_stat_statements()
3. select pg_stat_statements_reset()

Now the database is bonked. Any query will hang until pg_stat_statements() 
finishes.


=== How to fix ===
pgss uses LWLock to protect hashtable.
When the hashtable is reset or new user query is inserted in pgss_store() - 
exclusive lock is used.
When stats are updated for known query or pg_stat_statements are read - shared 
lock is used.

I propose to swap shared lock for stats update to conditional shared lock.
It cannot be taken during reset() - and that's totally fine. It cannot be taken 
during entering new query - and I think it's OK to spill some stats in this 
case. PFA patch attached.

This patch prevents from a disaster incurred by described here locking.


=== Other possible mitigation ===
The incident would not occur if tuplestore did not convert into on-disk 
representation. But I don't see how to reliably protect from this.

What do you think?


Thank!


Best regards, Andrey Borodin.

Attachment: v1-0001-Demonstrate-and-fix-lock-of-all-SQL-queries-by-pg.patch
Description: Binary data

Reply via email to