Josh Berkus wrote:
The lack of vacuum could be occurring for any of 4 reasons:

1) Locking
2) You have a lot of tables and not enough autovac_workers / too much
sleep time
3) You need to autovac this particular table more frequently, since it
gets dirtied really fast
4) The table has been set with special autovac settings which keep it
from being autovac'd

We can currently distinguish between cased 2, 3, 4 based on existing
available facts.  However, distinguishing case 1 from 2 or 3, in
particular, isn't currently possible except by methods which require
collecting a lot of ad-hoc monitoring data over a period of time.  This
makes the effort required for the diagnosis completely out of proportion
with the magnitude of the problem.

Since no one coded up something more exciting, I just did a round of self-review of the little logging patch I sent upthread to see if it was a worthy CF candidate. I agree that *something* should be done, and I'd rather have a logging-based solution than nothing at all here.

So that patch is, to use a polite UK term I've adopted recently, rubbish. But unless I'm confused (it is late), I think there's a problem here that's much worse than what you described--in the case where someone has grabbed a really large lock on a table that needs cleanup at least. I started with the following test case:

SHOW log_autovacuum_min_duration;
DROP TABLE t;
CREATE TABLE t(s serial,i integer);
INSERT INTO t(i) SELECT generate_series(1,100000);
SELECT relname,last_autovacuum,autovacuum_count FROM pg_stat_user_tables WHERE relname='t';
DELETE FROM t WHERE s<50000;
\q

Which, so long as the min duration is 0, does mostly what I expected. I followed immediately by starting a new session, doing:

BEGIN;
LOCK t;

Before autovacuum got to the table; I left this session open, stuck there idle in a transaction holding a lock. I threw some extra logging in the code path leading up to where we'd both like the lack of lock acquisition to be logged, and it shows the following (at DEBUG3):

DEBUG:  t: vac: 49999 (threshold 50), anl: 149999 (threshold 50)
DEBUG:  autovacuum: t: vac needed
INFO:  vacuum_rel:  processing 16528
INFO:  vacuum_rel:  trying to open relation 16528

All but the first one of those lines are not in the current code, and as noted already that one existing line is at DEBUG3. This was trying to simulate the case you were complaining about: autovacuum has been triggered, it knows there is work to be done, and it can't do said work.

It hasn't triggered the error message I tried to add yet through, because it's not hitting that spot. Look where it's actually blocked at:

$ ps -eaf | grep postgres
gsmith 5490 4880 0 04:09 ? 00:00:00 postgres: autovacuum worker process gsmith waiting $ psql -c "select procpid,now() - query_start as runtime,current_query from pg_stat_activity where current_query like 'autovacuum%'" procpid | runtime | current_query ---------+-----------------+-------------------------------------
   5490 | 00:11:35.813727 | autovacuum: VACUUM ANALYZE public.t

I haven't just failed to vacuum the table that needs it without any entry in the logs saying as much. I've actually tied down an autovacuum worker and kept it from doing anything else until that lock is released! When vacuum_rel inside of vacuum.c does this on a relation it wants to acquire a lock on:

   onerel = try_relation_open(relid, lmode);

It just hangs out there forever, if the only issue is not being able to get the lock it wants. The abort code path I tried to insert logging into only occurs if the relation was deleted. Watch what's happened while I've been typing:

procpid | runtime | current_query ---------+-----------------+-------------------------------------
   5490 | 00:36:24.426528 | autovacuum: VACUUM ANALYZE public.t

I've now hit the AV worker with a denial of service attack, essentially, for over 35 minutes. I'd bet that if I was a malicious user, I could make AV come to a halt altogether for the whole server with this approach, even with relatively limited rights on the database.

Does try_relation_open need to have a lock acquisition timeout when AV is calling it? I don't see any other way to get a log mesage noting when this problem passed by in there, you can only see it live in these process view. And the current behavior doesn't feel right to me.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to