Re: [HACKERS] We need to log aborted autovacuums

Greg Smith Sat, 15 Jan 2011 01:48:45 -0800

Josh Berkus wrote:

The lack of vacuum could be occurring for any of 4 reasons:


1) Locking
2) You have a lot of tables and not enough autovac_workers / too much
sleep time
3) You need to autovac this particular table more frequently, since it
gets dirtied really fast
4) The table has been set with special autovac settings which keep it
from being autovac'd

We can currently distinguish between cased 2, 3, 4 based on existing
available facts.  However, distinguishing case 1 from 2 or 3, in
particular, isn't currently possible except by methods which require
collecting a lot of ad-hoc monitoring data over a period of time.  This
makes the effort required for the diagnosis completely out of proportion
with the magnitude of the problem.

Since no one coded up something more exciting, I just did a round ofself-review of the little logging patch I sent upthread to see if it wasa worthy CF candidate. I agree that *something* should be done, and I'drather have a logging-based solution than nothing at all here.

So that patch is, to use a polite UK term I've adopted recently,rubbish. But unless I'm confused (it is late), I think there's aproblem here that's much worse than what you described--in the casewhere someone has grabbed a really large lock on a table that needscleanup at least. I started with the following test case:


SHOW log_autovacuum_min_duration;
DROP TABLE t;
CREATE TABLE t(s serial,i integer);
INSERT INTO t(i) SELECT generate_series(1,100000);

SELECT relname,last_autovacuum,autovacuum_count FROM pg_stat_user_tablesWHERE relname='t';

DELETE FROM t WHERE s<50000;
\q

Which, so long as the min duration is 0, does mostly what I expected. Ifollowed immediately by starting a new session, doing:


BEGIN;
LOCK t;

Before autovacuum got to the table; I left this session open, stuckthere idle in a transaction holding a lock. I threw some extra loggingin the code path leading up to where we'd both like the lack of lockacquisition to be logged, and it shows the following (at DEBUG3):


DEBUG:  t: vac: 49999 (threshold 50), anl: 149999 (threshold 50)
DEBUG:  autovacuum: t: vac needed
INFO:  vacuum_rel:  processing 16528
INFO:  vacuum_rel:  trying to open relation 16528

All but the first one of those lines are not in the current code, and asnoted already that one existing line is at DEBUG3. This was trying tosimulate the case you were complaining about: autovacuum has beentriggered, it knows there is work to be done, and it can't do said work.

It hasn't triggered the error message I tried to add yet through,because it's not hitting that spot. Look where it's actually blocked at:


$ ps -eaf | grep postgres

gsmith 5490 4880 0 04:09 ? 00:00:00 postgres: autovacuumworker process gsmith waiting$ psql -c "select procpid,now() - query_start as runtime,current_queryfrom pg_stat_activity where current_query like 'autovacuum%'"procpid | runtime | current_query---------+-----------------+-------------------------------------

   5490 | 00:11:35.813727 | autovacuum: VACUUM ANALYZE public.t

I haven't just failed to vacuum the table that needs it without anyentry in the logs saying as much. I've actually tied down an autovacuumworker and kept it from doing anything else until that lock isreleased! When vacuum_rel inside of vacuum.c does this on a relation itwants to acquire a lock on:


   onerel = try_relation_open(relid, lmode);

It just hangs out there forever, if the only issue is not being able toget the lock it wants. The abort code path I tried to insert logginginto only occurs if the relation was deleted. Watch what's happenedwhile I've been typing:

procpid | runtime | current_query---------+-----------------+-------------------------------------

   5490 | 00:36:24.426528 | autovacuum: VACUUM ANALYZE public.t

I've now hit the AV worker with a denial of service attack, essentially,for over 35 minutes. I'd bet that if I was a malicious user, I couldmake AV come to a halt altogether for the whole server with thisapproach, even with relatively limited rights on the database.

Does try_relation_open need to have a lock acquisition timeout when AVis calling it? I don't see any other way to get a log mesage notingwhen this problem passed by in there, you can only see it live in theseprocess view. And the current behavior doesn't feel right to me.


--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] We need to log aborted autovacuums

Reply via email to