Kevin and Robert are well aware of most of the below. I just want to put
this out here so other people, who haven't followed the discussion too
closely, may chime in.
Some details on the problem:
First of all, there is a minimum number of 1000 pages that the vacuum
scan must detect as possibly being all empty at the end of a relation.
Without at least 8MB of possible free space at the end, the code never
calls lazy_truncate_heap(). This means we don't have to worry about tiny
relations at all. Any relation that stays under 8MB turnover between
autovacuum VACUUM runs can never get into this ever.
Relations that have higher turnover than that, but at random places or
with a high percentage of rather static rows, don't fall into the
problem category either. They may never accumulate that much "contiguous
free space at the end". The turnover will be reusing free space all over
the place. So again, lazy_truncate_heap() won't be called ever.
Relations that eventually build up more than 8MB of free space at the
end aren't automatically a problem. The autovacuum VACUUM scan just
scanned those pages at the end, which means that the safety scan for
truncate, done under exclusive lock, is checking exactly those pages at
the end and most likely they are still in memory. The truncate safety
scan will be fast due to a 99+% buffer cache hit rate.
The only actual problem case (I have found so far) are rolling window
tables of significant size, that can bloat multiple times their normal
size every now and then. This is indeed a rare corner case and I have no
idea how many users may (unknowingly) be suffering from it.
This rare corner case triggers lazy_truncate_heap() with a significant
amount of free space to truncate. The table bloats, then all the bloat
is deleted and the periodic 100% turnover will guarantee that all "live"
tuples will shortly after circulate in lower block numbers again, with
gigabytes of empty space at the end.
This by itself isn't a problem still. The existing code may do the job
just fine "unless" there is "frequent" access to that very table. Only
at this special combination of circumstances we actually have a problem.
Only now, with a significant amount of free space at the end and
frequent access to the table, the truncate safety scan takes long enough
and has to actually read pages from disk to interfere with client
transactions.
At this point, the truncate safety scan may have to be interrupted to
let the frequent other traffic go through. This is what we accomplish
with the autovacuum_truncate_lock_check interval, where we voluntarily
release the lock whenever someone else needs it. I agree with Kevin that
a 20ms check interval is reasonable because the code to check this is
even less expensive than releasing the exclusive lock we're holding.
At the same time, completely giving up and relying on the autovacuum
launcher to restart another worker isn't as free as it looks like
either. The next autovacuum worker will have to do the VACUUM scan
first, before getting to the truncate phase. We cannot just skip blindly
to the truncate code. With repeated abortion of the truncate, the table
would deteriorate and accumulate dead tuples again. The removal of dead
tuples and their index tuples has priority.
As said earlier in the discussion, the VACUUM scan will skip pages, that
are marked as completely visible. So the scan won't physically read the
majority of the empty pages at the end of the table over and over. But
it will at least scan all pages, that had been modified since the last
VACUUM run.
To me this means that we want to be more generous to the truncate code
about acquiring the exclusive lock. In my tests, I've seen that a
rolling window table with a "live" set of just 10 MB or so, but empty
space of 3 GB, can still have a 2 minute VACUUM scan time. Throwing that
work away because we can't acquire the exclusive lock withing 2 seconds
is a waste of effort.
Something in between 2-60 seconds sounds more reasonable to me.
Jan
--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers