Kevin and Robert are well aware of most of the below. I just want to put this out here so other people, who haven't followed the discussion too closely, may chime in.

Some details on the problem:

First of all, there is a minimum number of 1000 pages that the vacuum scan must detect as possibly being all empty at the end of a relation. Without at least 8MB of possible free space at the end, the code never calls lazy_truncate_heap(). This means we don't have to worry about tiny relations at all. Any relation that stays under 8MB turnover between autovacuum VACUUM runs can never get into this ever.

Relations that have higher turnover than that, but at random places or with a high percentage of rather static rows, don't fall into the problem category either. They may never accumulate that much "contiguous free space at the end". The turnover will be reusing free space all over the place. So again, lazy_truncate_heap() won't be called ever.

Relations that eventually build up more than 8MB of free space at the end aren't automatically a problem. The autovacuum VACUUM scan just scanned those pages at the end, which means that the safety scan for truncate, done under exclusive lock, is checking exactly those pages at the end and most likely they are still in memory. The truncate safety scan will be fast due to a 99+% buffer cache hit rate.

The only actual problem case (I have found so far) are rolling window tables of significant size, that can bloat multiple times their normal size every now and then. This is indeed a rare corner case and I have no idea how many users may (unknowingly) be suffering from it.

This rare corner case triggers lazy_truncate_heap() with a significant amount of free space to truncate. The table bloats, then all the bloat is deleted and the periodic 100% turnover will guarantee that all "live" tuples will shortly after circulate in lower block numbers again, with gigabytes of empty space at the end.

This by itself isn't a problem still. The existing code may do the job just fine "unless" there is "frequent" access to that very table. Only at this special combination of circumstances we actually have a problem.

Only now, with a significant amount of free space at the end and frequent access to the table, the truncate safety scan takes long enough and has to actually read pages from disk to interfere with client transactions.

At this point, the truncate safety scan may have to be interrupted to let the frequent other traffic go through. This is what we accomplish with the autovacuum_truncate_lock_check interval, where we voluntarily release the lock whenever someone else needs it. I agree with Kevin that a 20ms check interval is reasonable because the code to check this is even less expensive than releasing the exclusive lock we're holding.

At the same time, completely giving up and relying on the autovacuum launcher to restart another worker isn't as free as it looks like either. The next autovacuum worker will have to do the VACUUM scan first, before getting to the truncate phase. We cannot just skip blindly to the truncate code. With repeated abortion of the truncate, the table would deteriorate and accumulate dead tuples again. The removal of dead tuples and their index tuples has priority.

As said earlier in the discussion, the VACUUM scan will skip pages, that are marked as completely visible. So the scan won't physically read the majority of the empty pages at the end of the table over and over. But it will at least scan all pages, that had been modified since the last VACUUM run.

To me this means that we want to be more generous to the truncate code about acquiring the exclusive lock. In my tests, I've seen that a rolling window table with a "live" set of just 10 MB or so, but empty space of 3 GB, can still have a 2 minute VACUUM scan time. Throwing that work away because we can't acquire the exclusive lock withing 2 seconds is a waste of effort.

Something in between 2-60 seconds sounds more reasonable to me.


Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to