Here's my roadmap for the "scan-resistant buffer cache" and "synchronized scans" patches.

1. Fix the current vacuum behavior of throwing dirty buffers to the freelist, forcing a lot of WAL flushes. Instead, use a backend-private ring of shared buffers that are recycled. This is what Simon's "scan-resistant buffer manager" did.

The theory here is that if a page is read in by vacuum, it's unlikely to be accessed in the near future, therefore it should be recycled. If vacuum doesn't dirty the page, it's best to reuse the buffer immediately for the next page. However, if the buffer is dirty (and not just because we set hint bits), we ought to delay writing it to disk until the corresponding WAL record has been flushed to disk.

Simon's patch used a fixed size ring of buffers that are recycled, but I think the ring should be dynamically sized. Start with a small ring, and whenever you need to do a WAL flush to write a dirty buffer, increase the ring size. On every full iteration through the ring, decrease its size to trim down an unnecessarily large ring.

This only alters the behavior of vacuums, and it's pretty safe to say it won't get worse than what we have now. In the future, we can use the buffer ring for seqscans as well; more on that on step 3.

2. Implement the list/table of last/ongoing seq scan positions. This is Jeff's "synchronized scans" patch. When a seq scan starts on a table larger than some threshold, it starts from where the previous seq scan is currently, or where it ended. This will synchronize the scans so that for two concurrent scans the total I/O is halved in the best case. There should be no other effect on performance.

If you have a partitioned table, or union of multiple tables or any other plan where multiple seq scans are performed in arbitrary order, this change won't change the order the partitions are scanned and won't therefore ensure they will be synchronized.


Now that we have both pieces of the puzzle in place, it's time to consider what more we can do with them:


3A. To take advantage of the "cache trail" of a previous seq scan, scan backwards from where the previous seq scan ended, until a you hit a buffer that's not in cache.

This will allow taking advantage of the buffer cache even if the table doesn't fit completely in RAM. That can make a big difference if the table size is just slightly bigger than RAM, and can avoid the nasty surprise when a table grows beyond RAM size and queries start taking minutes instead of seconds.

This should be a non-controversial change on its own from performance point of view. No query should get slower, and some will become faster. But see step 3B:

3B. Currently, sequential scans on a large table spoils the buffer cache by evicting other pages from the cache. In CVS HEAD, as soon as the table is larger than shared_buffers, the pages in the buffer won't be used to speed up running the same query again, and there's no reason to believe the pages read in would be more useful than any other page in the database, and in particular the pages that were in the buffer cache before the huge seq scan. If the table being scanned is > 5 * shared_buffers, the scan will evict every other page from the cache if there's no other activity in the database (max usage_count is 5).

If the table is much larger than shared_buffers, say 10 times as large, even with the change 3B to read the pages that are in cache first, using all shared_buffers to cache the table will only speed up the query by 10%. We should not spoil the cache for such a small gain, and use the local buffer ring strategy instead. It's better to make queries that are slow anyway a little bit slower, than making queries that are normally really fast, slow.


As you may notice, 3A and 3B are at odds with each other. We can implement both, but you can't use both strategies in the same scan. Therefore we need to have decision logic of some kind to figure out which strategy is optimal.

A simple heuristic is to decide based on the table size:

< 0.1*shared_buffers -> start from page 0, keep in cache (like we do now)
< 5 * shared_buffers -> start from last read page, keep in cache
> 5 * shared_buffers -> start from last read page, use buffer ring

I'm not sure about the constants, we might need to make them GUC variables as Simon argued, but that would be the general approach.

In the future, I'm envisioning a smarter algorithm to size the local buffer ring. Take all buffers with usage_count=0, plus a few with usage_count=1 from the clock sweep. That way if there's a lot of buffers in the buffer cache that are seldomly used, we'll use more buffers to cache the large scan, and vice versa. And no matter how large the scan, it wouldn't blow all buffers from the cache. But if you execute the same query again, the buffers left in the cache from the last scan were apparently useful, so we use a bigger ring this time.

I'm going to do this incrementally, and we'll see how far we get for 8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up Simon's patch (step 1), run some performance tests with vacuum, and submit a patch for that. Then I'll move to Jeff's patch (step 2).

Thoughts? Everyone happy with the roadmap?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

              http://archives.postgresql.org

Reply via email to