On Fri, Apr 18, 2014 at 11:46 AM, Greg Stark <st...@mit.edu> wrote: > On Fri, Apr 18, 2014 at 4:14 PM, Robert Haas <robertmh...@gmail.com> wrote: >> I am a bit confused by this remark. In *any* circumstance when you >> evict you're incurring precisely one page fault I/O when the page is >> read back in. That doesn't mean that the choice of which page to >> evict is irrelevant. > > But you might be evicting a page that will be needed soon or one that > won't be needed for a while. If it's not needed for a while you might > be able to avoid many page evictions by caching a page that will be > used several times.
Sure. > If all the pages currently in RAM are hot -- meaning they're hot > enough that they'll be needed again before the page you're reading in > -- then they're all equally bad to evict. Also true. But the problem is that it is very rarely, if ever, the case that all pages are *equally* hot. On a pgbench workload, for example, I'm very confident that while there's not really any cold data, the btree roots and visibility map pages are a whole lot hotter than a randomly-selected heap page. If you evict a heap page, you're going to need it back pretty quick, because it won't be long until the random-number generator again chooses a key that happens to be located on that page. But if you evict the root of the btree index, you're going to need it back *immediately*, because the very next query, no matter what key it's looking for, is going to need that page. I'm pretty sure that's a significant difference. > I'm trying to push us away from the gut instinct that frequently used > pages are important to cache and towards actually counting how many > i/os we're saving. In the extreme it's possible to simulate any cache > algorithm on a recorded list of page requests and count how many page > misses it generates to compare it with an optimal cache algorithm. There's another issue, which Simon clued me into a few years back: evicting the wrong page can cause system-wide stalls. In the pgbench case, evicting a heap page will force the next process that chooses a random number that maps to a tuple on that page to wait for the page to be faulted back in. That's sad, but unless the scale factor is small compared to the number of backends, there will probably be only ONE process waiting. On the other hand, if we evict the btree root, within a fraction of a second, EVERY process that isn't already waiting on some other I/O will be waiting for that I/O to complete. The impact on throughput is much bigger in that case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers