On Mon, Mar 21, 2011 at 5:24 AM, Greg Stark <gsst...@mit.edu> wrote: > On Fri, Mar 18, 2011 at 11:55 PM, Josh Berkus <j...@agliodbs.com> wrote: >>> To take the opposite approach... has anyone looked at having the OS just >>> manage all caching for us? Something like MMAPed shared buffers? Even if we >>> find the issue with large shared buffers, we still can't dedicate serious >>> amounts of memory to them because of work_mem issues. Granted, that's >>> something else on the TODO list, but it really seems like we're >>> re-inventing the wheels that the OS has already created here... > > A lot of people have talked about it. You can find references to mmap > going at least as far back as 2001 or so. The problem is that it would > depend on the OS implementing things in a certain way and guaranteeing > things we don't think can be portably assumed. We would need to mlock > large amounts of address space which most OS's don't allow, and we > would need to at least mlock and munlock lots of small bits of memory > all over the place which would create lots and lots of mappings which > the kernel and hardware implementations would generally not > appreciate. > >> As far as I know, no OS has a more sophisticated approach to eviction >> than LRU. And clock-sweep is a significant improvement on performance >> over LRU for frequently accessed database objects ... plus our >> optimizations around not overwriting the whole cache for things like VACUUM. > > The clock-sweep algorithm was standard OS design before you or I knew > how to type. I would expect any half-decent OS to have sometihng at > least as good -- perhaps better because it can rely on hardware > features to handle things. > > However the second point is the crux of the issue and of all similar > issues on where to draw the line between the OS and Postgres. The OS > knows better about the hardware characteristics and can better > optimize the overall system behaviour, but Postgres understands better > its own access patterns and can better optimize its behaviour whereas > the OS is stuck reverse-engineering what Postgres needs, usually from > simple heuristics. > >> >> 2-level caches work well for a variety of applications. > > I think 2-level caches with simple heuristics like "pin all the > indexes" is unlikely to be helpful. At least it won't optimize the > average case and I think that's been proven. It might be helpful for > optimizing the worst-case which would reduce the standard deviation. > Perhaps we're at the point now where that matters. > > Where it might be helpful is as a more refined version of the > "sequential scans use limited set of buffers" patch. Instead of having > each sequential scan use a hard coded number of buffers, perhaps all > sequential scans should share a fraction of the global buffer pool > managed separately from the main pool. Though in my thought > experiments I don't see any real win here. In the current scheme if > there's any sign the buffer is useful it gets thrown from the > sequential scan's set of buffers to reuse anyways. > >> Now, what would be *really* useful is some way to avoid all the data >> copying we do between shared_buffers and the FS cache. >> > > Well the two options are mmap/mlock or directio. The former might be a > fun experiment but I expect any OS to fall over pretty quickly when > faced with thousands (or millions) of 8kB mappings. The latter would > need Postgres to do async i/o and hopefully a global view of its i/o > access patterns so it could do prefetching in a lot more cases.
Can't you make just one large mapping and lock it in 8k regions? I thought the problem with mmap was not being able to detect other processes (http://www.mail-archive.com/pgsql-general@postgresql.org/msg122301.html) compatibility issues (possibly obsolete), etc. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers