James Bottomley <james.bottom...@hansenpartnership.com> writes: > The current mechanism for coherency between a userspace cache and the > in-kernel page cache is mmap ... that's the only way you get the same > page in both currently.
Right. > glibc used to have an implementation of read/write in terms of mmap, so > it should be possible to insert it into your current implementation > without a major rewrite. The problem I think this brings you is > uncontrolled writeback: you don't want dirty pages to go to disk until > you issue a write() Exactly. > I think we could fix this with another madvise(): > something like MADV_WILLUPDATE telling the page cache we expect to alter > the pages again, so don't be aggressive about cleaning them. "Don't be aggressive" isn't good enough. The prohibition on early write has to be absolute, because writing a dirty page before we've done whatever else we need to do results in a corrupt database. It has to be treated like a write barrier. > The problem is we can't give you absolute control of when pages are > written back because that interface can be used to DoS the system: once > we get too many dirty uncleanable pages, we'll thrash looking for memory > and the system will livelock. Understood, but that makes this direction a dead end. We can't use it if the kernel might decide to write anyway. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers