On Wed, Feb 5, 2014 at 10:03 PM, Matthew Ahrens <mahr...@delphix.com> wrote:
> On Wed, Feb 5, 2014 at 2:31 AM, Glauber Costa < > glom...@cloudius-systems.com> wrote: > >> Hi >> >> I've been recently trying to devise some mechanism to reuse the ARC >> buffers directly into a file back mapping (created by mmap, for instance). >> My main goal is not to have duplication between what is in the ARC and what >> is in the page cache (and to be honest, the OS I am working on does not >> have a page cache, so my real goal is to keep it this way). >> >> It seems like Solaris and BSD never did that, but I could not find any >> indication about the why. >> That's the kind of thing I am pretty sure was thought about before, so I >> wonder if the lack of an implementation like that is due to a major >> showstopper found by you guys. >> >> So before I dive too deeply into this, can anybody advise me on this? >> >> > Thanks > As I recall, the main reason we kept the ZFS cache separate from the page > cache was to avoid complexity related to the different locking models. If > you are designing mmap from scratch, I imagine you could avoid that. > Most definitely. This does not exist yet for me outside of a paper sheet, but as I am evolving with it, one of the main problems I am foreseeing is that every ARC access is intermediated by a read or write operation that can call arc_access in a very well defined point. If this buffer is used outside of the ARC, those accesses won't exist. Specially for mmap, that information is held in the processor acessed / dirty bits (for the case of moving to anonymous), and periodically synchronizing the whole memory can be prohibitive - although I am talking dozens Gbs of memory here, hundreds is a bit out of our scope. Did you guys give this any previous thought ? My proposed solution so far is to every time a new page in inserted into the cache, verify the bits in the page that it would dislodge and update accordingly if needed. Same goes for eviction, especially eviction to ghost list. > Read-only mmap should be relatively straightforward. When the page is > faulted in you can just keep the dbuf (dmu_buf_impl_t) held, so that it > stays in memory, and then find its page_t and map it into the process's > address space. > Actually, I don't want it to stay in memory. One of the big wins for me to do this is to be able to re-use ZFS's paging policy instead of implementing our own. We only ever want to do paging for file back shared mappings, so we have no other kind of paging. What I intend to do is to have ZFS to tell the OS when the page is about to be taken out from the cache, and then allow me to get rid of the present bit. But that still seems doable given what you said, as long as I do all that updates with the buffer still held (doesn't seem a problem) > > Write-back mmap (i.e. PROT_WRITE + MAP_SHARED) will be trickier, because > you can't modify the page while the dbuf is being written (due to > checksums, raid-z, etc). Nor can you have transactions of indefinite > length (e.g. create transaction when page is first stored to, commit it > when pageout gets around to flushing it). I guess you could do something > like mark the dbuf dirty and then when syncing context (dbuf_sync_leaf()) > gets around to writing it, copy the data from the page to a new arc_buf > that's just only while writing it out. > > Last case I can COW on write and implement a writeback mechanism that allows me to simplify it to keep the mapping with clean pages only (re-read them upon writeback completion). But that would be less desirable. Is that limitation about not being able to modify the dbuf only valid for the period in which IO is initiated? That seems possible to overcome as well, although we'd be already out of the trivial zone. About the transaction of indefinite length, I'd have to look more at that part of the code, more carefully. I believe I am still lacking some understanding here - which makes your heads up even more valuable. Thanks again
_______________________________________________ developer mailing list developer@open-zfs.org http://lists.open-zfs.org/mailman/listinfo/developer