On Wed, Feb 5, 2014 at 10:03 PM, Matthew Ahrens <mahr...@delphix.com> wrote:

> On Wed, Feb 5, 2014 at 2:31 AM, Glauber Costa <
> glom...@cloudius-systems.com> wrote:
>
>> Hi
>>
>> I've been recently trying to devise some mechanism to reuse the ARC
>> buffers directly into a file back mapping (created by mmap, for instance).
>> My main goal is not to have duplication between what is in the ARC and what
>> is in the page cache (and to be honest, the OS I am working on does not
>> have a page cache, so my real goal is to keep it this way).
>>
>> It seems like Solaris and BSD never did that, but I could not find any
>> indication about the why.
>> That's the kind of thing I am pretty sure was thought about before, so I
>> wonder if the lack of an implementation like that is due to a major
>> showstopper found by you guys.
>>
>> So before I dive too deeply into this, can anybody advise me on this?
>>
>>
> Thanks


> As I recall, the main reason we kept the ZFS cache separate from the page
> cache was to avoid complexity related to the different locking models.  If
> you are designing mmap from scratch, I imagine you could avoid that.
>

Most definitely. This does not exist yet for me outside of a paper sheet,
but as I am evolving with it, one of the main problems I am foreseeing is
that every ARC access is intermediated by a read or write operation that
can call arc_access in a very well defined point. If this buffer is used
outside of the ARC, those accesses won't exist. Specially for mmap, that
information is held in the processor acessed / dirty bits (for the case of
moving to anonymous), and periodically synchronizing the whole memory can
be prohibitive - although I am talking dozens Gbs of memory here, hundreds
is a bit out of our scope.

Did you guys give this any previous thought ?

My proposed solution so far is to every time a new page in inserted into
the cache, verify the bits in the page that it would dislodge and update
accordingly if needed. Same goes for eviction, especially eviction to ghost
list.



> Read-only mmap should be relatively straightforward.  When the page is
> faulted in you can just keep the dbuf (dmu_buf_impl_t) held, so that it
> stays in memory, and then find its page_t and map it into the process's
> address space.
>

Actually, I don't want it to stay in memory. One of the big wins for me to
do this is to be able to re-use ZFS's paging policy instead of implementing
our own. We only ever want to do paging for file back shared mappings, so
we have no other kind of paging. What I intend to do is to have ZFS to tell
the OS when the page is about to be taken out from the cache, and then
allow me to get rid of the present bit. But that still seems doable given
what you said, as long as I do all that updates with the buffer still held
(doesn't seem a problem)


>
> Write-back mmap (i.e. PROT_WRITE + MAP_SHARED) will be trickier, because
> you can't modify the page while the dbuf is being written (due to
> checksums, raid-z, etc).  Nor can you have transactions of indefinite
> length (e.g. create transaction when page is first stored to, commit it
> when pageout gets around to flushing it).  I guess you could do something
> like mark the dbuf dirty and then when syncing context (dbuf_sync_leaf())
> gets around to writing it, copy the data from the page to a new arc_buf
> that's just only while writing it out.
>
> Last case I can COW on write and implement a writeback mechanism that
allows me to simplify it to keep the mapping with clean pages only (re-read
them upon writeback completion). But that would be less desirable.

Is that limitation about not being able to modify the dbuf only valid for
the period in which IO is initiated? That seems possible to overcome as
well, although we'd be already out of the trivial zone.

About the transaction of indefinite length, I'd have to look more at that
part of the code, more carefully. I believe I am still lacking some
understanding here - which makes your heads up even more valuable.

Thanks again
_______________________________________________
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to