dgaudet 98/02/20 00:25:35
Added: docs page_io Log: expand on what I mean by page-based i/o Revision Changes Path 1.1 apache-2.0/docs/page_io Index: page_io =================================================================== From [EMAIL PROTECTED] Fri Feb 20 00:36:52 1998 Date: Fri, 20 Feb 1998 00:35:37 -0800 (PST) From: Dean Gaudet <[EMAIL PROTECTED]> To: [email protected] Subject: page-based i/o X-Comment: Visit http://www.arctic.org/~dgaudet/legal for information regarding copyright and disclaimer. Reply-To: [email protected] Ed asked me for more details on what I mean when I talk about "paged based zero copy i/o". While writing mod_mmap_static I was thinking about the primitives that the core requires of the filesystem. What exactly is it that ties us into the filesystem? and how would we abstract it? The metadata (last modified time, file length) is actually pretty easy to abstract. It's also easy to define an "index" function so that MultiViews and such can be implemented. And with layered I/O we can hide the actual details of how you access these "virtual" files. But therein lies an inefficiency. If we had only bread() for reading virtual files, then we would enforce at least one copy of the data. bread() supplies the place that the caller wants to see the data, and so the bread() code has to copy it. But there's very little reason that bread() callers have to supply the buffer... bread() itself could supply the buffer. Call this new interface page_read(). It looks something like this: typedef struct { const void *data; size_t data_len; /* amt of data on page which is valid */ ... other stuff necessary for managing the page pool ... } a_page_head; /* returns NULL if an error or EOF occurs, on EOF errno will be * set to 0 */ a_page_head *page_read(BUFF *fb); /* queues entire page for writing, returns 0 on success, -1 on * error */ int page_write(BUFF *fb, a_page_head *); It's very important that a_page_head structures point to the data page rather than be part of the data page. This way we can build a_page_head structures which refer to parts of mmap()d memory. This stuff is a little more tricky to do, but is a big win for performance. With this integrated into our layered I/O it means that we can have zero-copy performance while still getting the advantages of layering. But note I'm glossing over a bunch of details... like the fact that we have to decide if a_page_heads are shared data, and hence need reference counting (i.e. I said "queues for writing" up there, which means some bit of the a_page_head data has to be kept until its actually written). Similarly for the page data. There are other tricks in this area that we can take advantage of -- like interprocess communication on architectures that do page flipping. On these boxes if you write() something that's page-aligned and page-sized to a pipe or unix socket, and the other end read()s into a page-aligned page-sized buffer then the kernel can get away without copying any data. It just marks the two pages as shared copy-on-write, and only when they're written to will the copy be made. So to make this work, your writer uses a ring of 2+ page-aligned/sized buffers so that it's not writing on something the reader is still reading. Dean
