On Fri, Oct 27, 2000 at 02:23:04PM -0700, Linus Torvalds wrote:
>
> [...]
>
> That solution, btw, might be as simple as just saying:
> 
>  - raw IO is based on physical pages, and the COW mapping crated by
>    fork() may cause the changes to be visibile to either child or parent
>    or both, depending on usage patterns to the page in question.  For
>    repeatable behaviour, do not have outstanding direct IO in progress
>    over a fork().
> 
> Ie, just _document_ it. It's not _wrong_, it can just be surprising (but
> it is actually entirely straightforward and sane if you just look at it
> the right way).

Ok, here is an updated patch witout that change, but instead with a little
piece of kiobuf documentation that does document this and other things
related to kiobufs.

        Christoph

-- 
Always remember that you are unique.  Just like everyone else.


--- linux.orig/drivers/char/raw.c       Thu Oct 19 13:21:24 2000
+++ linux/drivers/char/raw.c    Sun Oct 29 20:55:43 2000
@@ -277,8 +277,11 @@
        
        if ((*offp & sector_mask) || (size & sector_mask))
                return -EINVAL;
-       if ((*offp >> sector_bits) > limit)
+       if ((*offp >> sector_bits) > limit) {
+               if (size)
+                       return -ENXIO;
                return 0;
+       }
 
        /* 
         * We'll just use one kiobuf
--- linux.orig/fs/buffer.c      Fri Oct 27 12:28:40 2000
+++ linux/fs/buffer.c   Sun Oct 29 20:55:43 2000
@@ -1924,6 +1924,8 @@
        
        spin_unlock(&unused_list_lock);
 
+       if (!iosize)
+               return -EIO;
        return iosize;
 }
 
--- linux.orig/mm/memory.c      Fri Oct 27 12:28:42 2000
+++ linux/mm/memory.c   Sun Oct 29 20:56:09 2000
@@ -382,9 +382,12 @@
 
 
 /*
- * Do a quick page-table lookup for a single page. 
+ * Do a quick page-table lookup for a single page. We have already verified
+ * access type, and done a fault in. But, kswapd might have stolen the page
+ * in the meantime. Return an indication of whether we should retry the fault
+ * in. Writability test is superfluous but conservative.
  */
-static struct page * follow_page(unsigned long address) 
+static struct page * follow_page(unsigned long address, int writeacc, int * ret) 
 {
        pgd_t *pgd;
        pmd_t *pmd;
@@ -393,10 +396,15 @@
        pmd = pmd_offset(pgd, address);
        if (pmd) {
                pte_t * pte = pte_offset(pmd, address);
-               if (pte && pte_present(*pte))
+               if (pte && pte_present(*pte)) {
+                       if (writeacc && !pte_write(*pte))
+                               goto retry;
                        return pte_page(*pte);
+               }
        }
-       
+
+retry:
+       *ret = 1;
        return NULL;
 }
 
@@ -428,7 +436,8 @@
        struct page *           map;
        int                     i;
        int                     datain = (rw == READ);
-       
+       int                     failed;
+
        /* Make sure the iobuf is not already mapped somewhere. */
        if (iobuf->nr_pages)
                return -EINVAL;
@@ -467,18 +476,22 @@
                        }
                        if (((datain) && (!(vma->vm_flags & VM_WRITE))) ||
                                        (!(vma->vm_flags & VM_READ))) {
-                               err = -EACCES;
                                goto out_unlock;
                        }
                }
+
+faultin:
                if (handle_mm_fault(current->mm, vma, ptr, datain) <= 0) 
                        goto out_unlock;
                spin_lock(&mm->page_table_lock);
-               map = follow_page(ptr);
-               if (!map) {
+               map = follow_page(ptr, datain, &failed);
+               if (failed) {
+                       /*
+                        * Page got stolen before we could lock it down.
+                        * Retry.
+                        */
                        spin_unlock(&mm->page_table_lock);
-                       dprintk (KERN_ERR "Missing page in map_user_kiobuf\n");
-                       goto out_unlock;
+                       goto faultin;
                }
                map = get_page_map(map);
                if (map)
diff -uNr linux.orig/Documentation/kiobuf.txt linux/Documentation/kiobuf.txt
--- linux.orig/Documentation/kiobuf.txt Thu Jan  1 01:00:00 1970
+++ linux/Documentation/kiobuf.txt      Sun Oct 29 21:38:20 2000
@@ -0,0 +1,100 @@
+               Abstract Kernel IO Buffers
+                     Under Linux
+       
+           Christoph Hellwig <[EMAIL PROTECTED]>
+
+
+This document describes the kiobuf concept used in the Linux Kernel
+IO/memory subsystem.  It describes it's usages, functions working
+with kernel IO buffers and show some examples for kiobuf usage.
+
+
+The main reason for implementing kernel IO buffers (by Stephen Tweedie)
+was the lack of raw devices support in Linux kernels <= 2.2.  Raw devices
+are the character devices that AT&T derived UNIX version implement to
+allow character based uncached access to mass storage devices.  In
+Linux kernels <= 2.2 all blockdevice IO goes either through the buffer-
+or pagecache, so that applications like databases cannot get full
+control over their data.
+
+The solution in Linux 2.3 an higher is that the new raw devices driver
+locks down the virtual memory it gets passed by the ->read and ->write
+methods and does physical page io on them, bypassing the caches.
+NOTE: the physical memory referenced by kiobufs does - unlike nearly
+everything else in the Linux memory managment - not have reasonable COW
+semenantics. So don't even try to fork when doing rawio or using
+user-space memory in kiobufs in an other way.
+
+
+To use iobufs in this way you need to allocate one or more kiobufs (an
+array of kiobufs is called kiovec - do not confuse those with BSD iovecs).
+
+       err = alloc_kiovec (count, iovec);
+
+This allocates the memory for the wanted number of kiobufs (and adds them
+to a cache) and initalizes some variables - in an OO-language this would be
+the constructor.  Then you force the virtual memory to faulted in and locked
+in physical memory and reference it by the kiobuf. (NOTE: this must be done
+for each iobuf, not for the whole iovec).
+
+       err = map_user_kiobuf (rw, iobuf, address, len);
+
+After that you request IO against the wanted device.  For the case of
+raw devices where IO should be requested against a blockdevice, there
+is a function in fs/buffer.c that does exactly this. (the parameter 
+'blocks' is an array of the block numbers the IO should be requested
+against)
+
+       err = brw_kiovec (rw, count, iovec, dev, blocks, sector_size);
+
+After the IO for this iobuf is done, unmap the virtual memory.
+
+       unmap_kiobuf (iobuf);
+
+And when we are finished with the iovev, free it.
+
+       free_kiovec (count, iovec);
+
+
+Locking down user memory and doing mass storage device IO with it is not
+the only purpose of kiobufs.  Another use for kiobufs is allowing
+user-space mmaping dma memory, e.g in sound drivers.  To do so you
+need to lock-down kernel virtual memory and refernece it using kiobufs.
+The code that does exactly this is not yet in the kernel - get Stephen
+Tweedie's kiobuf patchset if you want to use this.
+
+
+In the long term it looks like all blockdev IO will be done using
+kiobufs.  In the SGI XFS tree there is code that allows passing kiovecs
+to the individual low-level block drivers.  There are lots of advantages
+of doing it this way:  the page cache doesn't need to fit the outstanding
+io into lots of bufferheads, passing each bufferhead to ll_rw_block()
+where the elevator merges some of them together for better device usage
+and submits them to the drivers.  Instead the cache locks down the pages
+and submits the kiovec to the low-level driver.  The lowlevel driver knows
+better how the request should be splitted for dmaing or whatever.  On the
+other hand software RAID or LVM get more complicated:  instead of just
+doing block-remapping they must split the kiobufs and - in case of LVM -
+find ways to do efficient IO on continguos areas.
+
+
+
+References:
+
+       Linux Kernel Sourcecode
+           (fs/buffer.c, fs/iobufs.c, mm/memory.c, drivers/char/raw.c)
+       
+       SGI XFS for Linux
+           (http://oss.sgi.com/projects/linux-xfs/)
+       
+       Stephen Tweedies kiobuf patchset
+           (ftp://ftp.linux.org.uk/pub/linux/sct/fs/raw-io/)
+
+       Linux MM mailinglist
+           (http://humbolt.geo.uu.nl/Linux-MM/linux-mm.html)
+
+
+Thanks to Arjan van de Ven, Daniel Phillips and Marcelo Tosatti for
+proofreading this document and giving usefull hints.
+
+$Id: kiobuf.txt,v 1.2 2000/10/29 20:37:54 hch Exp hch $
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Reply via email to