Don't feed the trooll [offtopic] Re: Why Plan 9 C compilers don'thave asm("")
Hey folks, Just a quick reminder: don't feed the troll. He's very hungry. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Don't feed the trooll [offtopic] Re: Why Plan 9 C compilers don'thave asm()
Hey folks, Just a quick reminder: don't feed the troll. He's very hungry. -ben - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFD w/info-PATCH] device arguments from lookup, partion code inuserspace
On Sat, 19 May 2001, Alexander Viro wrote: > On Sat, 19 May 2001, Ben LaHaise wrote: > > > It's not done yet, but similar techniques would be applied. I envision > > that a raid device would support operations such as > > open("/dev/md0/slot=5,hot-add=/dev/sda") > > Think for a moment and you'll see why it's not only ugly as hell, but simply > won't work. Yeah, I shouldn't be replying to email anymore in my bleery-eyed state. =) Of course slash seperated data doesn't work, so it would have to be hot-add= or somesuch. Gah, that's why the options are all parsed from a single lookup name anyways... -ben (who's going to sleep) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFD w/info-PATCH] device arguments from lookup, partion code inuserspace
On Sat, 19 May 2001, Andrew Clausen wrote: > (1) these issues are independent. The partition parsing could > be done in user space, today, by blkpg, if I read the code correctly > ;-) (there's an ioctl for [un]registering partitions) Never > tried it though ;-) I tried to imply that through the use of the the word component. Yes, they're independant, but the code is pretty meaningless without a demonstration of how it's used. ;-) > (2) what about bootstrapping? how do you find the root device? > Do you do "root=/dev/hda/offset=63,limit=1235823"? Bit nasty. root= becomes a parameter to mount, and initrd becomes mandatory. I'd be all for including all of the bits needed to build the initrd boot code in the tree, but it's completely in the air. > (3) how does this work for LVM and RAID? It's not done yet, but similar techniques would be applied. I envision that a raid device would support operations such as open("/dev/md0/slot=5,hot-add=/dev/sda") > (4) libparted already has a fair bit of partition > scanning code, etc. Should be trivial to hack it up... That said, > it should be split up into .so modules... 200k is a bit heavy just > for mounting partitions (most of the bulk is file system stuff). > Good. Less work to do. > (5) what happens to /etc/fstab? User-space ([u]mount?) translates > /dev/hda1 into /dev/hda/offset=63,limit=1235823, and back? I'd just create a symlink to /dev/hda1 at mount time, although that really isn't what the user wants to see: the label or uuid is more useful. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFD w/info-PATCH] device arguments from lookup, partion code inuserspace
Hey folks, The work-in-progress patch for-demonstration-purposes-only below consists of 3 major components, and is meant to start discussion about the future direction of device naming and its interaction block layer. The main motivations here are the wasting of minor numbers for partitions, and the duplication of code between user and kernel space in areas such as partition detection, uuid location, lvm setup, mount by label, journal replay, and so on... 1. Generic lookup method and argument parsiing (fs/lookupargs.c) This code implements a lookup function which is for demonstration purposes used in fs/block_dev.c. The general idea is to pass additional parameters to device drivers on open via a comma seperated list of options following the device's name. Sample uses: /dev/sda/raw-> open sda in raw mode. /dev/sda/limit=102400 -> open sda with a limit of 100K /dev/sda/offset=1024,limit=2048 -> open a device that gives a view of sda at an offset of 1KB to 2KB The arguments are defined in a table (fs/block_dev.c:660), which defines the name and type of argument to parse. This table is used at lookup time to determine if an option name is valid (resulting in a postive dentry) or invalid. Potential uses for this are numerous: opening a control channel to a device, specifying a graphics mode for a framebuffer on open, replacing ioctls, lots of options. Please seperate comments on this portion from the other parts of the patch. 2. Restricted block device (drivers/block/blkrestrict.c) This is a quick-n-dirty implementation of a simple md-like block device that adds an offset to sector requests and limits the maximum offset on the device. The idea here is to replace the special case minor numbers used for the partitioning code with a generic runtime allocated translation node. The idea will work best once its data can be stored in a kdev_t structure. The API for use is simple: kdev_t restrict_create_dev(kdev_t dev, unsigned long long offset, unsigned long long limit) The associated cleanup of the startup code is not addressed here. Comments on this part (I know the implementation is ugly, talk about the ideas please)? 3. Userspace partition code proposal Given the above two bits, here's a brief explaination of a proposal to move management of the partitioning scheme into userspace, along with portions of raid startup, lvm, uuid and mount by label code needed for mounting the root filesystem. Consider that the device node currently known as /dev/hda5 can also be viewed as /dev/hda at offset 512000 with a limit of 10GB. With the extensions in fs/block_dev.c, you could replace /dev/hda5 with /dev/hda/offset=512000,limit=1024000. Now, by putting the partition parsing code into a libpart and binding mount to a libpart, the root filesystem mounting code can be run out of an initrd image. The use of mount gives us the ability to mount filesystems by UUID, by label or other exotic schemes without having to add any additional code to the kernel. I'm going to stop writing this now. I need sleep... Folks, please let me know your opinions on the ideas presented herein, and do attempt to keep the bits of code that are useful. Cheers, -ben [23:34:07] bcrl: you are sick. [23:41:13] bcrl: you _are_ sick. [23:43:24] bcrl: you are _fscking_ sick. here starts v2.4.5-pre3_bdev_naming-A0.diff diff -urN kernels/2.4/v2.4.5-pre3/Makefile bdev_naming/Makefile --- kernels/2.4/v2.4.5-pre3/MakefileThu May 17 18:09:42 2001 +++ bdev_naming/MakefileSat May 19 01:33:39 2001 @@ -1,7 +1,7 @@ VERSION = 2 PATCHLEVEL = 4 SUBLEVEL = 5 -EXTRAVERSION =-pre3 +EXTRAVERSION =-pre3-sick-test KERNELRELEASE=$(VERSION).$(PATCHLEVEL).$(SUBLEVEL)$(EXTRAVERSION) diff -urN kernels/2.4/v2.4.5-pre3/arch/i386/boot/install.sh bdev_naming/arch/i386/boot/install.sh --- kernels/2.4/v2.4.5-pre3/arch/i386/boot/install.sh Tue Jan 3 06:57:26 1995 +++ bdev_naming/arch/i386/boot/install.sh Fri May 18 20:24:36 2001 @@ -21,6 +21,7 @@ # User may have a custom install script +if [ -x ~/bin/installkernel ]; then exec ~/bin/installkernel "$@"; fi if [ -x /sbin/installkernel ]; then exec /sbin/installkernel "$@"; fi # Default install - same as make zlilo diff -urN kernels/2.4/v2.4.5-pre3/drivers/block/Makefile bdev_naming/drivers/block/Makefile --- kernels/2.4/v2.4.5-pre3/drivers/block/Makefile Fri Dec 29 17:07:21 2000 +++ bdev_naming/drivers/block/Makefile Sat May 19 00:29:08 2001 @@ -12,7 +12,7 @@ export-objs
[RFD w/info-PATCH] device arguments from lookup, partion code inuserspace
Hey folks, The work-in-progress patch for-demonstration-purposes-only below consists of 3 major components, and is meant to start discussion about the future direction of device naming and its interaction block layer. The main motivations here are the wasting of minor numbers for partitions, and the duplication of code between user and kernel space in areas such as partition detection, uuid location, lvm setup, mount by label, journal replay, and so on... 1. Generic lookup method and argument parsiing (fs/lookupargs.c) This code implements a lookup function which is for demonstration purposes used in fs/block_dev.c. The general idea is to pass additional parameters to device drivers on open via a comma seperated list of options following the device's name. Sample uses: /dev/sda/raw- open sda in raw mode. /dev/sda/limit=102400 - open sda with a limit of 100K /dev/sda/offset=1024,limit=2048 - open a device that gives a view of sda at an offset of 1KB to 2KB The arguments are defined in a table (fs/block_dev.c:660), which defines the name and type of argument to parse. This table is used at lookup time to determine if an option name is valid (resulting in a postive dentry) or invalid. Potential uses for this are numerous: opening a control channel to a device, specifying a graphics mode for a framebuffer on open, replacing ioctls, lots of options. Please seperate comments on this portion from the other parts of the patch. 2. Restricted block device (drivers/block/blkrestrict.c) This is a quick-n-dirty implementation of a simple md-like block device that adds an offset to sector requests and limits the maximum offset on the device. The idea here is to replace the special case minor numbers used for the partitioning code with a generic runtime allocated translation node. The idea will work best once its data can be stored in a kdev_t structure. The API for use is simple: kdev_t restrict_create_dev(kdev_t dev, unsigned long long offset, unsigned long long limit) The associated cleanup of the startup code is not addressed here. Comments on this part (I know the implementation is ugly, talk about the ideas please)? 3. Userspace partition code proposal Given the above two bits, here's a brief explaination of a proposal to move management of the partitioning scheme into userspace, along with portions of raid startup, lvm, uuid and mount by label code needed for mounting the root filesystem. Consider that the device node currently known as /dev/hda5 can also be viewed as /dev/hda at offset 512000 with a limit of 10GB. With the extensions in fs/block_dev.c, you could replace /dev/hda5 with /dev/hda/offset=512000,limit=1024000. Now, by putting the partition parsing code into a libpart and binding mount to a libpart, the root filesystem mounting code can be run out of an initrd image. The use of mount gives us the ability to mount filesystems by UUID, by label or other exotic schemes without having to add any additional code to the kernel. I'm going to stop writing this now. I need sleep... Folks, please let me know your opinions on the ideas presented herein, and do attempt to keep the bits of code that are useful. Cheers, -ben [23:34:07] viro bcrl: you are sick. [23:41:13] viro bcrl: you _are_ sick. [23:43:24] viro bcrl: you are _fscking_ sick. here starts v2.4.5-pre3_bdev_naming-A0.diff diff -urN kernels/2.4/v2.4.5-pre3/Makefile bdev_naming/Makefile --- kernels/2.4/v2.4.5-pre3/MakefileThu May 17 18:09:42 2001 +++ bdev_naming/MakefileSat May 19 01:33:39 2001 @@ -1,7 +1,7 @@ VERSION = 2 PATCHLEVEL = 4 SUBLEVEL = 5 -EXTRAVERSION =-pre3 +EXTRAVERSION =-pre3-sick-test KERNELRELEASE=$(VERSION).$(PATCHLEVEL).$(SUBLEVEL)$(EXTRAVERSION) diff -urN kernels/2.4/v2.4.5-pre3/arch/i386/boot/install.sh bdev_naming/arch/i386/boot/install.sh --- kernels/2.4/v2.4.5-pre3/arch/i386/boot/install.sh Tue Jan 3 06:57:26 1995 +++ bdev_naming/arch/i386/boot/install.sh Fri May 18 20:24:36 2001 @@ -21,6 +21,7 @@ # User may have a custom install script +if [ -x ~/bin/installkernel ]; then exec ~/bin/installkernel $@; fi if [ -x /sbin/installkernel ]; then exec /sbin/installkernel $@; fi # Default install - same as make zlilo diff -urN kernels/2.4/v2.4.5-pre3/drivers/block/Makefile bdev_naming/drivers/block/Makefile --- kernels/2.4/v2.4.5-pre3/drivers/block/Makefile Fri Dec 29 17:07:21 2000 +++ bdev_naming/drivers/block/Makefile Sat May 19 00:29:08 2001 @@ -12,7 +12,7 @@
Re: [RFD w/info-PATCH] device arguments from lookup, partion code inuserspace
On Sat, 19 May 2001, Andrew Clausen wrote: (1) these issues are independent. The partition parsing could be done in user space, today, by blkpg, if I read the code correctly ;-) (there's an ioctl for [un]registering partitions) Never tried it though ;-) I tried to imply that through the use of the the word component. Yes, they're independant, but the code is pretty meaningless without a demonstration of how it's used. ;-) (2) what about bootstrapping? how do you find the root device? Do you do root=/dev/hda/offset=63,limit=1235823? Bit nasty. root= becomes a parameter to mount, and initrd becomes mandatory. I'd be all for including all of the bits needed to build the initrd boot code in the tree, but it's completely in the air. (3) how does this work for LVM and RAID? It's not done yet, but similar techniques would be applied. I envision that a raid device would support operations such as open(/dev/md0/slot=5,hot-add=/dev/sda) (4) propagandalibparted already has a fair bit of partition scanning code, etc. Should be trivial to hack it up... That said, it should be split up into .so modules... 200k is a bit heavy just for mounting partitions (most of the bulk is file system stuff). /propaganda Good. Less work to do. (5) what happens to /etc/fstab? User-space ([u]mount?) translates /dev/hda1 into /dev/hda/offset=63,limit=1235823, and back? I'd just create a symlink to /dev/hda1 at mount time, although that really isn't what the user wants to see: the label or uuid is more useful. -ben - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFD w/info-PATCH] device arguments from lookup, partion code inuserspace
On Sat, 19 May 2001, Alexander Viro wrote: On Sat, 19 May 2001, Ben LaHaise wrote: It's not done yet, but similar techniques would be applied. I envision that a raid device would support operations such as open(/dev/md0/slot=5,hot-add=/dev/sda) Think for a moment and you'll see why it's not only ugly as hell, but simply won't work. Yeah, I shouldn't be replying to email anymore in my bleery-eyed state. =) Of course slash seperated data doesn't work, so it would have to be hot-add=filedescriptor or somesuch. Gah, that's why the options are all parsed from a single lookup name anyways... -ben (who's going to sleep) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] v2.4.4-ac9 highmem deadlock
Hey folks, The patch below consists of 3 seperate fixes for helping remove the deadlocks present in current kernels with respect to highmem systems. Each fix is to a seperate file, so please accept/reject as such. The first patch adding __GFP_FAIL to GFP_BUFFER is needed to fix a livelock caused by the kswapd -> swap out -> create_page_buffers -> GFP_BUFFER allocation -> waits for kswapd to wake up and free memory code path. Second patch (to highmem.c) silences the critical shortage messages that make viewing any console output impossible, as well as managing to slow the machine down to a crawl when running with a serial console. The third patch (to vmscan.c) adds a SCHED_YIELD to the page launder code before starting a launder loop. This one needs discussion, but what I'm attempting to accomplish is that when kswapd is cycling through page_launder repeatedly, bdflush or some other task submitting io via the bounce buffers needs to be given a chance to run and complete their io again. Failure to do so limits the rate of progress under extremely high load when the vast majority of io will be transferred via bounce buffers. Comments? -ben start of v2.4.4-ac9-highmem-1.diff diff -ur v2.4.4-ac9/include/linux/mm.h work/include/linux/mm.h --- v2.4.4-ac9/include/linux/mm.h Mon May 14 15:22:17 2001 +++ work/include/linux/mm.h Mon May 14 18:33:21 2001 @@ -528,7 +528,7 @@ #define GFP_BOUNCE (__GFP_HIGH | __GFP_FAIL) -#define GFP_BUFFER (__GFP_HIGH | __GFP_WAIT) +#define GFP_BUFFER (__GFP_HIGH | __GFP_FAIL | __GFP_WAIT) #define GFP_ATOMIC (__GFP_HIGH) #define GFP_USER ( __GFP_WAIT | __GFP_IO) #define GFP_HIGHUSER ( __GFP_WAIT | __GFP_IO | __GFP_HIGHMEM) diff -ur v2.4.4-ac9/mm/highmem.c work/mm/highmem.c --- v2.4.4-ac9/mm/highmem.c Mon May 14 14:57:00 2001 +++ work/mm/highmem.c Mon May 14 15:39:03 2001 @@ -279,6 +279,7 @@ struct page *alloc_bounce_page (void) { + static int buffer_warning; struct list_head *tmp; struct page *page; @@ -308,7 +309,8 @@ if (page) return page; - printk(KERN_WARNING "mm: critical shortage of bounce buffers.\n"); + if (!buffer_warning++) + printk(KERN_WARNING "mm: critical shortage of bounce buffers.\n"); current->policy |= SCHED_YIELD; @@ -319,6 +321,7 @@ struct buffer_head *alloc_bounce_bh (void) { + static int bh_warning; struct list_head *tmp; struct buffer_head *bh; @@ -348,7 +351,8 @@ if (bh) return bh; - printk(KERN_WARNING "mm: critical shortage of bounce bh's.\n"); + if (!bh_warning++) + printk(KERN_WARNING "mm: critical shortage of bounce bh's.\n"); current->policy |= SCHED_YIELD; diff -ur v2.4.4-ac9/mm/vmscan.c work/mm/vmscan.c --- v2.4.4-ac9/mm/vmscan.c Mon May 14 14:57:00 2001 +++ work/mm/vmscan.cMon May 14 16:43:05 2001 @@ -636,6 +636,12 @@ */ shortage = free_shortage(); if (can_get_io_locks && !launder_loop && shortage) { + if (gfp_mask & __GFP_WAIT) { + __set_current_state(TASK_RUNNING); + current->policy |= SCHED_YIELD; + schedule(); + } + launder_loop = 1; /* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] v2.4.4-ac9 highmem deadlock
Hey folks, The patch below consists of 3 seperate fixes for helping remove the deadlocks present in current kernels with respect to highmem systems. Each fix is to a seperate file, so please accept/reject as such. The first patch adding __GFP_FAIL to GFP_BUFFER is needed to fix a livelock caused by the kswapd - swap out - create_page_buffers - GFP_BUFFER allocation - waits for kswapd to wake up and free memory code path. Second patch (to highmem.c) silences the critical shortage messages that make viewing any console output impossible, as well as managing to slow the machine down to a crawl when running with a serial console. The third patch (to vmscan.c) adds a SCHED_YIELD to the page launder code before starting a launder loop. This one needs discussion, but what I'm attempting to accomplish is that when kswapd is cycling through page_launder repeatedly, bdflush or some other task submitting io via the bounce buffers needs to be given a chance to run and complete their io again. Failure to do so limits the rate of progress under extremely high load when the vast majority of io will be transferred via bounce buffers. Comments? -ben start of v2.4.4-ac9-highmem-1.diff diff -ur v2.4.4-ac9/include/linux/mm.h work/include/linux/mm.h --- v2.4.4-ac9/include/linux/mm.h Mon May 14 15:22:17 2001 +++ work/include/linux/mm.h Mon May 14 18:33:21 2001 @@ -528,7 +528,7 @@ #define GFP_BOUNCE (__GFP_HIGH | __GFP_FAIL) -#define GFP_BUFFER (__GFP_HIGH | __GFP_WAIT) +#define GFP_BUFFER (__GFP_HIGH | __GFP_FAIL | __GFP_WAIT) #define GFP_ATOMIC (__GFP_HIGH) #define GFP_USER ( __GFP_WAIT | __GFP_IO) #define GFP_HIGHUSER ( __GFP_WAIT | __GFP_IO | __GFP_HIGHMEM) diff -ur v2.4.4-ac9/mm/highmem.c work/mm/highmem.c --- v2.4.4-ac9/mm/highmem.c Mon May 14 14:57:00 2001 +++ work/mm/highmem.c Mon May 14 15:39:03 2001 @@ -279,6 +279,7 @@ struct page *alloc_bounce_page (void) { + static int buffer_warning; struct list_head *tmp; struct page *page; @@ -308,7 +309,8 @@ if (page) return page; - printk(KERN_WARNING mm: critical shortage of bounce buffers.\n); + if (!buffer_warning++) + printk(KERN_WARNING mm: critical shortage of bounce buffers.\n); current-policy |= SCHED_YIELD; @@ -319,6 +321,7 @@ struct buffer_head *alloc_bounce_bh (void) { + static int bh_warning; struct list_head *tmp; struct buffer_head *bh; @@ -348,7 +351,8 @@ if (bh) return bh; - printk(KERN_WARNING mm: critical shortage of bounce bh's.\n); + if (!bh_warning++) + printk(KERN_WARNING mm: critical shortage of bounce bh's.\n); current-policy |= SCHED_YIELD; diff -ur v2.4.4-ac9/mm/vmscan.c work/mm/vmscan.c --- v2.4.4-ac9/mm/vmscan.c Mon May 14 14:57:00 2001 +++ work/mm/vmscan.cMon May 14 16:43:05 2001 @@ -636,6 +636,12 @@ */ shortage = free_shortage(); if (can_get_io_locks !launder_loop shortage) { + if (gfp_mask __GFP_WAIT) { + __set_current_state(TASK_RUNNING); + current-policy |= SCHED_YIELD; + schedule(); + } + launder_loop = 1; /* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] zero^H^H^H^Hsingle copy pipe
On Mon, 7 May 2001, Manfred Spraul wrote: > The main problem is that map_user_kiobuf() locks pages into memory. > It's a bad idea for pipes. Either we must severely limit the maximum > amount of data in the direct-copy buffers, or we must add a swap file > based backing store. If I understand the BSD direct-pipe code correctly > it has a swap file based backing store. I think that's insane. And > limiting the direct copy buffers to a few kB defeats the purpose of > direct copy. Okay, how about the following instead (I'm thinking of generic code that we can reuse): continue to queue the mm, address, length tuple (I've actually got use for this too), and then use a map_mm_kiobuf (which is map_user_kiobuf but with an mm parameter) for the portion of the buffer that's currently being copied. That improves code reuse and gives us a few primatives that are quite useful elsewhere. > And the current pipe_{read,write} are a total mess with nested loops and > gotos. It's possible to create wakeup storms. I rewrote them as well ;-) Cool! =) -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] zero^H^H^H^Hsingle copy pipe
Manfred Spraul wrote: > > I'm now running with the patch for several hours, no problems. > > bw_pipe transfer rate has nearly doubled and the number of context > switches for one bw_pipe run is down from 71500 to 5500. > > Please test it. Any particular reason for not using davem's single copy kiobuf based code? -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] zero^H^H^H^Hsingle copy pipe
Manfred Spraul wrote: I'm now running with the patch for several hours, no problems. bw_pipe transfer rate has nearly doubled and the number of context switches for one bw_pipe run is down from 71500 to 5500. Please test it. Any particular reason for not using davem's single copy kiobuf based code? -ben - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] zero^H^H^H^Hsingle copy pipe
On Mon, 7 May 2001, Manfred Spraul wrote: The main problem is that map_user_kiobuf() locks pages into memory. It's a bad idea for pipes. Either we must severely limit the maximum amount of data in the direct-copy buffers, or we must add a swap file based backing store. If I understand the BSD direct-pipe code correctly it has a swap file based backing store. I think that's insane. And limiting the direct copy buffers to a few kB defeats the purpose of direct copy. Okay, how about the following instead (I'm thinking of generic code that we can reuse): continue to queue the mm, address, length tuple (I've actually got use for this too), and then use a map_mm_kiobuf (which is map_user_kiobuf but with an mm parameter) for the portion of the buffer that's currently being copied. That improves code reuse and gives us a few primatives that are quite useful elsewhere. And the current pipe_{read,write} are a total mess with nested loops and gotos. It's possible to create wakeup storms. I rewrote them as well ;-) Cool! =) -ben - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: rw_semaphores
On Sun, 8 Apr 2001, Linus Torvalds wrote: > > The "down_writer_failed()" case was wrong: Which is exactly the same problem in the original code. How about the following patch against the original code? I hadn't sent it yet as the test code isn't finished (hence, it's untested), but given that Andrew is going full steam ahead, people might as well give this a try. -ben rwsem-2.4.4-pre1-A0.diff diff -ur v2.4.4-pre1/arch/i386/kernel/semaphore.c work-2.4.4-pre1/arch/i386/kernel/semaphore.c --- v2.4.4-pre1/arch/i386/kernel/semaphore.cSat Nov 18 20:31:25 2000 +++ work-2.4.4-pre1/arch/i386/kernel/semaphore.cMon Apr 9 09:47:02 2001 @@ -269,10 +269,9 @@ ret 2: calldown_write_failed - " LOCK "subl$" RW_LOCK_BIAS_STR ",(%eax) - jz 1b - jnc 2b - jmp 3b + popl%ecx + popl%edx + ret " ); @@ -366,19 +365,56 @@ struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); - __up_write(sem);/* this takes care of granting the lock */ + /* Originally we called __up_write here, but that +* doesn't work: the lock add operation could result +* in us failing to detect a bias grant. Instead, +* we'll use a compare and exchange to get the lock +* from a known state: either <= -BIAS while another +* waiter is still around, or > -BIAS if we were given +* the lock's bias. +*/ + do { + int old = atomic_read(>count), new, res; + if (old > -RW_LOCK_BIAS) + return down_write_failed_biased(sem); + new = old + RW_LOCK_BIAS; + res = cmpxchg(>count.counter, old, new); + } while (res != old); + +again: + /* We are now removed from the lock. Wait for all other +* waiting writers to go away. +*/ add_wait_queue_exclusive(>wait, ); while (atomic_read(>count) < 0) { set_task_state(tsk, TASK_UNINTERRUPTIBLE); - if (atomic_read(>count) >= 0) + if (atomic_read(>count) >= 0) { break; /* we must attempt to acquire or bias the lock */ + } + schedule(); } remove_wait_queue(>wait, ); tsk->state = TASK_RUNNING; + + /* Okay, try to grab the lock. */ + for (;;) { + int old = atomic_read(>count), new, res; + if (old < 0) + goto again; + new = old - RW_LOCK_BIAS; + res = cmpxchg(>count.counter, old, new); + if (res != old) + continue; + if (old == RW_LOCK_BIAS) + break; + if (old >= 0) + return down_write_failed_biased(sem); + goto again; + } return sem; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: rw_semaphores
On Sun, 8 Apr 2001, Linus Torvalds wrote: The "down_writer_failed()" case was wrong: Which is exactly the same problem in the original code. How about the following patch against the original code? I hadn't sent it yet as the test code isn't finished (hence, it's untested), but given that Andrew is going full steam ahead, people might as well give this a try. -ben rwsem-2.4.4-pre1-A0.diff diff -ur v2.4.4-pre1/arch/i386/kernel/semaphore.c work-2.4.4-pre1/arch/i386/kernel/semaphore.c --- v2.4.4-pre1/arch/i386/kernel/semaphore.cSat Nov 18 20:31:25 2000 +++ work-2.4.4-pre1/arch/i386/kernel/semaphore.cMon Apr 9 09:47:02 2001 @@ -269,10 +269,9 @@ ret 2: calldown_write_failed - " LOCK "subl$" RW_LOCK_BIAS_STR ",(%eax) - jz 1b - jnc 2b - jmp 3b + popl%ecx + popl%edx + ret " ); @@ -366,19 +365,56 @@ struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); - __up_write(sem);/* this takes care of granting the lock */ + /* Originally we called __up_write here, but that +* doesn't work: the lock add operation could result +* in us failing to detect a bias grant. Instead, +* we'll use a compare and exchange to get the lock +* from a known state: either = -BIAS while another +* waiter is still around, or -BIAS if we were given +* the lock's bias. +*/ + do { + int old = atomic_read(sem-count), new, res; + if (old -RW_LOCK_BIAS) + return down_write_failed_biased(sem); + new = old + RW_LOCK_BIAS; + res = cmpxchg(sem-count.counter, old, new); + } while (res != old); + +again: + /* We are now removed from the lock. Wait for all other +* waiting writers to go away. +*/ add_wait_queue_exclusive(sem-wait, wait); while (atomic_read(sem-count) 0) { set_task_state(tsk, TASK_UNINTERRUPTIBLE); - if (atomic_read(sem-count) = 0) + if (atomic_read(sem-count) = 0) { break; /* we must attempt to acquire or bias the lock */ + } + schedule(); } remove_wait_queue(sem-wait, wait); tsk-state = TASK_RUNNING; + + /* Okay, try to grab the lock. */ + for (;;) { + int old = atomic_read(sem-count), new, res; + if (old 0) + goto again; + new = old - RW_LOCK_BIAS; + res = cmpxchg(sem-count.counter, old, new); + if (res != old) + continue; + if (old == RW_LOCK_BIAS) + break; + if (old = 0) + return down_write_failed_biased(sem); + goto again; + } return sem; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at page_alloc.c:75! / exit.c
On Thu, 5 Apr 2001 [EMAIL PROTECTED] wrote: > "Albert D. Cahalan" wrote: > > > > > I'm running the 2.4.3 kernel and my system always (!) crashes when I try > > > to generate the "Linux kernel poster" from lgp.linuxcare.com.au. After > > > working for one hour, the kernel printed this message: > > > > I'd guess you have a heat problem. Check for dust, a slow fan, > > an overclocked CPU, memory chips with airflow blocked by cables, > > motherboard chips that are too hot to touch... This is *not* a hardware problem. We're tracking something fishy in the vm code that is resulting in exactly the same BUG() tripping up on a number of boxes (4 and 8 way SMP). -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at page_alloc.c:75! / exit.c
On Thu, 5 Apr 2001 [EMAIL PROTECTED] wrote: "Albert D. Cahalan" wrote: I'm running the 2.4.3 kernel and my system always (!) crashes when I try to generate the "Linux kernel poster" from lgp.linuxcare.com.au. After working for one hour, the kernel printed this message: I'd guess you have a heat problem. Check for dust, a slow fan, an overclocked CPU, memory chips with airflow blocked by cables, motherboard chips that are too hot to touch... This is *not* a hardware problem. We're tracking something fishy in the vm code that is resulting in exactly the same BUG() tripping up on a number of boxes (4 and 8 way SMP). -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Writing on raw device with software RAID 0 is slow
On Thu, 1 Mar 2001, Stephen C. Tweedie wrote: > Yep. There shouldn't be any problem increasing the 64KB size, it's > only the lack of accounting for the pinned memory which stopped me > increasing it by default. Actually, how about making it a sysctl? That's probably the most reasonable approach for now since the optimal size depends on hardware. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Writing on raw device with software RAID 0 is slow
Hello all, On Thu, 1 Mar 2001, Stephen C. Tweedie wrote: > Raw IO is always synchronous: it gets flushed to disk before the write > returns. You don't get any write-behind with raw IO, so the smaller > the blocksize you write in, the slower things get. More importantly, the mainstream raw io code only writes in 64KB chunks that are unpipelined, which can lead to writes not hitting the drive before the sector passes under the rw head. You can work around this to some extent by issuing multiple writes (via threads, or the aio work I've done) at the expense of atomicity. Also, before we allow locking of arbitrary larger ios in main memory, we need bean counting to prevent the obvious DoSes. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Writing on raw device with software RAID 0 is slow
Hello all, On Thu, 1 Mar 2001, Stephen C. Tweedie wrote: Raw IO is always synchronous: it gets flushed to disk before the write returns. You don't get any write-behind with raw IO, so the smaller the blocksize you write in, the slower things get. More importantly, the mainstream raw io code only writes in 64KB chunks that are unpipelined, which can lead to writes not hitting the drive before the sector passes under the rw head. You can work around this to some extent by issuing multiple writes (via threads, or the aio work I've done) at the expense of atomicity. Also, before we allow locking of arbitrary larger ios in main memory, we need bean counting to prevent the obvious DoSes. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Writing on raw device with software RAID 0 is slow
On Thu, 1 Mar 2001, Stephen C. Tweedie wrote: Yep. There shouldn't be any problem increasing the 64KB size, it's only the lack of accounting for the pinned memory which stopped me increasing it by default. Actually, how about making it a sysctl? That's probably the most reasonable approach for now since the optimal size depends on hardware. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug Report in pc_keyb
On Tue, 27 Feb 2001, Russell C. Hay wrote: > I'm not really sure who to send this too. Unfortunately, I don't really have > much information on this bug, and I will provide more when I'm around the box > in question. I have linux 2.2.16 running fine on the box. I am currently > trying to upgrade to linux 2.4.2. However, after compiling 2.4.2 and > installing in lilo and rebooting, I get the following error scrolling on > my screen I'm working on a patch for pc_keyb which should hopefully address this problem. I'll send a copy to you for testing as soon as it's ready. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Bug Report in pc_keyb
On Tue, 27 Feb 2001, Russell C. Hay wrote: I'm not really sure who to send this too. Unfortunately, I don't really have much information on this bug, and I will provide more when I'm around the box in question. I have linux 2.2.16 running fine on the box. I am currently trying to upgrade to linux 2.4.2. However, after compiling 2.4.2 and installing in lilo and rebooting, I get the following error scrolling on my screen I'm working on a patch for pc_keyb which should hopefully address this problem. I'll send a copy to you for testing as soon as it's ready. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] make nfsroot accept server addresses from BOOTP root
On Tue, 20 Feb 2001, Tom Rini wrote: > Er, say that again? Right now, for bootp if you specify "sa=xxx.xxx.xxx.xxx" > Linux uses that as the host for the NFS server (which does have the side > effect of if TFTP server != NFS server, you don't boot). Are you saying > your patch takes "rp=xxx.xxx.xxx.xxx:/foo/root" ? Just curious, since I > don't know, whats the RFC say about this? Yeah, that's the problem I was trying to work around, mostly because the docs on dhcpd are sufficiently vague and obscure. Personally, I don't actually need tftp support, so I've just configured the system to now point at the NFS server. For anyone who cares, the last patch was wrong, this one is right. -ben diff -ur v2.4.1-ac18/fs/nfs/nfsroot.c work/fs/nfs/nfsroot.c --- v2.4.1-ac18/fs/nfs/nfsroot.cMon Sep 25 16:13:53 2000 +++ work/fs/nfs/nfsroot.c Tue Feb 20 01:59:32 2001 @@ -226,6 +226,7 @@ if (name[0] && strcmp(name, "default")) { strncpy(buf, name, NFS_MAXPATHLEN-1); buf[NFS_MAXPATHLEN-1] = 0; + root_nfs_parse_addr(buf); } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] make nfsroot accept server addresses from BOOTP root
On Tue, 20 Feb 2001, Tom Rini wrote: Er, say that again? Right now, for bootp if you specify "sa=xxx.xxx.xxx.xxx" Linux uses that as the host for the NFS server (which does have the side effect of if TFTP server != NFS server, you don't boot). Are you saying your patch takes "rp=xxx.xxx.xxx.xxx:/foo/root" ? Just curious, since I don't know, whats the RFC say about this? Yeah, that's the problem I was trying to work around, mostly because the docs on dhcpd are sufficiently vague and obscure. Personally, I don't actually need tftp support, so I've just configured the system to now point at the NFS server. For anyone who cares, the last patch was wrong, this one is right. -ben diff -ur v2.4.1-ac18/fs/nfs/nfsroot.c work/fs/nfs/nfsroot.c --- v2.4.1-ac18/fs/nfs/nfsroot.cMon Sep 25 16:13:53 2000 +++ work/fs/nfs/nfsroot.c Tue Feb 20 01:59:32 2001 @@ -226,6 +226,7 @@ if (name[0] strcmp(name, "default")) { strncpy(buf, name, NFS_MAXPATHLEN-1); buf[NFS_MAXPATHLEN-1] = 0; + root_nfs_parse_addr(buf); } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] trylock for rw_semaphores: 2.4.1
On Mon, 19 Feb 2001, Brian J. Watson wrote: > Here is an x86 implementation of down_read_trylock() and down_write_trylock() > for read/write semaphores. As with down_trylock() for exclusive semaphores, they > don't block if they fail to get the lock. They just return 1, as opposed to 0 in > the success case. How about the following instead? Warning: compiled, not tested. -ben diff -ur v2.4.2-pre3/include/asm-i386/semaphore.h trylock/include/asm-i386/semaphore.h --- v2.4.2-pre3/include/asm-i386/semaphore.hMon Feb 12 16:04:59 2001 +++ trylock/include/asm-i386/semaphore.hMon Feb 19 23:50:03 2001 @@ -382,5 +382,32 @@ __up_write(sem); } +/* returns 1 if it successfully obtained the semaphore for write */ +static inline int down_write_trylock(struct rw_semaphore *sem) +{ + int old = RW_LOCK_BIAS, new = 0; + int res; + + res = cmpxchg(>count.counter, old, new); + return (res == RW_LOCK_BIAS); +} + +/* returns 1 if it successfully obtained the semaphore for read */ +static inline int down_read_trylock(struct rw_semaphore *sem) +{ + int ret = 1; + asm volatile( + LOCK "subl $1,%0 + js 2f + 1: + .section .text.lock,\"ax\" + 2:" LOCK "inc %0 + subl %1,%1 + jmp 1b + .previous" + :"=m" (*(volatile int *)sem), "=r" (ret) : "1" (ret) : "memory"); + return ret; +} + #endif #endif - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] make nfsroot accept server addresses from BOOTP root
Hello, Here's a handy little patch that makes the kernel parse out the ip address of the nfs server from the bootp root path. Otherwise it's impossible to boot the kernel without command line options on diskless workstations (I hate RPL). -ben diff -ur v2.4.1-ac18/fs/nfs/nfsroot.c work/fs/nfs/nfsroot.c --- v2.4.1-ac18/fs/nfs/nfsroot.cMon Sep 25 16:13:53 2000 +++ work/fs/nfs/nfsroot.c Mon Feb 19 18:05:24 2001 @@ -224,8 +224,7 @@ } } if (name[0] && strcmp(name, "default")) { - strncpy(buf, name, NFS_MAXPATHLEN-1); - buf[NFS_MAXPATHLEN-1] = 0; + root_nfs_parse_addr(name); } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] make nfsroot accept server addresses from BOOTP root
Hello, Here's a handy little patch that makes the kernel parse out the ip address of the nfs server from the bootp root path. Otherwise it's impossible to boot the kernel without command line options on diskless workstations (I hate RPL). -ben diff -ur v2.4.1-ac18/fs/nfs/nfsroot.c work/fs/nfs/nfsroot.c --- v2.4.1-ac18/fs/nfs/nfsroot.cMon Sep 25 16:13:53 2000 +++ work/fs/nfs/nfsroot.c Mon Feb 19 18:05:24 2001 @@ -224,8 +224,7 @@ } } if (name[0] strcmp(name, "default")) { - strncpy(buf, name, NFS_MAXPATHLEN-1); - buf[NFS_MAXPATHLEN-1] = 0; + root_nfs_parse_addr(name); } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] trylock for rw_semaphores: 2.4.1
On Mon, 19 Feb 2001, Brian J. Watson wrote: Here is an x86 implementation of down_read_trylock() and down_write_trylock() for read/write semaphores. As with down_trylock() for exclusive semaphores, they don't block if they fail to get the lock. They just return 1, as opposed to 0 in the success case. How about the following instead? Warning: compiled, not tested. -ben diff -ur v2.4.2-pre3/include/asm-i386/semaphore.h trylock/include/asm-i386/semaphore.h --- v2.4.2-pre3/include/asm-i386/semaphore.hMon Feb 12 16:04:59 2001 +++ trylock/include/asm-i386/semaphore.hMon Feb 19 23:50:03 2001 @@ -382,5 +382,32 @@ __up_write(sem); } +/* returns 1 if it successfully obtained the semaphore for write */ +static inline int down_write_trylock(struct rw_semaphore *sem) +{ + int old = RW_LOCK_BIAS, new = 0; + int res; + + res = cmpxchg(sem-count.counter, old, new); + return (res == RW_LOCK_BIAS); +} + +/* returns 1 if it successfully obtained the semaphore for read */ +static inline int down_read_trylock(struct rw_semaphore *sem) +{ + int ret = 1; + asm volatile( + LOCK "subl $1,%0 + js 2f + 1: + .section .text.lock,\"ax\" + 2:" LOCK "inc %0 + subl %1,%1 + jmp 1b + .previous" + :"=m" (*(volatile int *)sem), "=r" (ret) : "1" (ret) : "memory"); + return ret; +} + #endif #endif - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86 ptep_get_and_clear question
On Fri, 16 Feb 2001, Linus Torvalds wrote: > This is, actually, a problem that I suspect ends up being _very_ similar > to the zap_page_range() case. zap_page_range() needs to make sure that > everything has been updated by the time the page is actually free'd. While > filemap_sync() needs to make sure that everything has been updated before > the page is written out (or marked dirty - which obviously also guarantees > the ordering, and makes the problems look even more similar). Ah, I see what I was missing. So long as the tlb flush is in between the ptep_test_and_clear_dirty and the set_page_dirty, we're fine (ie the current code is good). If we really want to reduce the number of tlb flushes, yes, we can use the gather code and then just do the set_page_dirty after a tlb_flush_range. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86 ptep_get_and_clear question
On Fri, 16 Feb 2001, Manfred Spraul wrote: > That leaves msync() - it currently does a flush_tlb_page() for every > single dirty page. > Is it possible to integrate that into the mmu gather code? > > tlb_transfer_dirty() in addition to tlb_clear_page()? Actually, in the filemap_sync case, the flush_tlb_page is redundant -- there's already a call to flush_tlb_range in filemap_sync after the dirty bits are cleared. None of the cpus we support document having a writeback tlb, and intel's docs explicitely state that they do not as they state that the dirty bit is updated on the first write to dirty the pte. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86 ptep_get_and_clear question
On Fri, 16 Feb 2001, Linus Torvalds wrote: > How do you expect to ever see this in practice? Sounds basically > impossible to test for this hardware race. The obvious "try to dirty as > fast as possible on one CPU while doing an atomic get-and-clear on the > other" thing is not valid - it's in fact quite likely to get into > lock-step because of page table cache movement synchronization. And as > such it could hide any race. That's not the behaviour I'm testing, but whether the CPU is doing lock pte = *ptep if (present && writable) pte |= dirty *ptep = pte unlock versus lock pte = *ptep pte |= dirty *ptep = pte unlock Which can be tested by means of getting the pte into the tlb then changing the pte without flushing and observing the results (page fault vs changed pte). I'm willing to bet that all cpus are doing the first version. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86 ptep_get_and_clear question
On Fri, 16 Feb 2001, Jamie Lokier wrote: > It should be fast on known CPUs, correct on unknown ones, and much > simpler than "gather" code which may be completely unnecessary and > rather difficult to test. > > If anyone reports the message, _then_ we think about the problem some more. > > Ben, fancy writing a boot-time test? Sure, I'll whip one up this afternoon. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86 ptep_get_and_clear question
On Fri, 16 Feb 2001, Jamie Lokier wrote: It should be fast on known CPUs, correct on unknown ones, and much simpler than "gather" code which may be completely unnecessary and rather difficult to test. If anyone reports the message, _then_ we think about the problem some more. Ben, fancy writing a boot-time test? Sure, I'll whip one up this afternoon. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86 ptep_get_and_clear question
On Fri, 16 Feb 2001, Linus Torvalds wrote: How do you expect to ever see this in practice? Sounds basically impossible to test for this hardware race. The obvious "try to dirty as fast as possible on one CPU while doing an atomic get-and-clear on the other" thing is not valid - it's in fact quite likely to get into lock-step because of page table cache movement synchronization. And as such it could hide any race. That's not the behaviour I'm testing, but whether the CPU is doing lock pte = *ptep if (present writable) pte |= dirty *ptep = pte unlock versus lock pte = *ptep pte |= dirty *ptep = pte unlock Which can be tested by means of getting the pte into the tlb then changing the pte without flushing and observing the results (page fault vs changed pte). I'm willing to bet that all cpus are doing the first version. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86 ptep_get_and_clear question
On Fri, 16 Feb 2001, Manfred Spraul wrote: That leaves msync() - it currently does a flush_tlb_page() for every single dirty page. Is it possible to integrate that into the mmu gather code? tlb_transfer_dirty() in addition to tlb_clear_page()? Actually, in the filemap_sync case, the flush_tlb_page is redundant -- there's already a call to flush_tlb_range in filemap_sync after the dirty bits are cleared. None of the cpus we support document having a writeback tlb, and intel's docs explicitely state that they do not as they state that the dirty bit is updated on the first write to dirty the pte. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86 ptep_get_and_clear question
On Fri, 16 Feb 2001, Linus Torvalds wrote: This is, actually, a problem that I suspect ends up being _very_ similar to the zap_page_range() case. zap_page_range() needs to make sure that everything has been updated by the time the page is actually free'd. While filemap_sync() needs to make sure that everything has been updated before the page is written out (or marked dirty - which obviously also guarantees the ordering, and makes the problems look even more similar). Ah, I see what I was missing. So long as the tlb flush is in between the ptep_test_and_clear_dirty and the set_page_dirty, we're fine (ie the current code is good). If we really want to reduce the number of tlb flushes, yes, we can use the gather code and then just do the set_page_dirty after a tlb_flush_range. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86 ptep_get_and_clear question
On Thu, 15 Feb 2001, Kanoj Sarcar wrote: > No. All architectures do not have this problem. For example, if the > Linux "dirty" (not the pte dirty) bit is managed by software, a fault > will actually be taken when processor 2 tries to do the write. The fault > is solely to make sure that the Linux "dirty" bit can be tracked. As long > as the fault handler grabs the right locks before updating the Linux "dirty" > bit, things should be okay. This is the case with mips, for example. > > The problem with x86 is that we depend on automatic x86 dirty bit > update to manage the Linux "dirty" bit (they are the same!). So appropriate > locks are not grabbed. Will you please go off and prove that this "problem" exists on some x86 processor before continuing this rant? None of the PII, PIII, Athlon, K6-2 or 486s I checked exhibited the worrisome behaviour you're speculating about, plus it is logically consistent with the statements the manual does make about updating ptes; otherwise how could an smp os perform a reliable shootdown by doing an atomic bit clear on the present bit of a pte? -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86 ptep_get_and_clear question
On Thu, 15 Feb 2001, Kanoj Sarcar wrote: No. All architectures do not have this problem. For example, if the Linux "dirty" (not the pte dirty) bit is managed by software, a fault will actually be taken when processor 2 tries to do the write. The fault is solely to make sure that the Linux "dirty" bit can be tracked. As long as the fault handler grabs the right locks before updating the Linux "dirty" bit, things should be okay. This is the case with mips, for example. The problem with x86 is that we depend on automatic x86 dirty bit update to manage the Linux "dirty" bit (they are the same!). So appropriate locks are not grabbed. Will you please go off and prove that this "problem" exists on some x86 processor before continuing this rant? None of the PII, PIII, Athlon, K6-2 or 486s I checked exhibited the worrisome behaviour you're speculating about, plus it is logically consistent with the statements the manual does make about updating ptes; otherwise how could an smp os perform a reliable shootdown by doing an atomic bit clear on the present bit of a pte? -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Linus Torvalds wrote: > There are currently no other alternatives in user space. You'd have to > create whole new interfaces for aio_read/write, and ways for the kernel to > inform user space that "now you can re-try submitting your IO". > > Could be done. But that's a big thing. Has been done. Still needs some work, but it works pretty well. As for throttling io, having ios submitted does not have to correspond to them being queued in the lower layers. The main issue with async io is limiting the amount of pinned memory for ios; if that's taken care of, I don't think it matters how many ios are in flight. > > An application which sets non blocking behavior and busy waits for a > > request (which seems to be your argument) is just stupid, of course. > > Tell me what else it could do at some point? You need something like > select() to wait on it. There are no such interfaces right now... > > (besides, latency would suck. I bet you're better off waiting for the > requests if they are all used up. It takes too long to get deep into the > kernel from user space, and you cannot use the exclusive waiters with its > anti-herd behaviour etc). Ah, but no. In fact for some things, the wait queue extensions I'm using will be more efficient as things like test_and_set_bit for obtaining a lock gets executed without waking up a task. > Simple rule: if you want to optimize concurrency and avoid waiting - use > several processes or threads instead. At which point you can get real work > done on multiple CPU's, instead of worrying about what happens when you > have to wait on the disk. There do exist plenty of cases where threads are not efficient enough. Just the stack overhead alone with 8000 threads makes things really suck. Event based io completion means that server processes don't need to have the overhead of select/poll. Add in NT style completion ports for waking up the right number of worker threads off of the completion queue, and That said, I don't expect all devices to support async io. But given support for files, raw and sockets all the important cases are covered. The remainder can be supported via userspace helpers. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: select() returning busy for regular files [was Re: [Kiobuf-io-devel]RFC: Kernel mechanism: Compound event wait]
On Thu, 8 Feb 2001, Pavel Machek wrote: > Hi! > > > > Its arguing against making a smart application block on the disk while its > > > able to use the CPU for other work. > > > > There are currently no other alternatives in user space. You'd have to > > create whole new interfaces for aio_read/write, and ways for the kernel to > > inform user space that "now you can re-try submitting your IO". > > Why is current select() interface not good enough? Think of random disk io scattered across the disk. Think about aio_write providing a means to perform zero copy io without needing to resort to playing mm tricks write protecting pages in the user's page tables. It's also a means for dealing efficiently with thousands of outstanding requests for network io. Using a select based interface is going to be an ugly kludge that still has all the overhead of select/poll. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: select() returning busy for regular files [was Re: [Kiobuf-io-devel]RFC: Kernel mechanism: Compound event wait]
On Thu, 8 Feb 2001, Pavel Machek wrote: Hi! Its arguing against making a smart application block on the disk while its able to use the CPU for other work. There are currently no other alternatives in user space. You'd have to create whole new interfaces for aio_read/write, and ways for the kernel to inform user space that "now you can re-try submitting your IO". Why is current select() interface not good enough? Think of random disk io scattered across the disk. Think about aio_write providing a means to perform zero copy io without needing to resort to playing mm tricks write protecting pages in the user's page tables. It's also a means for dealing efficiently with thousands of outstanding requests for network io. Using a select based interface is going to be an ugly kludge that still has all the overhead of select/poll. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Linus Torvalds wrote: There are currently no other alternatives in user space. You'd have to create whole new interfaces for aio_read/write, and ways for the kernel to inform user space that "now you can re-try submitting your IO". Could be done. But that's a big thing. Has been done. Still needs some work, but it works pretty well. As for throttling io, having ios submitted does not have to correspond to them being queued in the lower layers. The main issue with async io is limiting the amount of pinned memory for ios; if that's taken care of, I don't think it matters how many ios are in flight. An application which sets non blocking behavior and busy waits for a request (which seems to be your argument) is just stupid, of course. Tell me what else it could do at some point? You need something like select() to wait on it. There are no such interfaces right now... (besides, latency would suck. I bet you're better off waiting for the requests if they are all used up. It takes too long to get deep into the kernel from user space, and you cannot use the exclusive waiters with its anti-herd behaviour etc). Ah, but no. In fact for some things, the wait queue extensions I'm using will be more efficient as things like test_and_set_bit for obtaining a lock gets executed without waking up a task. Simple rule: if you want to optimize concurrency and avoid waiting - use several processes or threads instead. At which point you can get real work done on multiple CPU's, instead of worrying about what happens when you have to wait on the disk. There do exist plenty of cases where threads are not efficient enough. Just the stack overhead alone with 8000 threads makes things really suck. Event based io completion means that server processes don't need to have the overhead of select/poll. Add in NT style completion ports for waking up the right number of worker threads off of the completion queue, and That said, I don't expect all devices to support async io. But given support for files, raw and sockets all the important cases are covered. The remainder can be supported via userspace helpers. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Ingo Molnar wrote: > > On Tue, 6 Feb 2001, Ben LaHaise wrote: > > > This small correction is the crux of the problem: if it blocks, it > > takes away from the ability of the process to continue doing useful > > work. If it returns -EAGAIN, then that's okay, the io will be > > resubmitted later when other disk io has completed. But, it should be > > possible to continue servicing network requests or user io while disk > > io is underway. > > typical blocking point is waiting for page completion, not > __wait_request(). But, this is really not an issue, NR_REQUESTS can be > increased anytime. If NR_REQUESTS is large enough then think of it as the > 'absolute upper limit of doing IO', and think of the blocking as 'the > kernel pulling the brakes'. =) This is what I'm seeing: lots of processes waiting with wchan == __get_request_wait. With async io and a database flushing lots of io asynchronously spread out across the disk, the NR_REQUESTS limit is hit very quickly. > [overhead of 512-byte bhs in the raw IO code is an artificial problem of > the raw IO code.] True, and in the tests I've run, raw io is using 2KB blocks (same as the database). -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Ingo Molnar wrote: > > On Tue, 6 Feb 2001, Ben LaHaise wrote: > > > > > You mentioned non-spindle base io devices in your last message. Take > > > > something like a big RAM disk. Now compare kiobuf base io to buffer > > > > head based io. Tell me which one is going to perform better. > > > > > > roughly equal performance when using 4K bhs. And a hell of a lot more > > > complex and volatile code in the kiobuf case. > > > > I'm willing to benchmark you on this. > > sure. Could you specify the actual workload, and desired test-setups? Sure. General parameters will be as follows (since I think we both have access to these machines): - 4xXeon, 4GB memory, 3GB to be used for the ramdisk (enough for a base install plus data files. - data to/from the ram block device must be copied within the ram block driver. - the filesystem used must be ext2. optimisations to ext2 for tweaks to the interface are permitted & encouraged. The main item I'm interested in is read (page cache cold)/synchronous write performance for blocks from 256 bytes to 16MB in powers of two, much like what I've done in testing the aio patches that shows where improvement in latency is needed. Including a few other items on disk like the timings of find/make -s dep/bonnie/dbench is probably to show changes in throughput. Sound fair? -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Linus Torvalds wrote: > > > On Tue, 6 Feb 2001, Ben LaHaise wrote: > > > > s/impossible/unpleasant/. ll_rw_blk blocks; it should be possible to have > > a non blocking variant that does all of the setup in the caller's context. > > Yes, I know that we can do it with a kernel thread, but that isn't as > > clean and it significantly penalises small ios (hint: databases issue > > *lots* of small random ios and a good chunk of large ios). > > Ehh.. submit_bh() does everything you want. And, btw, ll_rw_block() does > NOT block. Never has. Never will. > > (Small correction: it doesn't block on anything else than allocating a > request structure if needed, and quite frankly, you have to block > SOMETIME. You can't just try to throw stuff at the device faster than it > can take it. Think of it as a "there can only be this many IO's in > flight") This small correction is the crux of the problem: if it blocks, it takes away from the ability of the process to continue doing useful work. If it returns -EAGAIN, then that's okay, the io will be resubmitted later when other disk io has completed. But, it should be possible to continue servicing network requests or user io while disk io is underway. > If you want to use kiobuf's because you think they are asycnrhonous and > bh's aren't, then somebody has been feeding you a lot of crap. The kiobuf > PR department seems to have been working overtime on some FUD strategy. I'm using bh's to refer to what is currently being done, and kiobuf when talking about what could be done. It's probably the wrong thing to do, and if bh's are extended to operate on arbitrary sized blocks then there is no difference between the two. > If you want to make a "raw disk device", you can do so TODAY with bh's. > How? Don't use "bread()" (which allocates the backing store and creates > the cache). Allocate a separate anonymous bh (or multiple), and set them > up to point to whatever data source/sink you have, and let it rip. All > asynchronous. All with nice completion callbacks. All with existing code, > no kiobuf's in sight. > What more do you think your kiobuf's should be able to do? That's what my code is doing today. There are a ton of bh's setup for a single kiobuf request that is issued. For something like a single 256kb io, this is the difference between the batched io requests being passed into submit_bh fitting in L1 cache and overflowing it. Resizable bh's would certainly improve this. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Ingo Molnar wrote: > > On Tue, 6 Feb 2001, Ben LaHaise wrote: > > > - reduce the overhead in submitting block ios, especially for > > large ios. Look at the %CPU usages differences between 512 byte > > blocks and 4KB blocks, this can be better. > > my system is already submitting 4KB bhs. If anyone's raw-IO setup submits > 512 byte bhs thats a problem of the raw IO code ... > > > - make asynchronous io possible in the block layer. This is > > impossible with the current ll_rw_block scheme and io request > > plugging. > > why is it impossible? s/impossible/unpleasant/. ll_rw_blk blocks; it should be possible to have a non blocking variant that does all of the setup in the caller's context. Yes, I know that we can do it with a kernel thread, but that isn't as clean and it significantly penalises small ios (hint: databases issue *lots* of small random ios and a good chunk of large ios). > > You mentioned non-spindle base io devices in your last message. Take > > something like a big RAM disk. Now compare kiobuf base io to buffer > > head based io. Tell me which one is going to perform better. > > roughly equal performance when using 4K bhs. And a hell of a lot more > complex and volatile code in the kiobuf case. I'm willing to benchmark you on this. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Ingo Molnar wrote: > If you are merging based on (device, offset) values, then that's lowlevel > - and this is what we have been doing for years. > > If you are merging based on (inode, offset), then it has flaws like not > being able to merge through a loopback or stacked filesystem. I disagree. Loopback filesystems typically have their data contiguously on disk and won't split up incoming requests any further. Here are the points I'm trying to address: - reduce the overhead in submitting block ios, especially for large ios. Look at the %CPU usages differences between 512 byte blocks and 4KB blocks, this can be better. - make asynchronous io possible in the block layer. This is impossible with the current ll_rw_block scheme and io request plugging. - provide a generic mechanism for reordering io requests for devices which will benefit from this. Make it a library for drivers to call into. IDE for example will probably make use of it, but some high end devices do this on the controller. This is the important point: Make it OPTIONAL. You mentioned non-spindle base io devices in your last message. Take something like a big RAM disk. Now compare kiobuf base io to buffer head based io. Tell me which one is going to perform better. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Ingo Molnar wrote: > - higher levels do not have the kind of state to eg. merge requests done > by different users. The only chance for merging is often the lowest > level, where we already know what disk, which sector. That's what a readaround buffer is for, and I suspect that readaround will give use a big performance boost. > - merging is not even *required* for some devices - and chances are high > that we'll get away from this inefficient and unreliable 'rotating array > of disks' business of storing bulk data in this century. (solid state > disks, holographic storage, whatever.) Interesting that you've brought up this point, as its an example > i'm truly shocked that you and Stephen are both saying this. Merging != sorting. Sorting of requests has to be carried out at the lower layers, and the specific block device should be able to choose the Right Thing To Do for the next item in a chain of sequential requests. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Jens Axboe wrote: > Stephen already covered this point, the merging is not a problem > to deal with for read-ahead. The underlying system can easily I just wanted to make sure that was clear =) > queue that in nice big chunks. Delayed allocation makes it > easier to to flush big chunks as well. I seem to recall the xfs people > having problems with the lack of merging causing a performance hit > on smaller I/O. That's where readaround buffers come into play. If we have a fixed number of readaround buffers that are used when small ios are issued, they should provide a low overhead means of substantially improving things like find (which reads many nearby inodes out of order but sequentially). I need to implement this can get cache hit rates for various workloads. ;-) > Of course merging doesn't have to happen in ll_rw_blk. > > > As for io completion, can't we just issue seperate requests for the > > critical data and the readahead? That way for SCSI disks, the important > > io should be finished while the readahead can continue. Thoughts? > > Priorities? Definately. I'd like to be able to issue readaheads with a "don't bother executing if this request unless the cost is low" bit set. It might also be helpful for heavy multiuser loads (or even a single user with multiple processes) to ensure progress is made for others. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: sync & asyck i/o
On Tue, 6 Feb 2001, Stephen C. Tweedie wrote: > It's worth noting that it *is* defined unambiguously in the standards: > fsync waits until all the data is hard on disk. Linux will obey that > if it possibly can: only in cases where the hardware is actively lying > about when the data has hit disk will the guarantee break down. It is defined for writes that have begun before the fsync() started. fsync has no bearing on aio writes until the async writes have completed. If people are worried about the interaction between an fsync in their app and an async write, they should be using syncronous writes (which are perfectly usable with async io). -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hey folks, On Tue, 6 Feb 2001, Stephen C. Tweedie wrote: > The whole point of the post was that it is merging, not splitting, > which is troublesome. How are you going to merge requests without > having chains of scatter-gather entities each with their own > completion callbacks? Let me just emphasize what Stephen is pointing out: if requests are properly merged at higher layers, then merging is neither required nor desired. Traditionally, ext2 has not done merging because the underlying system doesn't support it. This leads to rather convoluted code for readahead which doesn't result in appropriately merged requests on indirect block boundries, and in fact leads to suboptimal performance. The only case I see where merging of requests can improve things is when dealing with lots of small files. But we already know that small files need to be treated differently (fe tail merging). Besides, most of the benefit of merging can be had by doing readaround for these small files. As for io completion, can't we just issue seperate requests for the critical data and the readahead? That way for SCSI disks, the important io should be finished while the readahead can continue. Thoughts? -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
Hey folks, On Tue, 6 Feb 2001, Stephen C. Tweedie wrote: The whole point of the post was that it is merging, not splitting, which is troublesome. How are you going to merge requests without having chains of scatter-gather entities each with their own completion callbacks? Let me just emphasize what Stephen is pointing out: if requests are properly merged at higher layers, then merging is neither required nor desired. Traditionally, ext2 has not done merging because the underlying system doesn't support it. This leads to rather convoluted code for readahead which doesn't result in appropriately merged requests on indirect block boundries, and in fact leads to suboptimal performance. The only case I see where merging of requests can improve things is when dealing with lots of small files. But we already know that small files need to be treated differently (fe tail merging). Besides, most of the benefit of merging can be had by doing readaround for these small files. As for io completion, can't we just issue seperate requests for the critical data and the readahead? That way for SCSI disks, the important io should be finished while the readahead can continue. Thoughts? -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Jens Axboe wrote: Stephen already covered this point, the merging is not a problem to deal with for read-ahead. The underlying system can easily I just wanted to make sure that was clear =) queue that in nice big chunks. Delayed allocation makes it easier to to flush big chunks as well. I seem to recall the xfs people having problems with the lack of merging causing a performance hit on smaller I/O. That's where readaround buffers come into play. If we have a fixed number of readaround buffers that are used when small ios are issued, they should provide a low overhead means of substantially improving things like find (which reads many nearby inodes out of order but sequentially). I need to implement this can get cache hit rates for various workloads. ;-) Of course merging doesn't have to happen in ll_rw_blk. As for io completion, can't we just issue seperate requests for the critical data and the readahead? That way for SCSI disks, the important io should be finished while the readahead can continue. Thoughts? Priorities? Definately. I'd like to be able to issue readaheads with a "don't bother executing if this request unless the cost is low" bit set. It might also be helpful for heavy multiuser loads (or even a single user with multiple processes) to ensure progress is made for others. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Ingo Molnar wrote: - higher levels do not have the kind of state to eg. merge requests done by different users. The only chance for merging is often the lowest level, where we already know what disk, which sector. That's what a readaround buffer is for, and I suspect that readaround will give use a big performance boost. - merging is not even *required* for some devices - and chances are high that we'll get away from this inefficient and unreliable 'rotating array of disks' business of storing bulk data in this century. (solid state disks, holographic storage, whatever.) Interesting that you've brought up this point, as its an example i'm truly shocked that you and Stephen are both saying this. Merging != sorting. Sorting of requests has to be carried out at the lower layers, and the specific block device should be able to choose the Right Thing To Do for the next item in a chain of sequential requests. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Ingo Molnar wrote: If you are merging based on (device, offset) values, then that's lowlevel - and this is what we have been doing for years. If you are merging based on (inode, offset), then it has flaws like not being able to merge through a loopback or stacked filesystem. I disagree. Loopback filesystems typically have their data contiguously on disk and won't split up incoming requests any further. Here are the points I'm trying to address: - reduce the overhead in submitting block ios, especially for large ios. Look at the %CPU usages differences between 512 byte blocks and 4KB blocks, this can be better. - make asynchronous io possible in the block layer. This is impossible with the current ll_rw_block scheme and io request plugging. - provide a generic mechanism for reordering io requests for devices which will benefit from this. Make it a library for drivers to call into. IDE for example will probably make use of it, but some high end devices do this on the controller. This is the important point: Make it OPTIONAL. You mentioned non-spindle base io devices in your last message. Take something like a big RAM disk. Now compare kiobuf base io to buffer head based io. Tell me which one is going to perform better. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Ingo Molnar wrote: On Tue, 6 Feb 2001, Ben LaHaise wrote: - reduce the overhead in submitting block ios, especially for large ios. Look at the %CPU usages differences between 512 byte blocks and 4KB blocks, this can be better. my system is already submitting 4KB bhs. If anyone's raw-IO setup submits 512 byte bhs thats a problem of the raw IO code ... - make asynchronous io possible in the block layer. This is impossible with the current ll_rw_block scheme and io request plugging. why is it impossible? s/impossible/unpleasant/. ll_rw_blk blocks; it should be possible to have a non blocking variant that does all of the setup in the caller's context. Yes, I know that we can do it with a kernel thread, but that isn't as clean and it significantly penalises small ios (hint: databases issue *lots* of small random ios and a good chunk of large ios). You mentioned non-spindle base io devices in your last message. Take something like a big RAM disk. Now compare kiobuf base io to buffer head based io. Tell me which one is going to perform better. roughly equal performance when using 4K bhs. And a hell of a lot more complex and volatile code in the kiobuf case. I'm willing to benchmark you on this. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Linus Torvalds wrote: On Tue, 6 Feb 2001, Ben LaHaise wrote: s/impossible/unpleasant/. ll_rw_blk blocks; it should be possible to have a non blocking variant that does all of the setup in the caller's context. Yes, I know that we can do it with a kernel thread, but that isn't as clean and it significantly penalises small ios (hint: databases issue *lots* of small random ios and a good chunk of large ios). Ehh.. submit_bh() does everything you want. And, btw, ll_rw_block() does NOT block. Never has. Never will. (Small correction: it doesn't block on anything else than allocating a request structure if needed, and quite frankly, you have to block SOMETIME. You can't just try to throw stuff at the device faster than it can take it. Think of it as a "there can only be this many IO's in flight") This small correction is the crux of the problem: if it blocks, it takes away from the ability of the process to continue doing useful work. If it returns -EAGAIN, then that's okay, the io will be resubmitted later when other disk io has completed. But, it should be possible to continue servicing network requests or user io while disk io is underway. If you want to use kiobuf's because you think they are asycnrhonous and bh's aren't, then somebody has been feeding you a lot of crap. The kiobuf PR department seems to have been working overtime on some FUD strategy. I'm using bh's to refer to what is currently being done, and kiobuf when talking about what could be done. It's probably the wrong thing to do, and if bh's are extended to operate on arbitrary sized blocks then there is no difference between the two. If you want to make a "raw disk device", you can do so TODAY with bh's. How? Don't use "bread()" (which allocates the backing store and creates the cache). Allocate a separate anonymous bh (or multiple), and set them up to point to whatever data source/sink you have, and let it rip. All asynchronous. All with nice completion callbacks. All with existing code, no kiobuf's in sight. What more do you think your kiobuf's should be able to do? That's what my code is doing today. There are a ton of bh's setup for a single kiobuf request that is issued. For something like a single 256kb io, this is the difference between the batched io requests being passed into submit_bh fitting in L1 cache and overflowing it. Resizable bh's would certainly improve this. -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Ingo Molnar wrote: On Tue, 6 Feb 2001, Ben LaHaise wrote: You mentioned non-spindle base io devices in your last message. Take something like a big RAM disk. Now compare kiobuf base io to buffer head based io. Tell me which one is going to perform better. roughly equal performance when using 4K bhs. And a hell of a lot more complex and volatile code in the kiobuf case. I'm willing to benchmark you on this. sure. Could you specify the actual workload, and desired test-setups? Sure. General parameters will be as follows (since I think we both have access to these machines): - 4xXeon, 4GB memory, 3GB to be used for the ramdisk (enough for a base install plus data files. - data to/from the ram block device must be copied within the ram block driver. - the filesystem used must be ext2. optimisations to ext2 for tweaks to the interface are permitted encouraged. The main item I'm interested in is read (page cache cold)/synchronous write performance for blocks from 256 bytes to 16MB in powers of two, much like what I've done in testing the aio patches that shows where improvement in latency is needed. Including a few other items on disk like the timings of find/make -s dep/bonnie/dbench is probably to show changes in throughput. Sound fair? -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait
On Tue, 6 Feb 2001, Ingo Molnar wrote: On Tue, 6 Feb 2001, Ben LaHaise wrote: This small correction is the crux of the problem: if it blocks, it takes away from the ability of the process to continue doing useful work. If it returns -EAGAIN, then that's okay, the io will be resubmitted later when other disk io has completed. But, it should be possible to continue servicing network requests or user io while disk io is underway. typical blocking point is waiting for page completion, not __wait_request(). But, this is really not an issue, NR_REQUESTS can be increased anytime. If NR_REQUESTS is large enough then think of it as the 'absolute upper limit of doing IO', and think of the blocking as 'the kernel pulling the brakes'. =) This is what I'm seeing: lots of processes waiting with wchan == __get_request_wait. With async io and a database flushing lots of io asynchronously spread out across the disk, the NR_REQUESTS limit is hit very quickly. [overhead of 512-byte bhs in the raw IO code is an artificial problem of the raw IO code.] True, and in the tests I've run, raw io is using 2KB blocks (same as the database). -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound eventwait/notify + callback chains
On Tue, 30 Jan 2001 [EMAIL PROTECTED] wrote: > > Comments, suggestions, advise, feedback solicited ! > > If this seems like something that might (after some refinements) be a > useful abstraction to have, then I need some help in straightening out the > design. I am not very satisfied with it in its current form. Here's my first bit of feedback from the point of "this is what my code currently does and why". The waitqueue extension below is a minimalist approach for providing kernel support for fully asynchronous io. The basic idea is that a function pointer is added to the wait queue structure that is called during wake_up on a wait queue head. (The patch below also includes support for exclusive lifo wakeups, which isn't crucial/perfect, but just happened to be part of the code.) No function pointer or other data is added to the wait queue structure. Rather, users are expected to make use of it by embedding the wait queue structure within their own data structure that contains all needed info for running the state machine. Here's a snippet of code which demonstrates a non blocking lock of a page cache page: struct worktodo { wait_queue_twait; struct tq_structtq; void *data; }; static void __wtd_lock_page_waiter(wait_queue_t *wait) { struct worktodo *wtd = (struct worktodo *)wait; struct page *page = (struct page *)wtd->data; if (!TryLockPage(page)) { __remove_wait_queue(>wait, >wait); wtd_queue(wtd); } else { schedule_task(_disk_tq); } } void wtd_lock_page(struct worktodo *wtd, struct page *page) { if (TryLockPage(page)) { int raced = 0; wtd->data = page; init_waitqueue_func_entry(>wait, __wtd_lock_page_waiter); add_wait_queue_cond(>wait, >wait, TryLockPage(page), raced = 1); if (!raced) { run_task_queue(_disk); return; } } wtd->tq.routine(wtd->tq.data); } The use of wakeup functions is also useful for waking a specific reader or writer in the rw_sems, making semaphore avoid spurious wakeups, etc. I suspect that chaining of events should be built on top of the primatives, which should be kept as simple as possible. Comments? -ben diff -urN v2.4.1pre10/include/linux/mm.h work/include/linux/mm.h --- v2.4.1pre10/include/linux/mm.h Fri Jan 26 19:03:05 2001 +++ work/include/linux/mm.h Fri Jan 26 19:14:07 2001 @@ -198,10 +198,11 @@ */ #define UnlockPage(page) do { \ smp_mb__before_clear_bit(); \ + if (!test_bit(PG_locked, &(page)->flags)) { +printk("last: %p\n", (page)->last_unlock); BUG(); } \ + (page)->last_unlock = current_text_addr(); \ if (!test_and_clear_bit(PG_locked, &(page)->flags)) BUG(); \ smp_mb__after_clear_bit(); \ - if (waitqueue_active(>wait)) \ - wake_up(>wait); \ + wake_up(>wait); \ } while (0) #define PageError(page)test_bit(PG_error, &(page)->flags) #define SetPageError(page) set_bit(PG_error, &(page)->flags) diff -urN v2.4.1pre10/include/linux/sched.h work/include/linux/sched.h --- v2.4.1pre10/include/linux/sched.h Fri Jan 26 19:03:05 2001 +++ work/include/linux/sched.h Fri Jan 26 19:14:07 2001 @@ -751,6 +751,7 @@ extern void FASTCALL(add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)); extern void FASTCALL(add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t * wait)); +extern void FASTCALL(add_wait_queue_exclusive_lifo(wait_queue_head_t *q, wait_queue_t +* wait)); extern void FASTCALL(remove_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)); #define __wait_event(wq, condition)\ diff -urN v2.4.1pre10/include/linux/wait.h work/include/linux/wait.h --- v2.4.1pre10/include/linux/wait.hThu Jan 4 17:50:46 2001 +++ work/include/linux/wait.h Fri Jan 26 19:14:06 2001 @@ -43,17 +43,20 @@ } while (0) #endif +typedef struct __wait_queue wait_queue_t; +typedef void (*wait_queue_func_t)(wait_queue_t *wait); + struct __wait_queue { unsigned int flags; #define WQ_FLAG_EXCLUSIVE 0x01 struct task_struct * task; struct list_head task_list; + wait_queue_func_t func; #if WAITQUEUE_DEBUG long __magic; long __waker; #endif }; -typedef struct __wait_queue wait_queue_t; /* * 'dual' spinlock architecture. Can be switched between spinlock_t and @@ -110,7 +113,7 @@ #endif #define __WAITQUEUE_INITIALIZER(name,task) \ - { 0x0, task, { NULL, NULL }
Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound eventwait/notify + callback chains
On Tue, 30 Jan 2001 [EMAIL PROTECTED] wrote: Comments, suggestions, advise, feedback solicited ! If this seems like something that might (after some refinements) be a useful abstraction to have, then I need some help in straightening out the design. I am not very satisfied with it in its current form. Here's my first bit of feedback from the point of "this is what my code currently does and why". The waitqueue extension below is a minimalist approach for providing kernel support for fully asynchronous io. The basic idea is that a function pointer is added to the wait queue structure that is called during wake_up on a wait queue head. (The patch below also includes support for exclusive lifo wakeups, which isn't crucial/perfect, but just happened to be part of the code.) No function pointer or other data is added to the wait queue structure. Rather, users are expected to make use of it by embedding the wait queue structure within their own data structure that contains all needed info for running the state machine. Here's a snippet of code which demonstrates a non blocking lock of a page cache page: struct worktodo { wait_queue_twait; struct tq_structtq; void *data; }; static void __wtd_lock_page_waiter(wait_queue_t *wait) { struct worktodo *wtd = (struct worktodo *)wait; struct page *page = (struct page *)wtd-data; if (!TryLockPage(page)) { __remove_wait_queue(page-wait, wtd-wait); wtd_queue(wtd); } else { schedule_task(run_disk_tq); } } void wtd_lock_page(struct worktodo *wtd, struct page *page) { if (TryLockPage(page)) { int raced = 0; wtd-data = page; init_waitqueue_func_entry(wtd-wait, __wtd_lock_page_waiter); add_wait_queue_cond(page-wait, wtd-wait, TryLockPage(page), raced = 1); if (!raced) { run_task_queue(tq_disk); return; } } wtd-tq.routine(wtd-tq.data); } The use of wakeup functions is also useful for waking a specific reader or writer in the rw_sems, making semaphore avoid spurious wakeups, etc. I suspect that chaining of events should be built on top of the primatives, which should be kept as simple as possible. Comments? -ben diff -urN v2.4.1pre10/include/linux/mm.h work/include/linux/mm.h --- v2.4.1pre10/include/linux/mm.h Fri Jan 26 19:03:05 2001 +++ work/include/linux/mm.h Fri Jan 26 19:14:07 2001 @@ -198,10 +198,11 @@ */ #define UnlockPage(page) do { \ smp_mb__before_clear_bit(); \ + if (!test_bit(PG_locked, (page)-flags)) { +printk("last: %p\n", (page)-last_unlock); BUG(); } \ + (page)-last_unlock = current_text_addr(); \ if (!test_and_clear_bit(PG_locked, (page)-flags)) BUG(); \ smp_mb__after_clear_bit(); \ - if (waitqueue_active(page-wait)) \ - wake_up(page-wait); \ + wake_up(page-wait); \ } while (0) #define PageError(page)test_bit(PG_error, (page)-flags) #define SetPageError(page) set_bit(PG_error, (page)-flags) diff -urN v2.4.1pre10/include/linux/sched.h work/include/linux/sched.h --- v2.4.1pre10/include/linux/sched.h Fri Jan 26 19:03:05 2001 +++ work/include/linux/sched.h Fri Jan 26 19:14:07 2001 @@ -751,6 +751,7 @@ extern void FASTCALL(add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)); extern void FASTCALL(add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t * wait)); +extern void FASTCALL(add_wait_queue_exclusive_lifo(wait_queue_head_t *q, wait_queue_t +* wait)); extern void FASTCALL(remove_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)); #define __wait_event(wq, condition)\ diff -urN v2.4.1pre10/include/linux/wait.h work/include/linux/wait.h --- v2.4.1pre10/include/linux/wait.hThu Jan 4 17:50:46 2001 +++ work/include/linux/wait.h Fri Jan 26 19:14:06 2001 @@ -43,17 +43,20 @@ } while (0) #endif +typedef struct __wait_queue wait_queue_t; +typedef void (*wait_queue_func_t)(wait_queue_t *wait); + struct __wait_queue { unsigned int flags; #define WQ_FLAG_EXCLUSIVE 0x01 struct task_struct * task; struct list_head task_list; + wait_queue_func_t func; #if WAITQUEUE_DEBUG long __magic; long __waker; #endif }; -typedef struct __wait_queue wait_queue_t; /* * 'dual' spinlock architecture. Can be switched between spinlock_t and @@ -110,7 +113,7 @@ #endif #define __WAITQUEUE_INITIALIZER(name,task) \ - { 0x0, task, { NULL,
Re: oopses in test10-pre4 (was Re: [RFC] atomic pte updates and paechanges, take 3)
On Thu, 19 Oct 2000, Linus Torvalds wrote: > I think you overlooked the fact that SHM mappings use the page cache, and > it's ok if such pages are dirty and writable - they will get written out > by the shm_swap() logic once there are no mappings active any more. > > I like the test per se, because I think it's correct for the "normal" > case of a private page, but I really think those two BUG()'s are not bugs > at all in general, and we should just remove the two tests. > > Comments? Anything I've overlooked? The primary reason I added the BUG was that if this is valid, it means that the pte has to be removed from the page tables first with pte_get_and_clear since it can be modified by the other CPU. Although this may be safe for shm, I think it's very ugly and inconsistent. I'd rather make the code transfer the dirty bit to the page struct so that we *know* there is no information loss. If the above is correct, then the following patch should do (untested). Oh, I think I missed adding pte_same in the generic pgtable.h macros, too. I'm willing to take a closer look if you think it's needed. -ben diff -urN v2.4.0-test10-pre4/include/asm-generic/pgtable.h work-foo/include/asm-generic/pgtable.h --- v2.4.0-test10-pre4/include/asm-generic/pgtable.hFri Oct 20 00:58:03 2000 +++ work-foo/include/asm-generic/pgtable.h Fri Oct 20 01:42:24 2000 @@ -38,4 +38,6 @@ set_pte(ptep, pte_mkdirty(old_pte)); } +#define pte_same(left,right) (pte_val(left) == pte_val(right)) + #endif /* _ASM_GENERIC_PGTABLE_H */ diff -urN v2.4.0-test10-pre4/mm/vmscan.c work-foo/mm/vmscan.c --- v2.4.0-test10-pre4/mm/vmscan.c Fri Oct 20 00:58:04 2000 +++ work-foo/mm/vmscan.cFri Oct 20 01:43:54 2000 @@ -87,6 +87,13 @@ if (TryLockPage(page)) goto out_failed; + /* From this point on, the odds are that we're going to +* nuke this pte, so read and clear the pte. This hook +* is needed on CPUs which update the accessed and dirty +* bits in hardware. +*/ + pte = ptep_get_and_clear(page_table); + /* * Is the page already in the swap cache? If so, then * we can just drop our reference to it without doing @@ -98,10 +105,6 @@ if (PageSwapCache(page)) { entry.val = page->index; swap_duplicate(entry); - if (pte_dirty(pte)) - BUG(); - if (pte_write(pte)) - BUG(); set_pte(page_table, swp_entry_to_pte(entry)); drop_pte: UnlockPage(page); @@ -111,13 +114,6 @@ page_cache_release(page); goto out_failed; } - - /* From this point on, the odds are that we're going to -* nuke this pte, so read and clear the pte. This hook -* is needed on CPUs which update the accessed and dirty -* bits in hardware. -*/ - pte = ptep_get_and_clear(page_table); /* * Is it a clean page? Then it must be recoverable - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: oopses in test10-pre4 (was Re: [RFC] atomic pte updates and paechanges, take 3)
On Thu, 19 Oct 2000, Linus Torvalds wrote: I think you overlooked the fact that SHM mappings use the page cache, and it's ok if such pages are dirty and writable - they will get written out by the shm_swap() logic once there are no mappings active any more. I like the test per se, because I think it's correct for the "normal" case of a private page, but I really think those two BUG()'s are not bugs at all in general, and we should just remove the two tests. Comments? Anything I've overlooked? The primary reason I added the BUG was that if this is valid, it means that the pte has to be removed from the page tables first with pte_get_and_clear since it can be modified by the other CPU. Although this may be safe for shm, I think it's very ugly and inconsistent. I'd rather make the code transfer the dirty bit to the page struct so that we *know* there is no information loss. If the above is correct, then the following patch should do (untested). Oh, I think I missed adding pte_same in the generic pgtable.h macros, too. doh! I'm willing to take a closer look if you think it's needed. -ben diff -urN v2.4.0-test10-pre4/include/asm-generic/pgtable.h work-foo/include/asm-generic/pgtable.h --- v2.4.0-test10-pre4/include/asm-generic/pgtable.hFri Oct 20 00:58:03 2000 +++ work-foo/include/asm-generic/pgtable.h Fri Oct 20 01:42:24 2000 @@ -38,4 +38,6 @@ set_pte(ptep, pte_mkdirty(old_pte)); } +#define pte_same(left,right) (pte_val(left) == pte_val(right)) + #endif /* _ASM_GENERIC_PGTABLE_H */ diff -urN v2.4.0-test10-pre4/mm/vmscan.c work-foo/mm/vmscan.c --- v2.4.0-test10-pre4/mm/vmscan.c Fri Oct 20 00:58:04 2000 +++ work-foo/mm/vmscan.cFri Oct 20 01:43:54 2000 @@ -87,6 +87,13 @@ if (TryLockPage(page)) goto out_failed; + /* From this point on, the odds are that we're going to +* nuke this pte, so read and clear the pte. This hook +* is needed on CPUs which update the accessed and dirty +* bits in hardware. +*/ + pte = ptep_get_and_clear(page_table); + /* * Is the page already in the swap cache? If so, then * we can just drop our reference to it without doing @@ -98,10 +105,6 @@ if (PageSwapCache(page)) { entry.val = page-index; swap_duplicate(entry); - if (pte_dirty(pte)) - BUG(); - if (pte_write(pte)) - BUG(); set_pte(page_table, swp_entry_to_pte(entry)); drop_pte: UnlockPage(page); @@ -111,13 +114,6 @@ page_cache_release(page); goto out_failed; } - - /* From this point on, the odds are that we're going to -* nuke this pte, so read and clear the pte. This hook -* is needed on CPUs which update the accessed and dirty -* bits in hardware. -*/ - pte = ptep_get_and_clear(page_table); /* * Is it a clean page? Then it must be recoverable - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[RFC] atomic pte updates and pae changes, take 2
Hey folks Below is take two of the patch making pte_clear use atomic xchg in an effort to avoid the loss of dirty bits. PAE no longer uses cmpxchg8 for updates; set_pte is two ordered long writes with a barrier. The use of long long for ptes is also removed; gcc should generate better code now. A quick test with filemap_rw shows no measurable difference between pae and non pae code, as well as no degradation from the original non-atomic non-pae code. This code has been tested on a box with 4GB (about 48MB is above the 4G boundry) in PAE mode, and in non PAE mode on a couple of other boxes too. Linus: comments? Ingo: could you have a look over the code? Thanks, -ben diff -ur v2.4.0-test10-pre2/arch/i386/boot/install.sh work-10-2/arch/i386/boot/install.sh --- v2.4.0-test10-pre2/arch/i386/boot/install.shTue Jan 3 06:57:26 1995 +++ work-10-2/arch/i386/boot/install.sh Fri Oct 13 17:19:47 2000 @@ -21,6 +21,7 @@ # User may have a custom install script +if [ -x ~/bin/installkernel ]; then exec ~/bin/installkernel "$@"; fi if [ -x /sbin/installkernel ]; then exec /sbin/installkernel "$@"; fi # Default install - same as make zlilo diff -ur v2.4.0-test10-pre2/include/asm-i386/page.h work-10-2/include/asm-i386/page.h --- v2.4.0-test10-pre2/include/asm-i386/page.h Thu Oct 12 17:42:11 2000 +++ work-10-2/include/asm-i386/page.h Fri Oct 13 17:36:02 2000 @@ -37,20 +37,20 @@ * These are used to make use of C type-checking.. */ #if CONFIG_X86_PAE -typedef struct { unsigned long long pte; } pte_t; +typedef struct { unsigned long pte_low, pte_high; } pte_t; typedef struct { unsigned long long pmd; } pmd_t; typedef struct { unsigned long long pgd; } pgd_t; -#define PTE_MASK (~(unsigned long long) (PAGE_SIZE-1)) +#define pte_val(x) ((x).pte_low | ((unsigned long long)(x).pte_high << 32)) #else -typedef struct { unsigned long pte; } pte_t; +typedef struct { unsigned long pte_low; } pte_t; typedef struct { unsigned long pmd; } pmd_t; typedef struct { unsigned long pgd; } pgd_t; -#define PTE_MASK PAGE_MASK +#define pte_val(x) ((x).pte_low) #endif +#define PTE_MASK PAGE_MASK typedef struct { unsigned long pgprot; } pgprot_t; -#define pte_val(x) ((x).pte) #define pmd_val(x) ((x).pmd) #define pgd_val(x) ((x).pgd) #define pgprot_val(x) ((x).pgprot) diff -ur v2.4.0-test10-pre2/include/asm-i386/pgtable-2level.h work-10-2/include/asm-i386/pgtable-2level.h --- v2.4.0-test10-pre2/include/asm-i386/pgtable-2level.hFri Dec 3 14:12:23 1999 +++ work-10-2/include/asm-i386/pgtable-2level.h Fri Oct 13 17:41:14 2000 @@ -18,7 +18,7 @@ #define PTRS_PER_PTE 1024 #define pte_ERROR(e) \ - printk("%s:%d: bad pte %08lx.\n", __FILE__, __LINE__, pte_val(e)) + printk("%s:%d: bad pte %08lx.\n", __FILE__, __LINE__, (e).pte_low) #define pmd_ERROR(e) \ printk("%s:%d: bad pmd %08lx.\n", __FILE__, __LINE__, pmd_val(e)) #define pgd_ERROR(e) \ @@ -54,5 +54,12 @@ { return (pmd_t *) dir; } + +#define __HAVE_ARCH_pte_get_and_clear +#define pte_get_and_clear(xp) __pte(xchg(&(xp)->pte_low, 0)) +#define pte_same(a, b) ((a).pte_low == (b).pte_low) +#define pte_page(x)(mem_map+((unsigned long)(((x).pte_low >> +PAGE_SHIFT +#define pte_none(x)(!(x).pte_low) +#define __mk_pte(page_nr,pgprot) __pte(((page_nr) << PAGE_SHIFT) | pgprot_val(pgprot)) #endif /* _I386_PGTABLE_2LEVEL_H */ diff -ur v2.4.0-test10-pre2/include/asm-i386/pgtable-3level.h work-10-2/include/asm-i386/pgtable-3level.h --- v2.4.0-test10-pre2/include/asm-i386/pgtable-3level.hMon Dec 6 19:19:13 1999 +++ work-10-2/include/asm-i386/pgtable-3level.h Fri Oct 13 17:39:53 2000 @@ -27,7 +27,7 @@ #define PTRS_PER_PTE 512 #define pte_ERROR(e) \ - printk("%s:%d: bad pte %p(%016Lx).\n", __FILE__, __LINE__, &(e), pte_val(e)) + printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), +(e).pte_high, (e).pte_low) #define pmd_ERROR(e) \ printk("%s:%d: bad pmd %p(%016Lx).\n", __FILE__, __LINE__, &(e), pmd_val(e)) #define pgd_ERROR(e) \ @@ -45,8 +45,12 @@ extern inline int pgd_bad(pgd_t pgd) { return 0; } extern inline int pgd_present(pgd_t pgd) { return !pgd_none(pgd); } -#define set_pte(pteptr,pteval) \ - set_64bit((unsigned long long *)(pteptr),pte_val(pteval)) +extern inline void set_pte(pte_t *ptep, pte_t pte) +{ + ptep->pte_high = pte.pte_high; + barrier(); + ptep->pte_low = pte.pte_low; +} #define set_pmd(pmdptr,pmdval) \ set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval)) #define set_pgd(pgdptr,pgdval) \ @@ -75,5 +79,35 @@ /* Find an entry in the second-level page table.. */ #define pmd_offset(dir, address) ((pmd_t *) pgd_page(*(dir)) + \ __pmd_offset(address)) + +#define __HAVE_ARCH_pte_get_and_clear +extern inline pte_t pte_get_and_clear(pte_t *ptep) +{ + pte_t res; + +
[RFC] atomic pte updates and pae changes, take 2
Hey folks Below is take two of the patch making pte_clear use atomic xchg in an effort to avoid the loss of dirty bits. PAE no longer uses cmpxchg8 for updates; set_pte is two ordered long writes with a barrier. The use of long long for ptes is also removed; gcc should generate better code now. A quick test with filemap_rw shows no measurable difference between pae and non pae code, as well as no degradation from the original non-atomic non-pae code. This code has been tested on a box with 4GB (about 48MB is above the 4G boundry) in PAE mode, and in non PAE mode on a couple of other boxes too. Linus: comments? Ingo: could you have a look over the code? Thanks, -ben diff -ur v2.4.0-test10-pre2/arch/i386/boot/install.sh work-10-2/arch/i386/boot/install.sh --- v2.4.0-test10-pre2/arch/i386/boot/install.shTue Jan 3 06:57:26 1995 +++ work-10-2/arch/i386/boot/install.sh Fri Oct 13 17:19:47 2000 @@ -21,6 +21,7 @@ # User may have a custom install script +if [ -x ~/bin/installkernel ]; then exec ~/bin/installkernel "$@"; fi if [ -x /sbin/installkernel ]; then exec /sbin/installkernel "$@"; fi # Default install - same as make zlilo diff -ur v2.4.0-test10-pre2/include/asm-i386/page.h work-10-2/include/asm-i386/page.h --- v2.4.0-test10-pre2/include/asm-i386/page.h Thu Oct 12 17:42:11 2000 +++ work-10-2/include/asm-i386/page.h Fri Oct 13 17:36:02 2000 @@ -37,20 +37,20 @@ * These are used to make use of C type-checking.. */ #if CONFIG_X86_PAE -typedef struct { unsigned long long pte; } pte_t; +typedef struct { unsigned long pte_low, pte_high; } pte_t; typedef struct { unsigned long long pmd; } pmd_t; typedef struct { unsigned long long pgd; } pgd_t; -#define PTE_MASK (~(unsigned long long) (PAGE_SIZE-1)) +#define pte_val(x) ((x).pte_low | ((unsigned long long)(x).pte_high 32)) #else -typedef struct { unsigned long pte; } pte_t; +typedef struct { unsigned long pte_low; } pte_t; typedef struct { unsigned long pmd; } pmd_t; typedef struct { unsigned long pgd; } pgd_t; -#define PTE_MASK PAGE_MASK +#define pte_val(x) ((x).pte_low) #endif +#define PTE_MASK PAGE_MASK typedef struct { unsigned long pgprot; } pgprot_t; -#define pte_val(x) ((x).pte) #define pmd_val(x) ((x).pmd) #define pgd_val(x) ((x).pgd) #define pgprot_val(x) ((x).pgprot) diff -ur v2.4.0-test10-pre2/include/asm-i386/pgtable-2level.h work-10-2/include/asm-i386/pgtable-2level.h --- v2.4.0-test10-pre2/include/asm-i386/pgtable-2level.hFri Dec 3 14:12:23 1999 +++ work-10-2/include/asm-i386/pgtable-2level.h Fri Oct 13 17:41:14 2000 @@ -18,7 +18,7 @@ #define PTRS_PER_PTE 1024 #define pte_ERROR(e) \ - printk("%s:%d: bad pte %08lx.\n", __FILE__, __LINE__, pte_val(e)) + printk("%s:%d: bad pte %08lx.\n", __FILE__, __LINE__, (e).pte_low) #define pmd_ERROR(e) \ printk("%s:%d: bad pmd %08lx.\n", __FILE__, __LINE__, pmd_val(e)) #define pgd_ERROR(e) \ @@ -54,5 +54,12 @@ { return (pmd_t *) dir; } + +#define __HAVE_ARCH_pte_get_and_clear +#define pte_get_and_clear(xp) __pte(xchg((xp)-pte_low, 0)) +#define pte_same(a, b) ((a).pte_low == (b).pte_low) +#define pte_page(x)(mem_map+((unsigned long)(((x).pte_low +PAGE_SHIFT +#define pte_none(x)(!(x).pte_low) +#define __mk_pte(page_nr,pgprot) __pte(((page_nr) PAGE_SHIFT) | pgprot_val(pgprot)) #endif /* _I386_PGTABLE_2LEVEL_H */ diff -ur v2.4.0-test10-pre2/include/asm-i386/pgtable-3level.h work-10-2/include/asm-i386/pgtable-3level.h --- v2.4.0-test10-pre2/include/asm-i386/pgtable-3level.hMon Dec 6 19:19:13 1999 +++ work-10-2/include/asm-i386/pgtable-3level.h Fri Oct 13 17:39:53 2000 @@ -27,7 +27,7 @@ #define PTRS_PER_PTE 512 #define pte_ERROR(e) \ - printk("%s:%d: bad pte %p(%016Lx).\n", __FILE__, __LINE__, (e), pte_val(e)) + printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, (e), +(e).pte_high, (e).pte_low) #define pmd_ERROR(e) \ printk("%s:%d: bad pmd %p(%016Lx).\n", __FILE__, __LINE__, (e), pmd_val(e)) #define pgd_ERROR(e) \ @@ -45,8 +45,12 @@ extern inline int pgd_bad(pgd_t pgd) { return 0; } extern inline int pgd_present(pgd_t pgd) { return !pgd_none(pgd); } -#define set_pte(pteptr,pteval) \ - set_64bit((unsigned long long *)(pteptr),pte_val(pteval)) +extern inline void set_pte(pte_t *ptep, pte_t pte) +{ + ptep-pte_high = pte.pte_high; + barrier(); + ptep-pte_low = pte.pte_low; +} #define set_pmd(pmdptr,pmdval) \ set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval)) #define set_pgd(pgdptr,pgdval) \ @@ -75,5 +79,35 @@ /* Find an entry in the second-level page table.. */ #define pmd_offset(dir, address) ((pmd_t *) pgd_page(*(dir)) + \ __pmd_offset(address)) + +#define __HAVE_ARCH_pte_get_and_clear +extern inline pte_t pte_get_and_clear(pte_t *ptep) +{ + pte_t res; + + /* xchg
[RFC] atomic pte updates for x86 smp
On Wed, 11 Oct 2000 [EMAIL PROTECTED] wrote: >> 2. Capable Of Corrupting Your FS/data >> >> * Non-atomic page-map operations can cause loss of dirty bit on >>pages (sct, alan) > >Is anybody looking into fixing this bug ? > > According to sct (who's sitting next to me in my hotel room at ALS) Ben > LaHaise has a bugfix for this, but it hasn't been merged. Here's an updated version of the patch that doesn't do the funky RISC like dirty bit updates. It doesn't incur the additional overhead of page faults on dirty, which actually happens a lot on SHM attaches (during Oracle runs this is quite noticeable due to their use of hundreds of MB of SHM). Ted: Note that there are a couple of other SMP races that still need fixing: list them under VM threading bug under SMP (different bug). -ben # v2.4.0-test10-1-smp_pte_fix.diff diff -ur v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h work-v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h --- v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.hFri Dec 3 14:12:23 1999 +++ work-v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h Wed Oct 11 16:08:08 +2000 @@ -55,4 +55,7 @@ return (pmd_t *) dir; } +#define __HAVE_ARCH_pte_xchg_clear +#define pte_xchg_clear(xp) __pte(xchg(&(xp)->pte, 0)) + #endif /* _I386_PGTABLE_2LEVEL_H */ diff -ur v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h work-v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h --- v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.hMon Dec 6 19:19:13 1999 +++ work-v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h Wed Oct 11 16:14:40 +2000 @@ -76,4 +76,17 @@ #define pmd_offset(dir, address) ((pmd_t *) pgd_page(*(dir)) + \ __pmd_offset(address)) +#define __HAVE_ARCH_pte_xchg_clear +extern inline pte_t pte_xchg_clear(pte_t *ptep) +{ + long long res = pte_val(*ptep); +__asm__ __volatile__ ( +"1: cmpxchg8b (%1); +jnz 1b" +: "=A" (res) + :"D"(ptep), "0" (res), "b"(0), "c"(0) +: "memory"); + return (pte_t){ res }; +} + #endif /* _I386_PGTABLE_3LEVEL_H */ diff -ur v2.4.0-test10-pre1/include/asm-i386/pgtable.h work-v2.4.0-test10-pre1/include/asm-i386/pgtable.h --- v2.4.0-test10-pre1/include/asm-i386/pgtable.h Mon Oct 2 14:06:43 2000 +++ work-v2.4.0-test10-pre1/include/asm-i386/pgtable.h Wed Oct 11 17:44:04 2000 @@ -17,6 +17,10 @@ #include #include +#ifndef _I386_BITOPS_H +#include +#endif + extern pgd_t swapper_pg_dir[1024]; extern void paging_init(void); @@ -145,6 +149,16 @@ * the page directory entry points directly to a 4MB-aligned block of * memory. */ +#define _PAGE_BIT_PRESENT 0 +#define _PAGE_BIT_RW 1 +#define _PAGE_BIT_USER 2 +#define _PAGE_BIT_PWT 3 +#define _PAGE_BIT_PCD 4 +#define _PAGE_BIT_ACCESSED 5 +#define _PAGE_BIT_DIRTY6 +#define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page, Pentium+, if present.. +*/ +#define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */ + #define _PAGE_PRESENT 0x001 #define _PAGE_RW 0x002 #define _PAGE_USER 0x004 @@ -234,6 +248,24 @@ #define pte_none(x)(!pte_val(x)) #define pte_present(x) (pte_val(x) & (_PAGE_PRESENT | _PAGE_PROTNONE)) #define pte_clear(xp) do { set_pte(xp, __pte(0)); } while (0) + +#define __HAVE_ARCH_pte_test_and_clear_dirty +static inline int pte_test_and_clear_dirty(pte_t *page_table, pte_t pte) +{ + return test_and_clear_bit(_PAGE_BIT_DIRTY, page_table); +} + +#define __HAVE_ARCH_pte_test_and_clear_young +static inline int pte_test_and_clear_young(pte_t *page_table, pte_t pte) +{ + return test_and_clear_bit(_PAGE_BIT_ACCESSED, page_table); +} + +#define __HAVE_ARCH_atomic_pte_wrprotect +static inline void atomic_pte_wrprotect(pte_t *page_table, pte_t old_pte) +{ + clear_bit(_PAGE_BIT_RW, page_table); +} #define pmd_none(x)(!pmd_val(x)) #define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT) diff -ur v2.4.0-test10-pre1/include/linux/mm.h work-v2.4.0-test10-pre1/include/linux/mm.h --- v2.4.0-test10-pre1/include/linux/mm.h Tue Oct 3 13:40:38 2000 +++ work-v2.4.0-test10-pre1/include/linux/mm.h Wed Oct 11 17:44:38 2000 @@ -532,6 +532,42 @@ #define vmlist_modify_lock(mm) vmlist_access_lock(mm) #define vmlist_modify_unlock(mm) vmlist_access_unlock(mm) +#ifndef __HAVE_ARCH_pte_test_and_clear_young +static inline int pte_test_and_clear_young(pte_t *page_table, pte_t pte) +{ + if (!pte_young(pte)) + return 0; + set_pte(page_table, pte_mkold(pte)); + return 1; +} +#endif + +#ifndef __HAVE_ARCH_pte_test_and_clear_dirty +static inline int pte_test_and_clear_dirty(pte_t *page_table, pte_t pte) +{ + if (!pte_dirty(pte)) +
Re: test10-pre1 problems on 4-way SuperServer8050
On Wed, 11 Oct 2000, Tigran Aivazian wrote: > it works fine then. Kernel compiles in 68 seconds as it should. Shall I > keep incrementing mem= to see what happens next... I suspect fixing the mtrrs on the machine will fix this problem, as a 38-40 times slowdown on a machine that isn't swapping is most likely a lack of memory caching (as Rik pointed out 38-40 times is right on the nose for the difference in speed between the cache and main memory). -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: test10-pre1 problems on 4-way SuperServer8050
On Wed, 11 Oct 2000, Tigran Aivazian wrote: it works fine then. Kernel compiles in 68 seconds as it should. Shall I keep incrementing mem= to see what happens next... I suspect fixing the mtrrs on the machine will fix this problem, as a 38-40 times slowdown on a machine that isn't swapping is most likely a lack of memory caching (as Rik pointed out 38-40 times is right on the nose for the difference in speed between the cache and main memory). -ben - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[RFC] atomic pte updates for x86 smp
On Wed, 11 Oct 2000 [EMAIL PROTECTED] wrote: 2. Capable Of Corrupting Your FS/data * Non-atomic page-map operations can cause loss of dirty bit on pages (sct, alan) Is anybody looking into fixing this bug ? According to sct (who's sitting next to me in my hotel room at ALS) Ben LaHaise has a bugfix for this, but it hasn't been merged. Here's an updated version of the patch that doesn't do the funky RISC like dirty bit updates. It doesn't incur the additional overhead of page faults on dirty, which actually happens a lot on SHM attaches (during Oracle runs this is quite noticeable due to their use of hundreds of MB of SHM). Ted: Note that there are a couple of other SMP races that still need fixing: list them under VM threading bug under SMP (different bug). -ben # v2.4.0-test10-1-smp_pte_fix.diff diff -ur v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h work-v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h --- v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.hFri Dec 3 14:12:23 1999 +++ work-v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h Wed Oct 11 16:08:08 +2000 @@ -55,4 +55,7 @@ return (pmd_t *) dir; } +#define __HAVE_ARCH_pte_xchg_clear +#define pte_xchg_clear(xp) __pte(xchg((xp)-pte, 0)) + #endif /* _I386_PGTABLE_2LEVEL_H */ diff -ur v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h work-v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h --- v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.hMon Dec 6 19:19:13 1999 +++ work-v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h Wed Oct 11 16:14:40 +2000 @@ -76,4 +76,17 @@ #define pmd_offset(dir, address) ((pmd_t *) pgd_page(*(dir)) + \ __pmd_offset(address)) +#define __HAVE_ARCH_pte_xchg_clear +extern inline pte_t pte_xchg_clear(pte_t *ptep) +{ + long long res = pte_val(*ptep); +__asm__ __volatile__ ( +"1: cmpxchg8b (%1); +jnz 1b" +: "=A" (res) + :"D"(ptep), "0" (res), "b"(0), "c"(0) +: "memory"); + return (pte_t){ res }; +} + #endif /* _I386_PGTABLE_3LEVEL_H */ diff -ur v2.4.0-test10-pre1/include/asm-i386/pgtable.h work-v2.4.0-test10-pre1/include/asm-i386/pgtable.h --- v2.4.0-test10-pre1/include/asm-i386/pgtable.h Mon Oct 2 14:06:43 2000 +++ work-v2.4.0-test10-pre1/include/asm-i386/pgtable.h Wed Oct 11 17:44:04 2000 @@ -17,6 +17,10 @@ #include asm/fixmap.h #include linux/threads.h +#ifndef _I386_BITOPS_H +#include asm/bitops.h +#endif + extern pgd_t swapper_pg_dir[1024]; extern void paging_init(void); @@ -145,6 +149,16 @@ * the page directory entry points directly to a 4MB-aligned block of * memory. */ +#define _PAGE_BIT_PRESENT 0 +#define _PAGE_BIT_RW 1 +#define _PAGE_BIT_USER 2 +#define _PAGE_BIT_PWT 3 +#define _PAGE_BIT_PCD 4 +#define _PAGE_BIT_ACCESSED 5 +#define _PAGE_BIT_DIRTY6 +#define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page, Pentium+, if present.. +*/ +#define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */ + #define _PAGE_PRESENT 0x001 #define _PAGE_RW 0x002 #define _PAGE_USER 0x004 @@ -234,6 +248,24 @@ #define pte_none(x)(!pte_val(x)) #define pte_present(x) (pte_val(x) (_PAGE_PRESENT | _PAGE_PROTNONE)) #define pte_clear(xp) do { set_pte(xp, __pte(0)); } while (0) + +#define __HAVE_ARCH_pte_test_and_clear_dirty +static inline int pte_test_and_clear_dirty(pte_t *page_table, pte_t pte) +{ + return test_and_clear_bit(_PAGE_BIT_DIRTY, page_table); +} + +#define __HAVE_ARCH_pte_test_and_clear_young +static inline int pte_test_and_clear_young(pte_t *page_table, pte_t pte) +{ + return test_and_clear_bit(_PAGE_BIT_ACCESSED, page_table); +} + +#define __HAVE_ARCH_atomic_pte_wrprotect +static inline void atomic_pte_wrprotect(pte_t *page_table, pte_t old_pte) +{ + clear_bit(_PAGE_BIT_RW, page_table); +} #define pmd_none(x)(!pmd_val(x)) #define pmd_present(x) (pmd_val(x) _PAGE_PRESENT) diff -ur v2.4.0-test10-pre1/include/linux/mm.h work-v2.4.0-test10-pre1/include/linux/mm.h --- v2.4.0-test10-pre1/include/linux/mm.h Tue Oct 3 13:40:38 2000 +++ work-v2.4.0-test10-pre1/include/linux/mm.h Wed Oct 11 17:44:38 2000 @@ -532,6 +532,42 @@ #define vmlist_modify_lock(mm) vmlist_access_lock(mm) #define vmlist_modify_unlock(mm) vmlist_access_unlock(mm) +#ifndef __HAVE_ARCH_pte_test_and_clear_young +static inline int pte_test_and_clear_young(pte_t *page_table, pte_t pte) +{ + if (!pte_young(pte)) + return 0; + set_pte(page_table, pte_mkold(pte)); + return 1; +} +#endif + +#ifndef __HAVE_ARCH_pte_test_and_clear_dirty +static inline int pte_test_and_clear_dirty(pte_t *page_table, pte_t pte) +{ + if (!pte_dirty(pte)) + return 0; + set_pte(