Don't feed the trooll [offtopic] Re: Why Plan 9 C compilers don'thave asm("")

2001-07-04 Thread Ben LaHaise

Hey folks,

Just a quick reminder: don't feed the troll.  He's very hungry.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Don't feed the trooll [offtopic] Re: Why Plan 9 C compilers don'thave asm()

2001-07-04 Thread Ben LaHaise

Hey folks,

Just a quick reminder: don't feed the troll.  He's very hungry.

-ben

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFD w/info-PATCH] device arguments from lookup, partion code inuserspace

2001-05-19 Thread Ben LaHaise

On Sat, 19 May 2001, Alexander Viro wrote:

> On Sat, 19 May 2001, Ben LaHaise wrote:
>
> > It's not done yet, but similar techniques would be applied.  I envision
> > that a raid device would support operations such as
> > open("/dev/md0/slot=5,hot-add=/dev/sda")
>
> Think for a moment and you'll see why it's not only ugly as hell, but simply
> won't work.

Yeah, I shouldn't be replying to email anymore in my bleery-eyed state.
=) Of course slash seperated data doesn't work, so it would have to be
hot-add= or somesuch.  Gah, that's why the options are all
parsed from a single lookup name anyways...

-ben (who's going to sleep)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFD w/info-PATCH] device arguments from lookup, partion code inuserspace

2001-05-19 Thread Ben LaHaise

On Sat, 19 May 2001, Andrew Clausen wrote:

> (1) these issues are independent.  The partition parsing could
> be done in user space, today, by blkpg, if I read the code correctly
> ;-)  (there's an ioctl for [un]registering partitions)  Never
> tried it though ;-)

I tried to imply that through the use of the the word component.  Yes,
they're independant, but the code is pretty meaningless without a
demonstration of how it's used. ;-)

> (2) what about bootstrapping?  how do you find the root device?
> Do you do "root=/dev/hda/offset=63,limit=1235823"?  Bit nasty.

root= becomes a parameter to mount, and initrd becomes mandatory.  I'd be
all for including all of the bits needed to build the initrd boot code in
the tree, but it's completely in the air.

> (3) how does this work for LVM and RAID?

It's not done yet, but similar techniques would be applied.  I envision
that a raid device would support operations such as
open("/dev/md0/slot=5,hot-add=/dev/sda")

> (4) libparted already has a fair bit of partition
> scanning code, etc.  Should be trivial to hack it up... That said,
> it should be split up into .so modules... 200k is a bit heavy just
> for mounting partitions (most of the bulk is file system stuff).
> 

Good.  Less work to do.

> (5) what happens to /etc/fstab?  User-space ([u]mount?) translates
> /dev/hda1 into /dev/hda/offset=63,limit=1235823, and back?

I'd just create a symlink to /dev/hda1 at mount time, although that really
isn't what the user wants to see: the label or uuid is more useful.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[RFD w/info-PATCH] device arguments from lookup, partion code inuserspace

2001-05-19 Thread Ben LaHaise

Hey folks,

The work-in-progress patch for-demonstration-purposes-only below consists
of 3 major components, and is meant to start discussion about the future
direction of device naming and its interaction block layer.  The main
motivations here are the wasting of minor numbers for partitions, and the
duplication of code between user and kernel space in areas such as
partition detection, uuid location, lvm setup, mount by label, journal
replay, and so on...

1. Generic lookup method and argument parsiing (fs/lookupargs.c)

This code implements a lookup function which is for demonstration
purposes used in fs/block_dev.c.  The general idea is to pass
additional parameters to device drivers on open via a comma
seperated list of options following the device's name.  Sample
uses:

/dev/sda/raw-> open sda in raw mode.
/dev/sda/limit=102400   -> open sda with a limit of 100K
/dev/sda/offset=1024,limit=2048
-> open a device that gives a view of sda at an
   offset of 1KB to 2KB

The arguments are defined in a table (fs/block_dev.c:660), which
defines the name and type of argument to parse.  This table is
used at lookup time to determine if an option name is valid
(resulting in a postive dentry) or invalid.  Potential uses for
this are numerous: opening a control channel to a device,
specifying a graphics mode for a framebuffer on open, replacing
ioctls,  lots of options.  Please seperate comments on this
portion from the other parts of the patch.

2. Restricted block device (drivers/block/blkrestrict.c)

This is a quick-n-dirty implementation of a simple md-like block
device that adds an offset to sector requests and limits the
maximum offset on the device.  The idea here is to replace the
special case minor numbers used for the partitioning code with
a generic runtime allocated translation node.  The idea will work
best once its data can be stored in a kdev_t structure.  The API
for use is simple:

kdev_t restrict_create_dev(kdev_t dev,
unsigned long long offset,
unsigned long long limit)

The associated cleanup of the startup code is not addressed here.
Comments on this part (I know the implementation is ugly, talk
about the ideas please)?

3. Userspace partition code proposal

Given the above two bits, here's a brief explaination of a
proposal to move management of the partitioning scheme into
userspace, along with portions of raid startup, lvm, uuid and
mount by label code needed for mounting the root filesystem.

Consider that the device node currently known as /dev/hda5 can
also be viewed as /dev/hda at offset 512000 with a limit of 10GB.
With the extensions in fs/block_dev.c, you could replace /dev/hda5
with /dev/hda/offset=512000,limit=1024000.  Now, by putting
the partition parsing code into a libpart and binding mount to a
libpart, the root filesystem mounting code can be run out of an
initrd image.  The use of mount gives us the ability to mount
filesystems by UUID, by label or other exotic schemes without
having to add any additional code to the kernel.

I'm going to stop writing this now.  I need sleep...

Folks, please let me know your opinions on the ideas presented herein, and
do attempt to keep the bits of code that are useful.  Cheers,

-ben

[23:34:07]  bcrl: you are sick.
[23:41:13]  bcrl: you _are_ sick.
[23:43:24]  bcrl: you are _fscking_ sick.

here starts v2.4.5-pre3_bdev_naming-A0.diff
diff -urN kernels/2.4/v2.4.5-pre3/Makefile bdev_naming/Makefile
--- kernels/2.4/v2.4.5-pre3/MakefileThu May 17 18:09:42 2001
+++ bdev_naming/MakefileSat May 19 01:33:39 2001
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 4
 SUBLEVEL = 5
-EXTRAVERSION =-pre3
+EXTRAVERSION =-pre3-sick-test

 KERNELRELEASE=$(VERSION).$(PATCHLEVEL).$(SUBLEVEL)$(EXTRAVERSION)

diff -urN kernels/2.4/v2.4.5-pre3/arch/i386/boot/install.sh 
bdev_naming/arch/i386/boot/install.sh
--- kernels/2.4/v2.4.5-pre3/arch/i386/boot/install.sh   Tue Jan  3 06:57:26 1995
+++ bdev_naming/arch/i386/boot/install.sh   Fri May 18 20:24:36 2001
@@ -21,6 +21,7 @@

 # User may have a custom install script

+if [ -x ~/bin/installkernel ]; then exec ~/bin/installkernel "$@"; fi
 if [ -x /sbin/installkernel ]; then exec /sbin/installkernel "$@"; fi

 # Default install - same as make zlilo
diff -urN kernels/2.4/v2.4.5-pre3/drivers/block/Makefile 
bdev_naming/drivers/block/Makefile
--- kernels/2.4/v2.4.5-pre3/drivers/block/Makefile  Fri Dec 29 17:07:21 2000
+++ bdev_naming/drivers/block/Makefile  Sat May 19 00:29:08 2001
@@ -12,7 +12,7 @@

 export-objs 

[RFD w/info-PATCH] device arguments from lookup, partion code inuserspace

2001-05-19 Thread Ben LaHaise

Hey folks,

The work-in-progress patch for-demonstration-purposes-only below consists
of 3 major components, and is meant to start discussion about the future
direction of device naming and its interaction block layer.  The main
motivations here are the wasting of minor numbers for partitions, and the
duplication of code between user and kernel space in areas such as
partition detection, uuid location, lvm setup, mount by label, journal
replay, and so on...

1. Generic lookup method and argument parsiing (fs/lookupargs.c)

This code implements a lookup function which is for demonstration
purposes used in fs/block_dev.c.  The general idea is to pass
additional parameters to device drivers on open via a comma
seperated list of options following the device's name.  Sample
uses:

/dev/sda/raw- open sda in raw mode.
/dev/sda/limit=102400   - open sda with a limit of 100K
/dev/sda/offset=1024,limit=2048
- open a device that gives a view of sda at an
   offset of 1KB to 2KB

The arguments are defined in a table (fs/block_dev.c:660), which
defines the name and type of argument to parse.  This table is
used at lookup time to determine if an option name is valid
(resulting in a postive dentry) or invalid.  Potential uses for
this are numerous: opening a control channel to a device,
specifying a graphics mode for a framebuffer on open, replacing
ioctls,  lots of options.  Please seperate comments on this
portion from the other parts of the patch.

2. Restricted block device (drivers/block/blkrestrict.c)

This is a quick-n-dirty implementation of a simple md-like block
device that adds an offset to sector requests and limits the
maximum offset on the device.  The idea here is to replace the
special case minor numbers used for the partitioning code with
a generic runtime allocated translation node.  The idea will work
best once its data can be stored in a kdev_t structure.  The API
for use is simple:

kdev_t restrict_create_dev(kdev_t dev,
unsigned long long offset,
unsigned long long limit)

The associated cleanup of the startup code is not addressed here.
Comments on this part (I know the implementation is ugly, talk
about the ideas please)?

3. Userspace partition code proposal

Given the above two bits, here's a brief explaination of a
proposal to move management of the partitioning scheme into
userspace, along with portions of raid startup, lvm, uuid and
mount by label code needed for mounting the root filesystem.

Consider that the device node currently known as /dev/hda5 can
also be viewed as /dev/hda at offset 512000 with a limit of 10GB.
With the extensions in fs/block_dev.c, you could replace /dev/hda5
with /dev/hda/offset=512000,limit=1024000.  Now, by putting
the partition parsing code into a libpart and binding mount to a
libpart, the root filesystem mounting code can be run out of an
initrd image.  The use of mount gives us the ability to mount
filesystems by UUID, by label or other exotic schemes without
having to add any additional code to the kernel.

I'm going to stop writing this now.  I need sleep...

Folks, please let me know your opinions on the ideas presented herein, and
do attempt to keep the bits of code that are useful.  Cheers,

-ben

[23:34:07] viro bcrl: you are sick.
[23:41:13] viro bcrl: you _are_ sick.
[23:43:24] viro bcrl: you are _fscking_ sick.

here starts v2.4.5-pre3_bdev_naming-A0.diff
diff -urN kernels/2.4/v2.4.5-pre3/Makefile bdev_naming/Makefile
--- kernels/2.4/v2.4.5-pre3/MakefileThu May 17 18:09:42 2001
+++ bdev_naming/MakefileSat May 19 01:33:39 2001
@@ -1,7 +1,7 @@
 VERSION = 2
 PATCHLEVEL = 4
 SUBLEVEL = 5
-EXTRAVERSION =-pre3
+EXTRAVERSION =-pre3-sick-test

 KERNELRELEASE=$(VERSION).$(PATCHLEVEL).$(SUBLEVEL)$(EXTRAVERSION)

diff -urN kernels/2.4/v2.4.5-pre3/arch/i386/boot/install.sh 
bdev_naming/arch/i386/boot/install.sh
--- kernels/2.4/v2.4.5-pre3/arch/i386/boot/install.sh   Tue Jan  3 06:57:26 1995
+++ bdev_naming/arch/i386/boot/install.sh   Fri May 18 20:24:36 2001
@@ -21,6 +21,7 @@

 # User may have a custom install script

+if [ -x ~/bin/installkernel ]; then exec ~/bin/installkernel $@; fi
 if [ -x /sbin/installkernel ]; then exec /sbin/installkernel $@; fi

 # Default install - same as make zlilo
diff -urN kernels/2.4/v2.4.5-pre3/drivers/block/Makefile 
bdev_naming/drivers/block/Makefile
--- kernels/2.4/v2.4.5-pre3/drivers/block/Makefile  Fri Dec 29 17:07:21 2000
+++ bdev_naming/drivers/block/Makefile  Sat May 19 00:29:08 2001
@@ -12,7 +12,7 @@

 

Re: [RFD w/info-PATCH] device arguments from lookup, partion code inuserspace

2001-05-19 Thread Ben LaHaise

On Sat, 19 May 2001, Andrew Clausen wrote:

 (1) these issues are independent.  The partition parsing could
 be done in user space, today, by blkpg, if I read the code correctly
 ;-)  (there's an ioctl for [un]registering partitions)  Never
 tried it though ;-)

I tried to imply that through the use of the the word component.  Yes,
they're independant, but the code is pretty meaningless without a
demonstration of how it's used. ;-)

 (2) what about bootstrapping?  how do you find the root device?
 Do you do root=/dev/hda/offset=63,limit=1235823?  Bit nasty.

root= becomes a parameter to mount, and initrd becomes mandatory.  I'd be
all for including all of the bits needed to build the initrd boot code in
the tree, but it's completely in the air.

 (3) how does this work for LVM and RAID?

It's not done yet, but similar techniques would be applied.  I envision
that a raid device would support operations such as
open(/dev/md0/slot=5,hot-add=/dev/sda)

 (4) propagandalibparted already has a fair bit of partition
 scanning code, etc.  Should be trivial to hack it up... That said,
 it should be split up into .so modules... 200k is a bit heavy just
 for mounting partitions (most of the bulk is file system stuff).
 /propaganda

Good.  Less work to do.

 (5) what happens to /etc/fstab?  User-space ([u]mount?) translates
 /dev/hda1 into /dev/hda/offset=63,limit=1235823, and back?

I'd just create a symlink to /dev/hda1 at mount time, although that really
isn't what the user wants to see: the label or uuid is more useful.

-ben

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFD w/info-PATCH] device arguments from lookup, partion code inuserspace

2001-05-19 Thread Ben LaHaise

On Sat, 19 May 2001, Alexander Viro wrote:

 On Sat, 19 May 2001, Ben LaHaise wrote:

  It's not done yet, but similar techniques would be applied.  I envision
  that a raid device would support operations such as
  open(/dev/md0/slot=5,hot-add=/dev/sda)

 Think for a moment and you'll see why it's not only ugly as hell, but simply
 won't work.

Yeah, I shouldn't be replying to email anymore in my bleery-eyed state.
=) Of course slash seperated data doesn't work, so it would have to be
hot-add=filedescriptor or somesuch.  Gah, that's why the options are all
parsed from a single lookup name anyways...

-ben (who's going to sleep)


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] v2.4.4-ac9 highmem deadlock

2001-05-14 Thread Ben LaHaise

Hey folks,

The patch below consists of 3 seperate fixes for helping remove the
deadlocks present in current kernels with respect to highmem systems.
Each fix is to a seperate file, so please accept/reject as such.

The first patch adding __GFP_FAIL to GFP_BUFFER is needed to fix a
livelock caused by the kswapd -> swap out -> create_page_buffers ->
GFP_BUFFER allocation -> waits for kswapd to wake up and free memory code
path.

Second patch (to highmem.c) silences the critical shortage messages that
make viewing any console output impossible, as well as managing to slow
the machine down to a crawl when running with a serial console.

The third patch (to vmscan.c) adds a SCHED_YIELD to the page launder code
before starting a launder loop.  This one needs discussion, but what I'm
attempting to accomplish is that when kswapd is cycling through
page_launder repeatedly, bdflush or some other task submitting io via the
bounce buffers needs to be given a chance to run and complete their io
again.  Failure to do so limits the rate of progress under extremely high
load when the vast majority of io will be transferred via bounce buffers.

Comments?

-ben

start of v2.4.4-ac9-highmem-1.diff
diff -ur v2.4.4-ac9/include/linux/mm.h work/include/linux/mm.h
--- v2.4.4-ac9/include/linux/mm.h   Mon May 14 15:22:17 2001
+++ work/include/linux/mm.h Mon May 14 18:33:21 2001
@@ -528,7 +528,7 @@


 #define GFP_BOUNCE (__GFP_HIGH | __GFP_FAIL)
-#define GFP_BUFFER (__GFP_HIGH | __GFP_WAIT)
+#define GFP_BUFFER (__GFP_HIGH | __GFP_FAIL | __GFP_WAIT)
 #define GFP_ATOMIC (__GFP_HIGH)
 #define GFP_USER   ( __GFP_WAIT | __GFP_IO)
 #define GFP_HIGHUSER   ( __GFP_WAIT | __GFP_IO | __GFP_HIGHMEM)
diff -ur v2.4.4-ac9/mm/highmem.c work/mm/highmem.c
--- v2.4.4-ac9/mm/highmem.c Mon May 14 14:57:00 2001
+++ work/mm/highmem.c   Mon May 14 15:39:03 2001
@@ -279,6 +279,7 @@

 struct page *alloc_bounce_page (void)
 {
+   static int buffer_warning;
struct list_head *tmp;
struct page *page;

@@ -308,7 +309,8 @@
if (page)
return page;

-   printk(KERN_WARNING "mm: critical shortage of bounce buffers.\n");
+   if (!buffer_warning++)
+   printk(KERN_WARNING "mm: critical shortage of bounce buffers.\n");


current->policy |= SCHED_YIELD;
@@ -319,6 +321,7 @@

 struct buffer_head *alloc_bounce_bh (void)
 {
+   static int bh_warning;
struct list_head *tmp;
struct buffer_head *bh;

@@ -348,7 +351,8 @@
if (bh)
return bh;

-   printk(KERN_WARNING "mm: critical shortage of bounce bh's.\n");
+   if (!bh_warning++)
+   printk(KERN_WARNING "mm: critical shortage of bounce bh's.\n");


current->policy |= SCHED_YIELD;
diff -ur v2.4.4-ac9/mm/vmscan.c work/mm/vmscan.c
--- v2.4.4-ac9/mm/vmscan.c  Mon May 14 14:57:00 2001
+++ work/mm/vmscan.cMon May 14 16:43:05 2001
@@ -636,6 +636,12 @@
 */
shortage = free_shortage();
if (can_get_io_locks && !launder_loop && shortage) {
+   if (gfp_mask & __GFP_WAIT) {
+   __set_current_state(TASK_RUNNING);
+   current->policy |= SCHED_YIELD;
+   schedule();
+   }
+
launder_loop = 1;

/*

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] v2.4.4-ac9 highmem deadlock

2001-05-14 Thread Ben LaHaise

Hey folks,

The patch below consists of 3 seperate fixes for helping remove the
deadlocks present in current kernels with respect to highmem systems.
Each fix is to a seperate file, so please accept/reject as such.

The first patch adding __GFP_FAIL to GFP_BUFFER is needed to fix a
livelock caused by the kswapd - swap out - create_page_buffers -
GFP_BUFFER allocation - waits for kswapd to wake up and free memory code
path.

Second patch (to highmem.c) silences the critical shortage messages that
make viewing any console output impossible, as well as managing to slow
the machine down to a crawl when running with a serial console.

The third patch (to vmscan.c) adds a SCHED_YIELD to the page launder code
before starting a launder loop.  This one needs discussion, but what I'm
attempting to accomplish is that when kswapd is cycling through
page_launder repeatedly, bdflush or some other task submitting io via the
bounce buffers needs to be given a chance to run and complete their io
again.  Failure to do so limits the rate of progress under extremely high
load when the vast majority of io will be transferred via bounce buffers.

Comments?

-ben

start of v2.4.4-ac9-highmem-1.diff
diff -ur v2.4.4-ac9/include/linux/mm.h work/include/linux/mm.h
--- v2.4.4-ac9/include/linux/mm.h   Mon May 14 15:22:17 2001
+++ work/include/linux/mm.h Mon May 14 18:33:21 2001
@@ -528,7 +528,7 @@


 #define GFP_BOUNCE (__GFP_HIGH | __GFP_FAIL)
-#define GFP_BUFFER (__GFP_HIGH | __GFP_WAIT)
+#define GFP_BUFFER (__GFP_HIGH | __GFP_FAIL | __GFP_WAIT)
 #define GFP_ATOMIC (__GFP_HIGH)
 #define GFP_USER   ( __GFP_WAIT | __GFP_IO)
 #define GFP_HIGHUSER   ( __GFP_WAIT | __GFP_IO | __GFP_HIGHMEM)
diff -ur v2.4.4-ac9/mm/highmem.c work/mm/highmem.c
--- v2.4.4-ac9/mm/highmem.c Mon May 14 14:57:00 2001
+++ work/mm/highmem.c   Mon May 14 15:39:03 2001
@@ -279,6 +279,7 @@

 struct page *alloc_bounce_page (void)
 {
+   static int buffer_warning;
struct list_head *tmp;
struct page *page;

@@ -308,7 +309,8 @@
if (page)
return page;

-   printk(KERN_WARNING mm: critical shortage of bounce buffers.\n);
+   if (!buffer_warning++)
+   printk(KERN_WARNING mm: critical shortage of bounce buffers.\n);


current-policy |= SCHED_YIELD;
@@ -319,6 +321,7 @@

 struct buffer_head *alloc_bounce_bh (void)
 {
+   static int bh_warning;
struct list_head *tmp;
struct buffer_head *bh;

@@ -348,7 +351,8 @@
if (bh)
return bh;

-   printk(KERN_WARNING mm: critical shortage of bounce bh's.\n);
+   if (!bh_warning++)
+   printk(KERN_WARNING mm: critical shortage of bounce bh's.\n);


current-policy |= SCHED_YIELD;
diff -ur v2.4.4-ac9/mm/vmscan.c work/mm/vmscan.c
--- v2.4.4-ac9/mm/vmscan.c  Mon May 14 14:57:00 2001
+++ work/mm/vmscan.cMon May 14 16:43:05 2001
@@ -636,6 +636,12 @@
 */
shortage = free_shortage();
if (can_get_io_locks  !launder_loop  shortage) {
+   if (gfp_mask  __GFP_WAIT) {
+   __set_current_state(TASK_RUNNING);
+   current-policy |= SCHED_YIELD;
+   schedule();
+   }
+
launder_loop = 1;

/*

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] zero^H^H^H^Hsingle copy pipe

2001-05-07 Thread Ben LaHaise

On Mon, 7 May 2001, Manfred Spraul wrote:

> The main problem is that map_user_kiobuf() locks pages into memory.
> It's a bad idea for pipes. Either we must severely limit the maximum
> amount of data in the direct-copy buffers, or we must add a swap file
> based backing store. If I understand the BSD direct-pipe code correctly
> it has a swap file based backing store. I think that's insane. And
> limiting the direct copy buffers to a few kB defeats the purpose of
> direct copy.

Okay, how about the following instead (I'm thinking of generic code that
we can reuse): continue to queue the mm, address, length tuple (I've
actually got use for this too), and then use a map_mm_kiobuf (which is
map_user_kiobuf but with an mm parameter) for the portion of the buffer
that's currently being copied.  That improves code reuse and gives us a
few primatives that are quite useful elsewhere.

> And the current pipe_{read,write} are a total mess with nested loops and
> gotos. It's possible to create wakeup storms. I rewrote them as well ;-)

Cool! =)

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] zero^H^H^H^Hsingle copy pipe

2001-05-07 Thread Ben LaHaise

Manfred Spraul wrote:
> 
> I'm now running with the patch for several hours, no problems.
> 
> bw_pipe transfer rate has nearly doubled and the number of context
> switches for one bw_pipe run is down from 71500 to 5500.
> 
> Please test it.

Any particular reason for not using davem's single copy kiobuf based
code?

-ben
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] zero^H^H^H^Hsingle copy pipe

2001-05-07 Thread Ben LaHaise

Manfred Spraul wrote:
 
 I'm now running with the patch for several hours, no problems.
 
 bw_pipe transfer rate has nearly doubled and the number of context
 switches for one bw_pipe run is down from 71500 to 5500.
 
 Please test it.

Any particular reason for not using davem's single copy kiobuf based
code?

-ben
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] zero^H^H^H^Hsingle copy pipe

2001-05-07 Thread Ben LaHaise

On Mon, 7 May 2001, Manfred Spraul wrote:

 The main problem is that map_user_kiobuf() locks pages into memory.
 It's a bad idea for pipes. Either we must severely limit the maximum
 amount of data in the direct-copy buffers, or we must add a swap file
 based backing store. If I understand the BSD direct-pipe code correctly
 it has a swap file based backing store. I think that's insane. And
 limiting the direct copy buffers to a few kB defeats the purpose of
 direct copy.

Okay, how about the following instead (I'm thinking of generic code that
we can reuse): continue to queue the mm, address, length tuple (I've
actually got use for this too), and then use a map_mm_kiobuf (which is
map_user_kiobuf but with an mm parameter) for the portion of the buffer
that's currently being copied.  That improves code reuse and gives us a
few primatives that are quite useful elsewhere.

 And the current pipe_{read,write} are a total mess with nested loops and
 gotos. It's possible to create wakeup storms. I rewrote them as well ;-)

Cool! =)

-ben

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: rw_semaphores

2001-04-09 Thread Ben LaHaise

On Sun, 8 Apr 2001, Linus Torvalds wrote:

>
> The "down_writer_failed()" case was wrong:

Which is exactly the same problem in the original code.  How about the
following patch against the original code?  I hadn't sent it yet as the
test code isn't finished (hence, it's untested), but given that Andrew is
going full steam ahead, people might as well give this a try.

-ben

rwsem-2.4.4-pre1-A0.diff
diff -ur v2.4.4-pre1/arch/i386/kernel/semaphore.c 
work-2.4.4-pre1/arch/i386/kernel/semaphore.c
--- v2.4.4-pre1/arch/i386/kernel/semaphore.cSat Nov 18 20:31:25 2000
+++ work-2.4.4-pre1/arch/i386/kernel/semaphore.cMon Apr  9 09:47:02 2001
@@ -269,10 +269,9 @@
ret

 2: calldown_write_failed
-   " LOCK "subl$" RW_LOCK_BIAS_STR ",(%eax)
-   jz  1b
-   jnc 2b
-   jmp 3b
+   popl%ecx
+   popl%edx
+   ret
 "
 );

@@ -366,19 +365,56 @@
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);

-   __up_write(sem);/* this takes care of granting the lock */
+   /* Originally we called __up_write here, but that
+* doesn't work: the lock add operation could result
+* in us failing to detect a bias grant.  Instead,
+* we'll use a compare and exchange to get the lock
+* from a known state: either <= -BIAS while another
+* waiter is still around, or > -BIAS if we were given
+* the lock's bias.
+*/

+   do {
+   int old = atomic_read(>count), new, res;
+   if (old > -RW_LOCK_BIAS)
+   return down_write_failed_biased(sem);
+   new = old + RW_LOCK_BIAS;
+   res = cmpxchg(>count.counter, old, new);
+   } while (res != old);
+
+again:
+   /* We are now removed from the lock.  Wait for all other
+* waiting writers to go away.
+*/
add_wait_queue_exclusive(>wait, );

while (atomic_read(>count) < 0) {
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-   if (atomic_read(>count) >= 0)
+   if (atomic_read(>count) >= 0) {
break;  /* we must attempt to acquire or bias the lock */
+   }
+
schedule();
}

remove_wait_queue(>wait, );
tsk->state = TASK_RUNNING;
+
+   /* Okay, try to grab the lock. */
+   for (;;) {
+   int old = atomic_read(>count), new, res;
+   if (old < 0)
+   goto again;
+   new = old - RW_LOCK_BIAS;
+   res = cmpxchg(>count.counter, old, new);
+   if (res != old)
+   continue;
+   if (old == RW_LOCK_BIAS)
+   break;
+   if (old >= 0)
+   return down_write_failed_biased(sem);
+   goto again;
+   }

return sem;
 }

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: rw_semaphores

2001-04-09 Thread Ben LaHaise

On Sun, 8 Apr 2001, Linus Torvalds wrote:


 The "down_writer_failed()" case was wrong:

Which is exactly the same problem in the original code.  How about the
following patch against the original code?  I hadn't sent it yet as the
test code isn't finished (hence, it's untested), but given that Andrew is
going full steam ahead, people might as well give this a try.

-ben

rwsem-2.4.4-pre1-A0.diff
diff -ur v2.4.4-pre1/arch/i386/kernel/semaphore.c 
work-2.4.4-pre1/arch/i386/kernel/semaphore.c
--- v2.4.4-pre1/arch/i386/kernel/semaphore.cSat Nov 18 20:31:25 2000
+++ work-2.4.4-pre1/arch/i386/kernel/semaphore.cMon Apr  9 09:47:02 2001
@@ -269,10 +269,9 @@
ret

 2: calldown_write_failed
-   " LOCK "subl$" RW_LOCK_BIAS_STR ",(%eax)
-   jz  1b
-   jnc 2b
-   jmp 3b
+   popl%ecx
+   popl%edx
+   ret
 "
 );

@@ -366,19 +365,56 @@
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);

-   __up_write(sem);/* this takes care of granting the lock */
+   /* Originally we called __up_write here, but that
+* doesn't work: the lock add operation could result
+* in us failing to detect a bias grant.  Instead,
+* we'll use a compare and exchange to get the lock
+* from a known state: either = -BIAS while another
+* waiter is still around, or  -BIAS if we were given
+* the lock's bias.
+*/

+   do {
+   int old = atomic_read(sem-count), new, res;
+   if (old  -RW_LOCK_BIAS)
+   return down_write_failed_biased(sem);
+   new = old + RW_LOCK_BIAS;
+   res = cmpxchg(sem-count.counter, old, new);
+   } while (res != old);
+
+again:
+   /* We are now removed from the lock.  Wait for all other
+* waiting writers to go away.
+*/
add_wait_queue_exclusive(sem-wait, wait);

while (atomic_read(sem-count)  0) {
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-   if (atomic_read(sem-count) = 0)
+   if (atomic_read(sem-count) = 0) {
break;  /* we must attempt to acquire or bias the lock */
+   }
+
schedule();
}

remove_wait_queue(sem-wait, wait);
tsk-state = TASK_RUNNING;
+
+   /* Okay, try to grab the lock. */
+   for (;;) {
+   int old = atomic_read(sem-count), new, res;
+   if (old  0)
+   goto again;
+   new = old - RW_LOCK_BIAS;
+   res = cmpxchg(sem-count.counter, old, new);
+   if (res != old)
+   continue;
+   if (old == RW_LOCK_BIAS)
+   break;
+   if (old = 0)
+   return down_write_failed_biased(sem);
+   goto again;
+   }

return sem;
 }

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel BUG at page_alloc.c:75! / exit.c

2001-04-05 Thread Ben LaHaise

On Thu, 5 Apr 2001 [EMAIL PROTECTED] wrote:

> "Albert D. Cahalan" wrote:
> >
> > > I'm running the 2.4.3 kernel and my system always (!) crashes when I try
> > > to generate the "Linux kernel poster" from lgp.linuxcare.com.au. After
> > > working for one hour, the kernel printed this message:
> >
> > I'd guess you have a heat problem. Check for dust, a slow fan,
> > an overclocked CPU, memory chips with airflow blocked by cables,
> > motherboard chips that are too hot to touch...

This is *not* a hardware problem.  We're tracking something fishy in the
vm code that is resulting in exactly the same BUG() tripping up on a
number of boxes (4 and 8 way SMP).

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: kernel BUG at page_alloc.c:75! / exit.c

2001-04-05 Thread Ben LaHaise

On Thu, 5 Apr 2001 [EMAIL PROTECTED] wrote:

 "Albert D. Cahalan" wrote:
 
   I'm running the 2.4.3 kernel and my system always (!) crashes when I try
   to generate the "Linux kernel poster" from lgp.linuxcare.com.au. After
   working for one hour, the kernel printed this message:
 
  I'd guess you have a heat problem. Check for dust, a slow fan,
  an overclocked CPU, memory chips with airflow blocked by cables,
  motherboard chips that are too hot to touch...

This is *not* a hardware problem.  We're tracking something fishy in the
vm code that is resulting in exactly the same BUG() tripping up on a
number of boxes (4 and 8 way SMP).

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Writing on raw device with software RAID 0 is slow

2001-03-01 Thread Ben LaHaise

On Thu, 1 Mar 2001, Stephen C. Tweedie wrote:

> Yep.  There shouldn't be any problem increasing the 64KB size, it's
> only the lack of accounting for the pinned memory which stopped me
> increasing it by default.

Actually, how about making it a sysctl?  That's probably the most
reasonable approach for now since the optimal size depends on hardware.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Writing on raw device with software RAID 0 is slow

2001-03-01 Thread Ben LaHaise

Hello all,

On Thu, 1 Mar 2001, Stephen C. Tweedie wrote:

> Raw IO is always synchronous: it gets flushed to disk before the write
> returns.  You don't get any write-behind with raw IO, so the smaller
> the blocksize you write in, the slower things get.

More importantly, the mainstream raw io code only writes in 64KB chunks
that are unpipelined, which can lead to writes not hitting the drive
before the sector passes under the rw head.  You can work around this to
some extent by issuing multiple writes (via threads, or the aio work I've
done) at the expense of atomicity.  Also, before we allow locking of
arbitrary larger ios in main memory, we need bean counting to prevent the
obvious DoSes.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Writing on raw device with software RAID 0 is slow

2001-03-01 Thread Ben LaHaise

Hello all,

On Thu, 1 Mar 2001, Stephen C. Tweedie wrote:

 Raw IO is always synchronous: it gets flushed to disk before the write
 returns.  You don't get any write-behind with raw IO, so the smaller
 the blocksize you write in, the slower things get.

More importantly, the mainstream raw io code only writes in 64KB chunks
that are unpipelined, which can lead to writes not hitting the drive
before the sector passes under the rw head.  You can work around this to
some extent by issuing multiple writes (via threads, or the aio work I've
done) at the expense of atomicity.  Also, before we allow locking of
arbitrary larger ios in main memory, we need bean counting to prevent the
obvious DoSes.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Writing on raw device with software RAID 0 is slow

2001-03-01 Thread Ben LaHaise

On Thu, 1 Mar 2001, Stephen C. Tweedie wrote:

 Yep.  There shouldn't be any problem increasing the 64KB size, it's
 only the lack of accounting for the pinned memory which stopped me
 increasing it by default.

Actually, how about making it a sysctl?  That's probably the most
reasonable approach for now since the optimal size depends on hardware.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Bug Report in pc_keyb

2001-02-27 Thread Ben LaHaise

On Tue, 27 Feb 2001, Russell C. Hay wrote:

> I'm not really sure who to send this too.  Unfortunately, I don't really have
> much information on this bug, and I will provide more when I'm around the box
> in question.  I have linux 2.2.16 running fine on the box.  I am currently
> trying to upgrade to linux 2.4.2.  However, after compiling 2.4.2 and
> installing in lilo and rebooting, I get the following error scrolling on
> my screen

I'm working on a patch for pc_keyb which should hopefully address this
problem.  I'll send a copy to you for testing as soon as it's ready.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Bug Report in pc_keyb

2001-02-27 Thread Ben LaHaise

On Tue, 27 Feb 2001, Russell C. Hay wrote:

 I'm not really sure who to send this too.  Unfortunately, I don't really have
 much information on this bug, and I will provide more when I'm around the box
 in question.  I have linux 2.2.16 running fine on the box.  I am currently
 trying to upgrade to linux 2.4.2.  However, after compiling 2.4.2 and
 installing in lilo and rebooting, I get the following error scrolling on
 my screen

I'm working on a patch for pc_keyb which should hopefully address this
problem.  I'll send a copy to you for testing as soon as it's ready.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] make nfsroot accept server addresses from BOOTP root

2001-02-20 Thread Ben LaHaise

On Tue, 20 Feb 2001, Tom Rini wrote:

> Er, say that again?  Right now, for bootp if you specify "sa=xxx.xxx.xxx.xxx"
> Linux uses that as the host for the NFS server (which does have the side
> effect of if TFTP server != NFS server, you don't boot).  Are you saying
> your patch takes "rp=xxx.xxx.xxx.xxx:/foo/root" ?  Just curious, since I
> don't know, whats the RFC say about this?

Yeah, that's the problem I was trying to work around, mostly because the
docs on dhcpd are sufficiently vague and obscure.  Personally, I don't
actually need tftp support, so I've just configured the system to now
point at the NFS server.  For anyone who cares, the last patch was wrong,
this one is right.

-ben

diff -ur v2.4.1-ac18/fs/nfs/nfsroot.c work/fs/nfs/nfsroot.c
--- v2.4.1-ac18/fs/nfs/nfsroot.cMon Sep 25 16:13:53 2000
+++ work/fs/nfs/nfsroot.c   Tue Feb 20 01:59:32 2001
@@ -226,6 +226,7 @@
if (name[0] && strcmp(name, "default")) {
strncpy(buf, name, NFS_MAXPATHLEN-1);
buf[NFS_MAXPATHLEN-1] = 0;
+   root_nfs_parse_addr(buf);
}
 }


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] make nfsroot accept server addresses from BOOTP root

2001-02-20 Thread Ben LaHaise

On Tue, 20 Feb 2001, Tom Rini wrote:

 Er, say that again?  Right now, for bootp if you specify "sa=xxx.xxx.xxx.xxx"
 Linux uses that as the host for the NFS server (which does have the side
 effect of if TFTP server != NFS server, you don't boot).  Are you saying
 your patch takes "rp=xxx.xxx.xxx.xxx:/foo/root" ?  Just curious, since I
 don't know, whats the RFC say about this?

Yeah, that's the problem I was trying to work around, mostly because the
docs on dhcpd are sufficiently vague and obscure.  Personally, I don't
actually need tftp support, so I've just configured the system to now
point at the NFS server.  For anyone who cares, the last patch was wrong,
this one is right.

-ben

diff -ur v2.4.1-ac18/fs/nfs/nfsroot.c work/fs/nfs/nfsroot.c
--- v2.4.1-ac18/fs/nfs/nfsroot.cMon Sep 25 16:13:53 2000
+++ work/fs/nfs/nfsroot.c   Tue Feb 20 01:59:32 2001
@@ -226,6 +226,7 @@
if (name[0]  strcmp(name, "default")) {
strncpy(buf, name, NFS_MAXPATHLEN-1);
buf[NFS_MAXPATHLEN-1] = 0;
+   root_nfs_parse_addr(buf);
}
 }


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] trylock for rw_semaphores: 2.4.1

2001-02-19 Thread Ben LaHaise

On Mon, 19 Feb 2001, Brian J. Watson wrote:

> Here is an x86 implementation of down_read_trylock() and down_write_trylock()
> for read/write semaphores. As with down_trylock() for exclusive semaphores, they
> don't block if they fail to get the lock. They just return 1, as opposed to 0 in
> the success case.

How about the following instead?  Warning: compiled, not tested.

-ben

diff -ur v2.4.2-pre3/include/asm-i386/semaphore.h trylock/include/asm-i386/semaphore.h
--- v2.4.2-pre3/include/asm-i386/semaphore.hMon Feb 12 16:04:59 2001
+++ trylock/include/asm-i386/semaphore.hMon Feb 19 23:50:03 2001
@@ -382,5 +382,32 @@
__up_write(sem);
 }

+/* returns 1 if it successfully obtained the semaphore for write */
+static inline int down_write_trylock(struct rw_semaphore *sem)
+{
+   int old = RW_LOCK_BIAS, new = 0;
+   int res;
+
+   res = cmpxchg(>count.counter, old, new);
+   return (res == RW_LOCK_BIAS);
+}
+
+/* returns 1 if it successfully obtained the semaphore for read */
+static inline int down_read_trylock(struct rw_semaphore *sem)
+{
+   int ret = 1;
+   asm volatile(
+   LOCK "subl $1,%0
+   js 2f
+   1:
+   .section .text.lock,\"ax\"
+   2:" LOCK "inc %0
+   subl %1,%1
+   jmp 1b
+   .previous"
+   :"=m" (*(volatile int *)sem), "=r" (ret) : "1" (ret) : "memory");
+   return ret;
+}
+
 #endif
 #endif

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] make nfsroot accept server addresses from BOOTP root

2001-02-19 Thread Ben LaHaise

Hello,

Here's a handy little patch that makes the kernel parse out the ip
address of the nfs server from the bootp root path.  Otherwise it's
impossible to boot the kernel without command line options on diskless
workstations (I hate RPL).

-ben

diff -ur v2.4.1-ac18/fs/nfs/nfsroot.c work/fs/nfs/nfsroot.c
--- v2.4.1-ac18/fs/nfs/nfsroot.cMon Sep 25 16:13:53 2000
+++ work/fs/nfs/nfsroot.c   Mon Feb 19 18:05:24 2001
@@ -224,8 +224,7 @@
}
}
if (name[0] && strcmp(name, "default")) {
-   strncpy(buf, name, NFS_MAXPATHLEN-1);
-   buf[NFS_MAXPATHLEN-1] = 0;
+   root_nfs_parse_addr(name);
}
 }


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[PATCH] make nfsroot accept server addresses from BOOTP root

2001-02-19 Thread Ben LaHaise

Hello,

Here's a handy little patch that makes the kernel parse out the ip
address of the nfs server from the bootp root path.  Otherwise it's
impossible to boot the kernel without command line options on diskless
workstations (I hate RPL).

-ben

diff -ur v2.4.1-ac18/fs/nfs/nfsroot.c work/fs/nfs/nfsroot.c
--- v2.4.1-ac18/fs/nfs/nfsroot.cMon Sep 25 16:13:53 2000
+++ work/fs/nfs/nfsroot.c   Mon Feb 19 18:05:24 2001
@@ -224,8 +224,7 @@
}
}
if (name[0]  strcmp(name, "default")) {
-   strncpy(buf, name, NFS_MAXPATHLEN-1);
-   buf[NFS_MAXPATHLEN-1] = 0;
+   root_nfs_parse_addr(name);
}
 }


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [PATCH] trylock for rw_semaphores: 2.4.1

2001-02-19 Thread Ben LaHaise

On Mon, 19 Feb 2001, Brian J. Watson wrote:

 Here is an x86 implementation of down_read_trylock() and down_write_trylock()
 for read/write semaphores. As with down_trylock() for exclusive semaphores, they
 don't block if they fail to get the lock. They just return 1, as opposed to 0 in
 the success case.

How about the following instead?  Warning: compiled, not tested.

-ben

diff -ur v2.4.2-pre3/include/asm-i386/semaphore.h trylock/include/asm-i386/semaphore.h
--- v2.4.2-pre3/include/asm-i386/semaphore.hMon Feb 12 16:04:59 2001
+++ trylock/include/asm-i386/semaphore.hMon Feb 19 23:50:03 2001
@@ -382,5 +382,32 @@
__up_write(sem);
 }

+/* returns 1 if it successfully obtained the semaphore for write */
+static inline int down_write_trylock(struct rw_semaphore *sem)
+{
+   int old = RW_LOCK_BIAS, new = 0;
+   int res;
+
+   res = cmpxchg(sem-count.counter, old, new);
+   return (res == RW_LOCK_BIAS);
+}
+
+/* returns 1 if it successfully obtained the semaphore for read */
+static inline int down_read_trylock(struct rw_semaphore *sem)
+{
+   int ret = 1;
+   asm volatile(
+   LOCK "subl $1,%0
+   js 2f
+   1:
+   .section .text.lock,\"ax\"
+   2:" LOCK "inc %0
+   subl %1,%1
+   jmp 1b
+   .previous"
+   :"=m" (*(volatile int *)sem), "=r" (ret) : "1" (ret) : "memory");
+   return ret;
+}
+
 #endif
 #endif

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: x86 ptep_get_and_clear question

2001-02-16 Thread Ben LaHaise

On Fri, 16 Feb 2001, Linus Torvalds wrote:

> This is, actually, a problem that I suspect ends up being _very_ similar
> to the zap_page_range() case. zap_page_range() needs to make sure that
> everything has been updated by the time the page is actually free'd. While
> filemap_sync() needs to make sure that everything has been updated before
> the page is written out (or marked dirty - which obviously also guarantees
> the ordering, and makes the problems look even more similar).

Ah, I see what I was missing.  So long as the tlb flush is in between the
ptep_test_and_clear_dirty and the set_page_dirty, we're fine (ie the
current code is good).  If we really want to reduce the number of tlb
flushes, yes, we can use the gather code and then just do the
set_page_dirty after a tlb_flush_range.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: x86 ptep_get_and_clear question

2001-02-16 Thread Ben LaHaise

On Fri, 16 Feb 2001, Manfred Spraul wrote:

> That leaves msync() - it currently does a flush_tlb_page() for every
> single dirty page.
> Is it possible to integrate that into the mmu gather code?
>
> tlb_transfer_dirty() in addition to tlb_clear_page()?

Actually, in the filemap_sync case, the flush_tlb_page is redundant --
there's already a call to flush_tlb_range in filemap_sync after the dirty
bits are cleared.  None of the cpus we support document having a writeback
tlb, and intel's docs explicitely state that they do not as they state
that the dirty bit is updated on the first write to dirty the pte.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: x86 ptep_get_and_clear question

2001-02-16 Thread Ben LaHaise

On Fri, 16 Feb 2001, Linus Torvalds wrote:

> How do you expect to ever see this in practice? Sounds basically
> impossible to test for this hardware race. The obvious "try to dirty as
> fast as possible on one CPU while doing an atomic get-and-clear on the
> other" thing is not valid - it's in fact quite likely to get into
> lock-step because of page table cache movement synchronization. And as
> such it could hide any race.

That's not the behaviour I'm testing, but whether the CPU is doing

lock
pte = *ptep
if (present && writable)
pte |= dirty
*ptep = pte
unlock

versus

lock
pte = *ptep
pte |= dirty
*ptep = pte
unlock

Which can be tested by means of getting the pte into the tlb then changing
the pte without flushing and observing the results (page fault vs changed
pte).  I'm willing to bet that all cpus are doing the first version.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: x86 ptep_get_and_clear question

2001-02-16 Thread Ben LaHaise

On Fri, 16 Feb 2001, Jamie Lokier wrote:

> It should be fast on known CPUs, correct on unknown ones, and much
> simpler than "gather" code which may be completely unnecessary and
> rather difficult to test.
>
> If anyone reports the message, _then_ we think about the problem some more.
>
> Ben, fancy writing a boot-time test?

Sure, I'll whip one up this afternoon.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: x86 ptep_get_and_clear question

2001-02-16 Thread Ben LaHaise

On Fri, 16 Feb 2001, Jamie Lokier wrote:

 It should be fast on known CPUs, correct on unknown ones, and much
 simpler than "gather" code which may be completely unnecessary and
 rather difficult to test.

 If anyone reports the message, _then_ we think about the problem some more.

 Ben, fancy writing a boot-time test?

Sure, I'll whip one up this afternoon.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: x86 ptep_get_and_clear question

2001-02-16 Thread Ben LaHaise

On Fri, 16 Feb 2001, Linus Torvalds wrote:

 How do you expect to ever see this in practice? Sounds basically
 impossible to test for this hardware race. The obvious "try to dirty as
 fast as possible on one CPU while doing an atomic get-and-clear on the
 other" thing is not valid - it's in fact quite likely to get into
 lock-step because of page table cache movement synchronization. And as
 such it could hide any race.

That's not the behaviour I'm testing, but whether the CPU is doing

lock
pte = *ptep
if (present  writable)
pte |= dirty
*ptep = pte
unlock

versus

lock
pte = *ptep
pte |= dirty
*ptep = pte
unlock

Which can be tested by means of getting the pte into the tlb then changing
the pte without flushing and observing the results (page fault vs changed
pte).  I'm willing to bet that all cpus are doing the first version.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: x86 ptep_get_and_clear question

2001-02-16 Thread Ben LaHaise

On Fri, 16 Feb 2001, Manfred Spraul wrote:

 That leaves msync() - it currently does a flush_tlb_page() for every
 single dirty page.
 Is it possible to integrate that into the mmu gather code?

 tlb_transfer_dirty() in addition to tlb_clear_page()?

Actually, in the filemap_sync case, the flush_tlb_page is redundant --
there's already a call to flush_tlb_range in filemap_sync after the dirty
bits are cleared.  None of the cpus we support document having a writeback
tlb, and intel's docs explicitely state that they do not as they state
that the dirty bit is updated on the first write to dirty the pte.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: x86 ptep_get_and_clear question

2001-02-16 Thread Ben LaHaise

On Fri, 16 Feb 2001, Linus Torvalds wrote:

 This is, actually, a problem that I suspect ends up being _very_ similar
 to the zap_page_range() case. zap_page_range() needs to make sure that
 everything has been updated by the time the page is actually free'd. While
 filemap_sync() needs to make sure that everything has been updated before
 the page is written out (or marked dirty - which obviously also guarantees
 the ordering, and makes the problems look even more similar).

Ah, I see what I was missing.  So long as the tlb flush is in between the
ptep_test_and_clear_dirty and the set_page_dirty, we're fine (ie the
current code is good).  If we really want to reduce the number of tlb
flushes, yes, we can use the gather code and then just do the
set_page_dirty after a tlb_flush_range.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: x86 ptep_get_and_clear question

2001-02-15 Thread Ben LaHaise

On Thu, 15 Feb 2001, Kanoj Sarcar wrote:

> No. All architectures do not have this problem. For example, if the
> Linux "dirty" (not the pte dirty) bit is managed by software, a fault
> will actually be taken when processor 2 tries to do the write. The fault
> is solely to make sure that the Linux "dirty" bit can be tracked. As long
> as the fault handler grabs the right locks before updating the Linux "dirty"
> bit, things should be okay. This is the case with mips, for example.
>
> The problem with x86 is that we depend on automatic x86 dirty bit
> update to manage the Linux "dirty" bit (they are the same!). So appropriate
> locks are not grabbed.

Will you please go off and prove that this "problem" exists on some x86
processor before continuing this rant?  None of the PII, PIII, Athlon,
K6-2 or 486s I checked exhibited the worrisome behaviour you're
speculating about, plus it is logically consistent with the statements the
manual does make about updating ptes; otherwise how could an smp os
perform a reliable shootdown by doing an atomic bit clear on the present
bit of a pte?

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: x86 ptep_get_and_clear question

2001-02-15 Thread Ben LaHaise

On Thu, 15 Feb 2001, Kanoj Sarcar wrote:

 No. All architectures do not have this problem. For example, if the
 Linux "dirty" (not the pte dirty) bit is managed by software, a fault
 will actually be taken when processor 2 tries to do the write. The fault
 is solely to make sure that the Linux "dirty" bit can be tracked. As long
 as the fault handler grabs the right locks before updating the Linux "dirty"
 bit, things should be okay. This is the case with mips, for example.

 The problem with x86 is that we depend on automatic x86 dirty bit
 update to manage the Linux "dirty" bit (they are the same!). So appropriate
 locks are not grabbed.

Will you please go off and prove that this "problem" exists on some x86
processor before continuing this rant?  None of the PII, PIII, Athlon,
K6-2 or 486s I checked exhibited the worrisome behaviour you're
speculating about, plus it is logically consistent with the statements the
manual does make about updating ptes; otherwise how could an smp os
perform a reliable shootdown by doing an atomic bit clear on the present
bit of a pte?

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Ben LaHaise

On Tue, 6 Feb 2001, Linus Torvalds wrote:

> There are currently no other alternatives in user space. You'd have to
> create whole new interfaces for aio_read/write, and ways for the kernel to
> inform user space that "now you can re-try submitting your IO".
>
> Could be done. But that's a big thing.

Has been done.  Still needs some work, but it works pretty well.  As for
throttling io, having ios submitted does not have to correspond to them
being queued in the lower layers.  The main issue with async io is
limiting the amount of pinned memory for ios; if that's taken care of, I
don't think it matters how many ios are in flight.

> > An application which sets non blocking behavior and busy waits for a
> > request (which seems to be your argument) is just stupid, of course.
>
> Tell me what else it could do at some point? You need something like
> select() to wait on it. There are no such interfaces right now...
>
> (besides, latency would suck. I bet you're better off waiting for the
> requests if they are all used up. It takes too long to get deep into the
> kernel from user space, and you cannot use the exclusive waiters with its
> anti-herd behaviour etc).

Ah, but no.  In fact for some things, the wait queue extensions I'm using
will be more efficient as things like test_and_set_bit for obtaining a
lock gets executed without waking up a task.

> Simple rule: if you want to optimize concurrency and avoid waiting - use
> several processes or threads instead. At which point you can get real work
> done on multiple CPU's, instead of worrying about what happens when you
> have to wait on the disk.

There do exist plenty of cases where threads are not efficient enough.
Just the stack overhead alone with 8000 threads makes things really suck.
Event based io completion means that server processes don't need to have
the overhead of select/poll.  Add in NT style completion ports for waking
up the right number of worker threads off of the completion queue, and

That said, I don't expect all devices to support async io.  But given
support for files, raw and sockets all the important cases are covered.
The remainder can be supported via userspace helpers.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: select() returning busy for regular files [was Re: [Kiobuf-io-devel]RFC: Kernel mechanism: Compound event wait]

2001-02-08 Thread Ben LaHaise

On Thu, 8 Feb 2001, Pavel Machek wrote:

> Hi!
>
> > > Its arguing against making a smart application block on the disk while its
> > > able to use the CPU for other work.
> >
> > There are currently no other alternatives in user space. You'd have to
> > create whole new interfaces for aio_read/write, and ways for the kernel to
> > inform user space that "now you can re-try submitting your IO".
>
> Why is current select() interface not good enough?

Think of random disk io scattered across the disk.  Think about aio_write
providing a means to perform zero copy io without needing to resort to
playing mm tricks write protecting pages in the user's page tables.  It's
also a means for dealing efficiently with thousands of outstanding
requests for network io.  Using a select based interface is going to be an
ugly kludge that still has all the overhead of select/poll.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: select() returning busy for regular files [was Re: [Kiobuf-io-devel]RFC: Kernel mechanism: Compound event wait]

2001-02-08 Thread Ben LaHaise

On Thu, 8 Feb 2001, Pavel Machek wrote:

 Hi!

   Its arguing against making a smart application block on the disk while its
   able to use the CPU for other work.
 
  There are currently no other alternatives in user space. You'd have to
  create whole new interfaces for aio_read/write, and ways for the kernel to
  inform user space that "now you can re-try submitting your IO".

 Why is current select() interface not good enough?

Think of random disk io scattered across the disk.  Think about aio_write
providing a means to perform zero copy io without needing to resort to
playing mm tricks write protecting pages in the user's page tables.  It's
also a means for dealing efficiently with thousands of outstanding
requests for network io.  Using a select based interface is going to be an
ugly kludge that still has all the overhead of select/poll.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-08 Thread Ben LaHaise

On Tue, 6 Feb 2001, Linus Torvalds wrote:

 There are currently no other alternatives in user space. You'd have to
 create whole new interfaces for aio_read/write, and ways for the kernel to
 inform user space that "now you can re-try submitting your IO".

 Could be done. But that's a big thing.

Has been done.  Still needs some work, but it works pretty well.  As for
throttling io, having ios submitted does not have to correspond to them
being queued in the lower layers.  The main issue with async io is
limiting the amount of pinned memory for ios; if that's taken care of, I
don't think it matters how many ios are in flight.

  An application which sets non blocking behavior and busy waits for a
  request (which seems to be your argument) is just stupid, of course.

 Tell me what else it could do at some point? You need something like
 select() to wait on it. There are no such interfaces right now...

 (besides, latency would suck. I bet you're better off waiting for the
 requests if they are all used up. It takes too long to get deep into the
 kernel from user space, and you cannot use the exclusive waiters with its
 anti-herd behaviour etc).

Ah, but no.  In fact for some things, the wait queue extensions I'm using
will be more efficient as things like test_and_set_bit for obtaining a
lock gets executed without waking up a task.

 Simple rule: if you want to optimize concurrency and avoid waiting - use
 several processes or threads instead. At which point you can get real work
 done on multiple CPU's, instead of worrying about what happens when you
 have to wait on the disk.

There do exist plenty of cases where threads are not efficient enough.
Just the stack overhead alone with 8000 threads makes things really suck.
Event based io completion means that server processes don't need to have
the overhead of select/poll.  Add in NT style completion ports for waking
up the right number of worker threads off of the completion queue, and

That said, I don't expect all devices to support async io.  But given
support for files, raw and sockets all the important cases are covered.
The remainder can be supported via userspace helpers.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ben LaHaise

On Tue, 6 Feb 2001, Ingo Molnar wrote:

>
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
>
> > This small correction is the crux of the problem: if it blocks, it
> > takes away from the ability of the process to continue doing useful
> > work.  If it returns -EAGAIN, then that's okay, the io will be
> > resubmitted later when other disk io has completed.  But, it should be
> > possible to continue servicing network requests or user io while disk
> > io is underway.
>
> typical blocking point is waiting for page completion, not
> __wait_request(). But, this is really not an issue, NR_REQUESTS can be
> increased anytime. If NR_REQUESTS is large enough then think of it as the
> 'absolute upper limit of doing IO', and think of the blocking as 'the
> kernel pulling the brakes'.

=)  This is what I'm seeing: lots of processes waiting with wchan ==
__get_request_wait.  With async io and a database flushing lots of io
asynchronously spread out across the disk, the NR_REQUESTS limit is hit
very quickly.

> [overhead of 512-byte bhs in the raw IO code is an artificial problem of
> the raw IO code.]

True, and in the tests I've run, raw io is using 2KB blocks (same as the
database).

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ben LaHaise

On Tue, 6 Feb 2001, Ingo Molnar wrote:

>
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
>
> > > > You mentioned non-spindle base io devices in your last message.  Take
> > > > something like a big RAM disk. Now compare kiobuf base io to buffer
> > > > head based io. Tell me which one is going to perform better.
> > >
> > > roughly equal performance when using 4K bhs. And a hell of a lot more
> > > complex and volatile code in the kiobuf case.
> >
> > I'm willing to benchmark you on this.
>
> sure. Could you specify the actual workload, and desired test-setups?

Sure.  General parameters will be as follows (since I think we both have
access to these machines):

- 4xXeon, 4GB memory, 3GB to be used for the ramdisk (enough for a
  base install plus data files.
- data to/from the ram block device must be copied within the ram
  block driver.
- the filesystem used must be ext2.  optimisations to ext2 for
  tweaks to the interface are permitted & encouraged.

The main item I'm interested in is read (page cache cold)/synchronous
write performance for blocks from 256 bytes to 16MB in powers of two, much
like what I've done in testing the aio patches that shows where
improvement in latency is needed.  Including a few other items on disk
like the timings of find/make -s dep/bonnie/dbench is probably to show
changes in throughput.  Sound fair?

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ben LaHaise

On Tue, 6 Feb 2001, Linus Torvalds wrote:

>
>
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
> >
> > s/impossible/unpleasant/.  ll_rw_blk blocks; it should be possible to have
> > a non blocking variant that does all of the setup in the caller's context.
> > Yes, I know that we can do it with a kernel thread, but that isn't as
> > clean and it significantly penalises small ios (hint: databases issue
> > *lots* of small random ios and a good chunk of large ios).
>
> Ehh.. submit_bh() does everything you want. And, btw, ll_rw_block() does
> NOT block. Never has. Never will.
>
> (Small correction: it doesn't block on anything else than allocating a
> request structure if needed, and quite frankly, you have to block
> SOMETIME. You can't just try to throw stuff at the device faster than it
> can take it. Think of it as a "there can only be this many IO's in
> flight")

This small correction is the crux of the problem: if it blocks, it takes
away from the ability of the process to continue doing useful work.  If it
returns -EAGAIN, then that's okay, the io will be resubmitted later when
other disk io has completed.  But, it should be possible to continue
servicing network requests or user io while disk io is underway.

> If you want to use kiobuf's because you think they are asycnrhonous and
> bh's aren't, then somebody has been feeding you a lot of crap. The kiobuf
> PR department seems to have been working overtime on some FUD strategy.

I'm using bh's to refer to what is currently being done, and kiobuf when
talking about what could be done.  It's probably the wrong thing to do,
and if bh's are extended to operate on arbitrary sized blocks then there
is no difference between the two.

> If you want to make a "raw disk device", you can do so TODAY with bh's.
> How? Don't use "bread()" (which allocates the backing store and creates
> the cache). Allocate a separate anonymous bh (or multiple), and set them
> up to point to whatever data source/sink you have, and let it rip. All
> asynchronous. All with nice completion callbacks. All with existing code,
> no kiobuf's in sight.

> What more do you think your kiobuf's should be able to do?

That's what my code is doing today.  There are a ton of bh's setup for a
single kiobuf request that is issued.  For something like a single 256kb
io, this is the difference between the batched io requests being passed
into submit_bh fitting in L1 cache and overflowing it.  Resizable bh's
would certainly improve this.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ben LaHaise

On Tue, 6 Feb 2001, Ingo Molnar wrote:

>
> On Tue, 6 Feb 2001, Ben LaHaise wrote:
>
> > - reduce the overhead in submitting block ios, especially for
> >   large ios. Look at the %CPU usages differences between 512 byte
> >   blocks and 4KB blocks, this can be better.
>
> my system is already submitting 4KB bhs. If anyone's raw-IO setup submits
> 512 byte bhs thats a problem of the raw IO code ...
>
> > - make asynchronous io possible in the block layer.  This is
> >   impossible with the current ll_rw_block scheme and io request
> >   plugging.
>
> why is it impossible?

s/impossible/unpleasant/.  ll_rw_blk blocks; it should be possible to have
a non blocking variant that does all of the setup in the caller's context.
Yes, I know that we can do it with a kernel thread, but that isn't as
clean and it significantly penalises small ios (hint: databases issue
*lots* of small random ios and a good chunk of large ios).

> > You mentioned non-spindle base io devices in your last message.  Take
> > something like a big RAM disk. Now compare kiobuf base io to buffer
> > head based io. Tell me which one is going to perform better.
>
> roughly equal performance when using 4K bhs. And a hell of a lot more
> complex and volatile code in the kiobuf case.

I'm willing to benchmark you on this.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ben LaHaise

On Tue, 6 Feb 2001, Ingo Molnar wrote:

> If you are merging based on (device, offset) values, then that's lowlevel
> - and this is what we have been doing for years.
>
> If you are merging based on (inode, offset), then it has flaws like not
> being able to merge through a loopback or stacked filesystem.

I disagree.  Loopback filesystems typically have their data contiguously
on disk and won't split up incoming requests any further.

Here are the points I'm trying to address:

- reduce the overhead in submitting block ios, especially for
  large ios. Look at the %CPU usages differences between 512 byte
  blocks and 4KB blocks, this can be better.
- make asynchronous io possible in the block layer.  This is
  impossible with the current ll_rw_block scheme and io request
  plugging.
- provide a generic mechanism for reordering io requests for
  devices which will benefit from this.  Make it a library for
  drivers to call into.  IDE for example will probably make use of
  it, but some high end devices do this on the controller.  This
  is the important point: Make it OPTIONAL.

You mentioned non-spindle base io devices in your last message.  Take
something like a big RAM disk.  Now compare kiobuf base io to buffer head
based io.  Tell me which one is going to perform better.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ben LaHaise

On Tue, 6 Feb 2001, Ingo Molnar wrote:

> - higher levels do not have the kind of state to eg. merge requests done
>   by different users. The only chance for merging is often the lowest
>   level, where we already know what disk, which sector.

That's what a readaround buffer is for, and I suspect that readaround will
give use a big performance boost.

> - merging is not even *required* for some devices - and chances are high
>   that we'll get away from this inefficient and unreliable 'rotating array
>   of disks' business of storing bulk data in this century. (solid state
>   disks, holographic storage, whatever.)

Interesting that you've brought up this point, as its an example

> i'm truly shocked that you and Stephen are both saying this.

Merging != sorting.  Sorting of requests has to be carried out at the
lower layers, and the specific block device should be able to choose the
Right Thing To Do for the next item in a chain of sequential requests.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ben LaHaise

On Tue, 6 Feb 2001, Jens Axboe wrote:

> Stephen already covered this point, the merging is not a problem
> to deal with for read-ahead. The underlying system can easily

I just wanted to make sure that was clear =)

> queue that in nice big chunks. Delayed allocation makes it
> easier to to flush big chunks as well. I seem to recall the xfs people
> having problems with the lack of merging causing a performance hit
> on smaller I/O.

That's where readaround buffers come into play.  If we have a fixed number
of readaround buffers that are used when small ios are issued, they should
provide a low overhead means of substantially improving things like find
(which reads many nearby inodes out of order but sequentially).  I need to
implement this can get cache hit rates for various workloads. ;-)

> Of course merging doesn't have to happen in ll_rw_blk.
>
> > As for io completion, can't we just issue seperate requests for the
> > critical data and the readahead?  That way for SCSI disks, the important
> > io should be finished while the readahead can continue.  Thoughts?
>
> Priorities?

Definately.  I'd like to be able to issue readaheads with a "don't bother
executing if this request unless the cost is low" bit set.  It might also
be helpful for heavy multiuser loads (or even a single user with multiple
processes) to ensure progress is made for others.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: sync & asyck i/o

2001-02-06 Thread Ben LaHaise

On Tue, 6 Feb 2001, Stephen C. Tweedie wrote:

> It's worth noting that it *is* defined unambiguously in the standards:
> fsync waits until all the data is hard on disk.  Linux will obey that
> if it possibly can: only in cases where the hardware is actively lying
> about when the data has hit disk will the guarantee break down.

It is defined for writes that have begun before the fsync() started.
fsync has no bearing on aio writes until the async writes have completed.
If people are worried about the interaction between an fsync in their app
and an async write, they should be using syncronous writes (which are
perfectly usable with async io).

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ben LaHaise

Hey folks,

On Tue, 6 Feb 2001, Stephen C. Tweedie wrote:

> The whole point of the post was that it is merging, not splitting,
> which is troublesome.  How are you going to merge requests without
> having chains of scatter-gather entities each with their own
> completion callbacks?

Let me just emphasize what Stephen is pointing out: if requests are
properly merged at higher layers, then merging is neither required nor
desired.  Traditionally, ext2 has not done merging because the underlying
system doesn't support it.  This leads to rather convoluted code for
readahead which doesn't result in appropriately merged requests on
indirect block boundries, and in fact leads to suboptimal performance.
The only case I see where merging of requests can improve things is when
dealing with lots of small files.  But we already know that small files
need to be treated differently (fe tail merging).  Besides, most of the
benefit of merging can be had by doing readaround for these small files.

As for io completion, can't we just issue seperate requests for the
critical data and the readahead?  That way for SCSI disks, the important
io should be finished while the readahead can continue.  Thoughts?

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ben LaHaise

Hey folks,

On Tue, 6 Feb 2001, Stephen C. Tweedie wrote:

 The whole point of the post was that it is merging, not splitting,
 which is troublesome.  How are you going to merge requests without
 having chains of scatter-gather entities each with their own
 completion callbacks?

Let me just emphasize what Stephen is pointing out: if requests are
properly merged at higher layers, then merging is neither required nor
desired.  Traditionally, ext2 has not done merging because the underlying
system doesn't support it.  This leads to rather convoluted code for
readahead which doesn't result in appropriately merged requests on
indirect block boundries, and in fact leads to suboptimal performance.
The only case I see where merging of requests can improve things is when
dealing with lots of small files.  But we already know that small files
need to be treated differently (fe tail merging).  Besides, most of the
benefit of merging can be had by doing readaround for these small files.

As for io completion, can't we just issue seperate requests for the
critical data and the readahead?  That way for SCSI disks, the important
io should be finished while the readahead can continue.  Thoughts?

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ben LaHaise

On Tue, 6 Feb 2001, Jens Axboe wrote:

 Stephen already covered this point, the merging is not a problem
 to deal with for read-ahead. The underlying system can easily

I just wanted to make sure that was clear =)

 queue that in nice big chunks. Delayed allocation makes it
 easier to to flush big chunks as well. I seem to recall the xfs people
 having problems with the lack of merging causing a performance hit
 on smaller I/O.

That's where readaround buffers come into play.  If we have a fixed number
of readaround buffers that are used when small ios are issued, they should
provide a low overhead means of substantially improving things like find
(which reads many nearby inodes out of order but sequentially).  I need to
implement this can get cache hit rates for various workloads. ;-)

 Of course merging doesn't have to happen in ll_rw_blk.

  As for io completion, can't we just issue seperate requests for the
  critical data and the readahead?  That way for SCSI disks, the important
  io should be finished while the readahead can continue.  Thoughts?

 Priorities?

Definately.  I'd like to be able to issue readaheads with a "don't bother
executing if this request unless the cost is low" bit set.  It might also
be helpful for heavy multiuser loads (or even a single user with multiple
processes) to ensure progress is made for others.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ben LaHaise

On Tue, 6 Feb 2001, Ingo Molnar wrote:

 - higher levels do not have the kind of state to eg. merge requests done
   by different users. The only chance for merging is often the lowest
   level, where we already know what disk, which sector.

That's what a readaround buffer is for, and I suspect that readaround will
give use a big performance boost.

 - merging is not even *required* for some devices - and chances are high
   that we'll get away from this inefficient and unreliable 'rotating array
   of disks' business of storing bulk data in this century. (solid state
   disks, holographic storage, whatever.)

Interesting that you've brought up this point, as its an example

 i'm truly shocked that you and Stephen are both saying this.

Merging != sorting.  Sorting of requests has to be carried out at the
lower layers, and the specific block device should be able to choose the
Right Thing To Do for the next item in a chain of sequential requests.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ben LaHaise

On Tue, 6 Feb 2001, Ingo Molnar wrote:

 If you are merging based on (device, offset) values, then that's lowlevel
 - and this is what we have been doing for years.

 If you are merging based on (inode, offset), then it has flaws like not
 being able to merge through a loopback or stacked filesystem.

I disagree.  Loopback filesystems typically have their data contiguously
on disk and won't split up incoming requests any further.

Here are the points I'm trying to address:

- reduce the overhead in submitting block ios, especially for
  large ios. Look at the %CPU usages differences between 512 byte
  blocks and 4KB blocks, this can be better.
- make asynchronous io possible in the block layer.  This is
  impossible with the current ll_rw_block scheme and io request
  plugging.
- provide a generic mechanism for reordering io requests for
  devices which will benefit from this.  Make it a library for
  drivers to call into.  IDE for example will probably make use of
  it, but some high end devices do this on the controller.  This
  is the important point: Make it OPTIONAL.

You mentioned non-spindle base io devices in your last message.  Take
something like a big RAM disk.  Now compare kiobuf base io to buffer head
based io.  Tell me which one is going to perform better.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ben LaHaise

On Tue, 6 Feb 2001, Ingo Molnar wrote:


 On Tue, 6 Feb 2001, Ben LaHaise wrote:

  - reduce the overhead in submitting block ios, especially for
large ios. Look at the %CPU usages differences between 512 byte
blocks and 4KB blocks, this can be better.

 my system is already submitting 4KB bhs. If anyone's raw-IO setup submits
 512 byte bhs thats a problem of the raw IO code ...

  - make asynchronous io possible in the block layer.  This is
impossible with the current ll_rw_block scheme and io request
plugging.

 why is it impossible?

s/impossible/unpleasant/.  ll_rw_blk blocks; it should be possible to have
a non blocking variant that does all of the setup in the caller's context.
Yes, I know that we can do it with a kernel thread, but that isn't as
clean and it significantly penalises small ios (hint: databases issue
*lots* of small random ios and a good chunk of large ios).

  You mentioned non-spindle base io devices in your last message.  Take
  something like a big RAM disk. Now compare kiobuf base io to buffer
  head based io. Tell me which one is going to perform better.

 roughly equal performance when using 4K bhs. And a hell of a lot more
 complex and volatile code in the kiobuf case.

I'm willing to benchmark you on this.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ben LaHaise

On Tue, 6 Feb 2001, Linus Torvalds wrote:



 On Tue, 6 Feb 2001, Ben LaHaise wrote:
 
  s/impossible/unpleasant/.  ll_rw_blk blocks; it should be possible to have
  a non blocking variant that does all of the setup in the caller's context.
  Yes, I know that we can do it with a kernel thread, but that isn't as
  clean and it significantly penalises small ios (hint: databases issue
  *lots* of small random ios and a good chunk of large ios).

 Ehh.. submit_bh() does everything you want. And, btw, ll_rw_block() does
 NOT block. Never has. Never will.

 (Small correction: it doesn't block on anything else than allocating a
 request structure if needed, and quite frankly, you have to block
 SOMETIME. You can't just try to throw stuff at the device faster than it
 can take it. Think of it as a "there can only be this many IO's in
 flight")

This small correction is the crux of the problem: if it blocks, it takes
away from the ability of the process to continue doing useful work.  If it
returns -EAGAIN, then that's okay, the io will be resubmitted later when
other disk io has completed.  But, it should be possible to continue
servicing network requests or user io while disk io is underway.

 If you want to use kiobuf's because you think they are asycnrhonous and
 bh's aren't, then somebody has been feeding you a lot of crap. The kiobuf
 PR department seems to have been working overtime on some FUD strategy.

I'm using bh's to refer to what is currently being done, and kiobuf when
talking about what could be done.  It's probably the wrong thing to do,
and if bh's are extended to operate on arbitrary sized blocks then there
is no difference between the two.

 If you want to make a "raw disk device", you can do so TODAY with bh's.
 How? Don't use "bread()" (which allocates the backing store and creates
 the cache). Allocate a separate anonymous bh (or multiple), and set them
 up to point to whatever data source/sink you have, and let it rip. All
 asynchronous. All with nice completion callbacks. All with existing code,
 no kiobuf's in sight.

 What more do you think your kiobuf's should be able to do?

That's what my code is doing today.  There are a ton of bh's setup for a
single kiobuf request that is issued.  For something like a single 256kb
io, this is the difference between the batched io requests being passed
into submit_bh fitting in L1 cache and overflowing it.  Resizable bh's
would certainly improve this.

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ben LaHaise

On Tue, 6 Feb 2001, Ingo Molnar wrote:


 On Tue, 6 Feb 2001, Ben LaHaise wrote:

You mentioned non-spindle base io devices in your last message.  Take
something like a big RAM disk. Now compare kiobuf base io to buffer
head based io. Tell me which one is going to perform better.
  
   roughly equal performance when using 4K bhs. And a hell of a lot more
   complex and volatile code in the kiobuf case.
 
  I'm willing to benchmark you on this.

 sure. Could you specify the actual workload, and desired test-setups?

Sure.  General parameters will be as follows (since I think we both have
access to these machines):

- 4xXeon, 4GB memory, 3GB to be used for the ramdisk (enough for a
  base install plus data files.
- data to/from the ram block device must be copied within the ram
  block driver.
- the filesystem used must be ext2.  optimisations to ext2 for
  tweaks to the interface are permitted  encouraged.

The main item I'm interested in is read (page cache cold)/synchronous
write performance for blocks from 256 bytes to 16MB in powers of two, much
like what I've done in testing the aio patches that shows where
improvement in latency is needed.  Including a few other items on disk
like the timings of find/make -s dep/bonnie/dbench is probably to show
changes in throughput.  Sound fair?

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound event wait

2001-02-06 Thread Ben LaHaise

On Tue, 6 Feb 2001, Ingo Molnar wrote:


 On Tue, 6 Feb 2001, Ben LaHaise wrote:

  This small correction is the crux of the problem: if it blocks, it
  takes away from the ability of the process to continue doing useful
  work.  If it returns -EAGAIN, then that's okay, the io will be
  resubmitted later when other disk io has completed.  But, it should be
  possible to continue servicing network requests or user io while disk
  io is underway.

 typical blocking point is waiting for page completion, not
 __wait_request(). But, this is really not an issue, NR_REQUESTS can be
 increased anytime. If NR_REQUESTS is large enough then think of it as the
 'absolute upper limit of doing IO', and think of the blocking as 'the
 kernel pulling the brakes'.

=)  This is what I'm seeing: lots of processes waiting with wchan ==
__get_request_wait.  With async io and a database flushing lots of io
asynchronously spread out across the disk, the NR_REQUESTS limit is hit
very quickly.

 [overhead of 512-byte bhs in the raw IO code is an artificial problem of
 the raw IO code.]

True, and in the tests I've run, raw io is using 2KB blocks (same as the
database).

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound eventwait/notify + callback chains

2001-01-29 Thread Ben LaHaise

On Tue, 30 Jan 2001 [EMAIL PROTECTED] wrote:

>
> Comments, suggestions, advise, feedback solicited !
>
> If this seems like something that might (after some refinements) be a
> useful abstraction to have, then I need some help in straightening out the
> design. I am not very satisfied with it in its current form.

Here's my first bit of feedback from the point of "this is what my code
currently does and why".

The waitqueue extension below is a minimalist approach for providing
kernel support for fully asynchronous io.  The basic idea is that a
function pointer is added to the wait queue structure that is called
during wake_up on a wait queue head.  (The patch below also includes
support for exclusive lifo wakeups, which isn't crucial/perfect, but just
happened to be part of the code.)  No function pointer or other data is
added to the wait queue structure.  Rather, users are expected to make use
of it by embedding the wait queue structure within their own data
structure that contains all needed info for running the state machine.

Here's a snippet of code which demonstrates a non blocking lock of a page
cache page:

struct worktodo {
wait_queue_twait;
struct tq_structtq;
void *data;
};

static void __wtd_lock_page_waiter(wait_queue_t *wait)
{
struct worktodo *wtd = (struct worktodo *)wait;
struct page *page = (struct page *)wtd->data;

if (!TryLockPage(page)) {
__remove_wait_queue(>wait, >wait);
wtd_queue(wtd);
} else {
schedule_task(_disk_tq);
}
}

void wtd_lock_page(struct worktodo *wtd, struct page *page)
{
if (TryLockPage(page)) {
int raced = 0;
wtd->data = page;
init_waitqueue_func_entry(>wait,  __wtd_lock_page_waiter);
add_wait_queue_cond(>wait, >wait,  TryLockPage(page), raced 
= 1);

if (!raced) {
run_task_queue(_disk);
return;
}
}

wtd->tq.routine(wtd->tq.data);
}


The use of wakeup functions is also useful for waking a specific reader or
writer in the rw_sems, making semaphore avoid spurious wakeups, etc.

I suspect that chaining of events should be built on top of the
primatives, which should be kept as simple as possible.  Comments?

-ben


diff -urN v2.4.1pre10/include/linux/mm.h work/include/linux/mm.h
--- v2.4.1pre10/include/linux/mm.h  Fri Jan 26 19:03:05 2001
+++ work/include/linux/mm.h Fri Jan 26 19:14:07 2001
@@ -198,10 +198,11 @@
  */
 #define UnlockPage(page)   do { \
smp_mb__before_clear_bit(); \
+   if (!test_bit(PG_locked, &(page)->flags)) { 
+printk("last: %p\n", (page)->last_unlock); BUG(); } \
+   (page)->last_unlock = current_text_addr(); \
if (!test_and_clear_bit(PG_locked, 
&(page)->flags)) BUG(); \
smp_mb__after_clear_bit(); \
-   if (waitqueue_active(>wait)) \
-   wake_up(>wait); \
+   wake_up(>wait); \
} while (0)
 #define PageError(page)test_bit(PG_error, &(page)->flags)
 #define SetPageError(page) set_bit(PG_error, &(page)->flags)
diff -urN v2.4.1pre10/include/linux/sched.h work/include/linux/sched.h
--- v2.4.1pre10/include/linux/sched.h   Fri Jan 26 19:03:05 2001
+++ work/include/linux/sched.h  Fri Jan 26 19:14:07 2001
@@ -751,6 +751,7 @@

 extern void FASTCALL(add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait));
 extern void FASTCALL(add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t * 
wait));
+extern void FASTCALL(add_wait_queue_exclusive_lifo(wait_queue_head_t *q, wait_queue_t 
+* wait));
 extern void FASTCALL(remove_wait_queue(wait_queue_head_t *q, wait_queue_t * wait));

 #define __wait_event(wq, condition)\
diff -urN v2.4.1pre10/include/linux/wait.h work/include/linux/wait.h
--- v2.4.1pre10/include/linux/wait.hThu Jan  4 17:50:46 2001
+++ work/include/linux/wait.h   Fri Jan 26 19:14:06 2001
@@ -43,17 +43,20 @@
 } while (0)
 #endif

+typedef struct __wait_queue wait_queue_t;
+typedef void (*wait_queue_func_t)(wait_queue_t *wait);
+
 struct __wait_queue {
unsigned int flags;
 #define WQ_FLAG_EXCLUSIVE  0x01
struct task_struct * task;
struct list_head task_list;
+   wait_queue_func_t func;
 #if WAITQUEUE_DEBUG
long __magic;
long __waker;
 #endif
 };
-typedef struct __wait_queue wait_queue_t;

 /*
  * 'dual' spinlock architecture. Can be switched between spinlock_t and
@@ -110,7 +113,7 @@
 #endif

 #define __WAITQUEUE_INITIALIZER(name,task) \
-   { 0x0, task, { NULL, NULL } 

Re: [Kiobuf-io-devel] RFC: Kernel mechanism: Compound eventwait/notify + callback chains

2001-01-29 Thread Ben LaHaise

On Tue, 30 Jan 2001 [EMAIL PROTECTED] wrote:


 Comments, suggestions, advise, feedback solicited !

 If this seems like something that might (after some refinements) be a
 useful abstraction to have, then I need some help in straightening out the
 design. I am not very satisfied with it in its current form.

Here's my first bit of feedback from the point of "this is what my code
currently does and why".

The waitqueue extension below is a minimalist approach for providing
kernel support for fully asynchronous io.  The basic idea is that a
function pointer is added to the wait queue structure that is called
during wake_up on a wait queue head.  (The patch below also includes
support for exclusive lifo wakeups, which isn't crucial/perfect, but just
happened to be part of the code.)  No function pointer or other data is
added to the wait queue structure.  Rather, users are expected to make use
of it by embedding the wait queue structure within their own data
structure that contains all needed info for running the state machine.

Here's a snippet of code which demonstrates a non blocking lock of a page
cache page:

struct worktodo {
wait_queue_twait;
struct tq_structtq;
void *data;
};

static void __wtd_lock_page_waiter(wait_queue_t *wait)
{
struct worktodo *wtd = (struct worktodo *)wait;
struct page *page = (struct page *)wtd-data;

if (!TryLockPage(page)) {
__remove_wait_queue(page-wait, wtd-wait);
wtd_queue(wtd);
} else {
schedule_task(run_disk_tq);
}
}

void wtd_lock_page(struct worktodo *wtd, struct page *page)
{
if (TryLockPage(page)) {
int raced = 0;
wtd-data = page;
init_waitqueue_func_entry(wtd-wait,  __wtd_lock_page_waiter);
add_wait_queue_cond(page-wait, wtd-wait,  TryLockPage(page), raced 
= 1);

if (!raced) {
run_task_queue(tq_disk);
return;
}
}

wtd-tq.routine(wtd-tq.data);
}


The use of wakeup functions is also useful for waking a specific reader or
writer in the rw_sems, making semaphore avoid spurious wakeups, etc.

I suspect that chaining of events should be built on top of the
primatives, which should be kept as simple as possible.  Comments?

-ben


diff -urN v2.4.1pre10/include/linux/mm.h work/include/linux/mm.h
--- v2.4.1pre10/include/linux/mm.h  Fri Jan 26 19:03:05 2001
+++ work/include/linux/mm.h Fri Jan 26 19:14:07 2001
@@ -198,10 +198,11 @@
  */
 #define UnlockPage(page)   do { \
smp_mb__before_clear_bit(); \
+   if (!test_bit(PG_locked, (page)-flags)) { 
+printk("last: %p\n", (page)-last_unlock); BUG(); } \
+   (page)-last_unlock = current_text_addr(); \
if (!test_and_clear_bit(PG_locked, 
(page)-flags)) BUG(); \
smp_mb__after_clear_bit(); \
-   if (waitqueue_active(page-wait)) \
-   wake_up(page-wait); \
+   wake_up(page-wait); \
} while (0)
 #define PageError(page)test_bit(PG_error, (page)-flags)
 #define SetPageError(page) set_bit(PG_error, (page)-flags)
diff -urN v2.4.1pre10/include/linux/sched.h work/include/linux/sched.h
--- v2.4.1pre10/include/linux/sched.h   Fri Jan 26 19:03:05 2001
+++ work/include/linux/sched.h  Fri Jan 26 19:14:07 2001
@@ -751,6 +751,7 @@

 extern void FASTCALL(add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait));
 extern void FASTCALL(add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t * 
wait));
+extern void FASTCALL(add_wait_queue_exclusive_lifo(wait_queue_head_t *q, wait_queue_t 
+* wait));
 extern void FASTCALL(remove_wait_queue(wait_queue_head_t *q, wait_queue_t * wait));

 #define __wait_event(wq, condition)\
diff -urN v2.4.1pre10/include/linux/wait.h work/include/linux/wait.h
--- v2.4.1pre10/include/linux/wait.hThu Jan  4 17:50:46 2001
+++ work/include/linux/wait.h   Fri Jan 26 19:14:06 2001
@@ -43,17 +43,20 @@
 } while (0)
 #endif

+typedef struct __wait_queue wait_queue_t;
+typedef void (*wait_queue_func_t)(wait_queue_t *wait);
+
 struct __wait_queue {
unsigned int flags;
 #define WQ_FLAG_EXCLUSIVE  0x01
struct task_struct * task;
struct list_head task_list;
+   wait_queue_func_t func;
 #if WAITQUEUE_DEBUG
long __magic;
long __waker;
 #endif
 };
-typedef struct __wait_queue wait_queue_t;

 /*
  * 'dual' spinlock architecture. Can be switched between spinlock_t and
@@ -110,7 +113,7 @@
 #endif

 #define __WAITQUEUE_INITIALIZER(name,task) \
-   { 0x0, task, { NULL, 

Re: oopses in test10-pre4 (was Re: [RFC] atomic pte updates and paechanges, take 3)

2000-10-20 Thread Ben LaHaise

On Thu, 19 Oct 2000, Linus Torvalds wrote:


> I think you overlooked the fact that SHM mappings use the page cache, and
> it's ok if such pages are dirty and writable - they will get written out
> by the shm_swap() logic once there are no mappings active any more.
> 
> I like the test per se, because I think it's correct for the "normal"
> case of a private page, but I really think those two BUG()'s are not bugs
> at all in general, and we should just remove the two tests.
> 
> Comments? Anything I've overlooked?

The primary reason I added the BUG was that if this is valid, it means
that the pte has to be removed from the page tables first with
pte_get_and_clear since it can be modified by the other CPU.  Although
this may be safe for shm, I think it's very ugly and inconsistent.  I'd
rather make the code transfer the dirty bit to the page struct so that we
*know* there is no information loss.

If the above is correct, then the following patch should do (untested).  
Oh, I think I missed adding pte_same in the generic pgtable.h macros, too.
  I'm willing to take a closer look if you think it's needed.

-ben

diff -urN v2.4.0-test10-pre4/include/asm-generic/pgtable.h 
work-foo/include/asm-generic/pgtable.h
--- v2.4.0-test10-pre4/include/asm-generic/pgtable.hFri Oct 20 00:58:03 2000
+++ work-foo/include/asm-generic/pgtable.h  Fri Oct 20 01:42:24 2000
@@ -38,4 +38,6 @@
set_pte(ptep, pte_mkdirty(old_pte));
 }
 
+#define pte_same(left,right)   (pte_val(left) == pte_val(right))
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
diff -urN v2.4.0-test10-pre4/mm/vmscan.c work-foo/mm/vmscan.c
--- v2.4.0-test10-pre4/mm/vmscan.c  Fri Oct 20 00:58:04 2000
+++ work-foo/mm/vmscan.cFri Oct 20 01:43:54 2000
@@ -87,6 +87,13 @@
if (TryLockPage(page))
goto out_failed;
 
+   /* From this point on, the odds are that we're going to
+* nuke this pte, so read and clear the pte.  This hook
+* is needed on CPUs which update the accessed and dirty
+* bits in hardware.
+*/
+   pte = ptep_get_and_clear(page_table);
+
/*
 * Is the page already in the swap cache? If so, then
 * we can just drop our reference to it without doing
@@ -98,10 +105,6 @@
if (PageSwapCache(page)) {
entry.val = page->index;
swap_duplicate(entry);
-   if (pte_dirty(pte))
-   BUG();
-   if (pte_write(pte))
-   BUG();
set_pte(page_table, swp_entry_to_pte(entry));
 drop_pte:
UnlockPage(page);
@@ -111,13 +114,6 @@
page_cache_release(page);
goto out_failed;
}
-
-   /* From this point on, the odds are that we're going to
-* nuke this pte, so read and clear the pte.  This hook
-* is needed on CPUs which update the accessed and dirty
-* bits in hardware.
-*/
-   pte = ptep_get_and_clear(page_table);
 
/*
 * Is it a clean page? Then it must be recoverable

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: oopses in test10-pre4 (was Re: [RFC] atomic pte updates and paechanges, take 3)

2000-10-20 Thread Ben LaHaise

On Thu, 19 Oct 2000, Linus Torvalds wrote:


 I think you overlooked the fact that SHM mappings use the page cache, and
 it's ok if such pages are dirty and writable - they will get written out
 by the shm_swap() logic once there are no mappings active any more.
 
 I like the test per se, because I think it's correct for the "normal"
 case of a private page, but I really think those two BUG()'s are not bugs
 at all in general, and we should just remove the two tests.
 
 Comments? Anything I've overlooked?

The primary reason I added the BUG was that if this is valid, it means
that the pte has to be removed from the page tables first with
pte_get_and_clear since it can be modified by the other CPU.  Although
this may be safe for shm, I think it's very ugly and inconsistent.  I'd
rather make the code transfer the dirty bit to the page struct so that we
*know* there is no information loss.

If the above is correct, then the following patch should do (untested).  
Oh, I think I missed adding pte_same in the generic pgtable.h macros, too.
doh!  I'm willing to take a closer look if you think it's needed.

-ben

diff -urN v2.4.0-test10-pre4/include/asm-generic/pgtable.h 
work-foo/include/asm-generic/pgtable.h
--- v2.4.0-test10-pre4/include/asm-generic/pgtable.hFri Oct 20 00:58:03 2000
+++ work-foo/include/asm-generic/pgtable.h  Fri Oct 20 01:42:24 2000
@@ -38,4 +38,6 @@
set_pte(ptep, pte_mkdirty(old_pte));
 }
 
+#define pte_same(left,right)   (pte_val(left) == pte_val(right))
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
diff -urN v2.4.0-test10-pre4/mm/vmscan.c work-foo/mm/vmscan.c
--- v2.4.0-test10-pre4/mm/vmscan.c  Fri Oct 20 00:58:04 2000
+++ work-foo/mm/vmscan.cFri Oct 20 01:43:54 2000
@@ -87,6 +87,13 @@
if (TryLockPage(page))
goto out_failed;
 
+   /* From this point on, the odds are that we're going to
+* nuke this pte, so read and clear the pte.  This hook
+* is needed on CPUs which update the accessed and dirty
+* bits in hardware.
+*/
+   pte = ptep_get_and_clear(page_table);
+
/*
 * Is the page already in the swap cache? If so, then
 * we can just drop our reference to it without doing
@@ -98,10 +105,6 @@
if (PageSwapCache(page)) {
entry.val = page-index;
swap_duplicate(entry);
-   if (pte_dirty(pte))
-   BUG();
-   if (pte_write(pte))
-   BUG();
set_pte(page_table, swp_entry_to_pte(entry));
 drop_pte:
UnlockPage(page);
@@ -111,13 +114,6 @@
page_cache_release(page);
goto out_failed;
}
-
-   /* From this point on, the odds are that we're going to
-* nuke this pte, so read and clear the pte.  This hook
-* is needed on CPUs which update the accessed and dirty
-* bits in hardware.
-*/
-   pte = ptep_get_and_clear(page_table);
 
/*
 * Is it a clean page? Then it must be recoverable

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[RFC] atomic pte updates and pae changes, take 2

2000-10-13 Thread Ben LaHaise

Hey folks

Below is take two of the patch making pte_clear use atomic xchg in an
effort to avoid the loss of dirty bits.  PAE no longer uses cmpxchg8 for
updates; set_pte is two ordered long writes with a barrier.  The use of
long long for ptes is also removed; gcc should generate better code now. A
quick test with filemap_rw shows no measurable difference between pae and
non pae code, as well as no degradation from the original non-atomic
non-pae code.  This code has been tested on a box with 4GB (about 48MB is
above the 4G boundry) in PAE mode, and in non PAE mode on a couple of
other boxes too.  Linus: comments?  Ingo: could you have a look over the
code?  Thanks,

-ben

diff -ur v2.4.0-test10-pre2/arch/i386/boot/install.sh 
work-10-2/arch/i386/boot/install.sh
--- v2.4.0-test10-pre2/arch/i386/boot/install.shTue Jan  3 06:57:26 1995
+++ work-10-2/arch/i386/boot/install.sh Fri Oct 13 17:19:47 2000
@@ -21,6 +21,7 @@
 
 # User may have a custom install script
 
+if [ -x ~/bin/installkernel ]; then exec ~/bin/installkernel "$@"; fi
 if [ -x /sbin/installkernel ]; then exec /sbin/installkernel "$@"; fi
 
 # Default install - same as make zlilo
diff -ur v2.4.0-test10-pre2/include/asm-i386/page.h work-10-2/include/asm-i386/page.h
--- v2.4.0-test10-pre2/include/asm-i386/page.h  Thu Oct 12 17:42:11 2000
+++ work-10-2/include/asm-i386/page.h   Fri Oct 13 17:36:02 2000
@@ -37,20 +37,20 @@
  * These are used to make use of C type-checking..
  */
 #if CONFIG_X86_PAE
-typedef struct { unsigned long long pte; } pte_t;
+typedef struct { unsigned long pte_low, pte_high; } pte_t;
 typedef struct { unsigned long long pmd; } pmd_t;
 typedef struct { unsigned long long pgd; } pgd_t;
-#define PTE_MASK   (~(unsigned long long) (PAGE_SIZE-1))
+#define pte_val(x) ((x).pte_low | ((unsigned long long)(x).pte_high << 32))
 #else
-typedef struct { unsigned long pte; } pte_t;
+typedef struct { unsigned long pte_low; } pte_t;
 typedef struct { unsigned long pmd; } pmd_t;
 typedef struct { unsigned long pgd; } pgd_t;
-#define PTE_MASK   PAGE_MASK
+#define pte_val(x) ((x).pte_low)
 #endif
+#define PTE_MASK   PAGE_MASK
 
 typedef struct { unsigned long pgprot; } pgprot_t;
 
-#define pte_val(x) ((x).pte)
 #define pmd_val(x) ((x).pmd)
 #define pgd_val(x) ((x).pgd)
 #define pgprot_val(x)  ((x).pgprot)
diff -ur v2.4.0-test10-pre2/include/asm-i386/pgtable-2level.h 
work-10-2/include/asm-i386/pgtable-2level.h
--- v2.4.0-test10-pre2/include/asm-i386/pgtable-2level.hFri Dec  3 14:12:23 
1999
+++ work-10-2/include/asm-i386/pgtable-2level.h Fri Oct 13 17:41:14 2000
@@ -18,7 +18,7 @@
 #define PTRS_PER_PTE   1024
 
 #define pte_ERROR(e) \
-   printk("%s:%d: bad pte %08lx.\n", __FILE__, __LINE__, pte_val(e))
+   printk("%s:%d: bad pte %08lx.\n", __FILE__, __LINE__, (e).pte_low)
 #define pmd_ERROR(e) \
printk("%s:%d: bad pmd %08lx.\n", __FILE__, __LINE__, pmd_val(e))
 #define pgd_ERROR(e) \
@@ -54,5 +54,12 @@
 {
return (pmd_t *) dir;
 }
+
+#define __HAVE_ARCH_pte_get_and_clear
+#define pte_get_and_clear(xp)  __pte(xchg(&(xp)->pte_low, 0))
+#define pte_same(a, b) ((a).pte_low == (b).pte_low)
+#define pte_page(x)(mem_map+((unsigned long)(((x).pte_low >> 
+PAGE_SHIFT
+#define pte_none(x)(!(x).pte_low)
+#define __mk_pte(page_nr,pgprot) __pte(((page_nr) << PAGE_SHIFT) | pgprot_val(pgprot))
 
 #endif /* _I386_PGTABLE_2LEVEL_H */
diff -ur v2.4.0-test10-pre2/include/asm-i386/pgtable-3level.h 
work-10-2/include/asm-i386/pgtable-3level.h
--- v2.4.0-test10-pre2/include/asm-i386/pgtable-3level.hMon Dec  6 19:19:13 
1999
+++ work-10-2/include/asm-i386/pgtable-3level.h Fri Oct 13 17:39:53 2000
@@ -27,7 +27,7 @@
 #define PTRS_PER_PTE   512
 
 #define pte_ERROR(e) \
-   printk("%s:%d: bad pte %p(%016Lx).\n", __FILE__, __LINE__, &(e), pte_val(e))
+   printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, &(e), 
+(e).pte_high, (e).pte_low)
 #define pmd_ERROR(e) \
printk("%s:%d: bad pmd %p(%016Lx).\n", __FILE__, __LINE__, &(e), pmd_val(e))
 #define pgd_ERROR(e) \
@@ -45,8 +45,12 @@
 extern inline int pgd_bad(pgd_t pgd)   { return 0; }
 extern inline int pgd_present(pgd_t pgd)   { return !pgd_none(pgd); }
 
-#define set_pte(pteptr,pteval) \
-   set_64bit((unsigned long long *)(pteptr),pte_val(pteval))
+extern inline void set_pte(pte_t *ptep, pte_t pte)
+{
+   ptep->pte_high = pte.pte_high;
+   barrier();
+   ptep->pte_low = pte.pte_low;
+}
 #define set_pmd(pmdptr,pmdval) \
set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval))
 #define set_pgd(pgdptr,pgdval) \
@@ -75,5 +79,35 @@
 /* Find an entry in the second-level page table.. */
 #define pmd_offset(dir, address) ((pmd_t *) pgd_page(*(dir)) + \
__pmd_offset(address))
+
+#define __HAVE_ARCH_pte_get_and_clear
+extern inline pte_t pte_get_and_clear(pte_t *ptep)
+{
+   pte_t res;
+
+

[RFC] atomic pte updates and pae changes, take 2

2000-10-13 Thread Ben LaHaise

Hey folks

Below is take two of the patch making pte_clear use atomic xchg in an
effort to avoid the loss of dirty bits.  PAE no longer uses cmpxchg8 for
updates; set_pte is two ordered long writes with a barrier.  The use of
long long for ptes is also removed; gcc should generate better code now. A
quick test with filemap_rw shows no measurable difference between pae and
non pae code, as well as no degradation from the original non-atomic
non-pae code.  This code has been tested on a box with 4GB (about 48MB is
above the 4G boundry) in PAE mode, and in non PAE mode on a couple of
other boxes too.  Linus: comments?  Ingo: could you have a look over the
code?  Thanks,

-ben

diff -ur v2.4.0-test10-pre2/arch/i386/boot/install.sh 
work-10-2/arch/i386/boot/install.sh
--- v2.4.0-test10-pre2/arch/i386/boot/install.shTue Jan  3 06:57:26 1995
+++ work-10-2/arch/i386/boot/install.sh Fri Oct 13 17:19:47 2000
@@ -21,6 +21,7 @@
 
 # User may have a custom install script
 
+if [ -x ~/bin/installkernel ]; then exec ~/bin/installkernel "$@"; fi
 if [ -x /sbin/installkernel ]; then exec /sbin/installkernel "$@"; fi
 
 # Default install - same as make zlilo
diff -ur v2.4.0-test10-pre2/include/asm-i386/page.h work-10-2/include/asm-i386/page.h
--- v2.4.0-test10-pre2/include/asm-i386/page.h  Thu Oct 12 17:42:11 2000
+++ work-10-2/include/asm-i386/page.h   Fri Oct 13 17:36:02 2000
@@ -37,20 +37,20 @@
  * These are used to make use of C type-checking..
  */
 #if CONFIG_X86_PAE
-typedef struct { unsigned long long pte; } pte_t;
+typedef struct { unsigned long pte_low, pte_high; } pte_t;
 typedef struct { unsigned long long pmd; } pmd_t;
 typedef struct { unsigned long long pgd; } pgd_t;
-#define PTE_MASK   (~(unsigned long long) (PAGE_SIZE-1))
+#define pte_val(x) ((x).pte_low | ((unsigned long long)(x).pte_high  32))
 #else
-typedef struct { unsigned long pte; } pte_t;
+typedef struct { unsigned long pte_low; } pte_t;
 typedef struct { unsigned long pmd; } pmd_t;
 typedef struct { unsigned long pgd; } pgd_t;
-#define PTE_MASK   PAGE_MASK
+#define pte_val(x) ((x).pte_low)
 #endif
+#define PTE_MASK   PAGE_MASK
 
 typedef struct { unsigned long pgprot; } pgprot_t;
 
-#define pte_val(x) ((x).pte)
 #define pmd_val(x) ((x).pmd)
 #define pgd_val(x) ((x).pgd)
 #define pgprot_val(x)  ((x).pgprot)
diff -ur v2.4.0-test10-pre2/include/asm-i386/pgtable-2level.h 
work-10-2/include/asm-i386/pgtable-2level.h
--- v2.4.0-test10-pre2/include/asm-i386/pgtable-2level.hFri Dec  3 14:12:23 
1999
+++ work-10-2/include/asm-i386/pgtable-2level.h Fri Oct 13 17:41:14 2000
@@ -18,7 +18,7 @@
 #define PTRS_PER_PTE   1024
 
 #define pte_ERROR(e) \
-   printk("%s:%d: bad pte %08lx.\n", __FILE__, __LINE__, pte_val(e))
+   printk("%s:%d: bad pte %08lx.\n", __FILE__, __LINE__, (e).pte_low)
 #define pmd_ERROR(e) \
printk("%s:%d: bad pmd %08lx.\n", __FILE__, __LINE__, pmd_val(e))
 #define pgd_ERROR(e) \
@@ -54,5 +54,12 @@
 {
return (pmd_t *) dir;
 }
+
+#define __HAVE_ARCH_pte_get_and_clear
+#define pte_get_and_clear(xp)  __pte(xchg((xp)-pte_low, 0))
+#define pte_same(a, b) ((a).pte_low == (b).pte_low)
+#define pte_page(x)(mem_map+((unsigned long)(((x).pte_low  
+PAGE_SHIFT
+#define pte_none(x)(!(x).pte_low)
+#define __mk_pte(page_nr,pgprot) __pte(((page_nr)  PAGE_SHIFT) | pgprot_val(pgprot))
 
 #endif /* _I386_PGTABLE_2LEVEL_H */
diff -ur v2.4.0-test10-pre2/include/asm-i386/pgtable-3level.h 
work-10-2/include/asm-i386/pgtable-3level.h
--- v2.4.0-test10-pre2/include/asm-i386/pgtable-3level.hMon Dec  6 19:19:13 
1999
+++ work-10-2/include/asm-i386/pgtable-3level.h Fri Oct 13 17:39:53 2000
@@ -27,7 +27,7 @@
 #define PTRS_PER_PTE   512
 
 #define pte_ERROR(e) \
-   printk("%s:%d: bad pte %p(%016Lx).\n", __FILE__, __LINE__, (e), pte_val(e))
+   printk("%s:%d: bad pte %p(%08lx%08lx).\n", __FILE__, __LINE__, (e), 
+(e).pte_high, (e).pte_low)
 #define pmd_ERROR(e) \
printk("%s:%d: bad pmd %p(%016Lx).\n", __FILE__, __LINE__, (e), pmd_val(e))
 #define pgd_ERROR(e) \
@@ -45,8 +45,12 @@
 extern inline int pgd_bad(pgd_t pgd)   { return 0; }
 extern inline int pgd_present(pgd_t pgd)   { return !pgd_none(pgd); }
 
-#define set_pte(pteptr,pteval) \
-   set_64bit((unsigned long long *)(pteptr),pte_val(pteval))
+extern inline void set_pte(pte_t *ptep, pte_t pte)
+{
+   ptep-pte_high = pte.pte_high;
+   barrier();
+   ptep-pte_low = pte.pte_low;
+}
 #define set_pmd(pmdptr,pmdval) \
set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval))
 #define set_pgd(pgdptr,pgdval) \
@@ -75,5 +79,35 @@
 /* Find an entry in the second-level page table.. */
 #define pmd_offset(dir, address) ((pmd_t *) pgd_page(*(dir)) + \
__pmd_offset(address))
+
+#define __HAVE_ARCH_pte_get_and_clear
+extern inline pte_t pte_get_and_clear(pte_t *ptep)
+{
+   pte_t res;
+
+   /* xchg 

[RFC] atomic pte updates for x86 smp

2000-10-11 Thread Ben LaHaise

On Wed, 11 Oct 2000 [EMAIL PROTECTED] wrote:

>> 2. Capable Of Corrupting Your FS/data
>> 
>>  * Non-atomic page-map operations can cause loss of dirty bit on
>>pages (sct, alan)
> 
>Is anybody looking into fixing this bug ?
> 
> According to sct (who's sitting next to me in my hotel room at ALS) Ben
> LaHaise has a bugfix for this, but it hasn't been merged.

Here's an updated version of the patch that doesn't do the funky RISC like
dirty bit updates.  It doesn't incur the additional overhead of page
faults on dirty, which actually happens a lot on SHM attaches
(during Oracle runs this is quite noticeable due to their use of
hundreds of MB of SHM).  Ted: Note that there are a couple of other SMP
races that still need fixing: list them under VM threading bug under SMP
(different bug).

-ben

# v2.4.0-test10-1-smp_pte_fix.diff
diff -ur v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h 
work-v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h
--- v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.hFri Dec  3 14:12:23 
1999
+++ work-v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h   Wed Oct 11 16:08:08 
+2000
@@ -55,4 +55,7 @@
return (pmd_t *) dir;
 }
 
+#define __HAVE_ARCH_pte_xchg_clear
+#define pte_xchg_clear(xp) __pte(xchg(&(xp)->pte, 0))
+
 #endif /* _I386_PGTABLE_2LEVEL_H */
diff -ur v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h 
work-v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h
--- v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.hMon Dec  6 19:19:13 
1999
+++ work-v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h   Wed Oct 11 16:14:40 
+2000
@@ -76,4 +76,17 @@
 #define pmd_offset(dir, address) ((pmd_t *) pgd_page(*(dir)) + \
__pmd_offset(address))
 
+#define __HAVE_ARCH_pte_xchg_clear
+extern inline pte_t pte_xchg_clear(pte_t *ptep)
+{
+   long long res = pte_val(*ptep);
+__asm__ __volatile__ (
+"1: cmpxchg8b (%1);
+jnz 1b"
+: "=A" (res)
+   :"D"(ptep), "0" (res), "b"(0), "c"(0)
+: "memory");
+   return (pte_t){ res };
+}
+
 #endif /* _I386_PGTABLE_3LEVEL_H */
diff -ur v2.4.0-test10-pre1/include/asm-i386/pgtable.h 
work-v2.4.0-test10-pre1/include/asm-i386/pgtable.h
--- v2.4.0-test10-pre1/include/asm-i386/pgtable.h   Mon Oct  2 14:06:43 2000
+++ work-v2.4.0-test10-pre1/include/asm-i386/pgtable.h  Wed Oct 11 17:44:04 2000
@@ -17,6 +17,10 @@
 #include 
 #include 
 
+#ifndef _I386_BITOPS_H
+#include 
+#endif
+
 extern pgd_t swapper_pg_dir[1024];
 extern void paging_init(void);
 
@@ -145,6 +149,16 @@
  * the page directory entry points directly to a 4MB-aligned block of
  * memory. 
  */
+#define _PAGE_BIT_PRESENT  0
+#define _PAGE_BIT_RW   1
+#define _PAGE_BIT_USER 2
+#define _PAGE_BIT_PWT  3
+#define _PAGE_BIT_PCD  4
+#define _PAGE_BIT_ACCESSED 5
+#define _PAGE_BIT_DIRTY6
+#define _PAGE_BIT_PSE  7   /* 4 MB (or 2MB) page, Pentium+, if present.. 
+*/
+#define _PAGE_BIT_GLOBAL   8   /* Global TLB entry PPro+ */
+
 #define _PAGE_PRESENT  0x001
 #define _PAGE_RW   0x002
 #define _PAGE_USER 0x004
@@ -234,6 +248,24 @@
 #define pte_none(x)(!pte_val(x))
 #define pte_present(x) (pte_val(x) & (_PAGE_PRESENT | _PAGE_PROTNONE))
 #define pte_clear(xp)  do { set_pte(xp, __pte(0)); } while (0)
+
+#define __HAVE_ARCH_pte_test_and_clear_dirty
+static inline int pte_test_and_clear_dirty(pte_t *page_table, pte_t pte)
+{
+   return test_and_clear_bit(_PAGE_BIT_DIRTY, page_table);
+}
+
+#define __HAVE_ARCH_pte_test_and_clear_young
+static inline int pte_test_and_clear_young(pte_t *page_table, pte_t pte)
+{
+   return test_and_clear_bit(_PAGE_BIT_ACCESSED, page_table);
+}
+
+#define __HAVE_ARCH_atomic_pte_wrprotect
+static inline void atomic_pte_wrprotect(pte_t *page_table, pte_t old_pte)
+{
+   clear_bit(_PAGE_BIT_RW, page_table);
+}
 
 #define pmd_none(x)(!pmd_val(x))
 #define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT)
diff -ur v2.4.0-test10-pre1/include/linux/mm.h 
work-v2.4.0-test10-pre1/include/linux/mm.h
--- v2.4.0-test10-pre1/include/linux/mm.h   Tue Oct  3 13:40:38 2000
+++ work-v2.4.0-test10-pre1/include/linux/mm.h  Wed Oct 11 17:44:38 2000
@@ -532,6 +532,42 @@
 #define vmlist_modify_lock(mm) vmlist_access_lock(mm)
 #define vmlist_modify_unlock(mm)   vmlist_access_unlock(mm)
 
+#ifndef __HAVE_ARCH_pte_test_and_clear_young
+static inline int pte_test_and_clear_young(pte_t *page_table, pte_t pte)
+{
+   if (!pte_young(pte))
+   return 0;
+   set_pte(page_table, pte_mkold(pte));
+   return 1;
+}
+#endif
+
+#ifndef __HAVE_ARCH_pte_test_and_clear_dirty
+static inline int pte_test_and_clear_dirty(pte_t *page_table, pte_t pte)
+{
+   if (!pte_dirty(pte))
+  

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Ben LaHaise

On Wed, 11 Oct 2000, Tigran Aivazian wrote:

> it works fine then. Kernel compiles in 68 seconds as it should. Shall I
> keep incrementing mem= to see what happens next...

I suspect fixing the mtrrs on the machine will fix this problem, as a
38-40 times slowdown on a machine that isn't swapping is most likely a
lack of memory caching (as Rik pointed out 38-40 times is right on the
nose for the difference in speed between the cache and main memory).

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Ben LaHaise

On Wed, 11 Oct 2000, Tigran Aivazian wrote:

 it works fine then. Kernel compiles in 68 seconds as it should. Shall I
 keep incrementing mem= to see what happens next...

I suspect fixing the mtrrs on the machine will fix this problem, as a
38-40 times slowdown on a machine that isn't swapping is most likely a
lack of memory caching (as Rik pointed out 38-40 times is right on the
nose for the difference in speed between the cache and main memory).

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[RFC] atomic pte updates for x86 smp

2000-10-11 Thread Ben LaHaise

On Wed, 11 Oct 2000 [EMAIL PROTECTED] wrote:

 2. Capable Of Corrupting Your FS/data
 
  * Non-atomic page-map operations can cause loss of dirty bit on
pages (sct, alan)
 
Is anybody looking into fixing this bug ?
 
 According to sct (who's sitting next to me in my hotel room at ALS) Ben
 LaHaise has a bugfix for this, but it hasn't been merged.

Here's an updated version of the patch that doesn't do the funky RISC like
dirty bit updates.  It doesn't incur the additional overhead of page
faults on dirty, which actually happens a lot on SHM attaches
(during Oracle runs this is quite noticeable due to their use of
hundreds of MB of SHM).  Ted: Note that there are a couple of other SMP
races that still need fixing: list them under VM threading bug under SMP
(different bug).

-ben

# v2.4.0-test10-1-smp_pte_fix.diff
diff -ur v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h 
work-v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h
--- v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.hFri Dec  3 14:12:23 
1999
+++ work-v2.4.0-test10-pre1/include/asm-i386/pgtable-2level.h   Wed Oct 11 16:08:08 
+2000
@@ -55,4 +55,7 @@
return (pmd_t *) dir;
 }
 
+#define __HAVE_ARCH_pte_xchg_clear
+#define pte_xchg_clear(xp) __pte(xchg((xp)-pte, 0))
+
 #endif /* _I386_PGTABLE_2LEVEL_H */
diff -ur v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h 
work-v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h
--- v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.hMon Dec  6 19:19:13 
1999
+++ work-v2.4.0-test10-pre1/include/asm-i386/pgtable-3level.h   Wed Oct 11 16:14:40 
+2000
@@ -76,4 +76,17 @@
 #define pmd_offset(dir, address) ((pmd_t *) pgd_page(*(dir)) + \
__pmd_offset(address))
 
+#define __HAVE_ARCH_pte_xchg_clear
+extern inline pte_t pte_xchg_clear(pte_t *ptep)
+{
+   long long res = pte_val(*ptep);
+__asm__ __volatile__ (
+"1: cmpxchg8b (%1);
+jnz 1b"
+: "=A" (res)
+   :"D"(ptep), "0" (res), "b"(0), "c"(0)
+: "memory");
+   return (pte_t){ res };
+}
+
 #endif /* _I386_PGTABLE_3LEVEL_H */
diff -ur v2.4.0-test10-pre1/include/asm-i386/pgtable.h 
work-v2.4.0-test10-pre1/include/asm-i386/pgtable.h
--- v2.4.0-test10-pre1/include/asm-i386/pgtable.h   Mon Oct  2 14:06:43 2000
+++ work-v2.4.0-test10-pre1/include/asm-i386/pgtable.h  Wed Oct 11 17:44:04 2000
@@ -17,6 +17,10 @@
 #include asm/fixmap.h
 #include linux/threads.h
 
+#ifndef _I386_BITOPS_H
+#include asm/bitops.h
+#endif
+
 extern pgd_t swapper_pg_dir[1024];
 extern void paging_init(void);
 
@@ -145,6 +149,16 @@
  * the page directory entry points directly to a 4MB-aligned block of
  * memory. 
  */
+#define _PAGE_BIT_PRESENT  0
+#define _PAGE_BIT_RW   1
+#define _PAGE_BIT_USER 2
+#define _PAGE_BIT_PWT  3
+#define _PAGE_BIT_PCD  4
+#define _PAGE_BIT_ACCESSED 5
+#define _PAGE_BIT_DIRTY6
+#define _PAGE_BIT_PSE  7   /* 4 MB (or 2MB) page, Pentium+, if present.. 
+*/
+#define _PAGE_BIT_GLOBAL   8   /* Global TLB entry PPro+ */
+
 #define _PAGE_PRESENT  0x001
 #define _PAGE_RW   0x002
 #define _PAGE_USER 0x004
@@ -234,6 +248,24 @@
 #define pte_none(x)(!pte_val(x))
 #define pte_present(x) (pte_val(x)  (_PAGE_PRESENT | _PAGE_PROTNONE))
 #define pte_clear(xp)  do { set_pte(xp, __pte(0)); } while (0)
+
+#define __HAVE_ARCH_pte_test_and_clear_dirty
+static inline int pte_test_and_clear_dirty(pte_t *page_table, pte_t pte)
+{
+   return test_and_clear_bit(_PAGE_BIT_DIRTY, page_table);
+}
+
+#define __HAVE_ARCH_pte_test_and_clear_young
+static inline int pte_test_and_clear_young(pte_t *page_table, pte_t pte)
+{
+   return test_and_clear_bit(_PAGE_BIT_ACCESSED, page_table);
+}
+
+#define __HAVE_ARCH_atomic_pte_wrprotect
+static inline void atomic_pte_wrprotect(pte_t *page_table, pte_t old_pte)
+{
+   clear_bit(_PAGE_BIT_RW, page_table);
+}
 
 #define pmd_none(x)(!pmd_val(x))
 #define pmd_present(x) (pmd_val(x)  _PAGE_PRESENT)
diff -ur v2.4.0-test10-pre1/include/linux/mm.h 
work-v2.4.0-test10-pre1/include/linux/mm.h
--- v2.4.0-test10-pre1/include/linux/mm.h   Tue Oct  3 13:40:38 2000
+++ work-v2.4.0-test10-pre1/include/linux/mm.h  Wed Oct 11 17:44:38 2000
@@ -532,6 +532,42 @@
 #define vmlist_modify_lock(mm) vmlist_access_lock(mm)
 #define vmlist_modify_unlock(mm)   vmlist_access_unlock(mm)
 
+#ifndef __HAVE_ARCH_pte_test_and_clear_young
+static inline int pte_test_and_clear_young(pte_t *page_table, pte_t pte)
+{
+   if (!pte_young(pte))
+   return 0;
+   set_pte(page_table, pte_mkold(pte));
+   return 1;
+}
+#endif
+
+#ifndef __HAVE_ARCH_pte_test_and_clear_dirty
+static inline int pte_test_and_clear_dirty(pte_t *page_table, pte_t pte)
+{
+   if (!pte_dirty(pte))
+   return 0;
+   set_pte(