from:"Chen, Kenneth W"

RE: [patch] optimize o_direct on block device - v3

2007-01-11 Thread Chen, Kenneth W

Randy Dunlap wrote on Thursday, January 11, 2007 1:45 PM
> > +/* return a pge back to pvec array */
> 
> is pge just a typo or some other tla that i don't know?
> (not portland general electric or pacific gas & electric)


Typo with fat fingers. Thanks for catching it. Full patch with typo fixed.



[patch] fix blk_direct_IO bio preparation.

For large size DIO that needs multiple bio, one full page worth of data
was lost at the boundary of bio's maximum sector or segment limits.
After a bio is full and got submitted.  The outer while (nbytes) { ... }
loop will allocate a new bio and just march on to index into next page.
It just forget about the page that bio_add_page() rejected when previous
bio is full.  Fix it by put the rejected page back to pvec so we pick it
up again for the next bio.


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>

diff -Nurp linux-2.6.20-rc4/fs/block_dev.c linux-2.6.20.ken/fs/block_dev.c
--- linux-2.6.20-rc4/fs/block_dev.c 2007-01-06 21:45:51.0 -0800
+++ linux-2.6.20.ken/fs/block_dev.c 2007-01-10 19:54:53.0 -0800
@@ -190,6 +190,12 @@ static struct page *blk_get_page(unsigne
return pvec->page[pvec->idx++];
 }
 
+/* return a page back to pvec array */
+static void blk_unget_page(struct page *page, struct pvec *pvec)
+{
+   pvec->page[--pvec->idx] = page;
+}
+
 static ssize_t
 blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 loff_t pos, unsigned long nr_segs)
@@ -278,6 +284,8 @@ same_bio:
count = min(count, nbytes);
goto same_bio;
}
+   } else {
+   blk_unget_page(page, );
}
 
/* bio is ready, submit it */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] optimize o_direct on block device - v3

2007-01-11 Thread Chen, Kenneth W

Andrew Morton wrote on Thursday, January 11, 2007 11:29 AM
> On Thu, 11 Jan 2007 13:21:57 -0600
> Michael Reed <[EMAIL PROTECTED]> wrote:
> > Testing on my ia64 system reveals that this patch introduces a
> > data integrity error for direct i/o to a block device.  Device
> > errors which result in i/o failure do not propagate to the
> > process issuing direct i/o to the device.
> > 
> > This can be reproduced by doing writes to a fibre channel block
> > device and then disabling the switch port connecting the host
> > adapter to the switch.
> > 
> 
> Does this fix it?
> 
> 

Darn, kicking myself in the butt.  Thank you Andrew for fixing this.
We've also running DIO stress test almost non-stop over the last 30
days or so and we did uncover another bug in that patch.

Andrew, would you please take the follow bug fix patch as well.  It
is critical because it also affects data integrity.

[patch] fix blk_direct_IO bio preparation.

For large size DIO that needs multiple bio, one full page worth of data
was lost at the boundary of bio's maximum sector or segment limits.
After a bio is full and got submitted.  The outer while (nbytes) { ... }
loop will allocate a new bio and just march on to index into next page.
It just forget about the page that bio_add_page() rejected when previous
bio is full.  Fix it by put the rejected page back to pvec so we pick it
up again for the next bio.

Signed-off-by: Ken Chen <[EMAIL PROTECTED]>

diff -Nurp linux-2.6.20-rc4/fs/block_dev.c linux-2.6.20.ken/fs/block_dev.c
--- linux-2.6.20-rc4/fs/block_dev.c 2007-01-06 21:45:51.0 -0800
+++ linux-2.6.20.ken/fs/block_dev.c 2007-01-10 19:54:53.0 -0800
@@ -190,6 +190,12 @@ static struct page *blk_get_page(unsigne
return pvec->page[pvec->idx++];
 }

+/* return a pge back to pvec array */
+static void blk_unget_page(struct page *page, struct pvec *pvec)
+{
+   pvec->page[--pvec->idx] = page;
+}
+
 static ssize_t
 blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 loff_t pos, unsigned long nr_segs)
@@ -278,6 +284,8 @@ same_bio:
count = min(count, nbytes);
goto same_bio;
}
+   } else {
+   blk_unget_page(page, );
}

/* bio is ready, submit it */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] optimize o_direct on block device - v3

2007-01-11 Thread Chen, Kenneth W

Andrew Morton wrote on Thursday, January 11, 2007 11:29 AM
 On Thu, 11 Jan 2007 13:21:57 -0600
 Michael Reed [EMAIL PROTECTED] wrote:
  Testing on my ia64 system reveals that this patch introduces a
  data integrity error for direct i/o to a block device.  Device
  errors which result in i/o failure do not propagate to the
  process issuing direct i/o to the device.
  
  This can be reproduced by doing writes to a fibre channel block
  device and then disabling the switch port connecting the host
  adapter to the switch.
  
 
 Does this fix it?
 
 thwaps Ken


Darn, kicking myself in the butt.  Thank you Andrew for fixing this.
We've also running DIO stress test almost non-stop over the last 30
days or so and we did uncover another bug in that patch.

Andrew, would you please take the follow bug fix patch as well.  It
is critical because it also affects data integrity.


[patch] fix blk_direct_IO bio preparation.

For large size DIO that needs multiple bio, one full page worth of data
was lost at the boundary of bio's maximum sector or segment limits.
After a bio is full and got submitted.  The outer while (nbytes) { ... }
loop will allocate a new bio and just march on to index into next page.
It just forget about the page that bio_add_page() rejected when previous
bio is full.  Fix it by put the rejected page back to pvec so we pick it
up again for the next bio.


Signed-off-by: Ken Chen [EMAIL PROTECTED]

diff -Nurp linux-2.6.20-rc4/fs/block_dev.c linux-2.6.20.ken/fs/block_dev.c
--- linux-2.6.20-rc4/fs/block_dev.c 2007-01-06 21:45:51.0 -0800
+++ linux-2.6.20.ken/fs/block_dev.c 2007-01-10 19:54:53.0 -0800
@@ -190,6 +190,12 @@ static struct page *blk_get_page(unsigne
return pvec-page[pvec-idx++];
 }
 
+/* return a pge back to pvec array */
+static void blk_unget_page(struct page *page, struct pvec *pvec)
+{
+   pvec-page[--pvec-idx] = page;
+}
+
 static ssize_t
 blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 loff_t pos, unsigned long nr_segs)
@@ -278,6 +284,8 @@ same_bio:
count = min(count, nbytes);
goto same_bio;
}
+   } else {
+   blk_unget_page(page, pvec);
}
 
/* bio is ready, submit it */
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] optimize o_direct on block device - v3

2007-01-11 Thread Chen, Kenneth W

Randy Dunlap wrote on Thursday, January 11, 2007 1:45 PM
  +/* return a pge back to pvec array */
 
 is pge just a typo or some other tla that i don't know?
 (not portland general electric or pacific gas  electric)


Typo with fat fingers. Thanks for catching it. Full patch with typo fixed.



[patch] fix blk_direct_IO bio preparation.

For large size DIO that needs multiple bio, one full page worth of data
was lost at the boundary of bio's maximum sector or segment limits.
After a bio is full and got submitted.  The outer while (nbytes) { ... }
loop will allocate a new bio and just march on to index into next page.
It just forget about the page that bio_add_page() rejected when previous
bio is full.  Fix it by put the rejected page back to pvec so we pick it
up again for the next bio.


Signed-off-by: Ken Chen [EMAIL PROTECTED]

diff -Nurp linux-2.6.20-rc4/fs/block_dev.c linux-2.6.20.ken/fs/block_dev.c
--- linux-2.6.20-rc4/fs/block_dev.c 2007-01-06 21:45:51.0 -0800
+++ linux-2.6.20.ken/fs/block_dev.c 2007-01-10 19:54:53.0 -0800
@@ -190,6 +190,12 @@ static struct page *blk_get_page(unsigne
return pvec-page[pvec-idx++];
 }
 
+/* return a page back to pvec array */
+static void blk_unget_page(struct page *page, struct pvec *pvec)
+{
+   pvec-page[--pvec-idx] = page;
+}
+
 static ssize_t
 blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 loff_t pos, unsigned long nr_segs)
@@ -278,6 +284,8 @@ same_bio:
count = min(count, nbytes);
goto same_bio;
}
+   } else {
+   blk_unget_page(page, pvec);
}
 
/* bio is ready, submit it */
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] 4/4 block: explicit plugging

2007-01-05 Thread Chen, Kenneth W

Andrew Morton wrote on Wednesday, January 03, 2007 12:09 AM
> Do you have any benchmarks which got faster with these changes?

Jens Axboe wrote on Wednesday, January 03, 2007 12:22 AM
> I've asked Ken to run this series on some of his big iron, I hope he'll
> have some results for us soonish. I can run some pseudo benchmarks on a
> 4-way here with some simulated storage to demonstrate the locking
> improvements.

> Jens Axboe wrote on Thursday, January 04, 2007 6:39 AM
> > I will just retake the tip of your plug tree and retest.
> 
> That would be great! There's a busy race fixed in the current branch,
> make sure that one is included as well.


Good news: the tip of plug tree fixed the FC loop reset issue we are
seeing earlier.

Performance wise, our big db benchmark run came out with 0.14% regression
compare to 2.6.20-rc2.  It is small enough that we declared it as noise
level change. Unfortunately our internal profile tool broke on 2.6.20-rc2
so I don't have an execution profile to post.

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] 4/4 block: explicit plugging

2007-01-05 Thread Chen, Kenneth W

Andrew Morton wrote on Wednesday, January 03, 2007 12:09 AM
 Do you have any benchmarks which got faster with these changes?

Jens Axboe wrote on Wednesday, January 03, 2007 12:22 AM
 I've asked Ken to run this series on some of his big iron, I hope he'll
 have some results for us soonish. I can run some pseudo benchmarks on a
 4-way here with some simulated storage to demonstrate the locking
 improvements.

 Jens Axboe wrote on Thursday, January 04, 2007 6:39 AM
  I will just retake the tip of your plug tree and retest.
 
 That would be great! There's a busy race fixed in the current branch,
 make sure that one is included as well.


Good news: the tip of plug tree fixed the FC loop reset issue we are
seeing earlier.

Performance wise, our big db benchmark run came out with 0.14% regression
compare to 2.6.20-rc2.  It is small enough that we declared it as noise
level change. Unfortunately our internal profile tool broke on 2.6.20-rc2
so I don't have an execution profile to post.

- Ken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Chen, Kenneth W

Hugh Dickins wrote on Thursday, January 04, 2007 11:14 AM
> On Thu, 4 Jan 2007, Hua Zhong wrote:
> > So I'd argue that it makes more sense to support O_DIRECT
> > on tmpfs as the memory IS the backing store.
> 
> A few more voices in favour and I'll be persuaded.  Perhaps I'm
> out of date: when O_DIRECT came in, just a few filesystems supported
> it, and it was perfectly normal for open O_DIRECT to be failed; but
> I wouldn't want tmpfs to stand out now as a lone obstacle.

Maybe a bit hackish, all we need is to have an empty .direct_IO method
in shmem_aops to make __dentry_open() to pass the O_DIRECT check.  The
following patch adds 40 bytes to kernel text on x86-64.  An even more
hackish but zero cost route is to make .direct_IO variable non-zero via
a cast of -1 or some sort (that is probably ugly as hell).


diff -Nurp linus-2.6.git/mm/shmem.c linus-2.6.git.ken/mm/shmem.c
--- linus-2.6.git/mm/shmem.c2006-12-27 19:06:11.0 -0800
+++ linus-2.6.git.ken/mm/shmem.c2007-01-04 21:03:14.0 -0800
@@ -2314,10 +2314,18 @@ static void destroy_inodecache(void)
kmem_cache_destroy(shmem_inode_cachep);
 }
 
+ssize_t shmem_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
+   loff_t offset, unsigned long nr_segs)
+{
+   /* dummy direct_IO function.  Not to be executed */
+   BUG();
+}
+
 static const struct address_space_operations shmem_aops = {
.writepage  = shmem_writepage,
.set_page_dirty = __set_page_dirty_nobuffers,
 #ifdef CONFIG_TMPFS
+   .direct_IO  = shmem_direct_IO,
.prepare_write  = shmem_prepare_write,
.commit_write   = simple_commit_write,
 #endif
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Chen, Kenneth W

Hugh Dickins wrote on Thursday, January 04, 2007 11:14 AM
 On Thu, 4 Jan 2007, Hua Zhong wrote:
  So I'd argue that it makes more sense to support O_DIRECT
  on tmpfs as the memory IS the backing store.
 
 A few more voices in favour and I'll be persuaded.  Perhaps I'm
 out of date: when O_DIRECT came in, just a few filesystems supported
 it, and it was perfectly normal for open O_DIRECT to be failed; but
 I wouldn't want tmpfs to stand out now as a lone obstacle.

Maybe a bit hackish, all we need is to have an empty .direct_IO method
in shmem_aops to make __dentry_open() to pass the O_DIRECT check.  The
following patch adds 40 bytes to kernel text on x86-64.  An even more
hackish but zero cost route is to make .direct_IO variable non-zero via
a cast of -1 or some sort (that is probably ugly as hell).


diff -Nurp linus-2.6.git/mm/shmem.c linus-2.6.git.ken/mm/shmem.c
--- linus-2.6.git/mm/shmem.c2006-12-27 19:06:11.0 -0800
+++ linus-2.6.git.ken/mm/shmem.c2007-01-04 21:03:14.0 -0800
@@ -2314,10 +2314,18 @@ static void destroy_inodecache(void)
kmem_cache_destroy(shmem_inode_cachep);
 }
 
+ssize_t shmem_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
+   loff_t offset, unsigned long nr_segs)
+{
+   /* dummy direct_IO function.  Not to be executed */
+   BUG();
+}
+
 static const struct address_space_operations shmem_aops = {
.writepage  = shmem_writepage,
.set_page_dirty = __set_page_dirty_nobuffers,
 #ifdef CONFIG_TMPFS
+   .direct_IO  = shmem_direct_IO,
.prepare_write  = shmem_prepare_write,
.commit_write   = simple_commit_write,
 #endif
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] 4/4 block: explicit plugging

2007-01-03 Thread Chen, Kenneth W

Jens Axboe wrote on Wednesday, January 03, 2007 2:30 PM
> > We are having some trouble with the patch set that some of our fiber channel
> > host controller doesn't initialize properly anymore and thus lost whole
> > bunch of disks (somewhere around 200 disks out of 900) at boot time.
> > Presumably FC loop initialization command are done through block layer etc.
> > I haven't looked into the problem closely.
> > 
> > Jens, I assume the spin lock bug in __blk_run_queue is fixed in this patch
> > set?
> 
> It is. Are you still seeing problems after the initial mail exchange we
> had prior to christmas,

Yes. Not the same kernel panic, but a problem with FC loop reset itself.


> or are you referencing that initial problem?

No. we got passed that point thanks for the bug fix patch you give me
prior to Christmas.  That fixed a kernel panic on boot up.


> It's not likely to be a block layer issue, more likely the SCSI <->
> block interactions. If you mail me a new dmesg (if your problem is with
> the __blk_run_queue() fixups), I can take a look. Otherwise please do
> test with the __blk_run_queue() fixup, just use the current patchset.

I will just retake the tip of your plug tree and retest.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] 4/4 block: explicit plugging

2007-01-03 Thread Chen, Kenneth W

Jens Axboe wrote on Wednesday, January 03, 2007 12:22 AM
> > Do you have any benchmarks which got faster with these changes?
> 
> On the hardware I have immediately available, I see no regressions wrt
> performance. With instrumentation it's simple to demonstrate that most
> of the queueing activity of an io heavy benchmark spends less time in
> the kernel (most merging activity takes place outside of the queue
lock,
> hence queueing is lock free).
> 
> I've asked Ken to run this series on some of his big iron, I hope
he'll
> have some results for us soonish.

We are having some trouble with the patch set that some of our fiber
channel
host controller doesn't initialize properly anymore and thus lost whole
bunch
of disks (somewhere around 200 disks out of 900) at boot time.
Presumably FC
loop initialization command are done through block layer etc.  I haven't
looked into the problem closely.

Jens, I assume the spin lock bug in __blk_run_queue is fixed in this
patch
set?

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] 4/4 block: explicit plugging

2007-01-03 Thread Chen, Kenneth W

Jens Axboe wrote on Wednesday, January 03, 2007 12:22 AM
  Do you have any benchmarks which got faster with these changes?
 
 On the hardware I have immediately available, I see no regressions wrt
 performance. With instrumentation it's simple to demonstrate that most
 of the queueing activity of an io heavy benchmark spends less time in
 the kernel (most merging activity takes place outside of the queue
lock,
 hence queueing is lock free).
 
 I've asked Ken to run this series on some of his big iron, I hope
he'll
 have some results for us soonish.

We are having some trouble with the patch set that some of our fiber
channel
host controller doesn't initialize properly anymore and thus lost whole
bunch
of disks (somewhere around 200 disks out of 900) at boot time.
Presumably FC
loop initialization command are done through block layer etc.  I haven't
looked into the problem closely.

Jens, I assume the spin lock bug in __blk_run_queue is fixed in this
patch
set?

- Ken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] 4/4 block: explicit plugging

2007-01-03 Thread Chen, Kenneth W

Jens Axboe wrote on Wednesday, January 03, 2007 2:30 PM
  We are having some trouble with the patch set that some of our fiber channel
  host controller doesn't initialize properly anymore and thus lost whole
  bunch of disks (somewhere around 200 disks out of 900) at boot time.
  Presumably FC loop initialization command are done through block layer etc.
  I haven't looked into the problem closely.
  
  Jens, I assume the spin lock bug in __blk_run_queue is fixed in this patch
  set?
 
 It is. Are you still seeing problems after the initial mail exchange we
 had prior to christmas,

Yes. Not the same kernel panic, but a problem with FC loop reset itself.


 or are you referencing that initial problem?

No. we got passed that point thanks for the bug fix patch you give me
prior to Christmas.  That fixed a kernel panic on boot up.


 It's not likely to be a block layer issue, more likely the SCSI -
 block interactions. If you mail me a new dmesg (if your problem is with
 the __blk_run_queue() fixups), I can take a look. Otherwise please do
 test with the __blk_run_queue() fixup, just use the current patchset.

I will just retake the tip of your plug tree and retest.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W

Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM
> > In the example you
> > gave earlier, task with min_nr of 2 will be woken up after 4 completed
> > events.
> 
> I only gave 2 ios/events in that example.
> 
> Does that clear up the confusion?

It occurs to me that people might not be aware how peculiar the
current io_getevent wakeup scheme is, to the extend of erratic
behavior.

In the blocking path of read_events(), we essentially doing the
following loop (simplified for clarity):

while (i < nr) {
add_wait_queue_exclusive(>wait, );
do {
ret = aio_read_evt(ctx, );
if (!ret)
schedule();
while (1);
remove_wait_queue(>wait, );
copy_to_user(event, , sizeof(ent));
}

Noticed that when thread comes out of schedule(), it removes itself
from the wait queue, and requeue itself at the end of the wait queue
for each and every event it reaps.  So if there are multiple threads
waiting in io_getevents, completed I/O are handed out in round robin
scheme to all waiting threads.

To illustrate it in ascii graph, here is what happens:

   thread 1   thread 2

   queue at head
   schedule()

  queue at 2nd position
  schedule

aio_complete
(event 1)
   remove_wait_queue  (now thread 2 is at head)
   reap event 1
   requeue at tail
   schedule

aio_complete
(event 2)
  remove_wait_queue (now thread 1 is at 
head)
  reap event 2
  requeue at tail
  schedule

If thread 1 sleeps first with min_nr = 2, and thread 2 sleeps
second with min_nr = 3, then thread 1 wakes up on event _3_.
But if thread 2 sleeps first, thread 1 sleeps second, thread 1
wakes up on event _4_.  If someone ask me to describe algorithm
of io_getevents wake-up scheme in the presence of multiple
waiters, I call it erratic and un-deterministic.

Looking back to the example Zach gave earlier, current
implementation behaves just like what described as an undesired
bug (modified and tortured):

issue 2 ops
first io_getevents sleeps with a min_nr of 2
second io_getevents sleeps with min_nr of 3
2 ops complete
first sleeper twiddles thumbs

So I can categorize my patchset as a bug fix instead of a
performance patch ;-)  Let's be serious, this ought to be fixed
one way or the other.


- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W

Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM
> On Jan 2, 2007, at 5:50 PM, Chen, Kenneth W wrote:
> > Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM
> >>> That is not possible because when multiple tasks waiting for
> >>> events, they
> >>> enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
> >>> __add_wait_queue_tail().  So first io_getevents() with min_nr of 2
> >>> will be woken up when 2 ops completes.
> >>
> >> So switch the order of the two sleepers in the example?
> >
> > Not sure why that would be a problem though:  whoever sleep first will
> > be woken up first.
> 
> Why would the min_nr = 3 sleeper be woken up in that case?  Only 2  
> ios were issued.
> 
> Maybe the app was relying on the min_nr = 2 completion to issue 3  
> more ios for the min_nr = 3 sleeper, who knows.
> 
> Does that clear up the confusion?

Not really. I don't think I understand your concern. You gave an example:

issue 2 ops
first io_getevents sleeps with a min_nr of 2
second io_getevents sleeps with min_nr of 3
2 ops complete but only test the second sleeper's min_nr of 3
first sleeper twiddles thumbs

Or:

issue 2 ops
first io_getevents sleeps with a min_nr of 3
second io_getevents sleeps with min_nr of 2
2 ops complete but only test the second sleeper's min_nr of 2
first sleeper twiddles thumbs

First scenario doesn't exist because in the new scheme, we test first
sleeper (as in head of the queue) when 2 ops complete. It wakes up first.

2nd scenario is OK to me because first sleeper waiting for 3 events,
and there are only 2 ops completed, so it waits.

The one scenario that I can think of that breaks down is that one task
sleeps with min_nr of 100.  Then 50 ops completed.  Comes along 2nd
thread does a io_getevents and it will take all 50 events in the 2nd
thread.  Is that what you are talking about?  It doesn't involve two
sleepers.  That I can fix by testing whether wait queue is active or
not at the beginning of fast path in read_events().

The bigger question is: what is the semantics on event reap order for
thread? Random, FIFO or round robin?  It is not specified anywhere.
What would be the most optimal policy?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: streamline read events after woken up

2007-01-02 Thread Chen, Kenneth W

Zach Brown wrote on Tuesday, January 02, 2007 5:06 PM
> To: Chen, Kenneth W
> > Given the previous patch "aio: add per task aio wait event condition"
> > that we properly wake up event waiting process knowing that we have
> > enough events to reap, it's just plain waste of time to insert itself
> > into a wait queue, and then immediately remove itself from the wait
> > queue for *every* event reap iteration.
> 
> Hmm, I dunno.  It seems like we're still left with a pretty silly loop.
> 
> Would it be reasonable to have a loop that copied multiple events at  
> a time?  We could use some __copy_to_user_inatomic(), it didn't exist  
> when this stuff was first written.

It sounds reasonable, but I think it will be complicated because of
kmap_atomic on the ring buffer, along with tail wraps around ring
buffer index there.  By then, most of you would probably veto the
patch anyway ;-)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W

Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM
> > That is not possible because when multiple tasks waiting for  
> > events, they
> > enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
> > __add_wait_queue_tail().  So first io_getevents() with min_nr of 2  
> > will be woken up when 2 ops completes.
> 
> So switch the order of the two sleepers in the example?

Not sure why that would be a problem though:  whoever sleep first will
be woken up first.


> The point is that there's no way to guarantee that the head of the  
> wait queue will be the lowest min_nr.

Before I challenge that semantics, I want to mention that in current
implementation, dribbling AIO events will be distributed in round robin
fashion to all pending tasks waiting in io_getevents.  In the example you
gave earlier, task with min_nr of 2 will be woken up after 4 completed
events.  I consider that as an undesirable behavior as well.

Going back to your counter argument, why do we need the lowest min_nr in
the head of the queue?  These are tasks that shares one aio ctx and ioctx
is shareable only among threads.  Any reason why round robin policy is
superior than FIFO?  Also presumably, threads that shares ioctx should be
capable of handling events for the same ioctx.

>From wakeup order point of view, yes, tasks with lowest min_nr wakes up
first, but looking from io completion order, they are not. And these are
the source of excessive ctx switch.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: make aio_ring_info->nr_pages an unsigned int

2007-01-02 Thread Chen, Kenneth W

Zach Brown wrote on Tuesday, January 02, 2007 5:14 PM
> To: Chen, Kenneth W
> > --- ./include/linux/aio.h.orig  2006-12-24 22:31:55.0 -0800
> > +++ ./include/linux/aio.h   2006-12-24 22:41:28.0 -0800
> > @@ -165,7 +165,7 @@ struct aio_ring_info {
> >
> > struct page **ring_pages;
> > spinlock_t  ring_lock;
> > -   longnr_pages;
> > +   unsignednr_pages;
> >
> > unsignednr, tail;
> 
> Hmm.
> 
> This seems so trivial as to not be worth it.  It'd be more compelling  
> if it was more thorough -- doing things like updating the 'long i'  
> iterators that a feww have over ->nr_pages.  That kind of thing.   
> Giving some confidence that the references of ->nr_pages were audited.


I had that changes earlier, but dropped it to make the patch smaller. It
all started with head and tail index, which is defined as unsigned int in
structure, but in aio.c, all local variables that does temporary head and
tail calculation are unsigned long. While cleaning that, it got expanded
into nr_pages etc.  Oh well.

- Ken

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W

Zach Brown wrote on Tuesday, January 02, 2007 4:49 PM
> On Dec 29, 2006, at 6:31 PM, Chen, Kenneth W wrote:
> > This patch adds a wait condition to the wait queue and only wake-up
> > process when that condition meets.  And this condition is added on a
> > per task base for handling multi-threaded app that shares single  
> > ioctx.
> 
> But only one of the waiting tasks is tested, the one at the head of  
> the list.  It looks like this change could starve a io_getevents()  
> with a low min_nr in the presence of another io_getevents() with a  
> larger min_nr.
> 
> > +   if (waitqueue_active(>wait)) {
> > +   struct aio_wait_queue *wait;
> > +   wait = container_of(ctx->wait.task_list.next,
> > +   struct aio_wait_queue, wait.task_list);
> > +   if (nr_evt >= wait->nr_wait)
> > +   wake_up(>wait);
> > +   }
> 
> First is the fear of starvation as mentioned previously.
> 
> issue 2 ops
> first io_getevents sleeps with a min_nr of 2
> second io_getevents sleeps with min_nr of 3
> 2 ops complete but only test the second sleeper's min_nr of 3
> first sleeper twiddles thumbs

That is not possible because when multiple tasks waiting for events, they
enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
__add_wait_queue_tail().  So first io_getevents() with min_nr of 2 will
be woken up when 2 ops completes.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] remove redundant iov segment check

2007-01-02 Thread Chen, Kenneth W

Zach Brown wrote on Tuesday, January 02, 2007 10:22 AM
> >> I wonder if it wouldn't be better to make this change as part of a
> >> larger change that moves towards an explicit iovec container struct
> >> rather than bare 'struct iov *' and 'nr_segs' arguments.
> 
> > I suspect it should be rather trivial to get this started.  As a first
> > step we simply add a
> >
> > struct iodesc {
> > int nr_segs;
> > struct iovec ioc[]
> > };
> 
> Agreed, does anyone plan to try this in the near future?

Yes, I will take a stab at it this week.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] remove redundant iov segment check

2007-01-02 Thread Chen, Kenneth W

Zach Brown wrote on Tuesday, January 02, 2007 10:22 AM
  I wonder if it wouldn't be better to make this change as part of a
  larger change that moves towards an explicit iovec container struct
  rather than bare 'struct iov *' and 'nr_segs' arguments.
 
  I suspect it should be rather trivial to get this started.  As a first
  step we simply add a
 
  struct iodesc {
  int nr_segs;
  struct iovec ioc[]
  };
 
 Agreed, does anyone plan to try this in the near future?

Yes, I will take a stab at it this week.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W

Zach Brown wrote on Tuesday, January 02, 2007 4:49 PM
 On Dec 29, 2006, at 6:31 PM, Chen, Kenneth W wrote:
  This patch adds a wait condition to the wait queue and only wake-up
  process when that condition meets.  And this condition is added on a
  per task base for handling multi-threaded app that shares single  
  ioctx.
 
 But only one of the waiting tasks is tested, the one at the head of  
 the list.  It looks like this change could starve a io_getevents()  
 with a low min_nr in the presence of another io_getevents() with a  
 larger min_nr.
 
  +   if (waitqueue_active(ctx-wait)) {
  +   struct aio_wait_queue *wait;
  +   wait = container_of(ctx-wait.task_list.next,
  +   struct aio_wait_queue, wait.task_list);
  +   if (nr_evt = wait-nr_wait)
  +   wake_up(ctx-wait);
  +   }
 
 First is the fear of starvation as mentioned previously.
 
 issue 2 ops
 first io_getevents sleeps with a min_nr of 2
 second io_getevents sleeps with min_nr of 3
 2 ops complete but only test the second sleeper's min_nr of 3
 first sleeper twiddles thumbs

That is not possible because when multiple tasks waiting for events, they
enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
__add_wait_queue_tail().  So first io_getevents() with min_nr of 2 will
be woken up when 2 ops completes.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: make aio_ring_info-nr_pages an unsigned int

2007-01-02 Thread Chen, Kenneth W

Zach Brown wrote on Tuesday, January 02, 2007 5:14 PM
 To: Chen, Kenneth W
  --- ./include/linux/aio.h.orig  2006-12-24 22:31:55.0 -0800
  +++ ./include/linux/aio.h   2006-12-24 22:41:28.0 -0800
  @@ -165,7 +165,7 @@ struct aio_ring_info {
 
  struct page **ring_pages;
  spinlock_t  ring_lock;
  -   longnr_pages;
  +   unsignednr_pages;
 
  unsignednr, tail;
 
 Hmm.
 
 This seems so trivial as to not be worth it.  It'd be more compelling  
 if it was more thorough -- doing things like updating the 'long i'  
 iterators that a feww have over -nr_pages.  That kind of thing.   
 Giving some confidence that the references of -nr_pages were audited.


I had that changes earlier, but dropped it to make the patch smaller. It
all started with head and tail index, which is defined as unsigned int in
structure, but in aio.c, all local variables that does temporary head and
tail calculation are unsigned long. While cleaning that, it got expanded
into nr_pages etc.  Oh well.

- Ken

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W

Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM
  That is not possible because when multiple tasks waiting for  
  events, they
  enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
  __add_wait_queue_tail().  So first io_getevents() with min_nr of 2  
  will be woken up when 2 ops completes.
 
 So switch the order of the two sleepers in the example?

Not sure why that would be a problem though:  whoever sleep first will
be woken up first.


 The point is that there's no way to guarantee that the head of the  
 wait queue will be the lowest min_nr.

Before I challenge that semantics, I want to mention that in current
implementation, dribbling AIO events will be distributed in round robin
fashion to all pending tasks waiting in io_getevents.  In the example you
gave earlier, task with min_nr of 2 will be woken up after 4 completed
events.  I consider that as an undesirable behavior as well.

Going back to your counter argument, why do we need the lowest min_nr in
the head of the queue?  These are tasks that shares one aio ctx and ioctx
is shareable only among threads.  Any reason why round robin policy is
superior than FIFO?  Also presumably, threads that shares ioctx should be
capable of handling events for the same ioctx.

From wakeup order point of view, yes, tasks with lowest min_nr wakes up
first, but looking from io completion order, they are not. And these are
the source of excessive ctx switch.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: streamline read events after woken up

2007-01-02 Thread Chen, Kenneth W

Zach Brown wrote on Tuesday, January 02, 2007 5:06 PM
 To: Chen, Kenneth W
  Given the previous patch aio: add per task aio wait event condition
  that we properly wake up event waiting process knowing that we have
  enough events to reap, it's just plain waste of time to insert itself
  into a wait queue, and then immediately remove itself from the wait
  queue for *every* event reap iteration.
 
 Hmm, I dunno.  It seems like we're still left with a pretty silly loop.
 
 Would it be reasonable to have a loop that copied multiple events at  
 a time?  We could use some __copy_to_user_inatomic(), it didn't exist  
 when this stuff was first written.

It sounds reasonable, but I think it will be complicated because of
kmap_atomic on the ring buffer, along with tail wraps around ring
buffer index there.  By then, most of you would probably veto the
patch anyway ;-)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W

Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM
 On Jan 2, 2007, at 5:50 PM, Chen, Kenneth W wrote:
  Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM
  That is not possible because when multiple tasks waiting for
  events, they
  enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
  __add_wait_queue_tail().  So first io_getevents() with min_nr of 2
  will be woken up when 2 ops completes.
 
  So switch the order of the two sleepers in the example?
 
  Not sure why that would be a problem though:  whoever sleep first will
  be woken up first.
 
 Why would the min_nr = 3 sleeper be woken up in that case?  Only 2  
 ios were issued.
 
 Maybe the app was relying on the min_nr = 2 completion to issue 3  
 more ios for the min_nr = 3 sleeper, who knows.
 
 Does that clear up the confusion?


Not really. I don't think I understand your concern. You gave an example:

issue 2 ops
first io_getevents sleeps with a min_nr of 2
second io_getevents sleeps with min_nr of 3
2 ops complete but only test the second sleeper's min_nr of 3
first sleeper twiddles thumbs

Or:

issue 2 ops
first io_getevents sleeps with a min_nr of 3
second io_getevents sleeps with min_nr of 2
2 ops complete but only test the second sleeper's min_nr of 2
first sleeper twiddles thumbs


First scenario doesn't exist because in the new scheme, we test first
sleeper (as in head of the queue) when 2 ops complete. It wakes up first.

2nd scenario is OK to me because first sleeper waiting for 3 events,
and there are only 2 ops completed, so it waits.

The one scenario that I can think of that breaks down is that one task
sleeps with min_nr of 100.  Then 50 ops completed.  Comes along 2nd
thread does a io_getevents and it will take all 50 events in the 2nd
thread.  Is that what you are talking about?  It doesn't involve two
sleepers.  That I can fix by testing whether wait queue is active or
not at the beginning of fast path in read_events().

The bigger question is: what is the semantics on event reap order for
thread? Random, FIFO or round robin?  It is not specified anywhere.
What would be the most optimal policy?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W

Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM
  In the example you
  gave earlier, task with min_nr of 2 will be woken up after 4 completed
  events.
 
 I only gave 2 ios/events in that example.
 
 Does that clear up the confusion?

It occurs to me that people might not be aware how peculiar the
current io_getevent wakeup scheme is, to the extend of erratic
behavior.

In the blocking path of read_events(), we essentially doing the
following loop (simplified for clarity):

while (i  nr) {
add_wait_queue_exclusive(ctx-wait, wait);
do {
ret = aio_read_evt(ctx, ent);
if (!ret)
schedule();
while (1);
remove_wait_queue(ctx-wait, wait);
copy_to_user(event, ent, sizeof(ent));
}

Noticed that when thread comes out of schedule(), it removes itself
from the wait queue, and requeue itself at the end of the wait queue
for each and every event it reaps.  So if there are multiple threads
waiting in io_getevents, completed I/O are handed out in round robin
scheme to all waiting threads.

To illustrate it in ascii graph, here is what happens:

   thread 1   thread 2

   queue at head
   schedule()

  queue at 2nd position
  schedule

aio_complete
(event 1)
   remove_wait_queue  (now thread 2 is at head)
   reap event 1
   requeue at tail
   schedule

aio_complete
(event 2)
  remove_wait_queue (now thread 1 is at 
head)
  reap event 2
  requeue at tail
  schedule

If thread 1 sleeps first with min_nr = 2, and thread 2 sleeps
second with min_nr = 3, then thread 1 wakes up on event _3_.
But if thread 2 sleeps first, thread 1 sleeps second, thread 1
wakes up on event _4_.  If someone ask me to describe algorithm
of io_getevents wake-up scheme in the presence of multiple
waiters, I call it erratic and un-deterministic.

Looking back to the example Zach gave earlier, current
implementation behaves just like what described as an undesired
bug (modified and tortured):

issue 2 ops
first io_getevents sleeps with a min_nr of 2
second io_getevents sleeps with min_nr of 3
2 ops complete
first sleeper twiddles thumbs

So I can categorize my patchset as a bug fix instead of a
performance patch ;-)  Let's be serious, this ought to be fixed
one way or the other.


- Ken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] aio: make aio_ring_info->nr_pages an unsigned int

2006-12-29 Thread Chen, Kenneth W

The number of io_event in AIO event queue allowed currently is no more
than 2^32-1, because the syscall defines:

asmlinkage long sys_io_setup(unsigned nr_events, aio_context_t __user 
*ctxp)

We internally allocate a ring buffer for nr_events and keeps tracks of
page descriptors for each of these ring buffer pages.  Since page size
is significantly larger than AIO event size (4096 versus 32), I don't
think it is ever possible to overflow nr_pages in 32-bit quantity.

This patch changes nr_pages to unsigned int. on 64-bit arch, changing
it to unsigned int also allows better packing of aio_ring_info structure.


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>

--- ./include/linux/aio.h.orig  2006-12-24 22:31:55.0 -0800
+++ ./include/linux/aio.h   2006-12-24 22:41:28.0 -0800
@@ -165,7 +165,7 @@ struct aio_ring_info {
 
struct page **ring_pages;
spinlock_t  ring_lock;
-   longnr_pages;
+   unsignednr_pages;
 
unsignednr, tail;
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] aio: remove spurious ring head index modulo info->nr

2006-12-29 Thread Chen, Kenneth W

In aio_read_evt(), the ring->head will never wrap info->nr because
we already does the wrap when updating the ring head index:

if (head != ring->tail) {
...
head = (head + 1) % info->nr;
ring->head = head;
}

This makes the modulo of ring->head into local variable head unnecessary.
This patch removes that bogus code.


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>


--- ./fs/aio.c.orig 2006-12-24 22:01:36.0 -0800
+++ ./fs/aio.c  2006-12-24 22:34:48.0 -0800
@@ -1019,7 +1019,7 @@ static int aio_read_evt(struct kioctx *i
 {
struct aio_ring_info *info = >ring_info;
struct aio_ring *ring;
-   unsigned long head;
+   unsigned int head;
int ret = 0;
 
ring = kmap_atomic(info->ring_pages[0], KM_USER0);
@@ -1032,7 +1032,7 @@ static int aio_read_evt(struct kioctx *i
 
spin_lock(>ring_lock);
 
-   head = ring->head % info->nr;
+   head = ring->head;
if (head != ring->tail) {
struct io_event *evp = aio_ring_event(info, head, KM_USER1);
*ent = *evp;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] aio: streamline read events after woken up

2006-12-29 Thread Chen, Kenneth W

The read event loop in the blocking path is also inefficient. For every
event it reap (if not blocking), it does the following in a loop:

  while (i < nr) {
prepare_to_wait_exclusive
aio_read_evt
finish_wait
...
  }

Given the previous patch "aio: add per task aio wait event condition"
that we properly wake up event waiting process knowing that we have
enough events to reap, it's just plain waste of time to insert itself
into a wait queue, and then immediately remove itself from the wait
queue for *every* event reap iteration.

This patch factors out the wait queue insertion/deletion out of the event
reap loop, streamlines the event reaping after the process wakes up.


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>

--- ./fs/aio.c.orig 2006-12-24 17:04:52.0 -0800
+++ ./fs/aio.c  2006-12-24 17:05:10.0 -0800
@@ -1174,42 +1174,40 @@ retry:
}
 
aio_init_wait();
+wait:
+   prepare_to_wait_exclusive(>wait, , TASK_INTERRUPTIBLE);
+   ret = aio_read_evt(ctx, );
+   if (!ret) {
+   wait.nr_wait = min_nr - i;
+   schedule();
+   if (signal_pending(tsk))
+   ret = -EINTR;
+   }
+   finish_wait(>wait, );
+
+   if (ret < 0)
+   goto out_cleanup;
+
while (likely(i < nr)) {
-   do {
-   prepare_to_wait_exclusive(>wait, ,
- TASK_INTERRUPTIBLE);
-   ret = aio_read_evt(ctx, );
-   if (ret)
-   break;
-   if (min_nr <= i)
-   break;
-   ret = 0;
-   if (to.timed_out)   /* Only check after read evt */
-   break;
-   wait.nr_wait = min_nr - i;
-   schedule();
-   if (signal_pending(tsk)) {
-   ret = -EINTR;
+   if (ret) {
+   if (unlikely(copy_to_user(event, , sizeof(ent {
+   dprintk("aio: lost an event due to EFAULT.\n");
+   ret = -EFAULT;
break;
}
-   /*ret = aio_read_evt(ctx, );*/
-   } while (1) ;
-   finish_wait(>wait, );
-
-   if (unlikely(ret <= 0))
-   break;
+   event++;
+   i++;
+   }
 
-   ret = -EFAULT;
-   if (unlikely(copy_to_user(event, , sizeof(ent {
-   dprintk("aio: lost an event due to EFAULT.\n");
+   ret = aio_read_evt(ctx, );
+   if (unlikely(!ret)) {
+   if (i < min_nr && !to.timed_out)
+   goto wait;
break;
}
-
-   /* Good, event copied to userland, update counts. */
-   event ++;
-   i ++;
}
 
+out_cleanup:
if (timeout)
clear_timeout();
 out:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] aio: add per task aio wait event condition

2006-12-29 Thread Chen, Kenneth W

The AIO wake-up notification from aio_complete is really inefficient
in current AIO implementation in the presence of process waiting in
io_getevents().

For example, if app calls io_getevents with min_nr > 1, and aio event
queue doesn't have enough completed aio event, the process will block
in read_events().  However, aio_complete() will wake up the waiting
process for *each* complete I/O even though number of events that an
app is waiting for is much larger than 1.  This makes excessive and
unnecessary context switch because the waiting process will just reap
one single event and goes back to sleep again.  It is much more efficient
to wake up the waiting process when there are enough events for it to
reap.

This patch adds a wait condition to the wait queue and only wake-up
process when that condition meets.  And this condition is added on a
per task base for handling multi-threaded app that shares single ioctx.

To show the effect of this patch, here is an vmstat output before and
after the patch. The app does random O_DIRECT AIO on 60 disks. Context
switch is reduced from 13 thousand+ down to just 40+, an significant
improvement.

Before:
procs ---memory-- ---swap-- -io --system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
 0  0  0 3972608   7056  3131200 14000 0 7840 13715  0  2 98  0
 0  0  0 3972608   7056  3131200 14300 0 7793 13641  0  2 98  0
 0  0  0 3972608   7056  3131200 14100 0 7885 13747  0  2 98  0

After:
 0  0  0 3972608   7056  3131200 14000 0 784049  0  2 98  0
 0  0  0 3972608   7056  3131200 13800 0 779353  0  2 98  0
 0  0  0 3972608   7056  3131200 13800 0 788542  0  2 98  0


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>


--- ./fs/aio.c.orig 2006-12-24 16:41:39.0 -0800
+++ ./fs/aio.c  2006-12-24 16:52:15.0 -0800
@@ -193,6 +193,17 @@ static int aio_setup_ring(struct kioctx 
kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km); \
 } while(0)
 
+struct aio_wait_queue {
+   int nr_wait;/* wake-up condition */
+   wait_queue_twait;
+};
+
+static inline void aio_init_wait(struct aio_wait_queue *wait)
+{
+   wait->nr_wait = 0;
+   init_wait(>wait);
+}
+
 /* ioctx_alloc
  * Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
  */
@@ -296,13 +307,14 @@ static void aio_cancel_all(struct kioctx
 static void wait_for_all_aios(struct kioctx *ctx)
 {
struct task_struct *tsk = current;
-   DECLARE_WAITQUEUE(wait, tsk);
+   struct aio_wait_queue wait;
 
spin_lock_irq(>ctx_lock);
if (!ctx->reqs_active)
goto out;
 
-   add_wait_queue(>wait, );
+   aio_init_wait();
+   add_wait_queue(>wait, );
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
while (ctx->reqs_active) {
spin_unlock_irq(>ctx_lock);
@@ -311,7 +323,7 @@ static void wait_for_all_aios(struct kio
spin_lock_irq(>ctx_lock);
}
__set_task_state(tsk, TASK_RUNNING);
-   remove_wait_queue(>wait, );
+   remove_wait_queue(>wait, );
 
 out:
spin_unlock_irq(>ctx_lock);
@@ -932,6 +944,7 @@ int fastcall aio_complete(struct kiocb *
unsigned long   flags;
unsigned long   tail;
int ret;
+   int nr_evt = 0;
 
/*
 * Special case handling for sync iocbs:
@@ -992,6 +1005,9 @@ int fastcall aio_complete(struct kiocb *
info->tail = tail;
ring->tail = tail;
 
+   nr_evt = ring->tail - ring->head;
+   if (nr_evt < 0)
+   nr_evt += info->nr;
put_aio_ring_event(event, KM_IRQ0);
kunmap_atomic(ring, KM_IRQ1);
 
@@ -1000,8 +1016,13 @@ put_rq:
/* everything turned out well, dispose of the aiocb. */
ret = __aio_put_req(ctx, iocb);
 
-   if (waitqueue_active(>wait))
-   wake_up(>wait);
+   if (waitqueue_active(>wait)) {
+   struct aio_wait_queue *wait;
+   wait = container_of(ctx->wait.task_list.next,
+   struct aio_wait_queue, wait.task_list);
+   if (nr_evt >= wait->nr_wait)
+   wake_up(>wait);
+   }
 
spin_unlock_irqrestore(>ctx_lock, flags);
return ret;
@@ -1094,7 +1115,7 @@ static int read_events(struct kioctx *ct
 {
longstart_jiffies = jiffies;
struct task_struct  *tsk = current;
-   DECLARE_WAITQUEUE(wait, tsk);
+   struct aio_wait_queue   wait;
int ret;
int i = 0;
struct io_event ent;
@@ -1152,10 +1173,11 @@ retry:
set_timeout(start_jiffies, , );
}
 
+   aio_init_wait();
while (likely(i < nr)) {
-   add_wait_queue_exclusive(>wait, );

[patch] aio: fix buggy put_ioctx call in aio_complete - v2

2006-12-29 Thread Chen, Kenneth W

An AIO bug was reported that sleeping function is being called in softirq
context:

BUG: warning at kernel/mutex.c:132/__mutex_lock_common()
Call Trace:
 [] __mutex_lock_slowpath+0x640/0x6c0
 [] mutex_lock+0x20/0x40
 [] flush_workqueue+0xb0/0x1a0
 [] __put_ioctx+0xc0/0x240
 [] aio_complete+0x2f0/0x420
 [] finished_one_bio+0x200/0x2a0
 [] dio_bio_complete+0x1c0/0x200
 [] dio_bio_end_aio+0x60/0x80
 [] bio_endio+0x110/0x1c0
 [] __end_that_request_first+0x180/0xba0
 [] end_that_request_chunk+0x30/0x60
 [] scsi_end_request+0x50/0x300 [scsi_mod]
 [] scsi_io_completion+0x200/0x8a0 [scsi_mod]
 [] sd_rw_intr+0x330/0x860 [sd_mod]
 [] scsi_finish_command+0x100/0x1c0 [scsi_mod]
 [] scsi_softirq_done+0x230/0x300 [scsi_mod]
 [] blk_done_softirq+0x160/0x1c0
 [] __do_softirq+0x200/0x240
 [] do_softirq+0x70/0xc0

See report: http://marc.theaimsgroup.com/?l=linux-kernel=116599593200888=2

flush_workqueue() is not allowed to be called in the softirq context. 
However, aio_complete() called from I/O interrupt can potentially call
put_ioctx with last ref count on ioctx and triggers bug.  It is simply
incorrect to perform ioctx freeing from aio_complete.

The bug is trigger-able from a race between io_destroy() and aio_complete().
A possible scenario:

cpu0   cpu1
io_destroy aio_complete
  wait_for_all_aios {__aio_put_req
 ... ctx->reqs_active--;
 if (!ctx->reqs_active)
return;
  }
  ...
  put_ioctx(ioctx)

 put_ioctx(ctx);
__put_ioctx
  bam! Bug trigger!

The real problem is that the condition check of ctx->reqs_active in
wait_for_all_aios() is incorrect that access to reqs_active is not
being properly protected by spin lock.

This patch adds that protective spin lock, and at the same time removes
all duplicate ref counting for each kiocb as reqs_active is already used
as a ref count for each active ioctx.  This also ensures that buggy call
to flush_workqueue() in softirq context is eliminated.


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>

--- ./fs/aio.c.orig 2006-12-21 08:08:14.0 -0800
+++ ./fs/aio.c  2006-12-21 08:14:27.0 -0800
@@ -298,17 +298,23 @@ static void wait_for_all_aios(struct kio
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);
 
+   spin_lock_irq(>ctx_lock);
if (!ctx->reqs_active)
-   return;
+   goto out;
 
add_wait_queue(>wait, );
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
while (ctx->reqs_active) {
+   spin_unlock_irq(>ctx_lock);
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+   spin_lock_irq(>ctx_lock);
}
__set_task_state(tsk, TASK_RUNNING);
remove_wait_queue(>wait, );
+
+out:
+   spin_unlock_irq(>ctx_lock);
 }
 
 /* wait_on_sync_kiocb:
@@ -424,7 +430,6 @@ static struct kiocb fastcall *__aio_get_
ring = kmap_atomic(ctx->ring_info.ring_pages[0], KM_USER0);
if (ctx->reqs_active < aio_ring_avail(>ring_info, ring)) {
list_add(>ki_list, >active_reqs);
-   get_ioctx(ctx);
ctx->reqs_active++;
okay = 1;
}
@@ -536,8 +541,6 @@ int fastcall aio_put_req(struct kiocb *r
spin_lock_irq(>ctx_lock);
ret = __aio_put_req(ctx, req);
spin_unlock_irq(>ctx_lock);
-   if (ret)
-   put_ioctx(ctx);
return ret;
 }
 
@@ -782,8 +785,7 @@ static int __aio_run_iocbs(struct kioctx
 */
iocb->ki_users++;   /* grab extra reference */
aio_run_iocb(iocb);
-   if (__aio_put_req(ctx, iocb))  /* drop extra ref */
-   put_ioctx(ctx);
+   __aio_put_req(ctx, iocb);
}
if (!list_empty(>run_list))
return 1;
@@ -998,14 +1000,10 @@ put_rq:
/* everything turned out well, dispose of the aiocb. */
ret = __aio_put_req(ctx, iocb);
 
-   spin_unlock_irqrestore(>ctx_lock, flags);
-
if (waitqueue_active(>wait))
wake_up(>wait);
 
-   if (ret)
-   put_ioctx(ctx);
-
+   spin_unlock_irqrestore(>ctx_lock, flags);
return ret;
 }
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] aio: fix buggy put_ioctx call in aio_complete - v2

2006-12-29 Thread Chen, Kenneth W

An AIO bug was reported that sleeping function is being called in softirq
context:

BUG: warning at kernel/mutex.c:132/__mutex_lock_common()
Call Trace:
 [a00100577b00] __mutex_lock_slowpath+0x640/0x6c0
 [a00100577ba0] mutex_lock+0x20/0x40
 [a001000a25b0] flush_workqueue+0xb0/0x1a0
 [a0010018c0c0] __put_ioctx+0xc0/0x240
 [a0010018d470] aio_complete+0x2f0/0x420
 [a0010019cc80] finished_one_bio+0x200/0x2a0
 [a0010019d1c0] dio_bio_complete+0x1c0/0x200
 [a0010019d260] dio_bio_end_aio+0x60/0x80
 [a0010014acd0] bio_endio+0x110/0x1c0
 [a001002770e0] __end_that_request_first+0x180/0xba0
 [a00100277b90] end_that_request_chunk+0x30/0x60
 [a002073c0c70] scsi_end_request+0x50/0x300 [scsi_mod]
 [a002073c1240] scsi_io_completion+0x200/0x8a0 [scsi_mod]
 [a002074729b0] sd_rw_intr+0x330/0x860 [sd_mod]
 [a002073b3ac0] scsi_finish_command+0x100/0x1c0 [scsi_mod]
 [a002073c2910] scsi_softirq_done+0x230/0x300 [scsi_mod]
 [a00100277d20] blk_done_softirq+0x160/0x1c0
 [a00100083e00] __do_softirq+0x200/0x240
 [a00100083eb0] do_softirq+0x70/0xc0

See report: http://marc.theaimsgroup.com/?l=linux-kernelm=116599593200888w=2

flush_workqueue() is not allowed to be called in the softirq context. 
However, aio_complete() called from I/O interrupt can potentially call
put_ioctx with last ref count on ioctx and triggers bug.  It is simply
incorrect to perform ioctx freeing from aio_complete.

The bug is trigger-able from a race between io_destroy() and aio_complete().
A possible scenario:

cpu0   cpu1
io_destroy aio_complete
  wait_for_all_aios {__aio_put_req
 ... ctx-reqs_active--;
 if (!ctx-reqs_active)
return;
  }
  ...
  put_ioctx(ioctx)

 put_ioctx(ctx);
__put_ioctx
  bam! Bug trigger!

The real problem is that the condition check of ctx-reqs_active in
wait_for_all_aios() is incorrect that access to reqs_active is not
being properly protected by spin lock.

This patch adds that protective spin lock, and at the same time removes
all duplicate ref counting for each kiocb as reqs_active is already used
as a ref count for each active ioctx.  This also ensures that buggy call
to flush_workqueue() in softirq context is eliminated.


Signed-off-by: Ken Chen [EMAIL PROTECTED]

--- ./fs/aio.c.orig 2006-12-21 08:08:14.0 -0800
+++ ./fs/aio.c  2006-12-21 08:14:27.0 -0800
@@ -298,17 +298,23 @@ static void wait_for_all_aios(struct kio
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);
 
+   spin_lock_irq(ctx-ctx_lock);
if (!ctx-reqs_active)
-   return;
+   goto out;
 
add_wait_queue(ctx-wait, wait);
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
while (ctx-reqs_active) {
+   spin_unlock_irq(ctx-ctx_lock);
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+   spin_lock_irq(ctx-ctx_lock);
}
__set_task_state(tsk, TASK_RUNNING);
remove_wait_queue(ctx-wait, wait);
+
+out:
+   spin_unlock_irq(ctx-ctx_lock);
 }
 
 /* wait_on_sync_kiocb:
@@ -424,7 +430,6 @@ static struct kiocb fastcall *__aio_get_
ring = kmap_atomic(ctx-ring_info.ring_pages[0], KM_USER0);
if (ctx-reqs_active  aio_ring_avail(ctx-ring_info, ring)) {
list_add(req-ki_list, ctx-active_reqs);
-   get_ioctx(ctx);
ctx-reqs_active++;
okay = 1;
}
@@ -536,8 +541,6 @@ int fastcall aio_put_req(struct kiocb *r
spin_lock_irq(ctx-ctx_lock);
ret = __aio_put_req(ctx, req);
spin_unlock_irq(ctx-ctx_lock);
-   if (ret)
-   put_ioctx(ctx);
return ret;
 }
 
@@ -782,8 +785,7 @@ static int __aio_run_iocbs(struct kioctx
 */
iocb-ki_users++;   /* grab extra reference */
aio_run_iocb(iocb);
-   if (__aio_put_req(ctx, iocb))  /* drop extra ref */
-   put_ioctx(ctx);
+   __aio_put_req(ctx, iocb);
}
if (!list_empty(ctx-run_list))
return 1;
@@ -998,14 +1000,10 @@ put_rq:
/* everything turned out well, dispose of the aiocb. */
ret = __aio_put_req(ctx, iocb);
 
-   spin_unlock_irqrestore(ctx-ctx_lock, flags);
-
if (waitqueue_active(ctx-wait))
wake_up(ctx-wait);
 
-   if (ret)
-   put_ioctx(ctx);
-
+   spin_unlock_irqrestore(ctx-ctx_lock, flags);
return ret;
 }
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at

[patch] aio: add per task aio wait event condition

2006-12-29 Thread Chen, Kenneth W

The AIO wake-up notification from aio_complete is really inefficient
in current AIO implementation in the presence of process waiting in
io_getevents().

For example, if app calls io_getevents with min_nr  1, and aio event
queue doesn't have enough completed aio event, the process will block
in read_events().  However, aio_complete() will wake up the waiting
process for *each* complete I/O even though number of events that an
app is waiting for is much larger than 1.  This makes excessive and
unnecessary context switch because the waiting process will just reap
one single event and goes back to sleep again.  It is much more efficient
to wake up the waiting process when there are enough events for it to
reap.

This patch adds a wait condition to the wait queue and only wake-up
process when that condition meets.  And this condition is added on a
per task base for handling multi-threaded app that shares single ioctx.

To show the effect of this patch, here is an vmstat output before and
after the patch. The app does random O_DIRECT AIO on 60 disks. Context
switch is reduced from 13 thousand+ down to just 40+, an significant
improvement.

Before:
procs ---memory-- ---swap-- -io --system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
 0  0  0 3972608   7056  3131200 14000 0 7840 13715  0  2 98  0
 0  0  0 3972608   7056  3131200 14300 0 7793 13641  0  2 98  0
 0  0  0 3972608   7056  3131200 14100 0 7885 13747  0  2 98  0

After:
 0  0  0 3972608   7056  3131200 14000 0 784049  0  2 98  0
 0  0  0 3972608   7056  3131200 13800 0 779353  0  2 98  0
 0  0  0 3972608   7056  3131200 13800 0 788542  0  2 98  0


Signed-off-by: Ken Chen [EMAIL PROTECTED]


--- ./fs/aio.c.orig 2006-12-24 16:41:39.0 -0800
+++ ./fs/aio.c  2006-12-24 16:52:15.0 -0800
@@ -193,6 +193,17 @@ static int aio_setup_ring(struct kioctx 
kunmap_atomic((void *)((unsigned long)__event  PAGE_MASK), km); \
 } while(0)
 
+struct aio_wait_queue {
+   int nr_wait;/* wake-up condition */
+   wait_queue_twait;
+};
+
+static inline void aio_init_wait(struct aio_wait_queue *wait)
+{
+   wait-nr_wait = 0;
+   init_wait(wait-wait);
+}
+
 /* ioctx_alloc
  * Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
  */
@@ -296,13 +307,14 @@ static void aio_cancel_all(struct kioctx
 static void wait_for_all_aios(struct kioctx *ctx)
 {
struct task_struct *tsk = current;
-   DECLARE_WAITQUEUE(wait, tsk);
+   struct aio_wait_queue wait;
 
spin_lock_irq(ctx-ctx_lock);
if (!ctx-reqs_active)
goto out;
 
-   add_wait_queue(ctx-wait, wait);
+   aio_init_wait(wait);
+   add_wait_queue(ctx-wait, wait.wait);
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
while (ctx-reqs_active) {
spin_unlock_irq(ctx-ctx_lock);
@@ -311,7 +323,7 @@ static void wait_for_all_aios(struct kio
spin_lock_irq(ctx-ctx_lock);
}
__set_task_state(tsk, TASK_RUNNING);
-   remove_wait_queue(ctx-wait, wait);
+   remove_wait_queue(ctx-wait, wait.wait);
 
 out:
spin_unlock_irq(ctx-ctx_lock);
@@ -932,6 +944,7 @@ int fastcall aio_complete(struct kiocb *
unsigned long   flags;
unsigned long   tail;
int ret;
+   int nr_evt = 0;
 
/*
 * Special case handling for sync iocbs:
@@ -992,6 +1005,9 @@ int fastcall aio_complete(struct kiocb *
info-tail = tail;
ring-tail = tail;
 
+   nr_evt = ring-tail - ring-head;
+   if (nr_evt  0)
+   nr_evt += info-nr;
put_aio_ring_event(event, KM_IRQ0);
kunmap_atomic(ring, KM_IRQ1);
 
@@ -1000,8 +1016,13 @@ put_rq:
/* everything turned out well, dispose of the aiocb. */
ret = __aio_put_req(ctx, iocb);
 
-   if (waitqueue_active(ctx-wait))
-   wake_up(ctx-wait);
+   if (waitqueue_active(ctx-wait)) {
+   struct aio_wait_queue *wait;
+   wait = container_of(ctx-wait.task_list.next,
+   struct aio_wait_queue, wait.task_list);
+   if (nr_evt = wait-nr_wait)
+   wake_up(ctx-wait);
+   }
 
spin_unlock_irqrestore(ctx-ctx_lock, flags);
return ret;
@@ -1094,7 +1115,7 @@ static int read_events(struct kioctx *ct
 {
longstart_jiffies = jiffies;
struct task_struct  *tsk = current;
-   DECLARE_WAITQUEUE(wait, tsk);
+   struct aio_wait_queue   wait;
int ret;
int i = 0;
struct io_event ent;
@@ -1152,10 +1173,11 @@ retry:
set_timeout(start_jiffies, to, ts);
}
 
+   aio_init_wait(wait);
while

[patch] aio: streamline read events after woken up

2006-12-29 Thread Chen, Kenneth W

The read event loop in the blocking path is also inefficient. For every
event it reap (if not blocking), it does the following in a loop:

  while (i  nr) {
prepare_to_wait_exclusive
aio_read_evt
finish_wait
...
  }

Given the previous patch aio: add per task aio wait event condition
that we properly wake up event waiting process knowing that we have
enough events to reap, it's just plain waste of time to insert itself
into a wait queue, and then immediately remove itself from the wait
queue for *every* event reap iteration.

This patch factors out the wait queue insertion/deletion out of the event
reap loop, streamlines the event reaping after the process wakes up.


Signed-off-by: Ken Chen [EMAIL PROTECTED]

--- ./fs/aio.c.orig 2006-12-24 17:04:52.0 -0800
+++ ./fs/aio.c  2006-12-24 17:05:10.0 -0800
@@ -1174,42 +1174,40 @@ retry:
}
 
aio_init_wait(wait);
+wait:
+   prepare_to_wait_exclusive(ctx-wait, wait.wait, TASK_INTERRUPTIBLE);
+   ret = aio_read_evt(ctx, ent);
+   if (!ret) {
+   wait.nr_wait = min_nr - i;
+   schedule();
+   if (signal_pending(tsk))
+   ret = -EINTR;
+   }
+   finish_wait(ctx-wait, wait.wait);
+
+   if (ret  0)
+   goto out_cleanup;
+
while (likely(i  nr)) {
-   do {
-   prepare_to_wait_exclusive(ctx-wait, wait.wait,
- TASK_INTERRUPTIBLE);
-   ret = aio_read_evt(ctx, ent);
-   if (ret)
-   break;
-   if (min_nr = i)
-   break;
-   ret = 0;
-   if (to.timed_out)   /* Only check after read evt */
-   break;
-   wait.nr_wait = min_nr - i;
-   schedule();
-   if (signal_pending(tsk)) {
-   ret = -EINTR;
+   if (ret) {
+   if (unlikely(copy_to_user(event, ent, sizeof(ent {
+   dprintk(aio: lost an event due to EFAULT.\n);
+   ret = -EFAULT;
break;
}
-   /*ret = aio_read_evt(ctx, ent);*/
-   } while (1) ;
-   finish_wait(ctx-wait, wait.wait);
-
-   if (unlikely(ret = 0))
-   break;
+   event++;
+   i++;
+   }
 
-   ret = -EFAULT;
-   if (unlikely(copy_to_user(event, ent, sizeof(ent {
-   dprintk(aio: lost an event due to EFAULT.\n);
+   ret = aio_read_evt(ctx, ent);
+   if (unlikely(!ret)) {
+   if (i  min_nr  !to.timed_out)
+   goto wait;
break;
}
-
-   /* Good, event copied to userland, update counts. */
-   event ++;
-   i ++;
}
 
+out_cleanup:
if (timeout)
clear_timeout(to);
 out:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] aio: remove spurious ring head index modulo info-nr

2006-12-29 Thread Chen, Kenneth W

In aio_read_evt(), the ring-head will never wrap info-nr because
we already does the wrap when updating the ring head index:

if (head != ring-tail) {
...
head = (head + 1) % info-nr;
ring-head = head;
}

This makes the modulo of ring-head into local variable head unnecessary.
This patch removes that bogus code.


Signed-off-by: Ken Chen [EMAIL PROTECTED]


--- ./fs/aio.c.orig 2006-12-24 22:01:36.0 -0800
+++ ./fs/aio.c  2006-12-24 22:34:48.0 -0800
@@ -1019,7 +1019,7 @@ static int aio_read_evt(struct kioctx *i
 {
struct aio_ring_info *info = ioctx-ring_info;
struct aio_ring *ring;
-   unsigned long head;
+   unsigned int head;
int ret = 0;
 
ring = kmap_atomic(info-ring_pages[0], KM_USER0);
@@ -1032,7 +1032,7 @@ static int aio_read_evt(struct kioctx *i
 
spin_lock(info-ring_lock);
 
-   head = ring-head % info-nr;
+   head = ring-head;
if (head != ring-tail) {
struct io_event *evp = aio_ring_event(info, head, KM_USER1);
*ent = *evp;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] aio: make aio_ring_info-nr_pages an unsigned int

2006-12-29 Thread Chen, Kenneth W

The number of io_event in AIO event queue allowed currently is no more
than 2^32-1, because the syscall defines:

asmlinkage long sys_io_setup(unsigned nr_events, aio_context_t __user 
*ctxp)

We internally allocate a ring buffer for nr_events and keeps tracks of
page descriptors for each of these ring buffer pages.  Since page size
is significantly larger than AIO event size (4096 versus 32), I don't
think it is ever possible to overflow nr_pages in 32-bit quantity.

This patch changes nr_pages to unsigned int. on 64-bit arch, changing
it to unsigned int also allows better packing of aio_ring_info structure.


Signed-off-by: Ken Chen [EMAIL PROTECTED]

--- ./include/linux/aio.h.orig  2006-12-24 22:31:55.0 -0800
+++ ./include/linux/aio.h   2006-12-24 22:41:28.0 -0800
@@ -165,7 +165,7 @@ struct aio_ring_info {
 
struct page **ring_pages;
spinlock_t  ring_lock;
-   longnr_pages;
+   unsignednr_pages;
 
unsignednr, tail;
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread Chen, Kenneth W

Chen, Kenneth wrote on Wednesday, December 27, 2006 9:55 PM
> Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM
> > On Wed, 27 Dec 2006, David Miller wrote:
> > > > 
> > > > I still don't see _why_, though. But maybe smarter people than me can 
> > > > see 
> > > > it..
> > > 
> > > FWIW this program definitely triggers the bug for me.
> > 
> > Ok, now that I have something simple to do repeatable stuff with, I can 
> > say what the pattern is.. It's not all that surprising, but it's still 
> > worth just stating for the record.
> 
> 
> Running the test code, git bisect points its finger at this commit. Reverting
> this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.
> 
> edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit
> commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
> Author: Peter Zijlstra <[EMAIL PROTECTED]>
> Date:   Mon Sep 25 23:30:58 2006 -0700
> 
> [PATCH] mm: balance dirty pages
> 
> Now that we can detect writers of shared mappings, throttle them.  Avoids 
> OOM
> by surprise.


Oh, never mind :-(  I just didn't create enough write out pressure when
test this. I just saw bug got triggered on a kernel I previously thought
was OK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread Chen, Kenneth W

Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM
> On Wed, 27 Dec 2006, David Miller wrote:
> > > 
> > > I still don't see _why_, though. But maybe smarter people than me can see 
> > > it..
> > 
> > FWIW this program definitely triggers the bug for me.
> 
> Ok, now that I have something simple to do repeatable stuff with, I can 
> say what the pattern is.. It's not all that surprising, but it's still 
> worth just stating for the record.


Running the test code, git bisect points its finger at this commit. Reverting
this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.


edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit
commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
Author: Peter Zijlstra <[EMAIL PROTECTED]>
Date:   Mon Sep 25 23:30:58 2006 -0700

[PATCH] mm: balance dirty pages

Now that we can detect writers of shared mappings, throttle them.  Avoids 
OOM
by surprise.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
Cc: Hugh Dickins <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread Chen, Kenneth W

Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM
 On Wed, 27 Dec 2006, David Miller wrote:
   
   I still don't see _why_, though. But maybe smarter people than me can see 
   it..
  
  FWIW this program definitely triggers the bug for me.
 
 Ok, now that I have something simple to do repeatable stuff with, I can 
 say what the pattern is.. It's not all that surprising, but it's still 
 worth just stating for the record.


Running the test code, git bisect points its finger at this commit. Reverting
this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.


edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit
commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
Author: Peter Zijlstra [EMAIL PROTECTED]
Date:   Mon Sep 25 23:30:58 2006 -0700

[PATCH] mm: balance dirty pages

Now that we can detect writers of shared mappings, throttle them.  Avoids 
OOM
by surprise.

Signed-off-by: Peter Zijlstra [EMAIL PROTECTED]
Cc: Hugh Dickins [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
Signed-off-by: Linus Torvalds [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] mm: fix page_mkclean_one

2006-12-27 Thread Chen, Kenneth W

Chen, Kenneth wrote on Wednesday, December 27, 2006 9:55 PM
 Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM
  On Wed, 27 Dec 2006, David Miller wrote:

I still don't see _why_, though. But maybe smarter people than me can 
see 
it..
   
   FWIW this program definitely triggers the bug for me.
  
  Ok, now that I have something simple to do repeatable stuff with, I can 
  say what the pattern is.. It's not all that surprising, but it's still 
  worth just stating for the record.
 
 
 Running the test code, git bisect points its finger at this commit. Reverting
 this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.
 
 edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit
 commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
 Author: Peter Zijlstra [EMAIL PROTECTED]
 Date:   Mon Sep 25 23:30:58 2006 -0700
 
 [PATCH] mm: balance dirty pages
 
 Now that we can detect writers of shared mappings, throttle them.  Avoids 
 OOM
 by surprise.


Oh, never mind :-(  I just didn't create enough write out pressure when
test this. I just saw bug got triggered on a kernel I previously thought
was OK.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: fix buggy put_ioctx call in aio_complete

2006-12-21 Thread Chen, Kenneth W

[EMAIL PROTECTED] wrote on Thursday, December 21, 2006 9:35 AM
> kenneth.w.chen> Take ioctx_lock is one part, the other part is to move
> kenneth.w.chen>   spin_unlock_irqrestore(>ctx_lock, flags);
> kenneth.w.chen> in aio_complete all the way down to the end of the
> kenneth.w.chen> function, after wakeup and put_ioctx.  But then the ref
> kenneth.w.chen> counting on ioctx in aio_complete path is Meaningless,
> kenneth.w.chen> which is the thing I'm trying to remove.
> 
> OK, right.  But are we simply papering over the real problem?  Earlier in
> this thread, you stated:
> 
> > flush_workqueue() is not allowed to be called in the softirq context.
> > However, aio_complete() called from I/O interrupt can potentially call
> > put_ioctx with last ref count on ioctx and trigger a bug warning.  It
> > is simply incorrect to perform ioctx freeing from aio_complete.
> 
> But how do we end up with the last reference to the ioctx in the aio
> completion path?  That's a question I haven't seen answered.


It is explained in this posting, A race between io_destroy and aio_complete:
http://groups.google.com/group/linux.kernel/msg/d2ba7d73aca1dd0c?=en

Trond spotted a bug in that posting.  The correct way of locking is to
move the spin_unlock_irqrestore in aio_complete all the way down instead
of calling aio_put_req at the end.  Like this:


--- ./fs/aio.c.orig 2006-12-21 08:08:14.0 -0800
+++ ./fs/aio.c  2006-12-21 08:14:27.0 -0800
@@ -298,17 +298,23 @@ static void wait_for_all_aios(struct kio
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);
 
+   spin_lock_irq(>ctx_lock);
if (!ctx->reqs_active)
-   return;
+   goto out;
 
add_wait_queue(>wait, );
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
while (ctx->reqs_active) {
+   spin_unlock_irq(>ctx_lock);
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+   spin_lock_irq(>ctx_lock);
}
__set_task_state(tsk, TASK_RUNNING);
remove_wait_queue(>wait, );
+
+out:
+   spin_unlock_irq(>ctx_lock);
 }
 
 /* wait_on_sync_kiocb:
@@ -424,7 +430,6 @@ static struct kiocb fastcall *__aio_get_
ring = kmap_atomic(ctx->ring_info.ring_pages[0], KM_USER0);
if (ctx->reqs_active < aio_ring_avail(>ring_info, ring)) {
list_add(>ki_list, >active_reqs);
-   get_ioctx(ctx);
ctx->reqs_active++;
okay = 1;
}
@@ -536,8 +541,6 @@ int fastcall aio_put_req(struct kiocb *r
spin_lock_irq(>ctx_lock);
ret = __aio_put_req(ctx, req);
spin_unlock_irq(>ctx_lock);
-   if (ret)
-   put_ioctx(ctx);
return ret;
 }
 
@@ -782,8 +785,7 @@ static int __aio_run_iocbs(struct kioctx
 */
iocb->ki_users++;   /* grab extra reference */
aio_run_iocb(iocb);
-   if (__aio_put_req(ctx, iocb))  /* drop extra ref */
-   put_ioctx(ctx);
+   __aio_put_req(ctx, iocb);
}
if (!list_empty(>run_list))
return 1;
@@ -998,14 +1000,10 @@ put_rq:
/* everything turned out well, dispose of the aiocb. */
ret = __aio_put_req(ctx, iocb);
 
-   spin_unlock_irqrestore(>ctx_lock, flags);
-
if (waitqueue_active(>wait))
wake_up(>wait);
 
-   if (ret)
-   put_ioctx(ctx);
-
+   spin_unlock_irqrestore(>ctx_lock, flags);
return ret;
 }
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: fix buggy put_ioctx call in aio_complete

2006-12-21 Thread Chen, Kenneth W

[EMAIL PROTECTED] wrote on Thursday, December 21, 2006 8:56 AM
> kenneth.w.chen> I think I'm going to abandon this whole synchronize thing
> kenneth.w.chen> and going to put the wake up call inside ioctx_lock spin
> kenneth.w.chen> lock along with the other patch you mentioned above in the
> kenneth.w.chen> waiter path.  On top of that, I have another patch attempts
> kenneth.w.chen> to perform wake-up only when the waiter can truly proceed
> kenneth.w.chen> in aio_read_evt so dribbling I/O completion doesn't
> kenneth.w.chen> inefficiently waking up waiter too frequently and only to
> kenneth.w.chen> have waiter put back to sleep again. I will dig that up and
> kenneth.w.chen> experiment.
> 
> In the mean time, can't we simply take the context lock in
> wait_for_all_aios?  Unless I missed something, I think that will address
> the reference count problem.

Take ioctx_lock is one part, the other part is to move

spin_unlock_irqrestore(>ctx_lock, flags);

in aio_complete all the way down to the end of the function, after wakeup
and put_ioctx.  But then the ref counting on ioctx in aio_complete path is
Meaningless, which is the thing I'm trying to remove.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: fix buggy put_ioctx call in aio_complete

2006-12-21 Thread Chen, Kenneth W

Andrew Morton wrote on Thursday, December 21, 2006 12:18 AM
> Alas, your above description doesn't really tell us what the bug is, so I'm
> at a bit of a loss here.
> 
> http://marc.theaimsgroup.com/?l=linux-aio=116616463009218=2>
> 
> So that's a refcounting bug.  But it's really a locking bug, because
> refcounting needs locking too.

I should've quoted the original bug report (kicking myself for those
fat fingers!): 
http://marc.theaimsgroup.com/?l=linux-kernel=116599593200888=2

The bug manifested from an expectation that __put_ioctx can be called
in the softirq context, but it really shouldn't. Normally, aio_complete
will not decrement last ref count on ioctx, but under stressed system,
it might.


> > Problem is in wait_for_all_aios(), it is checking wait status without
> > properly holding an ioctx lock. Perhaps, this patch is walking on thin
> > ice.  It abuses rcu over a buggy code. OTOH, I really don't want to hold
> > ctx_lock over the entire wakeup call at the end of aio_complete:
> > 
> > if (waitqueue_active(>wait))
> > wake_up(>wait);
> > 
> > I'm worried about longer lock hold time in aio_complete and potentially 
> > increase lock contention for concurrent I/O completion.
> 
> There is a potential problem where we deliver a storm of wakeups at the
> waitqueue, and until the waked-up process actually ges going and removes
> itself from the waitqueue, those wake_up()s do lots of work waking up an
> already-runnable task.
> 
> If we were using DEFINE_WAIT/prepare_to_wait/finish_wait in there then the
> *first* wake_up() would do some work, but all the following ones are
> practically free.
> 
> So if you're going to get in there and run the numbers on this, try both
> approaches.

Yes, I agree and I would like to bring your patch on "DEFINE_WAIT..." back
too.  I will run that experiment.


> > A quick look
> > at lockmeter data I had on a 4 socket system (with dual core + HT), it
> > already showing signs of substantial lock contention in aio_complete.
> > I'm afraid putting the above call inside ioctx_lock will make things
> > worse.
> 
> It beats oopsing.

Yeah, correctness absolutely over rule optimization.  I just hope I can
find a way to satisfy both.


> > And synchronize_rcu fits the bill perfectly: aio_complete sets wakeup
> > status, drop ioctx_lock, do the wakeup call all protected inside rcu
> > lock.  Then wait_for_all_aios will just wait for all that sequence to
> > complete before it proceed with __put_ioctx().  All nice and easy.
> 
> Possibly it would be less abusive to use preempt_disable()/enable (with
> suitable comments) and synchronize_kernel().  To at least remove the rcu
> signals from in there.

I think I'm going to abandon this whole synchronize thing and going to
put the wake up call inside ioctx_lock spin lock along with the other
patch you mentioned above in the waiter path.  On top of that, I have
another patch attempts to perform wake-up only when the waiter can truly
proceed in aio_read_evt so dribbling I/O completion doesn't inefficiently
waking up waiter too frequently and only to have waiter put back to sleep
again. I will dig that up and experiment.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: fix buggy put_ioctx call in aio_complete

2006-12-21 Thread Chen, Kenneth W

Andrew Morton wrote on Wednesday, December 20, 2006 8:06 PM
> On Tue, 19 Dec 2006 13:49:18 -0800
> "Chen, Kenneth W" <[EMAIL PROTECTED]> wrote:
> > Regarding to a bug report on:
> > http://marc.theaimsgroup.com/?l=linux-kernel=116599593200888=2
> > 
> > flush_workqueue() is not allowed to be called in the softirq context.
> > However, aio_complete() called from I/O interrupt can potentially call
> > put_ioctx with last ref count on ioctx and trigger a bug warning.  It
> > is simply incorrect to perform ioctx freeing from aio_complete.
> > 
> > This patch removes all duplicate ref counting for each kiocb as
> > reqs_active already used as a request ref count for each active ioctx.
> > This also ensures that buggy call to flush_workqueue() in softirq
> > context is eliminated. wait_for_all_aios currently will wait on last
> > active kiocb.  However, it is racy.  This patch also tighten it up
> > by utilizing rcu synchronization mechanism to ensure no further
> > reference to ioctx before put_ioctx function is run.
> 
> hrm, maybe.  Does this count as "abuse of the RCU interfaces".  Or "reuse"?

Yeah, it's abuse.

Problem is in wait_for_all_aios(), it is checking wait status without
properly holding an ioctx lock. Perhaps, this patch is walking on thin
ice.  It abuses rcu over a buggy code. OTOH, I really don't want to hold
ctx_lock over the entire wakeup call at the end of aio_complete:

if (waitqueue_active(>wait))
wake_up(>wait);

I'm worried about longer lock hold time in aio_complete and potentially 
increase lock contention for concurrent I/O completion.  A quick look
at lockmeter data I had on a 4 socket system (with dual core + HT), it
already showing signs of substantial lock contention in aio_complete.
I'm afraid putting the above call inside ioctx_lock will make things
worse.

And synchronize_rcu fits the bill perfectly: aio_complete sets wakeup
status, drop ioctx_lock, do the wakeup call all protected inside rcu
lock.  Then wait_for_all_aios will just wait for all that sequence to
complete before it proceed with __put_ioctx().  All nice and easy.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: fix buggy put_ioctx call in aio_complete

2006-12-21 Thread Chen, Kenneth W

Andrew Morton wrote on Wednesday, December 20, 2006 8:06 PM
 On Tue, 19 Dec 2006 13:49:18 -0800
 Chen, Kenneth W [EMAIL PROTECTED] wrote:
  Regarding to a bug report on:
  http://marc.theaimsgroup.com/?l=linux-kernelm=116599593200888w=2
  
  flush_workqueue() is not allowed to be called in the softirq context.
  However, aio_complete() called from I/O interrupt can potentially call
  put_ioctx with last ref count on ioctx and trigger a bug warning.  It
  is simply incorrect to perform ioctx freeing from aio_complete.
  
  This patch removes all duplicate ref counting for each kiocb as
  reqs_active already used as a request ref count for each active ioctx.
  This also ensures that buggy call to flush_workqueue() in softirq
  context is eliminated. wait_for_all_aios currently will wait on last
  active kiocb.  However, it is racy.  This patch also tighten it up
  by utilizing rcu synchronization mechanism to ensure no further
  reference to ioctx before put_ioctx function is run.
 
 hrm, maybe.  Does this count as abuse of the RCU interfaces.  Or reuse?

Yeah, it's abuse.

Problem is in wait_for_all_aios(), it is checking wait status without
properly holding an ioctx lock. Perhaps, this patch is walking on thin
ice.  It abuses rcu over a buggy code. OTOH, I really don't want to hold
ctx_lock over the entire wakeup call at the end of aio_complete:

if (waitqueue_active(ctx-wait))
wake_up(ctx-wait);

I'm worried about longer lock hold time in aio_complete and potentially 
increase lock contention for concurrent I/O completion.  A quick look
at lockmeter data I had on a 4 socket system (with dual core + HT), it
already showing signs of substantial lock contention in aio_complete.
I'm afraid putting the above call inside ioctx_lock will make things
worse.

And synchronize_rcu fits the bill perfectly: aio_complete sets wakeup
status, drop ioctx_lock, do the wakeup call all protected inside rcu
lock.  Then wait_for_all_aios will just wait for all that sequence to
complete before it proceed with __put_ioctx().  All nice and easy.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: fix buggy put_ioctx call in aio_complete

2006-12-21 Thread Chen, Kenneth W

Andrew Morton wrote on Thursday, December 21, 2006 12:18 AM
 Alas, your above description doesn't really tell us what the bug is, so I'm
 at a bit of a loss here.
 
 finds http://marc.theaimsgroup.com/?l=linux-aiom=116616463009218w=2
 
 So that's a refcounting bug.  But it's really a locking bug, because
 refcounting needs locking too.

I should've quoted the original bug report (kicking myself for those
fat fingers!): 
http://marc.theaimsgroup.com/?l=linux-kernelm=116599593200888w=2

The bug manifested from an expectation that __put_ioctx can be called
in the softirq context, but it really shouldn't. Normally, aio_complete
will not decrement last ref count on ioctx, but under stressed system,
it might.


  Problem is in wait_for_all_aios(), it is checking wait status without
  properly holding an ioctx lock. Perhaps, this patch is walking on thin
  ice.  It abuses rcu over a buggy code. OTOH, I really don't want to hold
  ctx_lock over the entire wakeup call at the end of aio_complete:
  
  if (waitqueue_active(ctx-wait))
  wake_up(ctx-wait);
  
  I'm worried about longer lock hold time in aio_complete and potentially 
  increase lock contention for concurrent I/O completion.
 
 There is a potential problem where we deliver a storm of wakeups at the
 waitqueue, and until the waked-up process actually ges going and removes
 itself from the waitqueue, those wake_up()s do lots of work waking up an
 already-runnable task.
 
 If we were using DEFINE_WAIT/prepare_to_wait/finish_wait in there then the
 *first* wake_up() would do some work, but all the following ones are
 practically free.
 
 So if you're going to get in there and run the numbers on this, try both
 approaches.

Yes, I agree and I would like to bring your patch on DEFINE_WAIT... back
too.  I will run that experiment.


  A quick look
  at lockmeter data I had on a 4 socket system (with dual core + HT), it
  already showing signs of substantial lock contention in aio_complete.
  I'm afraid putting the above call inside ioctx_lock will make things
  worse.
 
 It beats oopsing.

Yeah, correctness absolutely over rule optimization.  I just hope I can
find a way to satisfy both.


  And synchronize_rcu fits the bill perfectly: aio_complete sets wakeup
  status, drop ioctx_lock, do the wakeup call all protected inside rcu
  lock.  Then wait_for_all_aios will just wait for all that sequence to
  complete before it proceed with __put_ioctx().  All nice and easy.
 
 Possibly it would be less abusive to use preempt_disable()/enable (with
 suitable comments) and synchronize_kernel().  To at least remove the rcu
 signals from in there.

I think I'm going to abandon this whole synchronize thing and going to
put the wake up call inside ioctx_lock spin lock along with the other
patch you mentioned above in the waiter path.  On top of that, I have
another patch attempts to perform wake-up only when the waiter can truly
proceed in aio_read_evt so dribbling I/O completion doesn't inefficiently
waking up waiter too frequently and only to have waiter put back to sleep
again. I will dig that up and experiment.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: fix buggy put_ioctx call in aio_complete

2006-12-21 Thread Chen, Kenneth W

[EMAIL PROTECTED] wrote on Thursday, December 21, 2006 8:56 AM
 kenneth.w.chen I think I'm going to abandon this whole synchronize thing
 kenneth.w.chen and going to put the wake up call inside ioctx_lock spin
 kenneth.w.chen lock along with the other patch you mentioned above in the
 kenneth.w.chen waiter path.  On top of that, I have another patch attempts
 kenneth.w.chen to perform wake-up only when the waiter can truly proceed
 kenneth.w.chen in aio_read_evt so dribbling I/O completion doesn't
 kenneth.w.chen inefficiently waking up waiter too frequently and only to
 kenneth.w.chen have waiter put back to sleep again. I will dig that up and
 kenneth.w.chen experiment.
 
 In the mean time, can't we simply take the context lock in
 wait_for_all_aios?  Unless I missed something, I think that will address
 the reference count problem.

Take ioctx_lock is one part, the other part is to move

spin_unlock_irqrestore(ctx-ctx_lock, flags);

in aio_complete all the way down to the end of the function, after wakeup
and put_ioctx.  But then the ref counting on ioctx in aio_complete path is
Meaningless, which is the thing I'm trying to remove.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: fix buggy put_ioctx call in aio_complete

2006-12-21 Thread Chen, Kenneth W

[EMAIL PROTECTED] wrote on Thursday, December 21, 2006 9:35 AM
 kenneth.w.chen Take ioctx_lock is one part, the other part is to move
 kenneth.w.chen   spin_unlock_irqrestore(ctx-ctx_lock, flags);
 kenneth.w.chen in aio_complete all the way down to the end of the
 kenneth.w.chen function, after wakeup and put_ioctx.  But then the ref
 kenneth.w.chen counting on ioctx in aio_complete path is Meaningless,
 kenneth.w.chen which is the thing I'm trying to remove.
 
 OK, right.  But are we simply papering over the real problem?  Earlier in
 this thread, you stated:
 
  flush_workqueue() is not allowed to be called in the softirq context.
  However, aio_complete() called from I/O interrupt can potentially call
  put_ioctx with last ref count on ioctx and trigger a bug warning.  It
  is simply incorrect to perform ioctx freeing from aio_complete.
 
 But how do we end up with the last reference to the ioctx in the aio
 completion path?  That's a question I haven't seen answered.


It is explained in this posting, A race between io_destroy and aio_complete:
http://groups.google.com/group/linux.kernel/msg/d2ba7d73aca1dd0c?hl=en

Trond spotted a bug in that posting.  The correct way of locking is to
move the spin_unlock_irqrestore in aio_complete all the way down instead
of calling aio_put_req at the end.  Like this:


--- ./fs/aio.c.orig 2006-12-21 08:08:14.0 -0800
+++ ./fs/aio.c  2006-12-21 08:14:27.0 -0800
@@ -298,17 +298,23 @@ static void wait_for_all_aios(struct kio
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);
 
+   spin_lock_irq(ctx-ctx_lock);
if (!ctx-reqs_active)
-   return;
+   goto out;
 
add_wait_queue(ctx-wait, wait);
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
while (ctx-reqs_active) {
+   spin_unlock_irq(ctx-ctx_lock);
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+   spin_lock_irq(ctx-ctx_lock);
}
__set_task_state(tsk, TASK_RUNNING);
remove_wait_queue(ctx-wait, wait);
+
+out:
+   spin_unlock_irq(ctx-ctx_lock);
 }
 
 /* wait_on_sync_kiocb:
@@ -424,7 +430,6 @@ static struct kiocb fastcall *__aio_get_
ring = kmap_atomic(ctx-ring_info.ring_pages[0], KM_USER0);
if (ctx-reqs_active  aio_ring_avail(ctx-ring_info, ring)) {
list_add(req-ki_list, ctx-active_reqs);
-   get_ioctx(ctx);
ctx-reqs_active++;
okay = 1;
}
@@ -536,8 +541,6 @@ int fastcall aio_put_req(struct kiocb *r
spin_lock_irq(ctx-ctx_lock);
ret = __aio_put_req(ctx, req);
spin_unlock_irq(ctx-ctx_lock);
-   if (ret)
-   put_ioctx(ctx);
return ret;
 }
 
@@ -782,8 +785,7 @@ static int __aio_run_iocbs(struct kioctx
 */
iocb-ki_users++;   /* grab extra reference */
aio_run_iocb(iocb);
-   if (__aio_put_req(ctx, iocb))  /* drop extra ref */
-   put_ioctx(ctx);
+   __aio_put_req(ctx, iocb);
}
if (!list_empty(ctx-run_list))
return 1;
@@ -998,14 +1000,10 @@ put_rq:
/* everything turned out well, dispose of the aiocb. */
ret = __aio_put_req(ctx, iocb);
 
-   spin_unlock_irqrestore(ctx-ctx_lock, flags);
-
if (waitqueue_active(ctx-wait))
wake_up(ctx-wait);
 
-   if (ret)
-   put_ioctx(ctx);
-
+   spin_unlock_irqrestore(ctx-ctx_lock, flags);
return ret;
 }
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC PATCH 1/8] rqbased-dm: allow blk_get_request() to be called from interrupt context

2006-12-20 Thread Chen, Kenneth W

Kiyoshi Ueda wrote on Wednesday, December 20, 2006 9:50 AM
> On Wed, 20 Dec 2006 14:48:49 +0100, Jens Axboe <[EMAIL PROTECTED]> wrote:
> > Big NACK on this - it's not only really ugly, it's also buggy to pass
> > interrupt flags as function arguments. As you also mention in the 0/1
> > mail, this also breaks CFQ.
> > 
> > Why do you need in-interrupt request allocation?
>  
> Because I'd like to use blk_get_request() in q->request_fn()
> which can be called from interrupt context like below:
>   scsi_io_completion -> scsi_end_request -> scsi_next_command
>   -> scsi_run_queue -> blk_run_queue -> q->request_fn
> 
> [ ...]
> 
> Do you think creating another function like blk_get_request_nowait()
> is acceptable?

You don't need to create another function.  blk_get_request already
have both wait and nowait semantics via gfp_mask argument. If you can
not block, then clear __GFP_WAIT bit in the mask before calling
blk_get_request.

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC PATCH 1/8] rqbased-dm: allow blk_get_request() to be called from interrupt context

2006-12-20 Thread Chen, Kenneth W

Kiyoshi Ueda wrote on Wednesday, December 20, 2006 9:50 AM
 On Wed, 20 Dec 2006 14:48:49 +0100, Jens Axboe [EMAIL PROTECTED] wrote:
  Big NACK on this - it's not only really ugly, it's also buggy to pass
  interrupt flags as function arguments. As you also mention in the 0/1
  mail, this also breaks CFQ.
  
  Why do you need in-interrupt request allocation?
  
 Because I'd like to use blk_get_request() in q-request_fn()
 which can be called from interrupt context like below:
   scsi_io_completion - scsi_end_request - scsi_next_command
   - scsi_run_queue - blk_run_queue - q-request_fn
 
 [ ...]
 
 Do you think creating another function like blk_get_request_nowait()
 is acceptable?

You don't need to create another function.  blk_get_request already
have both wait and nowait semantics via gfp_mask argument. If you can
not block, then clear __GFP_WAIT bit in the mask before calling
blk_get_request.

- Ken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] aio: fix buggy put_ioctx call in aio_complete

2006-12-19 Thread Chen, Kenneth W

Regarding to a bug report on:
http://marc.theaimsgroup.com/?l=linux-kernel=116599593200888=2

flush_workqueue() is not allowed to be called in the softirq context.
However, aio_complete() called from I/O interrupt can potentially call
put_ioctx with last ref count on ioctx and trigger a bug warning.  It
is simply incorrect to perform ioctx freeing from aio_complete.

This patch removes all duplicate ref counting for each kiocb as
reqs_active already used as a request ref count for each active ioctx.
This also ensures that buggy call to flush_workqueue() in softirq
context is eliminated. wait_for_all_aios currently will wait on last
active kiocb.  However, it is racy.  This patch also tighten it up
by utilizing rcu synchronization mechanism to ensure no further
reference to ioctx before put_ioctx function is run.


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>


--- ./fs/aio.c.orig 2006-12-19 08:35:01.0 -0800
+++ ./fs/aio.c  2006-12-19 08:46:34.0 -0800
@@ -308,6 +308,7 @@ static void wait_for_all_aios(struct kio
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}
__set_task_state(tsk, TASK_RUNNING);
+   synchronize_rcu();
remove_wait_queue(>wait, );
 }
 
@@ -425,7 +426,6 @@ static struct kiocb fastcall *__aio_get_
ring = kmap_atomic(ctx->ring_info.ring_pages[0], KM_USER0);
if (ctx->reqs_active < aio_ring_avail(>ring_info, ring)) {
list_add(>ki_list, >active_reqs);
-   get_ioctx(ctx);
ctx->reqs_active++;
okay = 1;
}
@@ -538,8 +538,6 @@ int fastcall aio_put_req(struct kiocb *r
spin_lock_irq(>ctx_lock);
ret = __aio_put_req(ctx, req);
spin_unlock_irq(>ctx_lock);
-   if (ret)
-   put_ioctx(ctx);
return ret;
 }
 
@@ -795,8 +793,7 @@ static int __aio_run_iocbs(struct kioctx
 */
iocb->ki_users++;   /* grab extra reference */
aio_run_iocb(iocb);
-   if (__aio_put_req(ctx, iocb))  /* drop extra ref */
-   put_ioctx(ctx);
+   __aio_put_req(ctx, iocb);
}
if (!list_empty(>run_list))
return 1;
@@ -1012,6 +1009,7 @@ int fastcall aio_complete(struct kiocb *
iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes);
 put_rq:
/* everything turned out well, dispose of the aiocb. */
+   rcu_read_lock();
ret = __aio_put_req(ctx, iocb);
 
spin_unlock_irqrestore(>ctx_lock, flags);
@@ -1019,9 +1017,7 @@ put_rq:
if (waitqueue_active(>wait))
wake_up(>wait);
 
-   if (ret)
-   put_ioctx(ctx);
-
+   rcu_read_unlock();
return ret;
 }
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] aio: fix buggy put_ioctx call in aio_complete

2006-12-19 Thread Chen, Kenneth W

Regarding to a bug report on:
http://marc.theaimsgroup.com/?l=linux-kernelm=116599593200888w=2

flush_workqueue() is not allowed to be called in the softirq context.
However, aio_complete() called from I/O interrupt can potentially call
put_ioctx with last ref count on ioctx and trigger a bug warning.  It
is simply incorrect to perform ioctx freeing from aio_complete.

This patch removes all duplicate ref counting for each kiocb as
reqs_active already used as a request ref count for each active ioctx.
This also ensures that buggy call to flush_workqueue() in softirq
context is eliminated. wait_for_all_aios currently will wait on last
active kiocb.  However, it is racy.  This patch also tighten it up
by utilizing rcu synchronization mechanism to ensure no further
reference to ioctx before put_ioctx function is run.


Signed-off-by: Ken Chen [EMAIL PROTECTED]


--- ./fs/aio.c.orig 2006-12-19 08:35:01.0 -0800
+++ ./fs/aio.c  2006-12-19 08:46:34.0 -0800
@@ -308,6 +308,7 @@ static void wait_for_all_aios(struct kio
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}
__set_task_state(tsk, TASK_RUNNING);
+   synchronize_rcu();
remove_wait_queue(ctx-wait, wait);
 }
 
@@ -425,7 +426,6 @@ static struct kiocb fastcall *__aio_get_
ring = kmap_atomic(ctx-ring_info.ring_pages[0], KM_USER0);
if (ctx-reqs_active  aio_ring_avail(ctx-ring_info, ring)) {
list_add(req-ki_list, ctx-active_reqs);
-   get_ioctx(ctx);
ctx-reqs_active++;
okay = 1;
}
@@ -538,8 +538,6 @@ int fastcall aio_put_req(struct kiocb *r
spin_lock_irq(ctx-ctx_lock);
ret = __aio_put_req(ctx, req);
spin_unlock_irq(ctx-ctx_lock);
-   if (ret)
-   put_ioctx(ctx);
return ret;
 }
 
@@ -795,8 +793,7 @@ static int __aio_run_iocbs(struct kioctx
 */
iocb-ki_users++;   /* grab extra reference */
aio_run_iocb(iocb);
-   if (__aio_put_req(ctx, iocb))  /* drop extra ref */
-   put_ioctx(ctx);
+   __aio_put_req(ctx, iocb);
}
if (!list_empty(ctx-run_list))
return 1;
@@ -1012,6 +1009,7 @@ int fastcall aio_complete(struct kiocb *
iocb-ki_nbytes - iocb-ki_left, iocb-ki_nbytes);
 put_rq:
/* everything turned out well, dispose of the aiocb. */
+   rcu_read_lock();
ret = __aio_put_req(ctx, iocb);
 
spin_unlock_irqrestore(ctx-ctx_lock, flags);
@@ -1019,9 +1017,7 @@ put_rq:
if (waitqueue_active(ctx-wait))
wake_up(ctx-wait);
 
-   if (ret)
-   put_ioctx(ctx);
-
+   rcu_read_unlock();
return ret;
 }
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] incorrect direct io error handling

2006-12-18 Thread Chen, Kenneth W

Dmitriy Monakhov wrote on Monday, December 18, 2006 5:23 AM
> This patch is result of discussion started week ago here:
> http://lkml.org/lkml/2006/12/11/66
> changes from original patch:
>  - Update wrong comments about i_mutex locking.
>  - Add BUG_ON(!mutex_is_locked(..)) for non blkdev. 
>  - vmtruncate call only for non blockdev
> LOG:
> If generic_file_direct_write() has fail (ENOSPC condition) inside 
> __generic_file_aio_write_nolock() it may have instantiated
> a few blocks outside i_size. And fsck will complain about wrong i_size
> (ext2, ext3 and reiserfs interpret i_size and biggest block difference as 
> error),
> after fsck will fix error i_size will be increased to the biggest block,
> but this blocks contain gurbage from previous write attempt, this is not 
> information leak, but its silence file data corruption. This issue affect 
> fs regardless the values of blocksize or pagesize.
> We need truncate any block beyond i_size after write have failed , do in 
> simular
> generic_file_buffered_write() error path. If host is !S_ISBLK i_mutex always
> held inside generic_file_aio_write_nolock() and we may safely call 
> vmtruncate().
> Some fs (XFS at least) may directly call generic_file_direct_write()with 
> i_mutex not held. There is no general scenario in this case. This fs have to 
> handle generic_file_direct_write() error by its own specific way (place). 
>  


I'm puzzled that if ext2 is able to instantiate some blocks, then why does it
return no space error?  Where is the error coming from?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] IA64: alignment bug in ldscript

2006-12-18 Thread Chen, Kenneth W

Kirill Korotaev wrote on Monday, December 18, 2006 4:05 AM
> [IA64] bug in ldscript (mainstream)
> 
> Occasionally, in mainstream number of fsys entries is even.

Is it a typo on "fsys entries is even"?

If not, then this change log is misleading. It is the instruction
patch list of FSYS_RETURN that can potentially cause other data
structures to be out of alignment. And number of FSYS_RETURN call
site will not necessarily match number of fsys entry.


> In OpenVZ it is odd and we get misaligned kernel image,
> which does not boot.

Otherwise, the patch looks fine to me.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] incorrect direct io error handling

2006-12-18 Thread Chen, Kenneth W

Dmitriy Monakhov wrote on Monday, December 18, 2006 5:23 AM
 This patch is result of discussion started week ago here:
 http://lkml.org/lkml/2006/12/11/66
 changes from original patch:
  - Update wrong comments about i_mutex locking.
  - Add BUG_ON(!mutex_is_locked(..)) for non blkdev. 
  - vmtruncate call only for non blockdev
 LOG:
 If generic_file_direct_write() has fail (ENOSPC condition) inside 
 __generic_file_aio_write_nolock() it may have instantiated
 a few blocks outside i_size. And fsck will complain about wrong i_size
 (ext2, ext3 and reiserfs interpret i_size and biggest block difference as 
 error),
 after fsck will fix error i_size will be increased to the biggest block,
 but this blocks contain gurbage from previous write attempt, this is not 
 information leak, but its silence file data corruption. This issue affect 
 fs regardless the values of blocksize or pagesize.
 We need truncate any block beyond i_size after write have failed , do in 
 simular
 generic_file_buffered_write() error path. If host is !S_ISBLK i_mutex always
 held inside generic_file_aio_write_nolock() and we may safely call 
 vmtruncate().
 Some fs (XFS at least) may directly call generic_file_direct_write()with 
 i_mutex not held. There is no general scenario in this case. This fs have to 
 handle generic_file_direct_write() error by its own specific way (place). 
  


I'm puzzled that if ext2 is able to instantiate some blocks, then why does it
return no space error?  Where is the error coming from?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] IA64: alignment bug in ldscript

2006-12-18 Thread Chen, Kenneth W

Kirill Korotaev wrote on Monday, December 18, 2006 4:05 AM
 [IA64] bug in ldscript (mainstream)
 
 Occasionally, in mainstream number of fsys entries is even.

Is it a typo on fsys entries is even?

If not, then this change log is misleading. It is the instruction
patch list of FSYS_RETURN that can potentially cause other data
structures to be out of alignment. And number of FSYS_RETURN call
site will not necessarily match number of fsys entry.


 In OpenVZ it is odd and we get misaligned kernel image,
 which does not boot.

Otherwise, the patch looks fine to me.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 2.6.18.4: flush_workqueue calls mutex_lock in interruptenvironment

2006-12-15 Thread Chen, Kenneth W

Trond Myklebust wrote on Friday, December 15, 2006 6:01 AM
> Oops. Missed the fact that you are removed the put_ioctx from
> aio_put_req, but the first sentence is still true. If you try to wake up
> wait_for_all_aios before you've changed the condition it is waiting for,
> then it may end up hanging forever.

The easy fix to that is to put wake_up in aio_complete inside the ctx spin
lock.


> Why not fix this by having the context freed via an RCU callback? That
> way you can protect the combined call to aio_put_req() +
> wake_up(ctx->wait) using a simple preempt_off/preempt_on, and all is
> good.

That has been suggested before on a different subject.  I will whip up
something.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-15 Thread Chen, Kenneth W

Christoph Hellwig wrote on Friday, December 15, 2006 2:44 AM
> So we're doing the sync_page_range once in __generic_file_aio_write
> with i_mutex held.
> 
> 
> > mutex_lock(>i_mutex);
> > -   ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
> > -   >ki_pos);
> > +   ret = __generic_file_aio_write(iocb, iov, nr_segs, pos);
> > mutex_unlock(>i_mutex);
> >  
> > if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
> 
> And then another time after it's unlocked, this seems wrong.


I didn't invent that mess though.

I should've ask the question first: in 2.6.20-rc1, generic_file_aio_write
will call sync_page_range twice, once from __generic_file_aio_write_nolock
and once within the function itself.  Is it redundant?  Can we delete the
one in the top level function?  Like the following?


--- ./mm/filemap.c.orig 2006-12-15 09:02:58.0 -0800
+++ ./mm/filemap.c  2006-12-15 09:03:19.0 -0800
@@ -2370,14 +2370,6 @@ ssize_t generic_file_aio_write(struct ki
ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
>ki_pos);
mutex_unlock(>i_mutex);
-
-   if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) {
-   ssize_t err;
-
-   err = sync_page_range(inode, mapping, pos, ret);
-   if (err < 0)
-   ret = err;
-   }
return ret;
 }
 EXPORT_SYMBOL(generic_file_aio_write);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-15 Thread Chen, Kenneth W

Christoph Hellwig wrote on Friday, December 15, 2006 2:44 AM
 So we're doing the sync_page_range once in __generic_file_aio_write
 with i_mutex held.
 
 
  mutex_lock(inode-i_mutex);
  -   ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
  -   iocb-ki_pos);
  +   ret = __generic_file_aio_write(iocb, iov, nr_segs, pos);
  mutex_unlock(inode-i_mutex);
   
  if (ret  0  ((file-f_flags  O_SYNC) || IS_SYNC(inode))) {
 
 And then another time after it's unlocked, this seems wrong.


I didn't invent that mess though.

I should've ask the question first: in 2.6.20-rc1, generic_file_aio_write
will call sync_page_range twice, once from __generic_file_aio_write_nolock
and once within the function itself.  Is it redundant?  Can we delete the
one in the top level function?  Like the following?


--- ./mm/filemap.c.orig 2006-12-15 09:02:58.0 -0800
+++ ./mm/filemap.c  2006-12-15 09:03:19.0 -0800
@@ -2370,14 +2370,6 @@ ssize_t generic_file_aio_write(struct ki
ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs,
iocb-ki_pos);
mutex_unlock(inode-i_mutex);
-
-   if (ret  0  ((file-f_flags  O_SYNC) || IS_SYNC(inode))) {
-   ssize_t err;
-
-   err = sync_page_range(inode, mapping, pos, ret);
-   if (err  0)
-   ret = err;
-   }
return ret;
 }
 EXPORT_SYMBOL(generic_file_aio_write);

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 2.6.18.4: flush_workqueue calls mutex_lock in interruptenvironment

2006-12-15 Thread Chen, Kenneth W

Trond Myklebust wrote on Friday, December 15, 2006 6:01 AM
 Oops. Missed the fact that you are removed the put_ioctx from
 aio_put_req, but the first sentence is still true. If you try to wake up
 wait_for_all_aios before you've changed the condition it is waiting for,
 then it may end up hanging forever.

The easy fix to that is to put wake_up in aio_complete inside the ctx spin
lock.


 Why not fix this by having the context freed via an RCU callback? That
 way you can protect the combined call to aio_put_req() +
 wake_up(ctx-wait) using a simple preempt_off/preempt_on, and all is
 good.

That has been suggested before on a different subject.  I will whip up
something.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 2.6.18.4: flush_workqueue calls mutex_lock in interrupt environment

2006-12-14 Thread Chen, Kenneth W

Chen, Kenneth wrote on Thursday, December 14, 2006 5:59 PM
> > It seems utterly insane to have aio_complete() flush a workqueue. That
> > function has to be called from a number of different environments,
> > including non-sleep tolerant environments.
> > 
> > For instance it means that directIO on NFS will now cause the rpciod
> > workqueues to call flush_workqueue(aio_wq), thus slowing down all RPC
> > activity.
> 
> The bug appears to be somewhere else, somehow the ref count on ioctx is
> all messed up.
> 
> In aio_complete, __put_ioctx() should not be invoked because ref count
> on ioctx is supposedly more than 2, aio_complete decrement it once and
> should return without invoking the free function.
> 
> The real freeing ioctx should be coming from exit_aio() or io_destroy(),
> in which case both wait until no further pending AIO request via
> wait_for_all_aios().

Ah, I think I see the bug: it must be a race between io_destroy() and
aio_complete().  A possible scenario:

cpu0   cpu1
io_destroy aio_complete
  wait_for_all_aios {__aio_put_req
 ... ctx->reqs_active--;
 if (!ctx->reqs_active)
return;
  }
  ...
  put_ioctx(ioctx)

 put_ioctx(ctx);
bam! Bug trigger!

AIO finished on cpu1 and while in the middle of aio_complete, cpu0 starts
io_destroy sequence, sees no pending AIO, went ahead decrement the ref
count on ioctx.  At a later point in aio_complete, the put_ioctx decrement
last ref count and calls the ioctx freeing function and there it triggered
the bug warning.

A simple fix would be to access ctx->reqs_active inside ctx spin lock in 
wait_for_all_aios().  At the mean time, I would like to
remove ref counting
for each iocb because we already performing ref count using reqs_active. This
would also prevent similar buggy code in the future.


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>

--- ./fs/aio.c.orig 2006-11-29 13:57:37.0 -0800
+++ ./fs/aio.c  2006-12-14 20:45:14.0 -0800
@@ -298,17 +298,23 @@ static void wait_for_all_aios(struct kio
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);
 
+   spin_lock_irq(>ctx_lock);
if (!ctx->reqs_active)
-   return;
+   goto out;
 
add_wait_queue(>wait, );
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
while (ctx->reqs_active) {
+   spin_unlock_irq(>ctx_lock);
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+   spin_lock_irq(>ctx_lock);
}
__set_task_state(tsk, TASK_RUNNING);
remove_wait_queue(>wait, );
+
+out:
+   spin_unlock_irq(>ctx_lock);
 }
 
 /* wait_on_sync_kiocb:
@@ -425,7 +431,6 @@ static struct kiocb fastcall *__aio_get_
ring = kmap_atomic(ctx->ring_info.ring_pages[0], KM_USER0);
if (ctx->reqs_active < aio_ring_avail(>ring_info, ring)) {
list_add(>ki_list, >active_reqs);
-   get_ioctx(ctx);
ctx->reqs_active++;
okay = 1;
}
@@ -538,8 +543,6 @@ int fastcall aio_put_req(struct kiocb *r
spin_lock_irq(>ctx_lock);
ret = __aio_put_req(ctx, req);
spin_unlock_irq(>ctx_lock);
-   if (ret)
-   put_ioctx(ctx);
return ret;
 }
 
@@ -795,8 +798,7 @@ static int __aio_run_iocbs(struct kioctx
 */
iocb->ki_users++;   /* grab extra reference */
aio_run_iocb(iocb);
-   if (__aio_put_req(ctx, iocb))  /* drop extra ref */
-   put_ioctx(ctx);
+   __aio_put_req(ctx, iocb);
}
if (!list_empty(>run_list))
return 1;
@@ -942,7 +944,6 @@ int fastcall aio_complete(struct kiocb *
struct io_event *event;
unsigned long   flags;
unsigned long   tail;
-   int ret;
 
/*
 * Special case handling for sync iocbs:
@@ -1011,18 +1012,12 @@ int fastcall aio_complete(struct kiocb *
pr_debug("%ld retries: %zd of %zd\n", iocb->ki_retried,
iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes);
 put_rq:
-   /* everything turned out well, dispose of the aiocb. */
-   ret = __aio_put_req(ctx, iocb);
-
spin_unlock_irqrestore(>ctx_lock, flags);
 
if (waitqueue_active(>wait))
wake_up(>wait);
 
-   if (ret)
-   put_ioctx(ctx);
-
-   return ret;
+   return aio_put_req(iocb);
 }
 
 /* aio_read_evt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 2.6.18.4: flush_workqueue calls mutex_lock in interrupt environment

2006-12-14 Thread Chen, Kenneth W

Andrew Morton wrote on Thursday, December 14, 2006 5:20 PM
> it's hard to disagree.
> 
> Begin forwarded message:
> > On Wed, 2006-12-13 at 08:25 +0100, xb wrote:
> > > Hi all,
> > > 
> > > Running some IO stress tests on a 8*ways IA64 platform, we got:
> > >  BUG: warning at kernel/mutex.c:132/__mutex_lock_common()  message
> > > followed by:
> > >  Unable to handle kernel paging request at virtual address
> > > 00200200
> > > oops corresponding to anon_vma_unlink() calling list_del() on a
> > > poisonned list.
> > > 
> > > Having a look to the stack, we see that flush_workqueue() calls
> > > mutex_lock() with softirqs disabled.
> > 
> > something is wrong here... flush_workqueue() is a sleeping function and
> > is not allowed to be called in such a context!
> 
> It seems utterly insane to have aio_complete() flush a workqueue. That
> function has to be called from a number of different environments,
> including non-sleep tolerant environments.
> 
> For instance it means that directIO on NFS will now cause the rpciod
> workqueues to call flush_workqueue(aio_wq), thus slowing down all RPC
> activity.

The bug appears to be somewhere else, somehow the ref count on ioctx is
all messed up.

In aio_complete, __put_ioctx() should not be invoked because ref count
on ioctx is supposedly more than 2, aio_complete decrement it once and
should return without invoking the free function.

The real freeing ioctx should be coming from exit_aio() or io_destroy(),
in which case both wait until no further pending AIO request via
wait_for_all_aios().

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 2.6.18.4: flush_workqueue calls mutex_lock in interrupt environment

2006-12-14 Thread Chen, Kenneth W

Chen, Kenneth wrote on Thursday, December 14, 2006 5:59 PM
  It seems utterly insane to have aio_complete() flush a workqueue. That
  function has to be called from a number of different environments,
  including non-sleep tolerant environments.
  
  For instance it means that directIO on NFS will now cause the rpciod
  workqueues to call flush_workqueue(aio_wq), thus slowing down all RPC
  activity.
 
 The bug appears to be somewhere else, somehow the ref count on ioctx is
 all messed up.
 
 In aio_complete, __put_ioctx() should not be invoked because ref count
 on ioctx is supposedly more than 2, aio_complete decrement it once and
 should return without invoking the free function.
 
 The real freeing ioctx should be coming from exit_aio() or io_destroy(),
 in which case both wait until no further pending AIO request via
 wait_for_all_aios().

Ah, I think I see the bug: it must be a race between io_destroy() and
aio_complete().  A possible scenario:

cpu0   cpu1
io_destroy aio_complete
  wait_for_all_aios {__aio_put_req
 ... ctx-reqs_active--;
 if (!ctx-reqs_active)
return;
  }
  ...
  put_ioctx(ioctx)

 put_ioctx(ctx);
bam! Bug trigger!

AIO finished on cpu1 and while in the middle of aio_complete, cpu0 starts
io_destroy sequence, sees no pending AIO, went ahead decrement the ref
count on ioctx.  At a later point in aio_complete, the put_ioctx decrement
last ref count and calls the ioctx freeing function and there it triggered
the bug warning.

A simple fix would be to access ctx-reqs_active inside ctx spin lock in 
wait_for_all_aios().  At the mean time, I would like to
remove ref counting
for each iocb because we already performing ref count using reqs_active. This
would also prevent similar buggy code in the future.


Signed-off-by: Ken Chen [EMAIL PROTECTED]

--- ./fs/aio.c.orig 2006-11-29 13:57:37.0 -0800
+++ ./fs/aio.c  2006-12-14 20:45:14.0 -0800
@@ -298,17 +298,23 @@ static void wait_for_all_aios(struct kio
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);
 
+   spin_lock_irq(ctx-ctx_lock);
if (!ctx-reqs_active)
-   return;
+   goto out;
 
add_wait_queue(ctx-wait, wait);
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
while (ctx-reqs_active) {
+   spin_unlock_irq(ctx-ctx_lock);
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+   spin_lock_irq(ctx-ctx_lock);
}
__set_task_state(tsk, TASK_RUNNING);
remove_wait_queue(ctx-wait, wait);
+
+out:
+   spin_unlock_irq(ctx-ctx_lock);
 }
 
 /* wait_on_sync_kiocb:
@@ -425,7 +431,6 @@ static struct kiocb fastcall *__aio_get_
ring = kmap_atomic(ctx-ring_info.ring_pages[0], KM_USER0);
if (ctx-reqs_active  aio_ring_avail(ctx-ring_info, ring)) {
list_add(req-ki_list, ctx-active_reqs);
-   get_ioctx(ctx);
ctx-reqs_active++;
okay = 1;
}
@@ -538,8 +543,6 @@ int fastcall aio_put_req(struct kiocb *r
spin_lock_irq(ctx-ctx_lock);
ret = __aio_put_req(ctx, req);
spin_unlock_irq(ctx-ctx_lock);
-   if (ret)
-   put_ioctx(ctx);
return ret;
 }
 
@@ -795,8 +798,7 @@ static int __aio_run_iocbs(struct kioctx
 */
iocb-ki_users++;   /* grab extra reference */
aio_run_iocb(iocb);
-   if (__aio_put_req(ctx, iocb))  /* drop extra ref */
-   put_ioctx(ctx);
+   __aio_put_req(ctx, iocb);
}
if (!list_empty(ctx-run_list))
return 1;
@@ -942,7 +944,6 @@ int fastcall aio_complete(struct kiocb *
struct io_event *event;
unsigned long   flags;
unsigned long   tail;
-   int ret;
 
/*
 * Special case handling for sync iocbs:
@@ -1011,18 +1012,12 @@ int fastcall aio_complete(struct kiocb *
pr_debug(%ld retries: %zd of %zd\n, iocb-ki_retried,
iocb-ki_nbytes - iocb-ki_left, iocb-ki_nbytes);
 put_rq:
-   /* everything turned out well, dispose of the aiocb. */
-   ret = __aio_put_req(ctx, iocb);
-
spin_unlock_irqrestore(ctx-ctx_lock, flags);
 
if (waitqueue_active(ctx-wait))
wake_up(ctx-wait);
 
-   if (ret)
-   put_ioctx(ctx);
-
-   return ret;
+   return aio_put_req(iocb);
 }
 
 /* aio_read_evt
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: 2.6.18.4: flush_workqueue calls mutex_lock in interrupt environment

2006-12-14 Thread Chen, Kenneth W

Andrew Morton wrote on Thursday, December 14, 2006 5:20 PM
 it's hard to disagree.
 
 Begin forwarded message:
  On Wed, 2006-12-13 at 08:25 +0100, xb wrote:
   Hi all,
   
   Running some IO stress tests on a 8*ways IA64 platform, we got:
BUG: warning at kernel/mutex.c:132/__mutex_lock_common()  message
   followed by:
Unable to handle kernel paging request at virtual address
   00200200
   oops corresponding to anon_vma_unlink() calling list_del() on a
   poisonned list.
   
   Having a look to the stack, we see that flush_workqueue() calls
   mutex_lock() with softirqs disabled.
  
  something is wrong here... flush_workqueue() is a sleeping function and
  is not allowed to be called in such a context!
 
 It seems utterly insane to have aio_complete() flush a workqueue. That
 function has to be called from a number of different environments,
 including non-sleep tolerant environments.
 
 For instance it means that directIO on NFS will now cause the rpciod
 workqueues to call flush_workqueue(aio_wq), thus slowing down all RPC
 activity.

The bug appears to be somewhere else, somehow the ref count on ioctx is
all messed up.

In aio_complete, __put_ioctx() should not be invoked because ref count
on ioctx is supposedly more than 2, aio_complete decrement it once and
should return without invoking the free function.

The real freeing ioctx should be coming from exit_aio() or io_destroy(),
in which case both wait until no further pending AIO request via
wait_for_all_aios().

- Ken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: cfq performance gap

2006-12-13 Thread Chen, Kenneth W

Miquel van Smoorenburg wrote on Wednesday, December 13, 2006 1:57 AM
> Chen, Kenneth W <[EMAIL PROTECTED]> wrote:
> >This rawio test plows through sequential I/O and modulo each small record
> >over number of threads.  So each thread appears to be non-contiguous within
> >its own process context, overall request hitting the device are sequential.
> >I can't see how any application does that kind of I/O pattern.
> 
> A NNTP server that has many incoming connections, handled by
> multiple threads, that stores the data in cylic buffers ?

Then whichever the thread that dumps the buffer content to the storage
will do one large contiguous I/O.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: cfq performance gap

2006-12-13 Thread Chen, Kenneth W

Miquel van Smoorenburg wrote on Wednesday, December 13, 2006 1:57 AM
 Chen, Kenneth W [EMAIL PROTECTED] wrote:
 This rawio test plows through sequential I/O and modulo each small record
 over number of threads.  So each thread appears to be non-contiguous within
 its own process context, overall request hitting the device are sequential.
 I can't see how any application does that kind of I/O pattern.
 
 A NNTP server that has many incoming connections, handled by
 multiple threads, that stores the data in cylic buffers ?

Then whichever the thread that dumps the buffer content to the storage
will do one large contiguous I/O.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: cfq performance gap

2006-12-12 Thread Chen, Kenneth W

AVANTIKA R. MATHUR wrote on Tuesday, December 12, 2006 5:33 PM
> >> rawio is actually performing sequential reads, but I don't believe it is
> >> purely sequential with the multiple processes.
> >> I am currently running the test with longer runtimes and will post
> >> results once it is complete.
> >> I've also attached the rawio source.
> >> 
> >
> > It's certainly the slice and idling hurting here. But at the same time,
> > I don't really think your test case is very interesting. The test area
> > is very small and you have 16 threads trying to read the same thing,
> > optimizing for that would be silly as I don't think it has much real
> > world relevance.
> 
> Could a database have similar workload to this test?


No.

Not what I have seen with db workloads exhibits such pattern.  There are
basically two types of db workloads: one does transaction processing, and
I/O pattern are truly random with large stride, both in the context of
process and overall I/O seen at device level.  A second one is decision
making type of db queries.  They does large sequential I/O within one
process context.

This rawio test plows through sequential I/O and modulo each small record
over number of threads.  So each thread appears to be non-contiguous within
its own process context, overall request hitting the device are sequential.
I can't see how any application does that kind of I/O pattern.

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-12 Thread Chen, Kenneth W

Andrew Morton wrote on Tuesday, December 12, 2006 2:40 AM
> On Tue, 12 Dec 2006 16:18:32 +0300
> Dmitriy Monakhov <[EMAIL PROTECTED]> wrote:
> 
> > >> but according to filemaps locking rules: mm/filemap.c:77
> > >>  ..
> > >>  *  ->i_mutex(generic_file_buffered_write)
> > >>  *->mmap_sem (fault_in_pages_readable->do_page_fault)
> > >>  ..
> > >> I'm confused a litle bit, where is the truth? 
> > >
> > > xfs_write() calls generic_file_direct_write() without taking i_mutex for
> > > O_DIRECT writes.
> > Yes, but my quastion is about __generic_file_aio_write_nolock().
> > As i understand _nolock sufix means that i_mutex was already locked 
> > by caller, am i right ?
> 
> Nope.  It just means that __generic_file_aio_write_nolock() doesn't take
> the lock.  We don't assume or require that the caller took it.  For example
> the raw driver calls generic_file_aio_write_nolock() without taking
> i_mutex.  Raw isn't relevant to the problem (although ocfs2 might be).  But
> we cannot assume that all callers have taken i_mutex, I think.


I think we should also clean up generic_file_aio_write_nolock. This was
brought up a couple of weeks ago and I gave up too early.  Here is my
second attempt.

How about the following patch, I think we can kill generic_file_aio_write_nolock
and merge both *file_aio_write_nolock into one function, then

generic_file_aio_write
ocfs2_file_aio_write
blk_dev->aio_write

all points to a non-lock version of __generic_file_aio_write().  First
two already hold i_mutex, while the block device's aio_write method
doesn't require i_mutex to be held.


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>

diff -Nurp linux-2.6.19/drivers/char/raw.c linux-2.6.19.ken/drivers/char/raw.c
--- linux-2.6.19/drivers/char/raw.c 2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/drivers/char/raw.c 2006-12-12 16:41:39.0 -0800
@@ -242,7 +242,7 @@ static const struct file_operations raw_
.read   =   do_sync_read,
.aio_read = generic_file_aio_read,
.write  =   do_sync_write,
-   .aio_write =generic_file_aio_write_nolock,
+   .aio_write =__generic_file_aio_write,
.open   =   raw_open,
.release=   raw_release,
.ioctl  =   raw_ioctl,
diff -Nurp linux-2.6.19/fs/block_dev.c linux-2.6.19.ken/fs/block_dev.c
--- linux-2.6.19/fs/block_dev.c 2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/fs/block_dev.c 2006-12-12 16:47:58.0 -0800
@@ -1198,7 +1198,7 @@ const struct file_operations def_blk_fop
.read   = do_sync_read,
.write  = do_sync_write,
.aio_read   = generic_file_aio_read,
-   .aio_write  = generic_file_aio_write_nolock,
+   .aio_write  = __generic_file_aio_write,
.mmap   = generic_file_mmap,
.fsync  = block_fsync,
.unlocked_ioctl = block_ioctl,
diff -Nurp linux-2.6.19/fs/ocfs2/file.c linux-2.6.19.ken/fs/ocfs2/file.c
--- linux-2.6.19/fs/ocfs2/file.c2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/fs/ocfs2/file.c2006-12-12 16:42:09.0 -0800
@@ -1107,7 +1107,7 @@ static ssize_t ocfs2_file_aio_write(stru
/* communicate with ocfs2_dio_end_io */
ocfs2_iocb_set_rw_locked(iocb);
 
-   ret = generic_file_aio_write_nolock(iocb, iov, nr_segs, iocb->ki_pos);
+   ret = __generic_file_aio_write(iocb, iov, nr_segs, iocb->ki_pos);
 
/* buffered aio wouldn't have proper lock coverage today */
BUG_ON(ret == -EIOCBQUEUED && !(filp->f_flags & O_DIRECT));
diff -Nurp linux-2.6.19/include/linux/fs.h linux-2.6.19.ken/include/linux/fs.h
--- linux-2.6.19/include/linux/fs.h 2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/include/linux/fs.h 2006-12-12 16:41:58.0 -0800
@@ -1742,7 +1742,7 @@ extern int file_send_actor(read_descript
 int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int 
isblk);
 extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, 
unsigned long, loff_t);
 extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, 
unsigned long, loff_t);
-extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct 
iovec *,
+extern ssize_t __generic_file_aio_write(struct kiocb *, const struct iovec *,
unsigned long, loff_t);
 extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *,
unsigned long *, loff_t, loff_t *, size_t, size_t);
diff -Nurp linux-2.6.19/mm/filemap.c linux-2.6.19.ken/mm/filemap.c
--- linux-2.6.19/mm/filemap.c   2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/mm/filemap.c   2006-12-12 16:47:58.0 -0800
@@ -2219,9 +2219,9 @@ zero_length_segment:
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
 
-static ssize_t
-__generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov,
-   unsigned long nr_segs, loff_t

RE: [PATCH] incorrect error handling inside generic_file_direct_write

2006-12-12 Thread Chen, Kenneth W

Andrew Morton wrote on Tuesday, December 12, 2006 2:40 AM
 On Tue, 12 Dec 2006 16:18:32 +0300
 Dmitriy Monakhov [EMAIL PROTECTED] wrote:
 
   but according to filemaps locking rules: mm/filemap.c:77
..
*  -i_mutex(generic_file_buffered_write)
*-mmap_sem (fault_in_pages_readable-do_page_fault)
..
   I'm confused a litle bit, where is the truth? 
  
   xfs_write() calls generic_file_direct_write() without taking i_mutex for
   O_DIRECT writes.
  Yes, but my quastion is about __generic_file_aio_write_nolock().
  As i understand _nolock sufix means that i_mutex was already locked 
  by caller, am i right ?
 
 Nope.  It just means that __generic_file_aio_write_nolock() doesn't take
 the lock.  We don't assume or require that the caller took it.  For example
 the raw driver calls generic_file_aio_write_nolock() without taking
 i_mutex.  Raw isn't relevant to the problem (although ocfs2 might be).  But
 we cannot assume that all callers have taken i_mutex, I think.


I think we should also clean up generic_file_aio_write_nolock. This was
brought up a couple of weeks ago and I gave up too early.  Here is my
second attempt.

How about the following patch, I think we can kill generic_file_aio_write_nolock
and merge both *file_aio_write_nolock into one function, then

generic_file_aio_write
ocfs2_file_aio_write
blk_dev-aio_write

all points to a non-lock version of __generic_file_aio_write().  First
two already hold i_mutex, while the block device's aio_write method
doesn't require i_mutex to be held.


Signed-off-by: Ken Chen [EMAIL PROTECTED]

diff -Nurp linux-2.6.19/drivers/char/raw.c linux-2.6.19.ken/drivers/char/raw.c
--- linux-2.6.19/drivers/char/raw.c 2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/drivers/char/raw.c 2006-12-12 16:41:39.0 -0800
@@ -242,7 +242,7 @@ static const struct file_operations raw_
.read   =   do_sync_read,
.aio_read = generic_file_aio_read,
.write  =   do_sync_write,
-   .aio_write =generic_file_aio_write_nolock,
+   .aio_write =__generic_file_aio_write,
.open   =   raw_open,
.release=   raw_release,
.ioctl  =   raw_ioctl,
diff -Nurp linux-2.6.19/fs/block_dev.c linux-2.6.19.ken/fs/block_dev.c
--- linux-2.6.19/fs/block_dev.c 2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/fs/block_dev.c 2006-12-12 16:47:58.0 -0800
@@ -1198,7 +1198,7 @@ const struct file_operations def_blk_fop
.read   = do_sync_read,
.write  = do_sync_write,
.aio_read   = generic_file_aio_read,
-   .aio_write  = generic_file_aio_write_nolock,
+   .aio_write  = __generic_file_aio_write,
.mmap   = generic_file_mmap,
.fsync  = block_fsync,
.unlocked_ioctl = block_ioctl,
diff -Nurp linux-2.6.19/fs/ocfs2/file.c linux-2.6.19.ken/fs/ocfs2/file.c
--- linux-2.6.19/fs/ocfs2/file.c2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/fs/ocfs2/file.c2006-12-12 16:42:09.0 -0800
@@ -1107,7 +1107,7 @@ static ssize_t ocfs2_file_aio_write(stru
/* communicate with ocfs2_dio_end_io */
ocfs2_iocb_set_rw_locked(iocb);
 
-   ret = generic_file_aio_write_nolock(iocb, iov, nr_segs, iocb-ki_pos);
+   ret = __generic_file_aio_write(iocb, iov, nr_segs, iocb-ki_pos);
 
/* buffered aio wouldn't have proper lock coverage today */
BUG_ON(ret == -EIOCBQUEUED  !(filp-f_flags  O_DIRECT));
diff -Nurp linux-2.6.19/include/linux/fs.h linux-2.6.19.ken/include/linux/fs.h
--- linux-2.6.19/include/linux/fs.h 2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/include/linux/fs.h 2006-12-12 16:41:58.0 -0800
@@ -1742,7 +1742,7 @@ extern int file_send_actor(read_descript
 int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int 
isblk);
 extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, 
unsigned long, loff_t);
 extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, 
unsigned long, loff_t);
-extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct 
iovec *,
+extern ssize_t __generic_file_aio_write(struct kiocb *, const struct iovec *,
unsigned long, loff_t);
 extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *,
unsigned long *, loff_t, loff_t *, size_t, size_t);
diff -Nurp linux-2.6.19/mm/filemap.c linux-2.6.19.ken/mm/filemap.c
--- linux-2.6.19/mm/filemap.c   2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/mm/filemap.c   2006-12-12 16:47:58.0 -0800
@@ -2219,9 +2219,9 @@ zero_length_segment:
 }
 EXPORT_SYMBOL(generic_file_buffered_write);
 
-static ssize_t
-__generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov,
-   unsigned long nr_segs, loff_t *ppos)
+ssize_t
+__generic_file_aio_write(struct kiocb *iocb,

RE: cfq performance gap

2006-12-12 Thread Chen, Kenneth W

AVANTIKA R. MATHUR wrote on Tuesday, December 12, 2006 5:33 PM
  rawio is actually performing sequential reads, but I don't believe it is
  purely sequential with the multiple processes.
  I am currently running the test with longer runtimes and will post
  results once it is complete.
  I've also attached the rawio source.
  
 
  It's certainly the slice and idling hurting here. But at the same time,
  I don't really think your test case is very interesting. The test area
  is very small and you have 16 threads trying to read the same thing,
  optimizing for that would be silly as I don't think it has much real
  world relevance.
 
 Could a database have similar workload to this test?


No.

Not what I have seen with db workloads exhibits such pattern.  There are
basically two types of db workloads: one does transaction processing, and
I/O pattern are truly random with large stride, both in the context of
process and overall I/O seen at device level.  A second one is decision
making type of db queries.  They does large sequential I/O within one
process context.

This rawio test plows through sequential I/O and modulo each small record
over number of threads.  So each thread appears to be non-contiguous within
its own process context, overall request hitting the device are sequential.
I can't see how any application does that kind of I/O pattern.

- Ken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] connector: Some fixes for ia64 unaligned access errors

2006-12-11 Thread Chen, Kenneth W

Pete Zaitcev wrote on Monday, December 11, 2006 5:29 PM
> On Mon, 11 Dec 2006 15:52:47 -0800, Matt Helsley <[EMAIL PROTECTED]> wrote:
> 
> > I'm shocked memcpy() introduces 8-byte stores that violate architecture
> > alignment rules. Is there any chance this a bug in ia64's memcpy()
> > implementation? I've tried to read it but since I'm not familiar with
> > ia64 asm I can't make out significant parts of it in
> > arch/ia64/lib/memcpy.S.
> 
> The arch/ia64/lib/memcpy.S is probably fine, it must be gcc doing
> an inline substitution of a well-known function.

arch/ia64/lib/memcpy.S is fine because it does alignment check at the very
beginning of the function and depends on the alignment of dst/src alignment,
it does different thing.  The unaligned access is coming from gcc inlined
version of memcpy.

But looking deeply, memory allocation for proc_event in proc_for_connector
doesn't looked correct at all:

In drivers/connector/cn_proc.c:
#define CN_PROC_MSG_SIZE (sizeof(struct cn_msg) + sizeof(struct proc_event))

void proc_fork_connector(struct task_struct *task)
{
struct cn_msg *msg;
struct proc_event *ev;
__u8 buffer[CN_PROC_MSG_SIZE];

You can't do that because gcc assumes struct proc_event aligns on certain
boundary.  Doing fancy hand crafting like that breaks code generated by gcc.

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH] connector: Some fixes for ia64 unaligned access errors

2006-12-11 Thread Chen, Kenneth W

Pete Zaitcev wrote on Monday, December 11, 2006 5:29 PM
 On Mon, 11 Dec 2006 15:52:47 -0800, Matt Helsley [EMAIL PROTECTED] wrote:
 
  I'm shocked memcpy() introduces 8-byte stores that violate architecture
  alignment rules. Is there any chance this a bug in ia64's memcpy()
  implementation? I've tried to read it but since I'm not familiar with
  ia64 asm I can't make out significant parts of it in
  arch/ia64/lib/memcpy.S.
 
 The arch/ia64/lib/memcpy.S is probably fine, it must be gcc doing
 an inline substitution of a well-known function.

arch/ia64/lib/memcpy.S is fine because it does alignment check at the very
beginning of the function and depends on the alignment of dst/src alignment,
it does different thing.  The unaligned access is coming from gcc inlined
version of memcpy.

But looking deeply, memory allocation for proc_event in proc_for_connector
doesn't looked correct at all:

In drivers/connector/cn_proc.c:
#define CN_PROC_MSG_SIZE (sizeof(struct cn_msg) + sizeof(struct proc_event))

void proc_fork_connector(struct task_struct *task)
{
struct cn_msg *msg;
struct proc_event *ev;
__u8 buffer[CN_PROC_MSG_SIZE];

You can't do that because gcc assumes struct proc_event aligns on certain
boundary.  Doing fancy hand crafting like that breaks code generated by gcc.

- Ken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] speed up single bio_vec allocation

2006-12-08 Thread Chen, Kenneth W

> Chen, Kenneth wrote on Wednesday, December 06, 2006 10:20 AM
> > Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM
> > This is what I had in mind, in case it wasn't completely clear. Not
> > tested, other than it compiles. Basically it eliminates the small
> > bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
> > 12-bytes on 32-bit archs instead and uses the room at the end for the
> > bio_vec structure.
> 
> Yeah, I had a very similar patch queued internally for the large benchmark
> measurement.  I will post the result as soon as I get it.


Jens, this improves 0.25% on our db transaction processing benchmark setup.
The patch tested is (on top of 2.6.19):
http://marc.theaimsgroup.com/?l=linux-kernel=116539972229021=2

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] speed up single bio_vec allocation

2006-12-08 Thread Chen, Kenneth W

 Chen, Kenneth wrote on Wednesday, December 06, 2006 10:20 AM
  Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM
  This is what I had in mind, in case it wasn't completely clear. Not
  tested, other than it compiles. Basically it eliminates the small
  bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
  12-bytes on 32-bit archs instead and uses the room at the end for the
  bio_vec structure.
 
 Yeah, I had a very similar patch queued internally for the large benchmark
 measurement.  I will post the result as soon as I get it.


Jens, this improves 0.25% on our db transaction processing benchmark setup.
The patch tested is (on top of 2.6.19):
http://marc.theaimsgroup.com/?l=linux-kernelm=116539972229021w=2

- Ken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] speed up single bio_vec allocation

2006-12-07 Thread Chen, Kenneth W

Andi Kleen wrote on Thursday, December 07, 2006 6:28 PM
> "Chen, Kenneth W" <[EMAIL PROTECTED]> writes:
> > I tried to use cache_line_size() to find out the alignment of struct bio, 
> > but
> > stumbled on that it is a runtime function for x86_64.
> 
> It's a single global variable access:
> 
> #define cache_line_size() (boot_cpu_data.x86_cache_alignment)
> 
> Or do you mean it caused cache misses?  boot_cpu_data is cache aligned
> and practically read only, so there shouldn't be any false sharing at least.

No, I was looking for a generic constant that describes cache line size.
I needed a constant in order to avoid runtime check and to rely on compiler
to optimize a conditional check away.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] speed up single bio_vec allocation

2006-12-07 Thread Chen, Kenneth W

Nate Diller wrote on Thursday, December 07, 2006 1:46 PM
> the current code is straightforward and obviously correct.  you want
> to make the alloc/dealloc paths more complex, by special-casing for an
> arbitrary limit of "small" I/O, AFAICT.  of *course* you can expect
> less overhead when you're doing one large I/O vs. two small ones,
> that's the whole reason we have all this code to try to coalesce
> contiguous I/O, do readahead, swap page clustering, etc.  we *want*
> more complexity if it will get us bigger I/Os.  I don't see why we
> want more complexity to reduce the *inherent* penalty of doing smaller
> ones.

You should check out the latest proposal from Jens Axboe which treats
all biovec size the same and stuff it along with struct bio.  I think
it is a better approach than my first cut of special casing 1 segment
biovec.  His patch will speed up all sized I/O.

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] speed up single bio_vec allocation

2006-12-07 Thread Chen, Kenneth W

Nate Diller wrote on Thursday, December 07, 2006 11:22 AM
> > > I still can't help but think we can do better than this, and that this
> > > is nothing more than optimizing for a benchmark. For high performance
> > > I/O, you will be doing > 1 page bio's anyway and this patch wont help
> > > you at all. Perhaps we can just kill bio_vec slabs completely, and
> > > create bio slabs instead with differing sizes. So instead of having 1
> > > bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
> > > for the bio_vec list at the end. That would always eliminate the extra
> > > allocation, at the cost of blowing the 256-page case into a order 1 page
> > > allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs,
> > > which is something I've always tried to avoid.
> >
> > I took a quick query of biovec-* slab stats on various production machines,
> > majority of the allocation is on 1 and 4 segments, usages falls off quickly
> > on 16 or more.  256 segment biovec allocation is really rare.  I think it
> > makes sense to heavily bias towards smaller biovec allocation and have
> > separate biovec allocation for really large ones.
> 
> what file system?  have you tested with more than one?  have you
> tested with file systems that build their own bio's instead of using
> get_block() calls?  have you tested with large files or streaming
> workloads?  how about direct I/O?
> 
> i think that a "heavy bias" toward small biovecs is FS and workload
> dependent, and that it's irresponsible to make such unjustified
> changes just to show improvement on your particular benchmark.

It is no doubt that the above data is just a quick estimate on one
usage model. There are tons of other usage in the world.  After all,
any algorithm in the kernel has to be generic and self tune to
specific environment.

On very large I/O, the relative overhead in allocating biovec will
decrease because larger I/O needs more code to do setup, more code
to perform I/O completion, more code in the device driver etc. So
time spent on one mempool alloc will amortize over the size of I/O.
On a smaller I/O size, the overhead is more visible and thus makes
sense to me to cut down that relative overhead.

In fact, the large I/O already have unfair advantage.  If you do 1MB
I/O, only 1 call to mempool to get a 256 segment bio.  However if you
do two 512K I/O, two calls to mempool is made.  So in some sense,
current scheme is unfair to small I/O.

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] speed up single bio_vec allocation

2006-12-07 Thread Chen, Kenneth W

Nate Diller wrote on Thursday, December 07, 2006 11:22 AM
   I still can't help but think we can do better than this, and that this
   is nothing more than optimizing for a benchmark. For high performance
   I/O, you will be doing  1 page bio's anyway and this patch wont help
   you at all. Perhaps we can just kill bio_vec slabs completely, and
   create bio slabs instead with differing sizes. So instead of having 1
   bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
   for the bio_vec list at the end. That would always eliminate the extra
   allocation, at the cost of blowing the 256-page case into a order 1 page
   allocation (256*16 + sizeof(*bio)  PAGE_SIZE) for the 4kb 64-bit archs,
   which is something I've always tried to avoid.
 
  I took a quick query of biovec-* slab stats on various production machines,
  majority of the allocation is on 1 and 4 segments, usages falls off quickly
  on 16 or more.  256 segment biovec allocation is really rare.  I think it
  makes sense to heavily bias towards smaller biovec allocation and have
  separate biovec allocation for really large ones.
 
 what file system?  have you tested with more than one?  have you
 tested with file systems that build their own bio's instead of using
 get_block() calls?  have you tested with large files or streaming
 workloads?  how about direct I/O?
 
 i think that a heavy bias toward small biovecs is FS and workload
 dependent, and that it's irresponsible to make such unjustified
 changes just to show improvement on your particular benchmark.

It is no doubt that the above data is just a quick estimate on one
usage model. There are tons of other usage in the world.  After all,
any algorithm in the kernel has to be generic and self tune to
specific environment.

On very large I/O, the relative overhead in allocating biovec will
decrease because larger I/O needs more code to do setup, more code
to perform I/O completion, more code in the device driver etc. So
time spent on one mempool alloc will amortize over the size of I/O.
On a smaller I/O size, the overhead is more visible and thus makes
sense to me to cut down that relative overhead.

In fact, the large I/O already have unfair advantage.  If you do 1MB
I/O, only 1 call to mempool to get a 256 segment bio.  However if you
do two 512K I/O, two calls to mempool is made.  So in some sense,
current scheme is unfair to small I/O.

- Ken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] speed up single bio_vec allocation

2006-12-07 Thread Chen, Kenneth W

Nate Diller wrote on Thursday, December 07, 2006 1:46 PM
 the current code is straightforward and obviously correct.  you want
 to make the alloc/dealloc paths more complex, by special-casing for an
 arbitrary limit of small I/O, AFAICT.  of *course* you can expect
 less overhead when you're doing one large I/O vs. two small ones,
 that's the whole reason we have all this code to try to coalesce
 contiguous I/O, do readahead, swap page clustering, etc.  we *want*
 more complexity if it will get us bigger I/Os.  I don't see why we
 want more complexity to reduce the *inherent* penalty of doing smaller
 ones.

You should check out the latest proposal from Jens Axboe which treats
all biovec size the same and stuff it along with struct bio.  I think
it is a better approach than my first cut of special casing 1 segment
biovec.  His patch will speed up all sized I/O.

- Ken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] speed up single bio_vec allocation

2006-12-07 Thread Chen, Kenneth W

Andi Kleen wrote on Thursday, December 07, 2006 6:28 PM
 Chen, Kenneth W [EMAIL PROTECTED] writes:
  I tried to use cache_line_size() to find out the alignment of struct bio, 
  but
  stumbled on that it is a runtime function for x86_64.
 
 It's a single global variable access:
 
 #define cache_line_size() (boot_cpu_data.x86_cache_alignment)
 
 Or do you mean it caused cache misses?  boot_cpu_data is cache aligned
 and practically read only, so there shouldn't be any false sharing at least.

No, I was looking for a generic constant that describes cache line size.
I needed a constant in order to avoid runtime check and to rely on compiler
to optimize a conditional check away.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] optimize o_direct on block device - v3

2006-12-06 Thread Chen, Kenneth W

This patch implements block device specific .direct_IO method instead
of going through generic direct_io_worker for block device.

direct_io_worker is fairly complex because it needs to handle O_DIRECT
on file system, where it needs to perform block allocation, hole detection,
extents file on write, and tons of other corner cases. The end result is
that it takes tons of CPU time to submit an I/O.

For block device, the block allocation is much simpler and a tight triple
loop can be written to iterate each iovec and each page within the iovec
in order to construct/prepare bio structure and then subsequently submit
it to the block layer.  This significantly speeds up O_D on block device.


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>


---
This is v3 of the patch, I have addressed all the comments from Andrew,
Christoph, and Zach.  Too many to list here for v2->v3 changes.  I've
created 34 test cases specifically for corner cases and tested this patch.
(my monkey test code is on http://kernel-perf.sourceforge.net/diotest).



diff -Nurp linux-2.6.19/fs/block_dev.c linux-2.6.19.ken/fs/block_dev.c
--- linux-2.6.19/fs/block_dev.c 2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/fs/block_dev.c 2006-12-06 13:16:43.0 -0800
@@ -129,43 +129,188 @@ blkdev_get_block(struct inode *inode, se
return 0;
 }
 
-static int
-blkdev_get_blocks(struct inode *inode, sector_t iblock,
-   struct buffer_head *bh, int create)
+static int blk_end_aio(struct bio *bio, unsigned int bytes_done, int error)
 {
-   sector_t end_block = max_block(I_BDEV(inode));
-   unsigned long max_blocks = bh->b_size >> inode->i_blkbits;
+   struct kiocb *iocb = bio->bi_private;
+   atomic_t *bio_count = >ki_bio_count;
 
-   if ((iblock + max_blocks) > end_block) {
-   max_blocks = end_block - iblock;
-   if ((long)max_blocks <= 0) {
-   if (create)
-   return -EIO;/* write fully beyond EOF */
-   /*
-* It is a read which is fully beyond EOF.  We return
-* a !buffer_mapped buffer
-*/
-   max_blocks = 0;
-   }
+   if (bio_data_dir(bio) == READ)
+   bio_check_pages_dirty(bio);
+   else {
+   bio_release_pages(bio);
+   bio_put(bio);
+   }
+
+   /* iocb->ki_nbytes stores error code from LLDD */
+   if (error)
+   iocb->ki_nbytes = -EIO;
+
+   if (atomic_dec_and_test(bio_count)) {
+   if (iocb->ki_nbytes < 0)
+   aio_complete(iocb, iocb->ki_nbytes, 0);
+   else
+   aio_complete(iocb, iocb->ki_left, 0);
}
 
-   bh->b_bdev = I_BDEV(inode);
-   bh->b_blocknr = iblock;
-   bh->b_size = max_blocks << inode->i_blkbits;
-   if (max_blocks)
-   set_buffer_mapped(bh);
return 0;
 }
 
+#define VEC_SIZE   16
+struct pvec {
+   unsigned short nr;
+   unsigned short idx;
+   struct page *page[VEC_SIZE];
+};
+
+#define PAGES_SPANNED(addr, len)   \
+   (DIV_ROUND_UP((addr) + (len), PAGE_SIZE) - (addr) / PAGE_SIZE);
+
+/*
+ * get page pointer for user addr, we internally cache struct page array for
+ * (addr, count) range in pvec to avoid frequent call to get_user_pages.  If
+ * internal page list is exhausted, a batch count of up to VEC_SIZE is used
+ * to get next set of page struct.
+ */
+static struct page *blk_get_page(unsigned long addr, size_t count, int rw,
+struct pvec *pvec)
+{
+   int ret, nr_pages;
+   if (pvec->idx == pvec->nr) {
+   nr_pages = PAGES_SPANNED(addr, count);
+   nr_pages = min(nr_pages, VEC_SIZE);
+   down_read(>mm->mmap_sem);
+   ret = get_user_pages(current, current->mm, addr, nr_pages,
+rw == READ, 0, pvec->page, NULL);
+   up_read(>mm->mmap_sem);
+   if (ret < 0)
+   return ERR_PTR(ret);
+   pvec->nr = ret;
+   pvec->idx = 0;
+   }
+   return pvec->page[pvec->idx++];
+}
+
 static ssize_t
 blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
-   loff_t offset, unsigned long nr_segs)
+loff_t pos, unsigned long nr_segs)
 {
-   struct file *file = iocb->ki_filp;
-   struct inode *inode = file->f_mapping->host;
+   struct inode *inode = iocb->ki_filp->f_mapping->host;
+   unsigned blkbits = blksize_bits(bdev_hardsect_size(I_BDEV(inode)));
+   unsigned blocksize_mask = (1<< blkbits) - 1;
+   unsigned long seg = 0;  /* iov segment iterator */
+   unsigned long nvec; /* number of bio vec needed */
+   unsigned long cur_off;  /* offset into current page */
+   unsigned long cur_len;  /* I/O len of current page,

RE: [patch] speed up single bio_vec allocation

2006-12-06 Thread Chen, Kenneth W

Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM
> > > I will try that too.  I'm a bit touchy about sharing a cache line for
> > > different bio.  But given that there are 200,000 I/O per second we are
> > > currently pushing the kernel, the chances of two cpu working on two
> > > bio that sits in the same cache line are pretty small.
> > 
> > Yep I really think so. Besides, it's not like we are repeatedly writing
> > to these objects in the first place.
> 
> This is what I had in mind, in case it wasn't completely clear. Not
> tested, other than it compiles. Basically it eliminates the small
> bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
> 12-bytes on 32-bit archs instead and uses the room at the end for the
> bio_vec structure.

Yeah, I had a very similar patch queued internally for the large benchmark
measurement.  I will post the result as soon as I get it.


> I still can't help but think we can do better than this, and that this
> is nothing more than optimizing for a benchmark. For high performance
> I/O, you will be doing > 1 page bio's anyway and this patch wont help
> you at all. Perhaps we can just kill bio_vec slabs completely, and
> create bio slabs instead with differing sizes. So instead of having 1
> bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
> for the bio_vec list at the end. That would always eliminate the extra
> allocation, at the cost of blowing the 256-page case into a order 1 page
> allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs,
> which is something I've always tried to avoid.

I took a quick query of biovec-* slab stats on various production machines,
majority of the allocation is on 1 and 4 segments, usages falls off quickly
on 16 or more.  256 segment biovec allocation is really rare.  I think it
makes sense to heavily bias towards smaller biovec allocation and have
separate biovec allocation for really large ones.

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] speed up single bio_vec allocation

2006-12-06 Thread Chen, Kenneth W

Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM
   I will try that too.  I'm a bit touchy about sharing a cache line for
   different bio.  But given that there are 200,000 I/O per second we are
   currently pushing the kernel, the chances of two cpu working on two
   bio that sits in the same cache line are pretty small.
  
  Yep I really think so. Besides, it's not like we are repeatedly writing
  to these objects in the first place.
 
 This is what I had in mind, in case it wasn't completely clear. Not
 tested, other than it compiles. Basically it eliminates the small
 bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by
 12-bytes on 32-bit archs instead and uses the room at the end for the
 bio_vec structure.

Yeah, I had a very similar patch queued internally for the large benchmark
measurement.  I will post the result as soon as I get it.


 I still can't help but think we can do better than this, and that this
 is nothing more than optimizing for a benchmark. For high performance
 I/O, you will be doing  1 page bio's anyway and this patch wont help
 you at all. Perhaps we can just kill bio_vec slabs completely, and
 create bio slabs instead with differing sizes. So instead of having 1
 bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room
 for the bio_vec list at the end. That would always eliminate the extra
 allocation, at the cost of blowing the 256-page case into a order 1 page
 allocation (256*16 + sizeof(*bio)  PAGE_SIZE) for the 4kb 64-bit archs,
 which is something I've always tried to avoid.

I took a quick query of biovec-* slab stats on various production machines,
majority of the allocation is on 1 and 4 segments, usages falls off quickly
on 16 or more.  256 segment biovec allocation is really rare.  I think it
makes sense to heavily bias towards smaller biovec allocation and have
separate biovec allocation for really large ones.

- Ken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] optimize o_direct on block device - v3

2006-12-06 Thread Chen, Kenneth W

This patch implements block device specific .direct_IO method instead
of going through generic direct_io_worker for block device.

direct_io_worker is fairly complex because it needs to handle O_DIRECT
on file system, where it needs to perform block allocation, hole detection,
extents file on write, and tons of other corner cases. The end result is
that it takes tons of CPU time to submit an I/O.

For block device, the block allocation is much simpler and a tight triple
loop can be written to iterate each iovec and each page within the iovec
in order to construct/prepare bio structure and then subsequently submit
it to the block layer.  This significantly speeds up O_D on block device.


Signed-off-by: Ken Chen [EMAIL PROTECTED]


---
This is v3 of the patch, I have addressed all the comments from Andrew,
Christoph, and Zach.  Too many to list here for v2-v3 changes.  I've
created 34 test cases specifically for corner cases and tested this patch.
(my monkey test code is on http://kernel-perf.sourceforge.net/diotest).



diff -Nurp linux-2.6.19/fs/block_dev.c linux-2.6.19.ken/fs/block_dev.c
--- linux-2.6.19/fs/block_dev.c 2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/fs/block_dev.c 2006-12-06 13:16:43.0 -0800
@@ -129,43 +129,188 @@ blkdev_get_block(struct inode *inode, se
return 0;
 }
 
-static int
-blkdev_get_blocks(struct inode *inode, sector_t iblock,
-   struct buffer_head *bh, int create)
+static int blk_end_aio(struct bio *bio, unsigned int bytes_done, int error)
 {
-   sector_t end_block = max_block(I_BDEV(inode));
-   unsigned long max_blocks = bh-b_size  inode-i_blkbits;
+   struct kiocb *iocb = bio-bi_private;
+   atomic_t *bio_count = iocb-ki_bio_count;
 
-   if ((iblock + max_blocks)  end_block) {
-   max_blocks = end_block - iblock;
-   if ((long)max_blocks = 0) {
-   if (create)
-   return -EIO;/* write fully beyond EOF */
-   /*
-* It is a read which is fully beyond EOF.  We return
-* a !buffer_mapped buffer
-*/
-   max_blocks = 0;
-   }
+   if (bio_data_dir(bio) == READ)
+   bio_check_pages_dirty(bio);
+   else {
+   bio_release_pages(bio);
+   bio_put(bio);
+   }
+
+   /* iocb-ki_nbytes stores error code from LLDD */
+   if (error)
+   iocb-ki_nbytes = -EIO;
+
+   if (atomic_dec_and_test(bio_count)) {
+   if (iocb-ki_nbytes  0)
+   aio_complete(iocb, iocb-ki_nbytes, 0);
+   else
+   aio_complete(iocb, iocb-ki_left, 0);
}
 
-   bh-b_bdev = I_BDEV(inode);
-   bh-b_blocknr = iblock;
-   bh-b_size = max_blocks  inode-i_blkbits;
-   if (max_blocks)
-   set_buffer_mapped(bh);
return 0;
 }
 
+#define VEC_SIZE   16
+struct pvec {
+   unsigned short nr;
+   unsigned short idx;
+   struct page *page[VEC_SIZE];
+};
+
+#define PAGES_SPANNED(addr, len)   \
+   (DIV_ROUND_UP((addr) + (len), PAGE_SIZE) - (addr) / PAGE_SIZE);
+
+/*
+ * get page pointer for user addr, we internally cache struct page array for
+ * (addr, count) range in pvec to avoid frequent call to get_user_pages.  If
+ * internal page list is exhausted, a batch count of up to VEC_SIZE is used
+ * to get next set of page struct.
+ */
+static struct page *blk_get_page(unsigned long addr, size_t count, int rw,
+struct pvec *pvec)
+{
+   int ret, nr_pages;
+   if (pvec-idx == pvec-nr) {
+   nr_pages = PAGES_SPANNED(addr, count);
+   nr_pages = min(nr_pages, VEC_SIZE);
+   down_read(current-mm-mmap_sem);
+   ret = get_user_pages(current, current-mm, addr, nr_pages,
+rw == READ, 0, pvec-page, NULL);
+   up_read(current-mm-mmap_sem);
+   if (ret  0)
+   return ERR_PTR(ret);
+   pvec-nr = ret;
+   pvec-idx = 0;
+   }
+   return pvec-page[pvec-idx++];
+}
+
 static ssize_t
 blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
-   loff_t offset, unsigned long nr_segs)
+loff_t pos, unsigned long nr_segs)
 {
-   struct file *file = iocb-ki_filp;
-   struct inode *inode = file-f_mapping-host;
+   struct inode *inode = iocb-ki_filp-f_mapping-host;
+   unsigned blkbits = blksize_bits(bdev_hardsect_size(I_BDEV(inode)));
+   unsigned blocksize_mask = (1 blkbits) - 1;
+   unsigned long seg = 0;  /* iov segment iterator */
+   unsigned long nvec; /* number of bio vec needed */
+   unsigned long cur_off;  /* offset into current page */
+   unsigned long cur_len;  /* I/O len of current page, up to PAGE_SIZE */
+
+

RE: [-mm patch] sched remove lb_stopbalance counter

2006-12-05 Thread Chen, Kenneth W

Ingo Molnar wrote on Tuesday, December 05, 2006 7:42 AM
> * Chen, Kenneth W <[EMAIL PROTECTED]> wrote:
> > > but, please:
> > > 
> > > > -#define SCHEDSTAT_VERSION 13
> > > > +#define SCHEDSTAT_VERSION 12
> > > 
> > > change this to 14 instead. Versions should only go upwards, even if 
> > > we revert to an earlier output format.
> > 
> > Really?  sched-decrease-number-of-load-balances.patch has not yet hit 
> > the mainline and I think it's in -mm for only a couple of weeks.  I'm 
> > trying to back out the change after brief reviewing the patch.
> 
> not a big issue but it costs nothing to go to version 14 - OTOH if any 
> utility has been updated to version 13 and is forgotten about, it might 
> break spuriously if we again go to 13 in the future.

OK, with a reluctant agreement, I've changed the version to 14.


[patch] sched: remove lb_stopbalance counter

Remove scheduler stats lb_stopbalance counter. This counter can be
calculated by: lb_balanced - lb_nobusyg - lb_nobusyq.  There is no
need to create gazillion counters while we can derive the value.


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>
Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>

--- ./include/linux/sched.h.orig2006-12-05 07:56:11.0 -0800
+++ ./include/linux/sched.h 2006-12-05 07:57:55.0 -0800
@@ -684,7 +684,6 @@ struct sched_domain {
unsigned long lb_hot_gained[MAX_IDLE_TYPES];
unsigned long lb_nobusyg[MAX_IDLE_TYPES];
unsigned long lb_nobusyq[MAX_IDLE_TYPES];
-   unsigned long lb_stopbalance[MAX_IDLE_TYPES];
 
/* Active load balancing */
unsigned long alb_cnt;
--- ./kernel/sched.c.orig   2006-12-05 07:56:31.0 -0800
+++ ./kernel/sched.c2006-12-05 08:02:11.0 -0800
@@ -428,7 +428,7 @@ static inline void task_rq_unlock(struct
  * bump this up when changing the output format or the meaning of an existing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 13
+#define SCHEDSTAT_VERSION 14
 
 static int show_schedstat(struct seq_file *seq, void *v)
 {
@@ -466,7 +466,7 @@ static int show_schedstat(struct seq_fil
seq_printf(seq, "domain%d %s", dcnt++, mask_str);
for (itype = SCHED_IDLE; itype < MAX_IDLE_TYPES;
itype++) {
-   seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu 
%lu %lu",
+   seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu 
%lu",
sd->lb_cnt[itype],
sd->lb_balanced[itype],
sd->lb_failed[itype],
@@ -474,8 +474,7 @@ static int show_schedstat(struct seq_fil
sd->lb_gained[itype],
sd->lb_hot_gained[itype],
sd->lb_nobusyq[itype],
-   sd->lb_nobusyg[itype],
-   sd->lb_stopbalance[itype]);
+   sd->lb_nobusyg[itype]);
}
seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu %lu 
%lu %lu %lu\n",
sd->alb_cnt, sd->alb_failed, sd->alb_pushed,
@@ -2579,10 +2578,8 @@ redo:
group = find_busiest_group(sd, this_cpu, , idle, _idle,
   , balance);
 
-   if (*balance == 0) {
-   schedstat_inc(sd, lb_stopbalance[idle]);
+   if (*balance == 0)
goto out_balanced;
-   }
 
if (!group) {
schedstat_inc(sd, lb_nobusyg[idle]);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [-mm patch] sched remove lb_stopbalance counter

2006-12-05 Thread Chen, Kenneth W

Ingo Molnar wrote on Tuesday, December 05, 2006 7:32 AM
> * Chen, Kenneth W <[EMAIL PROTECTED]> wrote:
> > in -mm tree: I would like to revert the change on adding 
> > lb_stopbalance counter.  This count can be calculated by: lb_balanced 
> > - lb_nobusyg - lb_nobusyq.  There is no need to create gazillion 
> > counters while we can derive the value.  I'm more of against changing 
> > sched-stat format unless it is absolutely necessary as all user land 
> > tool parsing /proc/schedstat needs to be updated and it's a real pain 
> > trying to keep multiple versions of it.
> > 
> > Signed-off-by: Ken Chen <[EMAIL PROTECTED]>
> 
> Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
> 
> but, please:
> 
> > -#define SCHEDSTAT_VERSION 13
> > +#define SCHEDSTAT_VERSION 12
> 
> change this to 14 instead. Versions should only go upwards, even if we 
> revert to an earlier output format.

Really?  sched-decrease-number-of-load-balances.patch has not yet hit the
mainline and I think it's in -mm for only a couple of weeks.  I'm trying
to back out the change after brief reviewing the patch.

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[-mm patch] sched remove lb_stopbalance counter

2006-12-05 Thread Chen, Kenneth W

Regarding to sched-decrease-number-of-load-balances.patch currently
in -mm tree: I would like to revert the change on adding lb_stopbalance
counter.  This count can be calculated by: lb_balanced - lb_nobusyg -
lb_nobusyq.  There is no need to create gazillion counters while we
can derive the value.  I'm more of against changing sched-stat format
unless it is absolutely necessary as all user land tool parsing
/proc/schedstat needs to be updated and it's a real pain trying to keep
multiple versions of it.


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>


--- ./include/linux/sched.h.orig2006-12-05 07:56:11.0 -0800
+++ ./include/linux/sched.h 2006-12-05 07:57:55.0 -0800
@@ -684,7 +684,6 @@ struct sched_domain {
unsigned long lb_hot_gained[MAX_IDLE_TYPES];
unsigned long lb_nobusyg[MAX_IDLE_TYPES];
unsigned long lb_nobusyq[MAX_IDLE_TYPES];
-   unsigned long lb_stopbalance[MAX_IDLE_TYPES];
 
/* Active load balancing */
unsigned long alb_cnt;
--- ./kernel/sched.c.orig   2006-12-05 07:56:31.0 -0800
+++ ./kernel/sched.c2006-12-05 08:02:11.0 -0800
@@ -428,7 +428,7 @@ static inline void task_rq_unlock(struct
  * bump this up when changing the output format or the meaning of an existing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 13
+#define SCHEDSTAT_VERSION 12
 
 static int show_schedstat(struct seq_file *seq, void *v)
 {
@@ -466,7 +466,7 @@ static int show_schedstat(struct seq_fil
seq_printf(seq, "domain%d %s", dcnt++, mask_str);
for (itype = SCHED_IDLE; itype < MAX_IDLE_TYPES;
itype++) {
-   seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu 
%lu %lu",
+   seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu 
%lu",
sd->lb_cnt[itype],
sd->lb_balanced[itype],
sd->lb_failed[itype],
@@ -474,8 +474,7 @@ static int show_schedstat(struct seq_fil
sd->lb_gained[itype],
sd->lb_hot_gained[itype],
sd->lb_nobusyq[itype],
-   sd->lb_nobusyg[itype],
-   sd->lb_stopbalance[itype]);
+   sd->lb_nobusyg[itype]);
}
seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu %lu 
%lu %lu %lu\n",
sd->alb_cnt, sd->alb_failed, sd->alb_pushed,
@@ -2579,10 +2578,8 @@ redo:
group = find_busiest_group(sd, this_cpu, , idle, _idle,
   , balance);
 
-   if (*balance == 0) {
-   schedstat_inc(sd, lb_stopbalance[idle]);
+   if (*balance == 0)
goto out_balanced;
-   }
 
if (!group) {
schedstat_inc(sd, lb_nobusyg[idle]);



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[-mm patch] sched remove lb_stopbalance counter

2006-12-05 Thread Chen, Kenneth W

Regarding to sched-decrease-number-of-load-balances.patch currently
in -mm tree: I would like to revert the change on adding lb_stopbalance
counter.  This count can be calculated by: lb_balanced - lb_nobusyg -
lb_nobusyq.  There is no need to create gazillion counters while we
can derive the value.  I'm more of against changing sched-stat format
unless it is absolutely necessary as all user land tool parsing
/proc/schedstat needs to be updated and it's a real pain trying to keep
multiple versions of it.


Signed-off-by: Ken Chen [EMAIL PROTECTED]


--- ./include/linux/sched.h.orig2006-12-05 07:56:11.0 -0800
+++ ./include/linux/sched.h 2006-12-05 07:57:55.0 -0800
@@ -684,7 +684,6 @@ struct sched_domain {
unsigned long lb_hot_gained[MAX_IDLE_TYPES];
unsigned long lb_nobusyg[MAX_IDLE_TYPES];
unsigned long lb_nobusyq[MAX_IDLE_TYPES];
-   unsigned long lb_stopbalance[MAX_IDLE_TYPES];
 
/* Active load balancing */
unsigned long alb_cnt;
--- ./kernel/sched.c.orig   2006-12-05 07:56:31.0 -0800
+++ ./kernel/sched.c2006-12-05 08:02:11.0 -0800
@@ -428,7 +428,7 @@ static inline void task_rq_unlock(struct
  * bump this up when changing the output format or the meaning of an existing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 13
+#define SCHEDSTAT_VERSION 12
 
 static int show_schedstat(struct seq_file *seq, void *v)
 {
@@ -466,7 +466,7 @@ static int show_schedstat(struct seq_fil
seq_printf(seq, domain%d %s, dcnt++, mask_str);
for (itype = SCHED_IDLE; itype  MAX_IDLE_TYPES;
itype++) {
-   seq_printf(seq,  %lu %lu %lu %lu %lu %lu %lu 
%lu %lu,
+   seq_printf(seq,  %lu %lu %lu %lu %lu %lu %lu 
%lu,
sd-lb_cnt[itype],
sd-lb_balanced[itype],
sd-lb_failed[itype],
@@ -474,8 +474,7 @@ static int show_schedstat(struct seq_fil
sd-lb_gained[itype],
sd-lb_hot_gained[itype],
sd-lb_nobusyq[itype],
-   sd-lb_nobusyg[itype],
-   sd-lb_stopbalance[itype]);
+   sd-lb_nobusyg[itype]);
}
seq_printf(seq,  %lu %lu %lu %lu %lu %lu %lu %lu %lu 
%lu %lu %lu\n,
sd-alb_cnt, sd-alb_failed, sd-alb_pushed,
@@ -2579,10 +2578,8 @@ redo:
group = find_busiest_group(sd, this_cpu, imbalance, idle, sd_idle,
   cpus, balance);
 
-   if (*balance == 0) {
-   schedstat_inc(sd, lb_stopbalance[idle]);
+   if (*balance == 0)
goto out_balanced;
-   }
 
if (!group) {
schedstat_inc(sd, lb_nobusyg[idle]);



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [-mm patch] sched remove lb_stopbalance counter

2006-12-05 Thread Chen, Kenneth W

Ingo Molnar wrote on Tuesday, December 05, 2006 7:32 AM
 * Chen, Kenneth W [EMAIL PROTECTED] wrote:
  in -mm tree: I would like to revert the change on adding 
  lb_stopbalance counter.  This count can be calculated by: lb_balanced 
  - lb_nobusyg - lb_nobusyq.  There is no need to create gazillion 
  counters while we can derive the value.  I'm more of against changing 
  sched-stat format unless it is absolutely necessary as all user land 
  tool parsing /proc/schedstat needs to be updated and it's a real pain 
  trying to keep multiple versions of it.
  
  Signed-off-by: Ken Chen [EMAIL PROTECTED]
 
 Acked-by: Ingo Molnar [EMAIL PROTECTED]
 
 but, please:
 
  -#define SCHEDSTAT_VERSION 13
  +#define SCHEDSTAT_VERSION 12
 
 change this to 14 instead. Versions should only go upwards, even if we 
 revert to an earlier output format.

Really?  sched-decrease-number-of-load-balances.patch has not yet hit the
mainline and I think it's in -mm for only a couple of weeks.  I'm trying
to back out the change after brief reviewing the patch.

- Ken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [-mm patch] sched remove lb_stopbalance counter

2006-12-05 Thread Chen, Kenneth W

Ingo Molnar wrote on Tuesday, December 05, 2006 7:42 AM
 * Chen, Kenneth W [EMAIL PROTECTED] wrote:
   but, please:
   
-#define SCHEDSTAT_VERSION 13
+#define SCHEDSTAT_VERSION 12
   
   change this to 14 instead. Versions should only go upwards, even if 
   we revert to an earlier output format.
  
  Really?  sched-decrease-number-of-load-balances.patch has not yet hit 
  the mainline and I think it's in -mm for only a couple of weeks.  I'm 
  trying to back out the change after brief reviewing the patch.
 
 not a big issue but it costs nothing to go to version 14 - OTOH if any 
 utility has been updated to version 13 and is forgotten about, it might 
 break spuriously if we again go to 13 in the future.

OK, with a reluctant agreement, I've changed the version to 14.


[patch] sched: remove lb_stopbalance counter

Remove scheduler stats lb_stopbalance counter. This counter can be
calculated by: lb_balanced - lb_nobusyg - lb_nobusyq.  There is no
need to create gazillion counters while we can derive the value.


Signed-off-by: Ken Chen [EMAIL PROTECTED]
Signed-off-by: Suresh Siddha [EMAIL PROTECTED]
Acked-by: Ingo Molnar [EMAIL PROTECTED]

--- ./include/linux/sched.h.orig2006-12-05 07:56:11.0 -0800
+++ ./include/linux/sched.h 2006-12-05 07:57:55.0 -0800
@@ -684,7 +684,6 @@ struct sched_domain {
unsigned long lb_hot_gained[MAX_IDLE_TYPES];
unsigned long lb_nobusyg[MAX_IDLE_TYPES];
unsigned long lb_nobusyq[MAX_IDLE_TYPES];
-   unsigned long lb_stopbalance[MAX_IDLE_TYPES];
 
/* Active load balancing */
unsigned long alb_cnt;
--- ./kernel/sched.c.orig   2006-12-05 07:56:31.0 -0800
+++ ./kernel/sched.c2006-12-05 08:02:11.0 -0800
@@ -428,7 +428,7 @@ static inline void task_rq_unlock(struct
  * bump this up when changing the output format or the meaning of an existing
  * format, so that tools can adapt (or abort)
  */
-#define SCHEDSTAT_VERSION 13
+#define SCHEDSTAT_VERSION 14
 
 static int show_schedstat(struct seq_file *seq, void *v)
 {
@@ -466,7 +466,7 @@ static int show_schedstat(struct seq_fil
seq_printf(seq, domain%d %s, dcnt++, mask_str);
for (itype = SCHED_IDLE; itype  MAX_IDLE_TYPES;
itype++) {
-   seq_printf(seq,  %lu %lu %lu %lu %lu %lu %lu 
%lu %lu,
+   seq_printf(seq,  %lu %lu %lu %lu %lu %lu %lu 
%lu,
sd-lb_cnt[itype],
sd-lb_balanced[itype],
sd-lb_failed[itype],
@@ -474,8 +474,7 @@ static int show_schedstat(struct seq_fil
sd-lb_gained[itype],
sd-lb_hot_gained[itype],
sd-lb_nobusyq[itype],
-   sd-lb_nobusyg[itype],
-   sd-lb_stopbalance[itype]);
+   sd-lb_nobusyg[itype]);
}
seq_printf(seq,  %lu %lu %lu %lu %lu %lu %lu %lu %lu 
%lu %lu %lu\n,
sd-alb_cnt, sd-alb_failed, sd-alb_pushed,
@@ -2579,10 +2578,8 @@ redo:
group = find_busiest_group(sd, this_cpu, imbalance, idle, sd_idle,
   cpus, balance);
 
-   if (*balance == 0) {
-   schedstat_inc(sd, lb_stopbalance[idle]);
+   if (*balance == 0)
goto out_balanced;
-   }
 
if (!group) {
schedstat_inc(sd, lb_nobusyg[idle]);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] add an iterator index in struct pagevec

2006-12-04 Thread Chen, Kenneth W

Andrew Morton wrote on Monday, December 04, 2006 9:45 PM
> On Mon, 4 Dec 2006 21:21:31 -0800
> "Chen, Kenneth W" <[EMAIL PROTECTED]> wrote:
> 
> > pagevec is never expected to be more than PAGEVEC_SIZE, I think a
> > unsigned char is enough to count them.  This patch makes nr, cold
> > to be unsigned char
> 
> Is that on the right side of the speed/space tradeoff?

I haven't measured speed.  Size wise, making them char shrinks vmlinux
text size by 112 bytes on x86_64 (using default config option).

> I must say I'm a bit skeptical about the need for this.  But I haven't
> looked closely at the blockdev-specific dio code yet.

It was suggested to declare another struct that embeds pagevec to perform
iteration.  But I prefer to have pagevec having the capability, it is
more compact this way.

It would be nice if you can review blockdev-specific dio code.  I would
appreciate it very much.

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] add an iterator index in struct pagevec

2006-12-04 Thread Chen, Kenneth W

pagevec is never expected to be more than PAGEVEC_SIZE, I think a
unsigned char is enough to count them.  This patch makes nr, cold
to be unsigned char and also adds an iterator index. With that,
the size can be even bumped up by 1 to 15.

Signed-off-by: Ken Chen <[EMAIL PROTECTED]>


diff -Nurp linux-2.6.19/include/linux/pagevec.h 
linux-2.6.19.ken/include/linux/pagevec.h
--- linux-2.6.19/include/linux/pagevec.h2006-11-29 13:57:37.0 
-0800
+++ linux-2.6.19.ken/include/linux/pagevec.h2006-12-04 19:18:21.0 
-0800
@@ -8,15 +8,16 @@
 #ifndef _LINUX_PAGEVEC_H
 #define _LINUX_PAGEVEC_H
 
-/* 14 pointers + two long's align the pagevec structure to a power of two */
-#define PAGEVEC_SIZE   14
+/* 15 pointers + 3 char's align the pagevec structure to a power of two */
+#define PAGEVEC_SIZE   15
 
 struct page;
 struct address_space;
 
 struct pagevec {
-   unsigned long nr;
-   unsigned long cold;
+   unsigned char nr;
+   unsigned char cold;
+   unsigned char idx;
struct page *pages[PAGEVEC_SIZE];
 };
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] optimize o_direct on block device - v2

2006-12-04 Thread Chen, Kenneth W

This patch implements block device specific .direct_IO method instead
of going through generic direct_io_worker for block device.

direct_io_worker is fairly complex because it needs to handle O_DIRECT
on file system, where it needs to perform block allocation, hole detection,
extents file on write, and tons of other corner cases. The end result is
that it takes tons of CPU time to submit an I/O.

For block device, the block allocation is much simpler and a tight triple
loop can be written to iterate each iovec and each page within the iovec
in order to construct/prepare bio structure and then subsequently submit
it to the block layer.  This significantly speeds up O_D on block device.

Signed-off-by: Ken Chen <[EMAIL PROTECTED]>


---
Changes since v1->v2:

* add BUILD_BUG_ON to ensure bio_count fit inside iocb->private
* add comment that bio_alloc won't fail with GFP_KERNEL
* fix back out path if get_uer_pages fail
* fix back out path if iov segment doesn't align properly

 fs/bio.c|2
 fs/block_dev.c  |  173 
 fs/read_write.c |2
 include/linux/bio.h |1
 4 files changed, 150 insertions(+), 28 deletions(-)


--- ./fs/block_dev.c.orig   2006-11-29 13:57:37.0 -0800
+++ ./fs/block_dev.c2006-12-04 18:38:53.0 -0800
@@ -129,43 +129,164 @@ blkdev_get_block(struct inode *inode, se
return 0;
 }
 
-static int
-blkdev_get_blocks(struct inode *inode, sector_t iblock,
-   struct buffer_head *bh, int create)
+int blk_end_aio(struct bio *bio, unsigned int bytes_done, int error)
 {
-   sector_t end_block = max_block(I_BDEV(inode));
-   unsigned long max_blocks = bh->b_size >> inode->i_blkbits;
+   struct kiocb* iocb = bio->bi_private;
+   atomic_t* bio_count = (atomic_t*) >private;
+   long res;
+
+   if ((bio->bi_rw & 1) == READ)
+   bio_check_pages_dirty(bio);
+   else {
+   bio_release_pages(bio);
+   bio_put(bio);
+   }
 
-   if ((iblock + max_blocks) > end_block) {
-   max_blocks = end_block - iblock;
-   if ((long)max_blocks <= 0) {
-   if (create)
-   return -EIO;/* write fully beyond EOF */
-   /*
-* It is a read which is fully beyond EOF.  We return
-* a !buffer_mapped buffer
-*/
-   max_blocks = 0;
-   }
+   if (error)
+   iocb->ki_left = -EIO;
+
+   if (atomic_dec_and_test(bio_count)) {
+   res = (iocb->ki_left < 0) ? iocb->ki_left : iocb->ki_nbytes;
+   aio_complete(iocb, res, 0);
}
 
-   bh->b_bdev = I_BDEV(inode);
-   bh->b_blocknr = iblock;
-   bh->b_size = max_blocks << inode->i_blkbits;
-   if (max_blocks)
-   set_buffer_mapped(bh);
return 0;
 }
 
+#define VEC_SIZE   16
+struct pvec {
+   unsigned short nr;
+   unsigned short idx;
+   struct page *page[VEC_SIZE];
+};
+
+
+struct page *blk_get_page(unsigned long addr, size_t count, int rw,
+ struct pvec *pvec)
+{
+   int ret, nr_pages;
+   if (pvec->idx == pvec->nr) {
+   nr_pages = (addr + count + PAGE_SIZE - 1) / PAGE_SIZE -
+   addr / PAGE_SIZE;
+   nr_pages = min(nr_pages, VEC_SIZE);
+   down_read(>mm->mmap_sem);
+   ret = get_user_pages(current, current->mm, addr, nr_pages,
+rw==READ, 0, pvec->page, NULL);
+   up_read(>mm->mmap_sem);
+   if (ret < 0)
+   return ERR_PTR(ret);
+   pvec->nr = ret;
+   pvec->idx = 0;
+   }
+   return pvec->page[pvec->idx++];
+}
+
 static ssize_t
 blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
-   loff_t offset, unsigned long nr_segs)
+loff_t pos, unsigned long nr_segs)
 {
-   struct file *file = iocb->ki_filp;
-   struct inode *inode = file->f_mapping->host;
+   struct inode *inode = iocb->ki_filp->f_mapping->host;
+   unsigned blkbits = blksize_bits(bdev_hardsect_size(I_BDEV(inode)));
+   unsigned blocksize_mask = (1<< blkbits) - 1;
+   unsigned long seg, nvec, cur_off, cur_len;
+
+   unsigned long addr;
+   size_t count, nbytes = iocb->ki_nbytes;
+   loff_t size;
+   struct bio * bio;
+   atomic_t *bio_count = (atomic_t *) >private;
+   struct page *page;
+   struct pvec pvec = {.nr = 0, .idx = 0, };
+
+   BUILD_BUG_ON(sizeof(atomic_t) > sizeof(iocb->private));
+
+   size = i_size_read(inode);
+   if (pos + nbytes > size)
+   nbytes = size - pos;
+
+   seg = 0;
+   addr = (unsigned long) iov[0].iov_base;
+   count = iov[0].iov_len;
+   atomic_set(bio_count, 1);
+
+

RE: [patch] speed up single bio_vec allocation

2006-12-04 Thread Chen, Kenneth W

Jens Axboe wrote on Monday, December 04, 2006 12:07 PM
> On Mon, Dec 04 2006, Chen, Kenneth W wrote:
> > On 64-bit arch like x86_64, struct bio is 104 byte.  Since bio slab is
> > created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory
> > available at the end of bio.  I think we can utilize that memory for
> > bio_vec allocation.  The purpose is not so much on saving memory consumption
> > for bio_vec, instead, I'm attempting to optimize away a call to 
> > bvec_alloc_bs.
> > 
> > So here is a patch to do just that for 1 segment bio_vec (we currently only
> > have space for 1 on 2.6.19).  And the detection whether there are spare 
> > space
> > available is dynamically calculated at compile time.  If there are no space
> > available, there will be no run time cost at all because gcc simply optimize
> > away all the code added in this patch.  If there are space available, the 
> > only
> > run time check is to see what the size of iovec is and we do appropriate
> > assignment to bio->bi_io_Vec etc.  The cost is minimal and we gain a whole
> > lot back from not calling bvec_alloc_bs() function.
> > 
> > I tried to use cache_line_size() to find out the alignment of struct bio, 
> > but
> > stumbled on that it is a runtime function for x86_64. So instead I made bio
> > to hint to the slab allocator to align on 32 byte (slab will use the larger
> > value of hw cache line and caller hints of "align").  I think it is a sane
> > number for majority of the CPUs out in the world.
> 
> Any benchmarks for this one?

About 0.2% on database transaction processing benchmark.  It was done a while
back on top of a major Linux vendor kernel. I will retest it again for 2.6.19.


> [...]
> 
> Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless,
> I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a
> bio is allocated. It doesn't add a lot of overhead even for the case
> where we do > 1 page bios, and it gets rid of the dual allocation for
> the 1 page bio.

I will try that too.  I'm a bit touchy about sharing a cache line for different
bio.  But given that there are 200,000 I/O per second we are currently pushing
the kernel, the chances of two cpu working on two bio that sits in the same
cache line are pretty small.

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] remove redundant iov segment check

2006-12-04 Thread Chen, Kenneth W

Andrew Morton wrote on Monday, December 04, 2006 11:36 AM
> On Mon, 4 Dec 2006 08:26:36 -0800
> "Chen, Kenneth W" <[EMAIL PROTECTED]> wrote:
> 
> > So it's not possible to call down to generic_file_aio_read/write with 
> > invalid
> > iov segment.  Patch proposed to delete these redundant code.
> 
> erp, please don't touch that code.
> 
> The writev deadlock fixes which went into 2.6.17 trashed writev()
> performance and Nick and I are slowly trying to get it back, while fixing
> the has-been-there-forever pagefault-in-write() deadlock.
> 
> This is all proving very hard to do, and we don't need sweeping code
> cleanups happening under our feet ;)
> 
> I'll bring those patches back in next -mm, but not very confidently.


OK, I will wait until that bug settles down and resubmit.  I really don't
see the value of walking the iov multiple times doing the same thing.

I will also dig up the archive to see if I can help in any way.

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] remove redundant iov segment check

2006-12-04 Thread Chen, Kenneth W

Zach Brown wrote on Monday, December 04, 2006 11:19 AM
> On Dec 4, 2006, at 8:26 AM, Chen, Kenneth W wrote:
> 
> > The access_ok() and negative length check on each iov segment in  
> > function
> > generic_file_aio_read/write are redundant.  They are all already  
> > checked
> > before calling down to these low level generic functions.
> 
> ...
> 
> > So it's not possible to call down to generic_file_aio_read/write  
> > with invalid
> > iov segment.
> 
> Well, generic_file_aio_{read,write}() are exported to modules, so  
> anything's *possible*. :)
> 
> This change makes me nervous because it relies on our ability to  
> audit all code paths to ensure that it's correct.  It'd be nice if  
> the code enforced the rules.

Maybe we should create another internal generic_file_aio_read/write
for in-core function? fs/read_write.c and fs/aio.c are not module-able
and the check is already there.  For external module, we can do the
check and then calls down to the internal one.

I hate to see iov is being walked multiple times   And this is
part of my effort to bring back O_DIRECT performance compares to a
3-years old vendor kernel based on 2.4 kernel.

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] speed up single bio_vec allocation

2006-12-04 Thread Chen, Kenneth W

On 64-bit arch like x86_64, struct bio is 104 byte.  Since bio slab is
created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory
available at the end of bio.  I think we can utilize that memory for
bio_vec allocation.  The purpose is not so much on saving memory consumption
for bio_vec, instead, I'm attempting to optimize away a call to bvec_alloc_bs.

So here is a patch to do just that for 1 segment bio_vec (we currently only
have space for 1 on 2.6.19).  And the detection whether there are spare space
available is dynamically calculated at compile time.  If there are no space
available, there will be no run time cost at all because gcc simply optimize
away all the code added in this patch.  If there are space available, the only
run time check is to see what the size of iovec is and we do appropriate
assignment to bio->bi_io_Vec etc.  The cost is minimal and we gain a whole
lot back from not calling bvec_alloc_bs() function.

I tried to use cache_line_size() to find out the alignment of struct bio, but
stumbled on that it is a runtime function for x86_64. So instead I made bio
to hint to the slab allocator to align on 32 byte (slab will use the larger
value of hw cache line and caller hints of "align").  I think it is a sane
number for majority of the CPUs out in the world.


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>


--- ./fs/bio.c.orig 2006-12-03 17:20:36.0 -0800
+++ ./fs/bio.c  2006-12-03 21:29:20.0 -0800
@@ -29,11 +29,14 @@
 #include/* for struct sg_iovec */
 
 #define BIO_POOL_SIZE 256
-
+#define BIO_ALIGN 32   /* minimal bio structure alignment */
 static kmem_cache_t *bio_slab __read_mostly;
 
 #define BIOVEC_NR_POOLS 6
 
+#define BIOVEC_FIT_INSIDE_BIO_CACHE_LINE   \
+   (ALIGN(sizeof(struct bio), BIO_ALIGN) ==\
+ALIGN(sizeof(struct bio) + sizeof(struct bio_vec), BIO_ALIGN))
 /*
  * a small number of entries is fine, not going to be performance critical.
  * basically we just need to survive
@@ -113,7 +116,8 @@ void bio_free(struct bio *bio, struct bi
 
BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS);
 
-   mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]);
+   if (!BIOVEC_FIT_INSIDE_BIO_CACHE_LINE || pool_idx)
+   mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]);
mempool_free(bio, bio_set->bio_pool);
 }
 
@@ -166,7 +170,15 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
struct bio_vec *bvl = NULL;
 
bio_init(bio);
-   if (likely(nr_iovecs)) {
+
+   /*
+* if bio_vec can fit into remaining cache line of struct
+* bio, go ahead use it and skip mempool allocation.
+*/
+   if (nr_iovecs == 1 && BIOVEC_FIT_INSIDE_BIO_CACHE_LINE) {
+   bvl = (struct bio_vec*) (bio + 1);
+   bio->bi_max_vecs = 1;
+   } else if (likely(nr_iovecs)) {
unsigned long idx = 0; /* shut up gcc */
 
bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, , bs);
@@ -1204,7 +1216,7 @@ static void __init biovec_init_slabs(voi
struct biovec_slab *bvs = bvec_slabs + i;
 
size = bvs->nr_vecs * sizeof(struct bio_vec);
-   bvs->slab = kmem_cache_create(bvs->name, size, 0,
+   bvs->slab = kmem_cache_create(bvs->name, size, BIO_ALIGN,
 SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
}
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] remove redundant iov segment check

2006-12-04 Thread Chen, Kenneth W

The access_ok() and negative length check on each iov segment in function
generic_file_aio_read/write are redundant.  They are all already checked
before calling down to these low level generic functions.

Vector I/O (both sync and async) are checked via rw_copy_check_uvector().
For single segment synchronous I/O, the checks are done in various places:
first access_ok() is checked in vfs_read/write, the negative length is
checked in rw_verify_area.  For single segment AIO, access_ok() is done
in aio_setup_iocb, and negative length is checked in io_submit_one.

So it's not possible to call down to generic_file_aio_read/write with invalid
iov segment.  Patch proposed to delete these redundant code.  Also moved
negative length check for single segment AIO into aio_setup_single_vector
to somewhat mirror aio_setup_vectored_rw that they check illegal negative
length. It fits better there too because iocb->aio_nbytes is now double
duty as segment count for vectored AIO, and checking negative length in
the entry point of all AIO isn't very obvious.


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>


diff -Nurp linux-2.6.19/fs/aio.c linux-2.6.19.ken/fs/aio.c
--- linux-2.6.19/fs/aio.c   2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/fs/aio.c   2006-12-03 17:16:52.0 -0800
@@ -1416,6 +1416,8 @@ static ssize_t aio_setup_single_vector(s
kiocb->ki_nr_segs = 1;
kiocb->ki_cur_seg = 0;
kiocb->ki_nbytes = kiocb->ki_left;
+   if (unlikely((ssize_t) kiocb->ki_nbytes < 0))
+   return -EINVAL;
return 0;
 }
 
@@ -1560,8 +1562,7 @@ int fastcall io_submit_one(struct kioctx
/* prevent overflows */
if (unlikely(
(iocb->aio_buf != (unsigned long)iocb->aio_buf) ||
-   (iocb->aio_nbytes != (size_t)iocb->aio_nbytes) ||
-   ((ssize_t)iocb->aio_nbytes < 0)
+   (iocb->aio_nbytes != (size_t)iocb->aio_nbytes)
   )) {
pr_debug("EINVAL: io_submit: overflow check\n");
return -EINVAL;
diff -Nurp linux-2.6.19/mm/filemap.c linux-2.6.19.ken/mm/filemap.c
--- linux-2.6.19/mm/filemap.c   2006-11-29 13:57:37.0 -0800
+++ linux-2.6.19.ken/mm/filemap.c   2006-12-03 17:16:10.0 -0800
@@ -1143,29 +1143,9 @@ generic_file_aio_read(struct kiocb *iocb
struct file *filp = iocb->ki_filp;
ssize_t retval;
unsigned long seg;
-   size_t count;
+   size_t count = iocb->ki_left;
loff_t *ppos = >ki_pos;
 
-   count = 0;
-   for (seg = 0; seg < nr_segs; seg++) {
-   const struct iovec *iv = [seg];
-
-   /*
-* If any segment has a negative length, or the cumulative
-* length ever wraps negative then return -EINVAL.
-*/
-   count += iv->iov_len;
-   if (unlikely((ssize_t)(count|iv->iov_len) < 0))
-   return -EINVAL;
-   if (access_ok(VERIFY_WRITE, iv->iov_base, iv->iov_len))
-   continue;
-   if (seg == 0)
-   return -EFAULT;
-   nr_segs = seg;
-   count -= iv->iov_len;   /* This segment is no good */
-   break;
-   }
-
/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
if (filp->f_flags & O_DIRECT) {
loff_t size;
@@ -2228,32 +2208,11 @@ __generic_file_aio_write_nolock(struct k
size_t ocount;  /* original count */
size_t count;   /* after file limit checks */
struct inode*inode = mapping->host;
-   unsigned long   seg;
loff_t  pos;
ssize_t written;
ssize_t err;
 
-   ocount = 0;
-   for (seg = 0; seg < nr_segs; seg++) {
-   const struct iovec *iv = [seg];
-
-   /*
-* If any segment has a negative length, or the cumulative
-* length ever wraps negative then return -EINVAL.
-*/
-   ocount += iv->iov_len;
-   if (unlikely((ssize_t)(ocount|iv->iov_len) < 0))
-   return -EINVAL;
-   if (access_ok(VERIFY_READ, iv->iov_base, iv->iov_len))
-   continue;
-   if (seg == 0)
-   return -EFAULT;
-   nr_segs = seg;
-   ocount -= iv->iov_len;  /* This segment is no good */
-   break;
-   }
-
-   count = ocount;
+   count = ocount = iocb->ki_left;
pos = *ppos;
 
vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] kill pointless ki_nbytes assignment in aio_setup_single_vector

2006-12-04 Thread Chen, Kenneth W

io_submit_one assigns ki_left = ki_nbytes = iocb->aio_nbytes, then
calls down to aio_setup_iocb, then to aio_setup_single_vector. In there,
ki_nbytes is reassigned to the same value it got two call stack above it.
There is no need to do so.

Signed-off-by: Ken Chen <[EMAIL PROTECTED]>


diff -Nurp linux-2.6.19/fs/aio.c linux-2.6.19.ken/fs/aio.c
--- linux-2.6.19/fs/aio.c   2006-12-03 17:16:52.0 -0800
+++ linux-2.6.19.ken/fs/aio.c   2006-12-03 17:19:06.0 -0800
@@ -1415,7 +1415,6 @@ static ssize_t aio_setup_single_vector(s
kiocb->ki_iovec->iov_len = kiocb->ki_left;
kiocb->ki_nr_segs = 1;
kiocb->ki_cur_seg = 0;
-   kiocb->ki_nbytes = kiocb->ki_left;
if (unlikely((ssize_t) kiocb->ki_nbytes < 0))
return -EINVAL;
return 0;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[patch] kill pointless ki_nbytes assignment in aio_setup_single_vector

2006-12-04 Thread Chen, Kenneth W

io_submit_one assigns ki_left = ki_nbytes = iocb-aio_nbytes, then
calls down to aio_setup_iocb, then to aio_setup_single_vector. In there,
ki_nbytes is reassigned to the same value it got two call stack above it.
There is no need to do so.

Signed-off-by: Ken Chen [EMAIL PROTECTED]


diff -Nurp linux-2.6.19/fs/aio.c linux-2.6.19.ken/fs/aio.c
--- linux-2.6.19/fs/aio.c   2006-12-03 17:16:52.0 -0800
+++ linux-2.6.19.ken/fs/aio.c   2006-12-03 17:19:06.0 -0800
@@ -1415,7 +1415,6 @@ static ssize_t aio_setup_single_vector(s
kiocb-ki_iovec-iov_len = kiocb-ki_left;
kiocb-ki_nr_segs = 1;
kiocb-ki_cur_seg = 0;
-   kiocb-ki_nbytes = kiocb-ki_left;
if (unlikely((ssize_t) kiocb-ki_nbytes  0))
return -EINVAL;
return 0;

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 >

1 - 100 of 294 matches

Mail list logo