RE: [PATCH 1/3] aio: fix oops because of extra IO control block freeing.

2007-03-08 Thread Ananiev, Leonid I
ZB>If we really decide to remove EIOCBRETRY support we'd get rid of all
ZB>the retry infrastructure and remove the EIOCBRETRY errno so their
ZB>builds failed.

Originally EIOCBRETRY was used in fs/read_write.c for vector IO.
And EIOCBRETRY was deleted from it after.
Now EIOCBRETRY is used in drivers/usb/gadget/inode.c only. And the patch
2/3 proposes to set iocb flag instead of EIOCBRETRY errno in inode.c.
Agree that the name kiocbSetPgBusy() could be better chosen if
kiocbSetPgBusy() is just instead for vector IO. But why other than DIO
developers must continue using EIOCBRETRY? Thay could use the same way
as fs/read_write.c uses?

ZB> I was waiting for them to arrive before responding.
Sorry. I've listed "linux-kernel@vger.kernel.org;
[EMAIL PROTECTED]" in a single (that is incorrect) thunderbird
'To' line; I have got response from Stern; and thought that mailing was
OK.

ZB>- ext3_releasepage() returns failure if it races with kjournald
ZB>holding a reference while it waits for a transaction to commit.

The patch proposes iterative synchronization way to solve this problem.

ZB>- generic_file_direct_IO() causes an oops if it clobbers -EIOCBQUEUED
ZB>with the return code from invalidate_inode_pages2_range()..-
ZB>releasepage().

After patching invalidate_inode_pages2_range() does not change return
value.

ZB>This fix makes the incorrect assertion that *any* failure from
ZB>invalidate_inode_pages2_range(), which might not have anything to do
ZB>with this race you're currently seeing, is transitory and should
ZB>trigger a retry.  That's wrong, it should be returning the error.

Just patch makes to retry iocb if page is transitory busy for any
reason.
If busy -> do retry IO operation later.
We will get correct errno if mapping was deleted or retry success if
page is dirty or committed.

Leonid
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 2/3] aio: fix oops because of extra IO control block freeing.

2007-03-05 Thread Ananiev, Leonid I


>Where is kiocbSetPgBusy() defined,
It is defined in include/linux/aio.h (introduced in [PATCH 1/3])
>  and where in the documentation is it explained?
I will add a comment for kiocbSetPgBusy()in aio.h.

> For that matter, what is the reason for changing the return value at
all?
EIORETRY is not IO return value at all. It is processing stage.
It conflicts with IO return value. That is why a flag in iocb is used.
It should be noted that loop but not EIOCBRETRY was used for vector IO
(mm/filemap.c)

Leonid
>-Original Message-
>From: Alan Stern [mailto:[EMAIL PROTECTED]
>Sent: Monday, March 05, 2007 10:36 PM
>To: Ananiev, Leonid I
>Cc: Kernel development list
>Subject: Re: [PATCH 2/3] aio: fix oops because of extra IO control
block freeing.
>
>On Mon, 5 Mar 2007, Leonid Ananiev wrote:
>
>>  From Leonid Ananiev
>>
>> This patch finishes moving from using errno EIOCBRETRY to using flag
in
>> IO control block for aio retrying. After this change the process will
be
>> kicked for direct aio as it was for sync aio.
>>
>> Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]>
>>
>> The patch is applied to 2.6.20 or 2.6.21-rc2
>> diff -uprN -X linux-2.6.20-aio21/Documentation/dontdiff
>> linux-2.6.20-aio21/drivers/usb/gadget/inode.c
>> linux-2.6.20-aio22/drivers/usb/gadget/inode.c
>> --- linux-2.6.20-aio21/drivers/usb/gadget/inode.c2007-03-04
>> 21:45:52.0 +0300
>> +++ linux-2.6.20-aio22/drivers/usb/gadget/inode.c2007-03-05
>> 18:19:35.0 +0300
>> @@ -692,7 +692,10 @@ fail:
>>  kfree(priv);
>>  put_ep(epdata);
>>  } else
>> -value = (iv ? -EIOCBRETRY : -EIOCBQUEUED);
>> +if (iv)
>> +kiocbSetPgBusy(iocb);
>> +else
>> +value = -EIOCBQUEUED;
>>  return value;
>>   }
>>
>> diff -uprN -X linux-2.6.20-aio21/Documentation/dontdiff
>> linux-2.6.20-aio21/fs/ocfs2/dlmglue.c
linux-2.6.20-aio22/fs/ocfs2/dlmglue.c
>> --- linux-2.6.20-aio21/fs/ocfs2/dlmglue.c2007-03-04
21:45:52.0
>> +0300
>> +++ linux-2.6.20-aio22/fs/ocfs2/dlmglue.c2007-03-04
22:57:50.0
>> +0300
>> @@ -1639,7 +1639,7 @@ int ocfs2_meta_lock_full(struct inode *i
>>
>>  status = ocfs2_cluster_lock(osb, lockres, level, dlm_flags,
arg_flags);
>>  if (status < 0) {
>> -if (status != -EAGAIN && status != -EIOCBRETRY)
>> +if (status != -EAGAIN)
>>  mlog_errno(status);
>>  goto bail;
>>  }
>> diff -uprN -X linux-2.6.20-aio21/Documentation/dontdiff
>> linux-2.6.20-aio21/include/linux/aio.h
>> linux-2.6.20-aio22/include/linux/aio.h
>> --- linux-2.6.20-aio21/include/linux/aio.h   2007-03-04
21:46:45.0
>> +0300
>> +++ linux-2.6.20-aio22/include/linux/aio.h   2007-03-04
22:57:50.0
>> +0300
>> @@ -79,15 +79,6 @@ struct kioctx;
>>* not ask the method again -- ki_retry must ensure forward
progress.
>>* aio_complete() must be called once and only once in the future,
>> multiple
>>* calls may result in undefined behaviour.
>> - *
>> - * If ki_retry returns -EIOCBRETRY it has made a promise that
kick_iocb()
>> - * will be called on the kiocb pointer in the future.  This may
happen
>> - * through generic helpers that associate kiocb->ki_wait with a wait
>> - * queue head that ki_retry uses via current->io_wait.  It can also
happen
>> - * with custom tracking and manual calls to kick_iocb(), though that
is
>> - * discouraged.  In either case, kick_iocb() must be called once and
only
>> - * once.  ki_retry must ensure forward progress, the AIO core will
wait
>> - * indefinitely for kick_iocb() to be called.
>>*/
>>   struct kiocb {
>>  struct list_headki_run_list;
>> diff -uprN -X linux-2.6.20-aio21/Documentation/dontdiff
>> linux-2.6.20-aio21/include/linux/errno.h
>> linux-2.6.20-aio22/include/linux/errno.h
>> --- linux-2.6.20-aio21/include/linux/errno.h 2007-03-04
>> 21:45:51.0 +0300
>> +++ linux-2.6.20-aio22/include/linux/errno.h 2007-03-04
>> 22:57:50.0 +0300
>> @@ -22,7 +22,6 @@
>>   #define EBADTYPE   527 /* Type not supported by server */
>>   #define EJUKEBOX   528 /* Request initiated, but will not
complete
>> before timeout */
>>   #define EIOCBQUEUED529 /* iocb queued, will get
completion event */
>> -#define EIOCBRETRY  530 /* iocb queued, will trigger a retry */
>>
>>   #endif
>
>Where is kiocbSetPgBusy() defined,

Cfq first request patch changes IO performance

2007-02-26 Thread Ananiev, Leonid I
The patch "cfq-iosched: defer slice activation to first request being
active"
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=c
ommitdiff;h=44f7c16065c83060cbb9dd9b367141682a6e2b8e;hp=99f9628aba4d8fb3
b8d955c9efded0d0a1995fad

Sysbench fileio (random read/write mix) -15%
Tiobench sequential write   -3%
Tiobench random write   -5%
Iozone record re-write  -21%
Iozone mmap sequential read -20%
Iozone mmap random write+45%
Aio-stress direct sequential read   +29%
For other 3 tiobench, 7 aiostress and 13 iozone tests the result changes
are less than variation.

Considered patch does not impact on performance if all threads have
equal IO activity (tiobench).
If single thread has IO activity the performance becomes unstable and is
decreased mainly.

Leonid
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-21 Thread Ananiev, Leonid I
> kjournald submited buffers for IO and waiting for them to finish.

It is means that the patch incorrectly moves internal kernel
synchronization
problem into user space as EIO instead of fixing a root cause or perform
iterative
synchronization.
After patching users will be surprised a lot of EIO
(remember 47% EIO in aio-stress runs after patching) will buy new
disk...
This patch is harmful.

Leonid
>-Original Message-
>From: Zach Brown [mailto:[EMAIL PROTECTED]
>Sent: Wednesday, February 21, 2007 9:25 PM
>To: Ken Chen
>Cc: Ananiev, Leonid I; Chris Mason; linux-aio; Linux Kernel Mailing
List; Benjamin LaHaise; Suparna
>bhattacharya; Andrew Morton; Badari Pulavarty
>Subject: Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to
completion event
>
>
>On Feb 21, 2007, at 12:35 AM, Ken Chen wrote:
>
>> On 2/20/07, Ananiev, Leonid <[EMAIL PROTECTED]> wrote:
>>> 1) mem=1G in kernel boot param if you have more
>>> 2) unmount; mk2fs; mount
>>> 3) dd if=/dev/zero of= bs=1M count=1200
>>> 4) aiostress -s 1200m -O -o 2 -i 1 -r 16k 
>>> 5) if i++<50 goto 2).
>>
>> Would you please instrument the call chain of
>> invalidate_complete_page2() and tell us exactly where it returns zero
>> value in your failure case?
>>
>>   invalidate_complete_page2
>>  try_to_release_page
>> ext3_releasepage
>>journal_try_to_free_buffers
>>   ???
>
>For what it's worth, Badari has explained this race in the past in a
>credible way.  I'll take the liberty of pasting a mail from him:
>
>"
>kjournald submited buffers for IO and waiting for them to finish.
>Note that it has a ref. against the buffer.
>
>journal_commit_transaction()
> ...
> submited buffers for IO
> /* Waiting for IO to complete */
> while (commit_transaction->t_locked_list) {
> ...
> get_bh(bh);
> if (buffer_locked(bh)) {
> spin_unlock(&journal->j_list_lock);
> wait_on_buffer(bh);  <<<<<<
> spin_lock(&journal->j_list_lock);
> }
>
> ..
> put_bh(bh);
> }
>
>Now, DIO process comes to frees the jh through
>journal_try_to_free_buffers()
>but fails to drop_buffers() since kjournald() has a reference against
>it.
>invalidate_inode_pages2_range()
> ..
> ext3_releasepage()
> journal_try_to_free_buffers()
> journal_put_journal_head()
> __journal_try_to_free_buffer()
> <--- freed jh
>
> try_to_free_buffers()
> drop_buffers()
> if (buffer_busy(bh))
> goto failed;
>   <<--- returns EIO due to
>b_count
>
>"
>
>I don't mean to say that we shouldn't get traces to confirm the
>theory, just sharing.  And now we can point to this in the archives
>next time :).
>
>- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-21 Thread Ananiev, Leonid I
> where it returns zero 
I've wrote in the mail http://lkml.org/lkml/2007/2/8/337
invalidate_inode_pages2_range() reports BUG:
warning at mm/truncate.c:398  occurs becouse of
invalidate_complete_page2()returns 0; it returns 0 because of
try_to_release_page()  returns 0; it returns 0 because of
ext3_releasepage() returns 0; it returns 0 because of
journal_try_to_free_buffers()  returns 0; it returns 0 because of
try_to_free_buffers()  returns 0; it returns 0 because of
drop_buffers() returns 0; it returns 0 because of
buffer_busy(bh)returns 1; it returns 0 because of
buffer_head count is 1 (bh->b_count==1) as additional printk reports.

Leonid
-Original Message-
From: Ken Chen [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 21, 2007 11:36 AM
To: Ananiev, Leonid I
Cc: Chris Mason; Zach Brown; [EMAIL PROTECTED];
linux-kernel@vger.kernel.org; Benjamin LaHaise; Suparna bhattacharya;
Andrew Morton
Subject: Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to
completion event

On 2/20/07, Ananiev, Leonid <[EMAIL PROTECTED]> wrote:
> 1) mem=1G in kernel boot param if you have more
> 2) unmount; mk2fs; mount
> 3) dd if=/dev/zero of= bs=1M count=1200
> 4) aiostress -s 1200m -O -o 2 -i 1 -r 16k 
> 5) if i++<50 goto 2).

Would you please instrument the call chain of
invalidate_complete_page2() and tell us exactly where it returns zero
value in your failure case?

   invalidate_complete_page2
  try_to_release_page
 ext3_releasepage
journal_try_to_free_buffers
   ???
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-20 Thread Ananiev, Leonid I
-O I've missed:
aiostress -s 1200m -O -o 2 -i 1 -r 16k 

You are right I've used harness scripts.
i=0; while ((i++<50)); do ~/bm/bin/runs I2 -; done &

It runs bmrun harness script which is long for different hardware
configurations and test options.
The lines in aio-stress wrapper which could be useful for you are:
gcc aiostress.c -Wall -o aiostress -lpthread -laio
2>>${file_log}
...
sudo umount -f ${pdisk} ${ppath} > /dev/null 2>&1
sudo /sbin/mke2fs -j ${pdisk} > /dev/null 2>>${file_log}
sudo mount ${pdisk} ${ppath} > /dev/null 2>>${file_log}...
export LD_LIBRARY_PATH=/lib:/usr/lib:$LD_LIBRARY_PATH
all_files_size=1200 # in megabytes
...
else
num_disks=1
((file_size = all_files_size / num_disks))
testfiles=${ppath}/test/testfile
dd if=/dev/zero of=$testfiles bs=1M count=${file_size} >
/dev/null
fi
aiostress ${*} -v -o 2 -s ${file_size}m -i 1 -r 16k $testfiles >
${file_res} 2>&1
if [ "$1" == "-S" ] ; then
info="O_SYNC"
elif [ "$1" == "-O" ] ; then
info="O_DIRECT"
fi
awk  ${file_res}

Which does 
1) mem=1G in kernel boot param if you have more
2) unmount; mk2fs; mount
3) dd if=/dev/zero of= bs=1M count=1200
4) aiostress -s 1200m -O -o 2 -i 1 -r 16k 
5) if i++<50 goto 2).

Leonid
-Original Message-
From: Chris Mason [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 20, 2007 8:55 PM
To: Ananiev, Leonid I
Cc: Zach Brown; [EMAIL PROTECTED]; linux-kernel@vger.kernel.org;
Benjamin LaHaise; Suparna bhattacharya; Andrew Morton
Subject: Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to
completion event

On Tue, Feb 20, 2007 at 08:17:51PM +0300, Ananiev, Leonid I wrote:
> aio-stress command lines used for test
> 1) mem=1G in kernel boot param if you have more
> 2) mk2fs for test_file
> 3) dd if=/dev/zero of= bs=1M count=1200
> 4) aiostress -s 1200m -o 2 -i 1 -r 16k 
> 

Sorry, this aio-stress command line doesn't do O_DIRECT.  Do you have
this bundled into a test script that I can try?

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-20 Thread Ananiev, Leonid I
There is change in the patch which is uncommented in preface.
Now aio_ring_insert_entry() is not called if req->ki_users>=1.
Before was called.
Could you comment it?

Leonid
-Original Message-
From: Zach Brown [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 20, 2007 12:39 AM
To: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org
Cc: Benjamin LaHaise; Suparna bhattacharya; Andrew Morton; Ananiev,
Leonid I
Subject: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to
completion event

aio: propogate post-EIOCBQUEUED errors to completion event

This addresses an oops reported by Leonid Ananiev
<[EMAIL PROTECTED]>
as archived at http://lkml.org/lkml/2007/2/8/337.

O_DIRECT kicks off bios and returns -EIOCBQUEUED to indicate its
intention to
call aio_complete() once the bios complete.   As we return from
submission we
must preserve the -EIOCBQUEUED return code so that fs/aio.c knows to let
the
bio completion call aio_complete().  This stops us from returning errors
after
O_DIRECT submission.

But we have a few places that are very interested in generating errors
after
bio submission.

The most critical of these is invalidating the page cache after a write.
This
avoids exposing stale data to buffered operations that are performed
after the
O_DIRECT write succeeds.  We must do this after submission because the
user
buffer might have been an mmap()ed buffer of the region being written
to.  The
get_user_pages() in the O_DIRECT completion path could have faulted in
stale
data.

So this patch introduces a helper, aio_propogate_error(), which queues
post-submission errors in the iocb so that they are given to the user
completion event when aio_complete() is finally called.

To get this working we change the aio_complete() path so that the ring
insertion is performed as the final iocb reference is dropped.  This
gives the
submission path time to queue its pending error before it drops its
reference.
This increases the space in the iocb as it has to record the two result
codes
from aio_complete() and the pending error from the submission path.

This was tested by running O_DIRECT aio-stress concurrently with
buffered reads
while triggering EIO in invalidate_inode_pages2_range() with the help of
a
debugfs bool hack.  Previously the kernel would oops as fs/aio.c and bio
completion both called aio_complete().  With this patch aio-stress sees
-EIO.

Signed-off-by: Zach Brown <[EMAIL PROTECTED]>
---

 fs/aio.c|   49 +-
 include/linux/aio.h |   30 +
 mm/filemap.c|4 +--
 3 files changed, 57 insertions(+), 26 deletions(-)

--- a/fs/aio.c  Mon Feb 19 13:12:20 2007 -0800
+++ b/fs/aio.c  Mon Feb 19 13:16:00 2007 -0800
@@ -193,8 +193,7 @@ static int aio_setup_ring(struct kioctx 
kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km);
\
 } while(0)
 
-static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb
*iocb,
- long res, long res2)
+static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb
*iocb)
 {
struct aio_ring_info*info;
struct aio_ring *ring;
@@ -213,12 +212,12 @@ static void aio_ring_insert_entry(struct
 
event->obj = (u64)(unsigned long)iocb->ki_obj.user;
event->data = iocb->ki_user_data;
-   event->res = res;
-   event->res2 = res2;
-
-   dprintk("aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n",
+   event->res = iocb->ki_pending_err ? iocb->ki_pending_err :
iocb->ki_res;
+   event->res2 = iocb->ki_res2;
+
+   dprintk("aio_complete: %p[%lu]: %p: %p %Lx %d %lx %lx\n",
ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
-   res, res2);
+   iocb->ki_pending_err, iocb->ki_res, iocb->ki_res2);
 
/* after flagging the request as done, we
 * must never even look at it again
@@ -459,6 +458,7 @@ static struct kiocb fastcall *__aio_get_
req->ki_cancel = NULL;
req->ki_retry = NULL;
req->ki_dtor = NULL;
+   req->ki_pending_err = 0;
req->private = NULL;
req->ki_iovec = NULL;
INIT_LIST_HEAD(&req->ki_run_list);
@@ -548,10 +548,14 @@ static int __aio_put_req(struct kioctx *
 
assert_spin_locked(&ctx->ctx_lock);
 
-   req->ki_users --;
+   req->ki_users--;
BUG_ON(req->ki_users < 0);
if (likely(req->ki_users))
return 0;
+
+   if (kiocbIsInserted(req))
+   aio_ring_insert_entry(ctx, req);
+
list_del(&req->ki_list);/* remove from
active_reqs */
req->ki_cancel = NULL;
req->ki_retry = NULL;
@@ -983,27 +987,24 @@ int fastcall aio_complete(struct kiocb *
return 1;
}
 
-   /* add a completion event to the ring buffer.
-

RE: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-20 Thread Ananiev, Leonid I
aio-stress command lines used for test
1) mem=1G in kernel boot param if you have more
2) mk2fs for test_file
3) dd if=/dev/zero of= bs=1M count=1200
4) aiostress -s 1200m -o 2 -i 1 -r 16k 

Leonid
-Original Message-
From: Chris Mason [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 20, 2007 8:04 PM
To: Ananiev, Leonid I
Cc: Zach Brown; [EMAIL PROTECTED]; linux-kernel@vger.kernel.org;
Benjamin LaHaise; Suparna bhattacharya; Andrew Morton
Subject: Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to
completion event

On Tue, Feb 20, 2007 at 07:57:49PM +0300, Ananiev, Leonid I wrote:
> Zach> This addresses an oops reported by Leonid Ananiev
> <[EMAIL PROTECTED]>
> Zach> as archived at http://lkml.org/lkml/2007/2/8/337.
> 
> Zach> This was tested by running O_DIRECT aio-stress concurrently with
> buffered reads
> 
> The oops was with aio-stress only running in the loop
> WITHOUT buffered or mmaped IO which are patched and discussed now.
> Actually 47% aio is finished with EIO after patch.

I looked through the thread and couldn't find the aio-stress command
line used for the whole test.  Could you please post it here?

If aio+dio is being used to extend the file, you'll get pages in the
page cache, which is hopefully why you're able to trigger the EIOs.

-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-20 Thread Ananiev, Leonid I
Zach> This addresses an oops reported by Leonid Ananiev
<[EMAIL PROTECTED]>
Zach> as archived at http://lkml.org/lkml/2007/2/8/337.

Zach> This was tested by running O_DIRECT aio-stress concurrently with
buffered reads

The oops was with aio-stress only running in the loop
WITHOUT buffered or mmaped IO which are patched and discussed now.
Actually 47% aio is finished with EIO after patch.

Leonnid

-Original Message-
From: Zach Brown [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 20, 2007 12:39 AM
To: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org
Cc: Benjamin LaHaise; Suparna bhattacharya; Andrew Morton; Ananiev,
Leonid I
Subject: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to
completion event

aio: propogate post-EIOCBQUEUED errors to completion event

This addresses an oops reported by Leonid Ananiev
<[EMAIL PROTECTED]>
as archived at http://lkml.org/lkml/2007/2/8/337.

O_DIRECT kicks off bios and returns -EIOCBQUEUED to indicate its
intention to
call aio_complete() once the bios complete.   As we return from
submission we
must preserve the -EIOCBQUEUED return code so that fs/aio.c knows to let
the
bio completion call aio_complete().  This stops us from returning errors
after
O_DIRECT submission.

But we have a few places that are very interested in generating errors
after
bio submission.

The most critical of these is invalidating the page cache after a write.
This
avoids exposing stale data to buffered operations that are performed
after the
O_DIRECT write succeeds.  We must do this after submission because the
user
buffer might have been an mmap()ed buffer of the region being written
to.  The
get_user_pages() in the O_DIRECT completion path could have faulted in
stale
data.

So this patch introduces a helper, aio_propogate_error(), which queues
post-submission errors in the iocb so that they are given to the user
completion event when aio_complete() is finally called.

To get this working we change the aio_complete() path so that the ring
insertion is performed as the final iocb reference is dropped.  This
gives the
submission path time to queue its pending error before it drops its
reference.
This increases the space in the iocb as it has to record the two result
codes
from aio_complete() and the pending error from the submission path.

This was tested by running O_DIRECT aio-stress concurrently with
buffered reads
while triggering EIO in invalidate_inode_pages2_range() with the help of
a
debugfs bool hack.  Previously the kernel would oops as fs/aio.c and bio
completion both called aio_complete().  With this patch aio-stress sees
-EIO.

Signed-off-by: Zach Brown <[EMAIL PROTECTED]>
---

 fs/aio.c|   49 +-
 include/linux/aio.h |   30 +
 mm/filemap.c|4 +--
 3 files changed, 57 insertions(+), 26 deletions(-)

--- a/fs/aio.c  Mon Feb 19 13:12:20 2007 -0800
+++ b/fs/aio.c  Mon Feb 19 13:16:00 2007 -0800
@@ -193,8 +193,7 @@ static int aio_setup_ring(struct kioctx 
kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km);
\
 } while(0)
 
-static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb
*iocb,
- long res, long res2)
+static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb
*iocb)
 {
struct aio_ring_info*info;
struct aio_ring *ring;
@@ -213,12 +212,12 @@ static void aio_ring_insert_entry(struct
 
event->obj = (u64)(unsigned long)iocb->ki_obj.user;
event->data = iocb->ki_user_data;
-   event->res = res;
-   event->res2 = res2;
-
-   dprintk("aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n",
+   event->res = iocb->ki_pending_err ? iocb->ki_pending_err :
iocb->ki_res;
+   event->res2 = iocb->ki_res2;
+
+   dprintk("aio_complete: %p[%lu]: %p: %p %Lx %d %lx %lx\n",
ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
-   res, res2);
+   iocb->ki_pending_err, iocb->ki_res, iocb->ki_res2);
 
/* after flagging the request as done, we
 * must never even look at it again
@@ -459,6 +458,7 @@ static struct kiocb fastcall *__aio_get_
req->ki_cancel = NULL;
req->ki_retry = NULL;
req->ki_dtor = NULL;
+   req->ki_pending_err = 0;
req->private = NULL;
req->ki_iovec = NULL;
INIT_LIST_HEAD(&req->ki_run_list);
@@ -548,10 +548,14 @@ static int __aio_put_req(struct kioctx *
 
assert_spin_locked(&ctx->ctx_lock);
 
-   req->ki_users --;
+   req->ki_users--;
BUG_ON(req->ki_users < 0);
if (likely(req->ki_users))
return 0;
+
+   if (kiocbIsInserted(req))
+   aio_ring_insert_entry(ctx, req);
+
list_del(&req->ki_list);/* remove from
activ

RE: [PATCH] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-20 Thread Ananiev, Leonid I
I have applied the Brown's patch
and run aio-stress O_DIRECT random write 100 times
There is 4% throughput degradation.

47 of 100 aio-stress runs reported an EIO.
1 aio-stress error output example:

adding stage random write
starting with random write
file size 1200MB, record size 16KB, depth 64, ios per iteration 1
max io_submit 1, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification on
adding file /mnt/stp/test/testfile thread 0
io err 18446744073709551611 (Input/output error) op 1, size 16384
io err 18446744073709551611 (Input/output error) op 1, size 16384
io err 18446744073709551611 (Input/output error) op 1, size 16384
latency min 0.01 avg 0.16 max 6923.40
 76784 < 100 13 < 250 2 < 500 0 < 1000 0 < 5000 1 < 1
completion latency min 0.21 avg 153.07 max 1103.15
 34330 < 100 26669 < 250 13653 < 500 2082 < 1000 2 < 5000 0 <
1
random write on /mnt/stp/test/testfile (6.06 MB/s) 1200.00 MB in 198.03s
3 errors on oper, last 4294967291
thread 0 random write totals (6.06 MB/s) 1200.00 MB in 198.04s
3 errors on oper, last 4294967291

Leonid
-Original Message-
From: Zach Brown [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 19, 2007 11:35 PM
To: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org
Cc: Benjamin LaHaise; Suparna bhattacharya; Andrew Morton; Ananiev,
Leonid I
Subject: [PATCH] aio: propogate post-EIOCBQUEUED errors to completion
event

aio: propogate post-EIOCBQUEUED errors to completion event

This addresses an oops reported by Leonid Ananiev
<[EMAIL PROTECTED]>
as archived at http://lkml.org/lkml/2007/2/8/337.

O_DIRECT kicks off bios and returns -EIOCBQUEUED to indicate its
intention to
call aio_complete() once the bios complete.   As we return from
submission we
must preserve the -EIOCBQUEUED return code so that fs/aio.c knows to let
the
bio completion call aio_complete().  This stops us from returning errors
after
O_DIRECT submission.

But we have a few places that are very interested in generating errors
after
bio submission.

The most critical of these is invalidating the page cache after a write.
This
avoids exposing stale data to buffered operations that are performed
after the
O_DIRECT write succeeds.  We must do this after submission because the
user
buffer might have been an mmap()ed buffer of the region being written
to.  The
get_user_pages() in the O_DIRECT completion path could have faulted in
stale
data.

So this patch introduces a helper, aio_propogate_error(), which queues
post-submission errors in the iocb so that they are given to the user
completion event when aio_complete() is finally called.

To get this working we change the aio_complete() path so that the ring
insertion is performed as the final iocb reference is dropped.  This
gives the
submission path time to queue its pending error before it drops its
reference.
This increases the space in the iocb as it has to record the two result
codes
from aio_complete() and the pending error from the submission path.

This was tested by running O_DIRECT aio-stress concurrently with
buffered reads
while triggering EIO in invalidate_inode_pages2_range() with the help of
a
debugfs bool hack.  Previously the kernel would oops as fs/aio.c and bio
completion both called aio_complete().  With this patch aio-stress sees
-EIO.

Signed-off-by: Zach Brown <[EMAIL PROTECTED]>
---

 fs/aio.c|  109 ++
 include/linux/aio.h |   30 +++
 mm/filemap.c|4 -
 3 files changed, 91 insertions(+), 52 deletions(-)

--- a/fs/aio.c  Mon Feb 19 11:52:27 2007 -0800
+++ b/fs/aio.c  Mon Feb 19 12:32:52 2007 -0800
@@ -192,6 +192,46 @@ static int aio_setup_ring(struct kioctx 
(void)__event;  \
kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km);
\
 } while(0)
+
+static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb
*iocb)
+{
+   struct aio_ring_info*info;
+   struct aio_ring *ring;
+   struct io_event *event;
+   unsigned long   tail;   
+
+   assert_spin_locked(&ctx->ctx_lock);
+   
+   info = &ctx->ring_info;
+   ring = kmap_atomic(info->ring_pages[0], KM_IRQ1);
+
+   tail = info->tail;
+   event = aio_ring_event(info, tail, KM_IRQ0);
+   if (++tail >= info->nr)
+   tail = 0;
+
+   event->obj = (u64)(unsigned long)iocb->ki_obj.user;
+   event->data = iocb->ki_user_data;
+   event->res = iocb->ki_pending_err ? iocb->ki_pending_err :
iocb->ki_res;
+   event->res2 = iocb->ki_res2;
+
+   dprintk("aio_complete: %p[%lu]: %p: %p %Lx %d %lx %lx\n",
+   ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
+   iocb->ki_pending_err, iocb->ki_res, iocb->ki_res2);
+
+   /* after flagging the requ

RE: [PATCH] aio: propogate post-EIOCBQUEUED errors to completion event

2007-02-19 Thread Ananiev, Leonid I
> while triggering EIO in invalidate_inode_pages2_range() 
...
> With this patch aio-stress sees -EIO.

Actually if invalidate_inode_pages2_range() returns EIO it means
that internal kernel synchronization conflict was happen.
It is reported to user as hardware IO error.
Iteration in synchronization process could be performed instead.

Leonid
-Original Message-
From: Zach Brown [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 19, 2007 11:35 PM
To: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org
Cc: Benjamin LaHaise; Suparna bhattacharya; Andrew Morton; Ananiev,
Leonid I
Subject: [PATCH] aio: propogate post-EIOCBQUEUED errors to completion
event

aio: propogate post-EIOCBQUEUED errors to completion event

This addresses an oops reported by Leonid Ananiev
<[EMAIL PROTECTED]>
as archived at http://lkml.org/lkml/2007/2/8/337.

O_DIRECT kicks off bios and returns -EIOCBQUEUED to indicate its
intention to
call aio_complete() once the bios complete.   As we return from
submission we
must preserve the -EIOCBQUEUED return code so that fs/aio.c knows to let
the
bio completion call aio_complete().  This stops us from returning errors
after
O_DIRECT submission.

But we have a few places that are very interested in generating errors
after
bio submission.

The most critical of these is invalidating the page cache after a write.
This
avoids exposing stale data to buffered operations that are performed
after the
O_DIRECT write succeeds.  We must do this after submission because the
user
buffer might have been an mmap()ed buffer of the region being written
to.  The
get_user_pages() in the O_DIRECT completion path could have faulted in
stale
data.

So this patch introduces a helper, aio_propogate_error(), which queues
post-submission errors in the iocb so that they are given to the user
completion event when aio_complete() is finally called.

To get this working we change the aio_complete() path so that the ring
insertion is performed as the final iocb reference is dropped.  This
gives the
submission path time to queue its pending error before it drops its
reference.
This increases the space in the iocb as it has to record the two result
codes
from aio_complete() and the pending error from the submission path.

This was tested by running O_DIRECT aio-stress concurrently with
buffered reads
while triggering EIO in invalidate_inode_pages2_range() with the help of
a
debugfs bool hack.  Previously the kernel would oops as fs/aio.c and bio
completion both called aio_complete().  With this patch aio-stress sees
-EIO.

Signed-off-by: Zach Brown <[EMAIL PROTECTED]>
---

 fs/aio.c|  109 ++
 include/linux/aio.h |   30 +++
 mm/filemap.c|4 -
 3 files changed, 91 insertions(+), 52 deletions(-)

--- a/fs/aio.c  Mon Feb 19 11:52:27 2007 -0800
+++ b/fs/aio.c  Mon Feb 19 12:32:52 2007 -0800
@@ -192,6 +192,46 @@ static int aio_setup_ring(struct kioctx 
(void)__event;  \
kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km);
\
 } while(0)
+
+static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb
*iocb)
+{
+   struct aio_ring_info*info;
+   struct aio_ring *ring;
+   struct io_event *event;
+   unsigned long   tail;   
+
+   assert_spin_locked(&ctx->ctx_lock);
+   
+   info = &ctx->ring_info;
+   ring = kmap_atomic(info->ring_pages[0], KM_IRQ1);
+
+   tail = info->tail;
+   event = aio_ring_event(info, tail, KM_IRQ0);
+   if (++tail >= info->nr)
+   tail = 0;
+
+   event->obj = (u64)(unsigned long)iocb->ki_obj.user;
+   event->data = iocb->ki_user_data;
+   event->res = iocb->ki_pending_err ? iocb->ki_pending_err :
iocb->ki_res;
+   event->res2 = iocb->ki_res2;
+
+   dprintk("aio_complete: %p[%lu]: %p: %p %Lx %d %lx %lx\n",
+   ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
+   iocb->ki_pending_err, iocb->ki_res, iocb->ki_res2);
+
+   /* after flagging the request as done, we
+* must never even look at it again
+*/
+   smp_wmb();  /* make event visible before updating tail */
+
+   info->tail = tail;
+   ring->tail = tail;
+
+   put_aio_ring_event(event, KM_IRQ0);
+   kunmap_atomic(ring, KM_IRQ1);
+
+   pr_debug("added to ring %p at [%lu]\n", iocb, tail);
+}
 
 /* ioctx_alloc
  * Allocates and initializes an ioctx.  Returns an ERR_PTR if it
failed.
@@ -418,6 +458,7 @@ static struct kiocb fastcall *__aio_get_
req->ki_cancel = NULL;
req->ki_retry = NULL;
req->ki_dtor = NULL;
+   req->ki_pending_err = 0;
req->private = NULL;
req->ki_iovec = NULL;
INIT_LIST_HEAD(&req->ki_run_list);
@@ -507,10 +548,14 @@ static int __aio_put_req

RE: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-16 Thread Ananiev, Leonid I
Zach Brown wrote:
> It will not return until kiocbSetKicked() is called,  
> and that is only called from kick_iocb().
There is no test or wait of Kicked in considered
for (;;) aio_write() loop.

Zach Brown wrote:
>> The proposed patch does not crate this bug if any.
> Right, and I said that in the mail you're quoting.
...
> You're introducing other bugs with the patch.
Could you list other introduced bugs?

>> It is interesting that I've not seen any EIOCBQUEUED returned
>> to aio_run_iocb() during 5 hours aiostress running.
> What arguments are you running aio-stress with?  -EIOCBQUEUED is only

> used for O_DIRECT
I wrote in the vary first mail that the panic is appearing in
random write O_DIRECT aio-stress running. Other aio-stress modes
where tested after patching as well.

> and then only in certain circumstances.
Looking closely into sources we can see that
EIOCBQUEUED never may be returned to aio_run_iocb().

include/linux/aio.h says
* If ki_retry returns -EIOCBRETRY ...
Could you point source line which "returns -EIOCBRETRY"?

Leonid
-Original Message-
From: Zach Brown [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 16, 2007 3:01 AM
To: Ananiev, Leonid I
Cc: Ken Chen; [EMAIL PROTECTED]; Andrew Morton;
linux-kernel@vger.kernel.org; linux-aio; Chris Mason
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy


On Feb 15, 2007, at 3:32 PM, Ananiev, Leonid I wrote:

>>> If EIOCBRETRY then generic_file_aio_write() will be recalled for the
>>> same iocb.
>> Only if kick_iocb() is called.  It won't be called if i_i_p2_r() was
>> the only thing to return -EIOCBRETRY.
> It is not need to call kick_iocb()
> for generic_file_aio_write() calling.
> It is recalled without any wakeup waiting:
> for (;;) {
> ret = filp->f_op->aio_write(&kiocb, &iov, 1,
> kiocb.ki_pos);
> if (ret != -EIOCBRETRY)
> break;
> wait_on_retry_sync_kiocb(&kiocb);
> }
> Note: wait_on_retry_sync_kiocb() does not wait.

Yes it does.  It will not return until kiocbSetKicked() is called,  
and that is only called from kick_iocb().

>>> It overwrites -EIOCBQUEUED
> Do you mean that there is one more kernel bug which
> overwrites -EIOCBQUEUED by any errno or number of bytes and this
> new value is returned to caller as an IO result
> while IO is not finished yet.
>
> The proposed patch does not crate this bug if any.

Right, and I said that in the mail you're quoting.

> It actually fixes a kernel panic bag when iocb.users count becomes
> incorrect. The bag " Kernel BUG at fs/aio.c:509" is there because
> aio_run_iocb() have not a chance to differ real EIO and
> EIO which is actually means EAGAYN or EIOCBRETRY.

Yes, I understand the bug you're trying to fix.  You're introducing  
other bugs with the patch.  It will not be merged.

> It is interesting that I've not seen any EIOCBQUEUED returned
> to aio_run_iocb() during 5 hours aiostress running.

What arguments are you running aio-stress with?  -EIOCBQUEUED is only  
used for O_DIRECT, and then only in certain circumstances.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-15 Thread Ananiev, Leonid I
>> If EIOCBRETRY then generic_file_aio_write() will be recalled for the
>> same iocb.
> Only if kick_iocb() is called.  It won't be called if i_i_p2_r() was  
> the only thing to return -EIOCBRETRY.
It is not need to call kick_iocb()
for generic_file_aio_write() calling.
It is recalled without any wakeup waiting:
for (;;) {
ret = filp->f_op->aio_write(&kiocb, &iov, 1,
kiocb.ki_pos);
if (ret != -EIOCBRETRY)
break;
wait_on_retry_sync_kiocb(&kiocb);
}
Note: wait_on_retry_sync_kiocb() does not wait.
That is for dio. For aio iocb generic_file_aio_write() call
is required from ki_run_list while next io_submit or
read_events() is called.
So when an IO hang may happen?

>> It overwrites -EIOCBQUEUED
Do you mean that there is one more kernel bug which
overwrites -EIOCBQUEUED by any errno or number of bytes and this
new value is returned to caller as an IO result
while IO is not finished yet.

The proposed patch does not crate this bug if any.
It actually fixes a kernel panic bag when iocb.users count becomes
incorrect. The bag " Kernel BUG at fs/aio.c:509" is there because
aio_run_iocb() have not a chance to differ real EIO and
EIO which is actually means EAGAYN or EIOCBRETRY.
I'me sure the patch changes source code in correct direction:
to differentiate that two kinds of EIOs.

> Have you read the giant comment over the definition of struct kiocb  
> in include/linux/aio.h?
I have read. But compiler has not: it did not create an object code for
* If ki_retry returns -EIOCBRETRY ...

>>> This can lead to reference count confusion.
>> But just reference count confusion was deleted by patch. Isn't it?
>Sorry, I don't understand what you're trying to ask here.
One of reference count iocb.users confusion is deleted by the patch.
I'm not sure that there is other bag.

At least I have not see IO hang while testing.
It is interesting that I've not seen any EIOCBQUEUED returned
to aio_run_iocb() during 5 hours aiostress running.
Does it mean that EIOCBQUEUED is always reset and never returned?

Leonid

-Original Message-----
From: Zach Brown [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 15, 2007 10:23 PM
To: Ananiev, Leonid I
Cc: Ken Chen; [EMAIL PROTECTED]; Andrew Morton;
linux-kernel@vger.kernel.org; linux-aio; Chris Mason
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy


On Feb 15, 2007, at 11:11 AM, Ananiev, Leonid I wrote:

>> It returns -EIOCBRETRY without guaranteeing that kick_iocb() will be
>> called.  This can lead to operations hanging
>
> If EIOCBRETRY then generic_file_aio_write() will be recalled for the
> same iocb.

Only if kick_iocb() is called.  It won't be called if i_i_p2_r() was  
the only thing to return -EIOCBRETRY.

>
>> It overwrites -EIOCBQUEUED, leading to an aio_complete() while a
>> retry is happening.
>
> EIOCBQUEUED or EIOCBRETRY does not lead to aio_complete() call:

Not by fs/aio.c, but *by the place that originated -EIOCBQUEUED*.   
Later.  After IO has completed.  see fs/direct-io.c:dio_bio_end_aio().

This is what -EIOCBQUEUED means!  It's a promise to call aio_complete 
() in the future.

Have you read the giant comment over the definition of struct kiocb  
in include/linux/aio.h?

>> This can lead to reference count confusion.
> But just reference count confusion was deleted by patch. Isn't it?

Sorry, I don't understand what you're trying to ask here.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-15 Thread Ananiev, Leonid I


-Original Message-
From: Zach Brown [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 15, 2007 10:23 PM
To: Ananiev, Leonid I
Cc: Ken Chen; [EMAIL PROTECTED]; Andrew Morton;
linux-kernel@vger.kernel.org; linux-aio; Chris Mason
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy


On Feb 15, 2007, at 11:11 AM, Ananiev, Leonid I wrote:

>> It returns -EIOCBRETRY without guaranteeing that kick_iocb() will be
>> called.  This can lead to operations hanging
>
> If EIOCBRETRY then generic_file_aio_write() will be recalled for the
> same iocb.

Only if kick_iocb() is called.  It won't be called if i_i_p2_r() was  
the only thing to return -EIOCBRETRY.

>
>> It overwrites -EIOCBQUEUED, leading to an aio_complete() while a
>> retry is happening.
>
> EIOCBQUEUED or EIOCBRETRY does not lead to aio_complete() call:

Not by fs/aio.c, but *by the place that originated -EIOCBQUEUED*.   
Later.  After IO has completed.  see fs/direct-io.c:dio_bio_end_aio().

This is what -EIOCBQUEUED means!  It's a promise to call aio_complete 
() in the future.

Have you read the giant comment over the definition of struct kiocb  
in include/linux/aio.h?

>> This can lead to reference count confusion.
> But just reference count confusion was deleted by patch. Isn't it?

Sorry, I don't understand what you're trying to ask here.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-15 Thread Ananiev, Leonid I
> It returns -EIOCBRETRY without guaranteeing that kick_iocb() will be  
> called.  This can lead to operations hanging

If EIOCBRETRY then generic_file_aio_write() will be recalled for the
same iocb.

> It overwrites -EIOCBQUEUED, leading to an aio_complete() while a  
> retry is happening.

EIOCBQUEUED or EIOCBRETRY does not lead to aio_complete() call:
if (ret != -EIOCBRETRY && ret != -EIOCBQUEUED) {
aio_complete(iocb, ret, 0);

> This can lead to reference count confusion.
But just reference count confusion was deleted by patch. Isn't it?

Leonid

-Original Message-
From: Zach Brown [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 15, 2007 9:25 PM
To: Ananiev, Leonid I
Cc: Ken Chen; [EMAIL PROTECTED]; Andrew Morton;
linux-kernel@vger.kernel.org; linux-aio; Chris Mason
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

> If invalidate_inode_pages2_range() will return EIOCBRETRY as the patch
> "aio: fix kernel bug when page is temporally busy"

Sorry Leonid, this patch is not safe.

It returns -EIOCBRETRY without guaranteeing that kick_iocb() will be  
called.  This can lead to operations hanging, both AIO and calls that  
come through do_sync_{read,write}.

It overwrites -EIOCBQUEUED, leading to an aio_complete() while a  
retry is happening.  This can lead to reference count confusion.   
Double-frees, referencing freed memory, that kind of thing.  This  
isn't a new problem.  The current code that overwrites with -EIO has  
this problem.  But moving to -EIOCBRETRY does introduce new behaviour  
of aio_complete() and the retry path racing.

I'll have a candidate patch to address the problem of EIO being  
raised on the way back up from a path which has returned -EIOCBQUEUED.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-15 Thread Ananiev, Leonid I
Ken Chen wrote
> It might shut up kernel
> panic by eliminate double calls to aio_complete(), but it will
> silently introduce data corruption.

I had got kernel panic after an hour of aiostress running.
After patching I have not got aiostress massage 
"verify error, file %s offset %Lu contents (offset:bad:good):\n"
during 5 hour aiostress running with 'verify' option.
Looking closely into aiostress.c 
ftp://ftp.suse.com/pub/people/mason/utils/aio-stress.c
we can see that this program may write in random mode THE SAME
contents to the same file offset asynchronously from different
buffers and do not cure about it. Does Ken consider that kernel
panic is the best way to prevent data corruption in such kind
of programs?

> So any error value returned from invalidate_inode_pages2_range() has
> to be taken seriously in the direct IO submit path instead of dropping
> it to the floor.
If invalidate_inode_pages2_range() will return EIOCBRETRY as the patch 
"aio: fix kernel bug when page is temporally busy"
sets then do_sync_read/write() will not drop IO submit but will retry
it:
for (;;) {
ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);
if (ret != -EIOCBRETRY)
break;
wait_on_retry_sync_kiocb(&kiocb);
}
And do_sync_read/write() will not return EIO if page is busy
as it does now, before patching.

Ken Chen wrote:
> I also think the original patch is wrong.
What do you mean saying 'also'?

Leonid

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-14 Thread Ananiev, Leonid I
> Please, tell us what problem this is fixing so that we can look into
> alternative solutions.
This patch fixes a kernel panic "Kernel BUG at fs/aio.c:509"
http://marc.theaimsgroup.com/?l=linux-kernel&m=117031052517746&w=2
First of all it was found that the kernel panic happens after
IO error reporting. But later it was found that the actual
reason is not in real IO error but in a busy page.
EIO is returned if page has unstable state while IO completion
result is processing.
If aio_run_iocb() as well as in do_sync_read/write() get EIO result
it is considered as IO is finished and IO control block
could be free; result is reported to caller.
I suppose that aio_run_iocb() as well as in do_sync_read/write()
should have a chance to differ real EIO and EIO which is actually
means EAGAYN. That functions test for EIOCBRETRY meaning EAGAYN purpose.
do_sync_read/write() retries IO as well as aio:
for (;;) {
ret = filp->f_op->aio_read(&kiocb, &iov, 1,
kiocb.ki_pos);
if (ret != -EIOCBRETRY)
break;
wait_on_retry_sync_kiocb(&kiocb);
}
By the way EIOCBRETRY is never actually set in the kernel. So what is
the
purpose of previous lines?

More details about EIO returning:
aio_complete: res=-5 iocb=81002b38f080 ki_users=2 
[] invalidate_inode_pages2_range+0x236/0x26b
[] ext3_direct_IO+0x16c/0x19e
[] ext3_get_block+0x0/0xe2
[] generic_file_direct_IO+0xb9/0xcf
[] generic_file_direct_write+0x67/0x10e
[] __generic_file_aio_write_nolock+0x2d6/0x3fe
[] __switch_to+0x26f/0x27e
[] thread_return+0x0/0xd8
[] generic_file_aio_write+0x67/0xc7
[] ext3_file_write+0x0/0x8f
[] ext3_file_write+0x16/0x8f
[] ext3_file_write+0x0/0x8f
[] aio_rw_vect_retry+0x72/0x14f
[] aio_run_iocb+0xe6/0x187
[] io_submit_one+0x296/0x2e3
[] sys_io_submit+0x9b/0x108
[] default_wake_function+0x0/0xe
[] system_call+0x7e/0x83
Call Trace:
 [] aio_complete+0x5c/0x18c
[] finished_one_bio+0xac/0xf3
[] dio_bio_complete+0x95/0xaa
[] dio_bio_end_aio+0x20/0x25
[] __end_that_request_first+0x10e/0x3fd
[] scsi_end_request+0x27/0xc9
[] scsi_io_completion+0xf0/0x2c5
[] ahd_done+0x537/0x580
[] sd_rw_intr+0x182/0x1ad
[] blk_done_softirq+0x5c/0x6a
[] __do_softirq+0x55/0xc3
[] ack_apic_level+0x37/0x4b
[] call_softirq+0x1c/0x28
[] do_softirq+0x2c/0x7d
[] do_IRQ+0xa8/0xc8
[] mwait_idle+0x0/0x20
[] ret_from_intr+0x0/0xa
 [] mwait_idle_with_hints+0x44/0x45
[] mwait_idle+0xc/0x20
[] cpu_idle+0x8b/0xb0
[] start_secondary+0x462/0x471


-Original Message-
From: Andrew Morton [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 15, 2007 6:31 AM
To: Ananiev, Leonid I
Cc: linux-kernel@vger.kernel.org
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

On Wed, 14 Feb 2007 20:51:33 +0300 "Ananiev, Leonid I"
<[EMAIL PROTECTED]> wrote:

> Fix kernel bug when IO page is temporally busy:
> invalidate_inode_pages2_range() returns EIOCBRETRY but not  EIO.
> invalidate_inode_pages2() returns EIO as earlier.
> 
> Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]>
> ---
> --- linux-2.6.20/mm/truncate.c  2007-02-04 10:44:54.0 -0800
> +++ linux-2.6.20p/mm/truncate.c 2007-02-08 22:56:52.0 -0800
> @@ -366,7 +366,7 @@ static int do_launder_page(struct addres
>   * Any pages which are found to be mapped into pagetables are
unmapped prior to
>   * invalidation.
>   *
> - * Returns -EIO if any pages could not be invalidated.
> + * Returns -EIOCBRETRY if any pages could not be invalidated.
>   */
>  int invalidate_inode_pages2_range(struct address_space *mapping,
>   pgoff_t start, pgoff_t end)
> @@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct
> }
> ret = do_launder_page(mapping, page);
> if (ret == 0 &&
!invalidate_complete_page2(mapping, page))
> -   ret = -EIO;
> +   ret = -EIOCBRETRY;
> unlock_page(page);
> }
> pagevec_release(&pvec);
> @@ -444,6 +444,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
>   */
>  int invalidate_inode_pages2(struct address_space *mapping)
>  {
> -   return invalidate_inode_pages2_range(mapping, 0, -1);
> +   int ret =  invalidate_inode_pages2_range(mapping, 0, -1);
> +   return (ret < 0)?-EIO:ret;
>  }
>  EXPORT_SYMBOL_GPL(invalidate_inode_pages2);

If someone later uses invalidate_inode_pages2_range() elsewhere, they're
going to need to know to convert -EIOCBRETRY into -EIO, if they weren't
called by aio.  Or something.

Please, tell us what problem this is fixing so that we can look into
alternative solutions.

For example, one acceptable-but-ugly solution might be to do:

static inline int invalidate_inode_page

[PATCH] aio: fix kernel bug when page is temporally busy

2007-02-14 Thread Ananiev, Leonid I
-wrap lines are fixed. Sorry.

>From Leonid Ananiev

Fix kernel bug if IO page is temporally busy:
invalidate_inode_pages2_range() returns EIOCBRETRY but not  EIO.
invalidate_inode_pages2() returns EIO as earlier.

Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]>
---
--- linux-2.6.20/mm/truncate.c  2007-02-04 10:44:54.0 -0800
+++ linux-2.6.20p/mm/truncate.c 2007-02-08 22:56:52.0 -0800
@@ -366,7 +366,7 @@ static int do_launder_page(struct addres
  * Any pages which are found to be mapped into pagetables are unmapped prior to
  * invalidation.
  *This is path is modified so that invalidate_inode_pages2()
returns EIO as earlier.
Wrap lines are fixed

- From Leonid Ananiev

Fix kernel bug when IO page is temporally busy:
invalidate_inode_pages2_range() returns EIOCBRETRY but not  EIO.
invalidate_inode_pages2() returns EIO as earlier.

Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]>
---
--- linux-2.6.20/mm/truncate.c  2007-02-04 10:44:54.0 -0800
+++ linux-2.6.20p/mm/truncate.c 2007-02-08 22:56:52.0 -0800
@@ -366,7 +366,7 @@ static int do_launder_page(struct addres
  * Any pages which are found to be mapped into pagetables are unmapped prior to
  * invalidation.
  *
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EIOCBRETRY if any pages could not be invalidated.
  */
 int invalidate_inode_pages2_range(struct address_space *mapping,
  pgoff_t start, pgoff_t end)
@@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct
}
ret = do_launder_page(mapping, page);
if (ret == 0 && !invalidate_complete_page2(mapping, 
page))
-   ret = -EIO;
+   ret = -EIOCBRETRY;
unlock_page(page);
}
pagevec_release(&pvec);
@@ -444,6 +444,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
  */
 int invalidate_inode_pages2(struct address_space *mapping)
 {
-   return invalidate_inode_pages2_range(mapping, 0, -1);
+   int ret =  invalidate_inode_pages2_range(mapping, 0, -1);
+   return (ret < 0)?-EIO:ret;
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2);

- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EIOCBRETRY if any pages could not be invalidated.
  */
 int invalidate_inode_pages2_range(struct address_space *mapping,
  pgoff_t start, pgoff_t end)
@@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct
}
ret = do_launder_page(mapping, page);
if (ret == 0 && !invalidate_complete_page2(mapping, 
page))
-   ret = -EIO;
+   ret = -EIOCBRETRY;
unlock_page(page);
}
pagevec_release(&pvec);
@@ -444,6 +444,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
  */
 int invalidate_inode_pages2(struct address_space *mapping)
 {
-   return invalidate_inode_pages2_range(mapping, 0, -1);
+   int ret =  invalidate_inode_pages2_range(mapping, 0, -1);
+   return (ret < 0)?-EIO:ret;
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2);



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-12 Thread Ananiev, Leonid I
Andrew,
You wrote on Friday, February 09, 2007 8:53 AM
> invalidate_inode_pages2() has other callers.  I suspect with this
change
> we'll end up leaking EIOCBRETRY back to userspace.

The path is modified so that invalidate_inode_pages2() returns EIO as
earlier.
could you consider modified patch
The patch against 2.6.20.

Long story: The kernel panic is happening after hours of AIO benchmark
running in mcp.
First of all it was found that the kernel panic happens if IO error is
reported.
But later it was found that the actual reason is not in real IO error
but in a busy page.
While the current CPU tests if IO is completed it happens that another
CPU
at the same time processes IO completion in soft_irq.
The considered buffer page is busy now by second CPU and
invalidate_inode_pages2_range() returns EIO in this case.
First CPU reports EIO to caller ; completes IO and frees control block
in aio_complete().
Second CPU frees the same control block once more. 
The patch makes invalidate_inode_pages2_range() to return EIOCBRETRY
which is tested just in aio_run_iocb(). It retries IO competition check
if EIOCBRETRY is got.
EIOCBRETRY is tested in do_sync_read/write() functions as well.
And direct IO competition will be retested "instead of dropping it to
the floor".

>From Leonid Ananiev

Fix kernel bug when IO page is temporally busy:
invalidate_inode_pages2_range() returns EIOCBRETRY but not  EIO.
invalidate_inode_pages2() returns EIO as earlier.

Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]>
---
--- linux-2.6.20/mm/truncate.c  2007-02-04 10:44:54.0 -0800
+++ linux-2.6.20p/mm/truncate.c 2007-02-08 22:56:52.0 -0800
@@ -366,7 +366,7 @@ static int do_launder_page(struct addres
  * Any pages which are found to be mapped into pagetables are unmapped
prior to
  * invalidation.
  *
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EIOCBRETRY if any pages could not be invalidated.
  */
 int invalidate_inode_pages2_range(struct address_space *mapping,
  pgoff_t start, pgoff_t end)
@@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct
}
ret = do_launder_page(mapping, page);
if (ret == 0 &&
!invalidate_complete_page2(mapping, page))
-   ret = -EIO;
+   ret = -EIOCBRETRY;
unlock_page(page);
}
pagevec_release(&pvec);
@@ -444,6 +444,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
  */
 int invalidate_inode_pages2(struct address_space *mapping)
 {
-   return invalidate_inode_pages2_range(mapping, 0, -1);
+   int ret =  invalidate_inode_pages2_range(mapping, 0, -1);
+   return (ret < 0)?-EIO:ret;
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Recall: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-12 Thread Ananiev, Leonid I
Ananiev, Leonid I would like to recall the message, "[PATCH] aio: fix kernel 
bug when page is  temporally busy".
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Recall: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-12 Thread Ananiev, Leonid I
Ananiev, Leonid I would like to recall the message, "[PATCH] aio: fix kernel 
bug when page is  temporally busy".
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-12 Thread Ananiev, Leonid I
Andrew,
You wrote on Friday, February 09, 2007 8:53 AM
> invalidate_inode_pages2() has other callers.  I suspect with this
change
> we'll end up leaking EIOCBRETRY back to userspace.

The path is modified so that invalidate_inode_pages2() returns EIO as
earlier.
could you consider modified patch
The patch against 2.6.20.

Long story: The kernel panic is happening after hours of AIO benchmark
running in mcp.
First of all it was found that the kernel panic happens if IO error is
reported.
But later it was found that the actual reason is not in real IO error
but in a busy page.
While the current CPU tests if IO is completed it happens that another
CPU
at the same time processes IO completion in soft_irq.
The considered buffer page is busy now by second CPU and
invalidate_inode_pages2_range() returns EIO in this case.
First CPU reports EIO to caller ; completes IO and frees control block
in aio_complete().
Second CPU frees the same control block once more. 
The patch makes invalidate_inode_pages2_range() to return EIOCBRETRY
which is tested just in aio_run_iocb(). It retries IO competition check
if EIOCBRETRY is got.
EIOCBRETRY is tested in do_sync_read/write() functions as well.
And direct IO competition will be retested "instead of dropping it to
the floor".

>From Leonid Ananiev

Fix kernel bug when IO page is temporally busy:
invalidate_inode_pages2_range() returns EIOCBRETRY but not  EIO.
invalidate_inode_pages2() returns EIO as earlier.

Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]>
---
--- linux-2.6.20/mm/truncate.c  2007-02-04 10:44:54.0 -0800
+++ linux-2.6.20p/mm/truncate.c 2007-02-08 22:56:52.0 -0800
@@ -366,7 +366,7 @@ static int do_launder_page(struct addres
  * Any pages which are found to be mapped into pagetables are unmapped
prior to
  * invalidation.
  *
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EIOCBRETRY if any pages could not be invalidated.
  */
 int invalidate_inode_pages2_range(struct address_space *mapping,
  pgoff_t start, pgoff_t end)
@@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct
}
ret = do_launder_page(mapping, page);
if (ret == 0 &&
!invalidate_complete_page2(mapping, page))
-   ret = -EIO;
+   ret = -EIOCBRETRY;
unlock_page(page);
}
pagevec_release(&pvec);
@@ -444,6 +444,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
  */
 int invalidate_inode_pages2(struct address_space *mapping)
 {
-   return invalidate_inode_pages2_range(mapping, 0, -1);
+   int ret =  invalidate_inode_pages2_range(mapping, 0, -1);
+   return (ret < 0)?-EIO:ret;
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-10 Thread Ananiev, Leonid I
Ken Chen writes:
> So any error value returned from invalidate_inode_pages2_range() has
> to be taken seriously in the direct IO submit path instead of dropping
> it to the floor.

If invalidate_inode_pages2_range() will return EIOCBRETRY as the patch 
"aio: fix kernel bug when page is temporally busy"
serts then do_sync_read/write() will not drop IO submit but will retry
it:
for (;;) {
ret = filp->f_op->aio_read(&kiocb, &iov, 1,
kiocb.ki_pos);
if (ret != -EIOCBRETRY)
break;
wait_on_retry_sync_kiocb(&kiocb);
}

And do_sync_read/write() will not return EIO if page is busy as it does
now, befor patching.
Curretly do_sync_read/write() tests for EIOCBRETRY. But EIOCBRETRY is
not set ever.

Leonid
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-10 Thread Ananiev, Leonid I
> If invalidate_inode_pages2_range() says it can not invalidate pages,
> while dio to the same file offset range is in flight, something is
> really wrong there.

If invalidate_inode_pages2_range() says it can not invalidate pages
It means that soft_irq does completing IO now on other cpu.
Next retry() call in aio_run_iocb() will see the IO well completed.

The patch is updated: invalidate_inode_pages2() returns EIO as earlier
for
nfs and other.
But  invalidate_inode_pages2_range() returns EIOCBRETRY for aio and dio.

The patch against 2.6.20.

>From Leonid Ananiev

Fix kernel bug when IO page is temporally busy:
 invalidate_inode_pages2_range() returns EIOCBRETRY but not  EIO.

Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]>
---
--- linux-2.6.20/mm/truncate.c  2007-02-04 10:44:54.0 -0800
+++ linux-2.6.20p/mm/truncate.c 2007-02-08 22:56:52.0 -0800
@@ -366,7 +366,7 @@ static int do_launder_page(struct addres
  * Any pages which are found to be mapped into pagetables are unmapped
prior to
  * invalidation.
  *
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EIOCBRETRY if any pages could not be invalidated.
  */
 int invalidate_inode_pages2_range(struct address_space *mapping,
  pgoff_t start, pgoff_t end)
@@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct
}
ret = do_launder_page(mapping, page);
if (ret == 0 &&
!invalidate_complete_page2(mapping, page))
-   ret = -EIO;
+   ret = -EIOCBRETRY;
unlock_page(page);
}
pagevec_release(&pvec);
@@ -444,6 +444,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
  */
 int invalidate_inode_pages2(struct address_space *mapping)
 {
-   return invalidate_inode_pages2_range(mapping, 0, -1);
+   int ret =  invalidate_inode_pages2_range(mapping, 0, -1);
+   return (ret < 0)?-EIO:ret;
 }
 EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-10 Thread Ananiev, Leonid I
> If invalidate_inode_pages2_range() says it can not invalidate pages,
> while dio to the same file offset range is in flight, something is
> really wrong there.

If invalidate_inode_pages2_range() says it can not invalidate pages
It means that soft_irq does completing now on other cpu.
Next retry() will see a IO well completed.
Leonid

-Original Message-
From: Ken Chen [mailto:[EMAIL PROTECTED] 
Sent: Saturday, February 10, 2007 9:05 PM
To: Ananiev, Leonid I
Cc: [EMAIL PROTECTED]; Andrew Morton; linux-kernel@vger.kernel.org;
linux-aio; Zach Brown; Chris Mason; Badari Pulavarty
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

On 2/9/07, Ananiev, Leonid I <[EMAIL PROTECTED]> wrote:
> I have used EIOCBRETRY in the patch to minimize source code
modification
> only.
> [...]
> A lot of errno's have different meaning in different functions or
> contexts. EAGAIN could be used instated of EIOCBRETRY for irredundant
> set.

I also think the original patch is wrong.  It might shut up kernel
panic by eliminate double calls to aio_complete(), but it will
silently introduce data corruption.

If invalidate_inode_pages2_range() says it can not invalidate pages,
while dio to the same file offset range is in flight, something is
really wrong there.  In generic_file_direct_IO, the function
explicitly flushes all dirty pages and wait on them before submits
DIO.

So any error value returned from invalidate_inode_pages2_range() has
to be taken seriously in the direct IO submit path instead of dropping
it to the floor.

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-10 Thread Ananiev, Leonid I
But if page is busy as invalidate_inode_pages2_range() says because of
bh_count>0 then aio_complet() is not called from aio_run_iocb() and next
retry() will get the iocb result.

Leonid

-Original Message-
From: Ken Chen [mailto:[EMAIL PROTECTED] 
Sent: Saturday, February 10, 2007 9:05 PM
To: Ananiev, Leonid I
Cc: [EMAIL PROTECTED]; Andrew Morton; linux-kernel@vger.kernel.org;
linux-aio; Zach Brown; Chris Mason; Badari Pulavarty
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

On 2/9/07, Ananiev, Leonid I <[EMAIL PROTECTED]> wrote:
> I have used EIOCBRETRY in the patch to minimize source code
modification
> only.
> [...]
> A lot of errno's have different meaning in different functions or
> contexts. EAGAIN could be used instated of EIOCBRETRY for irredundant
> set.

I also think the original patch is wrong.  It might shut up kernel
panic by eliminate double calls to aio_complete(), but it will
silently introduce data corruption.

If invalidate_inode_pages2_range() says it can not invalidate pages,
while dio to the same file offset range is in flight, something is
really wrong there.  In generic_file_direct_IO, the function
explicitly flushes all dirty pages and wait on them before submits
DIO.

So any error value returned from invalidate_inode_pages2_range() has
to be taken seriously in the direct IO submit path instead of dropping
it to the floor.

- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-09 Thread Ananiev, Leonid I
>>noone seems to be setting the error values to EIOCBRETRY
> We'd want retries to act only on the remaining bytes which haven't
been
> transferred as yet, so even in the EIOCBRETRY...

OK. But any variable is not set to EIOCBRETRY  in 2.6.14, 2.6.19,
2.6.20.
Is it correct?

Leonid 

-Original Message-
From: Suparna Bhattacharya [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 09, 2007 2:06 PM
To: Jiri Kosina
Cc: Andrew Morton; Ananiev, Leonid I; linux-kernel@vger.kernel.org;
linux-aio; Zach Brown; Chris Mason; Badari Pulavarty; Jan Kara
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

On Fri, Feb 09, 2007 at 11:40:27AM +0100, Jiri Kosina wrote:
> On Fri, 9 Feb 2007, Andrew Morton wrote:
>  
> > > @@ -1204,7 +1204,7 @@ generic_file_aio_read(struct kiocb *iocb,
const struct iovec *iov,
> > >
do_generic_file_read(filp,ppos,&desc,file_read_actor);
> > >   retval += desc.written;
> > >   if (desc.error) {
> > > - retval = retval ?: desc.error;
> > > + retval = desc.error;
> > >   break;
> > >   }
> > Nope.  On error the read() syscall must return the number of bytes
which
> > were successfully read.
>  
> You are right. 
>  
> In current mainline this even is not a problem, because noone seems to
be 
> setting the error values to EIOCBRETRY. But it still stinks a bit, as 
> there are tests for EIOCBRETRY.

We'd want retries to act only on the remaining bytes which haven't been
transferred as yet, so even in the EIOCBRETRY case the right thing to do
is to return the number of bytes that were successfully read without
blocking. The high level AIO code (see aio_rw_vect_rety) has the ability
to handle this.

So this is still correct.

Is there a real bug that you are seeing here ?

Regards
Suparna

>  
> -- 
> Jiri Kosina
>  
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-09 Thread Ananiev, Leonid I
I have used EIOCBRETRY in the patch to minimize source code modification
only.
It is notable that EIOCBRETRY is never set in kernel, but tested only.

EAGAIN - means that you may want to recall the same function with the
same argument. But user have not to recall it just now. He may want to
free or process some other his resources before recall for success.
EIOCBRETRY - means that you may not recall the same function with the
same argument because an argument has no sense already in cuurnt
context. The result is in processing already. You may want to recall
much higher level or external function to get the common result. It is
means that inernal caller should ignore or let through errno to its
caller but do not recall the function which has returned this errno.

A lot of errno's have different meaning in different functions or
contexts. EAGAIN could be used instated of EIOCBRETRY for irredundant
set.

Leonid

-Original Message-
From: Suparna Bhattacharya [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 09, 2007 10:16 AM
To: Ananiev, Leonid I
Cc: Andrew Morton; linux-kernel@vger.kernel.org; linux-aio; Zach Brown;
Chris Mason; Badari Pulavarty
Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy

On Fri, Feb 09, 2007 at 08:41:41AM +0300, Ananiev, Leonid I wrote:
> > invalidate_inode_pages2() has other callers.  I suspect with this
> change
> > we'll end up leaking EIOCBRETRY back to userspace.
>  
> EIOCBRETRY is used and caught already in do_sync_read() and
> do_sync_readv_writev().
>  
> Below fixed patch against kernel 2.6.20.

I agree with Andrew, this doesn't look like the right solution.

EIOCBRETRY isn't the same as EAGAIN or "try again later". EIOCBRETRY
means
"I have queued up an action to notify (wakeup) the callback to retry the
operation once data is ready".

Regards
Suparna

>  
> >From Leonid Ananiev
>  
> Fix kernel bug when IO page is temporally busy:
> invalidate_inode_pages2() returns EIOCBRETRY but not  EIO..
>  
> Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]>
> ---
> --- linux-2.6.20/mm/truncate.c2007-02-04 10:44:54.0
-0800
> +++ linux-2.6.20p/mm/truncate.c   2007-02-08 11:38:11.0
-0800
> @@ -366,7 +366,7 @@ static int do_launder_page(struct addres
>   * Any pages which are found to be mapped into pagetables are
unmapped
> prior to
>   * invalidation.
>   *
> - * Returns -EIO if any pages could not be invalidated.
> + * Returns -EIOCBRETRY if any pages could not be invalidated.
>   */
>  int invalidate_inode_pages2_range(struct address_space *mapping,
> pgoff_t start, pgoff_t end)
> @@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct
>   }
>   ret = do_launder_page(mapping, page);
>   if (ret == 0 &&
> !invalidate_complete_page2(mapping, page))
> - ret = -EIO;
> + ret = -EIOCBRETRY;
>   unlock_page(page);
>   }
>   pagevec_release(&pvec);
> @@ -440,7 +440,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
>   * Any pages which are found to be mapped into pagetables are
unmapped
> prior to
>   * invalidation.
>   *
> - * Returns -EIO if any pages could not be invalidated.
> + * Returns -EIOCBRETRY if any pages could not be invalidated.
>   */
>  int invalidate_inode_pages2(struct address_space *mapping)
>  {
>  
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [EMAIL PROTECTED]  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED]

-- 
Suparna Bhattacharya ([EMAIL PROTECTED])
Linux Technology Center
IBM Software Lab, India
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] aio: fix kernel bug when page is temporally busy

2007-02-08 Thread Ananiev, Leonid I

> invalidate_inode_pages2() has other callers.  I suspect with this
change
> we'll end up leaking EIOCBRETRY back to userspace.

EIOCBRETRY is used and caught already in do_sync_read() and
do_sync_readv_writev().

Below fixed patch against kernel 2.6.20.

>From Leonid Ananiev

Fix kernel bug when IO page is temporally busy:
invalidate_inode_pages2() returns EIOCBRETRY but not  EIO..

Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]>
---
--- linux-2.6.20/mm/truncate.c  2007-02-04 10:44:54.0 -0800
+++ linux-2.6.20p/mm/truncate.c 2007-02-08 11:38:11.0 -0800
@@ -366,7 +366,7 @@ static int do_launder_page(struct addres
  * Any pages which are found to be mapped into pagetables are unmapped
prior to
  * invalidation.
  *
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EIOCBRETRY if any pages could not be invalidated.
  */
 int invalidate_inode_pages2_range(struct address_space *mapping,
  pgoff_t start, pgoff_t end)
@@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct
}
ret = do_launder_page(mapping, page);
if (ret == 0 &&
!invalidate_complete_page2(mapping, page))
-   ret = -EIO;
+   ret = -EIOCBRETRY;
unlock_page(page);
}
pagevec_release(&pvec);
@@ -440,7 +440,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
  * Any pages which are found to be mapped into pagetables are unmapped
prior to
  * invalidation.
  *
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EIOCBRETRY if any pages could not be invalidated.
  */
 int invalidate_inode_pages2(struct address_space *mapping)
 {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] aio: fix kernel bug when page is temporally busy

2007-02-08 Thread Ananiev, Leonid I
>From Leonid Ananiev

Fix "Kernel BUG at fs/aio.c:509". Return EIOCBRETRY but not  EIO if page
is busy.

Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]>

"Kernel BUG at fs/aio.c:509"
http://marc.theaimsgroup.com/?l=linux-kernel&m=117031052517746&w=2 
is present in 2.6.20 as well as 2.6.19.
The investigation shows that the bug's panic occurs when aio_complete()
is called with argument ret=EIO which is returned from
invalidate_inode_pages2_range() because the page is temporally busy.
Call Trace:
 [] invalidate_inode_pages2_range+0x236/0x26b
 [] ext3_direct_IO+0x16c/0x19e
 [] ext3_get_block+0x0/0xe2
 [] generic_file_direct_IO+0xb9/0xcf
 [] generic_file_direct_write+0x67/0x10e
 [] __generic_file_aio_write_nolock+0x2d6/0x3fe
 [] __switch_to+0x26f/0x27e
 [] thread_return+0x0/0xd8
 [] generic_file_aio_write+0x67/0xc7
 [] ext3_file_write+0x0/0x8f
 [] ext3_file_write+0x16/0x8f
 [] ext3_file_write+0x0/0x8f
 [] aio_rw_vect_retry+0x72/0x14f
 [] aio_run_iocb+0xe6/0x187
 [] io_submit_one+0x296/0x2e3
 [] sys_io_submit+0x9b/0x108
 [] default_wake_function+0x0/0xe
 [] system_call+0x7e/0x83

As a result aio_complete() is called while it should not be.
When IO is finished  aio_complete() is called once more
and iocb->ki_users becomes negative.

Taking into account aio.c lines in aio_run_iocb()
if (ret != -EIOCBRETRY && ret != -EIOCBQUEUED) {
aio_complete(iocb, ret, 0);

proposed patch makes function invalidate_inode_pages2_range()
to return EIOCBRETRY but not  EIO if page is busy.
The patch is tested on 2.6.20.

--- linux-2.6.20/mm/truncate.c  2007-02-04 10:44:54.0 -0800
+++ linux-2.6.20p/mm/truncate.c 2007-02-08 11:38:11.0 -0800
@@ -366,7 +366,7 @@ static int do_launder_page(struct addres
  * Any pages which are found to be mapped into pagetables are unmapped
prior to
  * invalidation.
  *
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EIOCBRETRY if any pages could not be invalidated.
  */
 int invalidate_inode_pages2_range(struct address_space *mapping,
  pgoff_t start, pgoff_t end)
@@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct
}
ret = do_launder_page(mapping, page);
if (ret == 0 &&
!invalidate_complete_page2(mapping, page))
-   ret = -EIO;
+   ret = -EIOCBRETRY;
unlock_page(page);
}
pagevec_release(&pvec);
@@ -440,7 +440,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages
  * Any pages which are found to be mapped into pagetables are unmapped
prior to
  * invalidation.
  *
- * Returns -EIO if any pages could not be invalidated.
+ * Returns -EIOCBRETRY if any pages could not be invalidated.
  */
 int invalidate_inode_pages2(struct address_space *mapping)
 {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Kernel BUG at fs/aio.c:509

2007-02-01 Thread Ananiev, Leonid I
>> BUG: warning at mm/truncate.c:398/invalidate_inode_pages2_range()
>  That warning got removed from -rc7.

2.6.20-rc-7: After 57 minutes and after 25 minutes of aiostress running
I've got 2 system hangs without any message in log.
On the third time I opened the console and got kernel panic massage:

$ [ cut here ]
kernel BUG at fs/aio.c:507!
invalid opcode:  [1] SMP 
CPU 2 
Modules linked in: firmware_class
Pid: 0, comm: swapper Not tainted 2.6.20-rc7 #1
RIP: 0010:[]  []
__aio_put_req+0x22/0x146
RSP: 0018:810037e4fd70  EFLAGS: 00010086
RAX:  RBX: 810037142ec0 RCX: 25934000
RDX:  RSI: 810037142ec0 RDI: 810005f8be40
RBP: 810005f8be40 R08: 6db6db6db6db6db7 R09: 81000100
R10:  R11: 802531b0 R12: 810005f8bea8
R13: 4000 R14:  R15: 0213
FS:  () GS:81000225dec0()
knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2: 003cdca945d0 CR3: 39002000 CR4: 06e0
Process swapper (pid: 0, threadinfo 810037e48000, task
810037e47560)
Stack:  810037142ec0 810005f8be40 810005f8bea8
80284958
   8100381f2c00 8100381f2f24
 0246   802912a1
Call Trace:
   [] aio_complete+0x16b/0x1c7
 [] dio_bio_end_aio+0x9e/0xbe
 [] __end_that_request_first+0x10e/0x3f7
 [] blk_run_queue+0x41/0x72
 [] scsi_end_request+0x27/0xc9
 [] scsi_io_completion+0xf0/0x2c5
 [] ahd_done+0x52b/0x574
 [] sd_rw_intr+0x182/0x1ad
 [] blk_done_softirq+0x5c/0x6a
 [] __do_softirq+0x55/0xc3
 [] ack_apic_level+0x37/0x4b
 [] call_softirq+0x1c/0x28
 [] do_softirq+0x2c/0x7d
 [] do_IRQ+0xb1/0xd0
 [] mwait_idle+0x0/0x46
 [] ret_from_intr+0x0/0xa
   [] scsi_request_fn+0x0/0x33e
 [] mwait_idle+0x42/0x46
 [] cpu_idle+0x8c/0xaf
 [] start_secondary+0x462/0x471
 
 
Code: 0f 0b eb fe 74 07 31 c0 e9 12 01 00 00 4c 8d a6 e0 00 00 00 
RIP  [] __aio_put_req+0x22/0x146
 RSP 
 <0>Kernel panic - not syncing: Aiee, killing interrupt handler!

> Oh.  2.6.19.  Did you try 2.6.19.latest?
The latest 2.6.19.2 functions exactly as 2.6.19 does.
I should add that I see on user screen:

...kernel: invalid opcode:  [1] SMP

Leonid
-Original Message-
From: Andrew Morton [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 01, 2007 10:16 AM
To: Ananiev, Leonid I
Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED]
Subject: Re: Kernel BUG at fs/aio.c:509

On Thu, 1 Feb 2007 09:13:51 +0300 "Ananiev, Leonid I"
<[EMAIL PROTECTED]> wrote:

> While repeatedly running 'aiostress -O -o 2' test I've got in
> /var/log/messages:
> 
> BUG: warning at mm/truncate.c:398/invalidate_inode_pages2_range()
> Call Trace:
>  [] invalidate_inode_pages2_range+0x236/0x26b
>  [] ext3_direct_IO+0x16c/0x19e
>  [] ext3_get_block+0x0/0xe2
>  [] generic_file_direct_IO+0xb9/0xcf
>  [] generic_file_direct_write+0x67/0x10e
>  [] __generic_file_aio_write_nolock+0x2d6/0x3fe
>  [] __switch_to+0x26f/0x27e
>  [] thread_return+0x0/0xd8
>  [] generic_file_aio_write+0x67/0xc7
>  [] ext3_file_write+0x0/0x8f
>  [] ext3_file_write+0x16/0x8f
>  [] ext3_file_write+0x0/0x8f
>  [] aio_rw_vect_retry+0x72/0x14f
>  [] aio_run_iocb+0xe6/0x187
>  [] io_submit_one+0x296/0x2e3
>  [] sys_io_submit+0x9b/0x108
>  [] default_wake_function+0x0/0xe
>  [] system_call+0x7e/0x83

That warning got removed from -rc7.  It's harmless, in most cases.  We
think.

It's a bit of a worry that aio-on-ext3 is triggering it.  That's
unexpected.

> --- [cut here ] - [please bite here ] -
> Kernel BUG at fs/aio.c:509
> invalid opcode:  [1] SMP 
> CPU 1 
> Modules linked in: firmware_class
> Pid: 11362, comm: aiostress Not tainted 2.6.19 #1

Oh.  2.6.19.  Did you try 2.6.19.latest?

> RIP: 0010:[]  []
> __aio_put_req+0x27/0x162
> RSP: 0018:81000c167e88  EFLAGS: 00010096
> RAX:  RBX: 8100184f8dc0 RCX: 1d44f000
> RDX: 000d RSI: 8100184f8dc0 RDI: 81001d44ee80
> RBP: 81001d44ee80 R08: 6db6db6db6db6db7 R09: 81000100
> R10:  R11: 0046 R12: 810037341c80
> R13: 81001d44ee80 R14: 00606998 R15: 
> FS:  2ab979b0ee90() GS:810037e896c0()
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 2aea207c3950 CR3: 1009c000 CR4: 06e0
> Process aiostress (pid: 11362, threadinfo 81000c166000, task
> 81003f49f5a0)
> Stack:  8100184f8dc0 81001d44ee80 810037341c80
> 80284ff6
>  8100184f8dc0 8100184f8dc0  80285409
>  00607fb0 81001d44ee80 

Kernel BUG at fs/aio.c:509

2007-01-31 Thread Ananiev, Leonid I
While repeatedly running 'aiostress -O -o 2' test I've got in
/var/log/messages:

BUG: warning at mm/truncate.c:398/invalidate_inode_pages2_range()
Call Trace:
 [] invalidate_inode_pages2_range+0x236/0x26b
 [] ext3_direct_IO+0x16c/0x19e
 [] ext3_get_block+0x0/0xe2
 [] generic_file_direct_IO+0xb9/0xcf
 [] generic_file_direct_write+0x67/0x10e
 [] __generic_file_aio_write_nolock+0x2d6/0x3fe
 [] __switch_to+0x26f/0x27e
 [] thread_return+0x0/0xd8
 [] generic_file_aio_write+0x67/0xc7
 [] ext3_file_write+0x0/0x8f
 [] ext3_file_write+0x16/0x8f
 [] ext3_file_write+0x0/0x8f
 [] aio_rw_vect_retry+0x72/0x14f
 [] aio_run_iocb+0xe6/0x187
 [] io_submit_one+0x296/0x2e3
 [] sys_io_submit+0x9b/0x108
 [] default_wake_function+0x0/0xe
 [] system_call+0x7e/0x83

--- [cut here ] - [please bite here ] -
Kernel BUG at fs/aio.c:509
invalid opcode:  [1] SMP 
CPU 1 
Modules linked in: firmware_class
Pid: 11362, comm: aiostress Not tainted 2.6.19 #1
RIP: 0010:[]  []
__aio_put_req+0x27/0x162
RSP: 0018:81000c167e88  EFLAGS: 00010096
RAX:  RBX: 8100184f8dc0 RCX: 1d44f000
RDX: 000d RSI: 8100184f8dc0 RDI: 81001d44ee80
RBP: 81001d44ee80 R08: 6db6db6db6db6db7 R09: 81000100
R10:  R11: 0046 R12: 810037341c80
R13: 81001d44ee80 R14: 00606998 R15: 
FS:  2ab979b0ee90() GS:810037e896c0()
knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 2aea207c3950 CR3: 1009c000 CR4: 06e0
Process aiostress (pid: 11362, threadinfo 81000c166000, task
81003f49f5a0)
Stack:  8100184f8dc0 81001d44ee80 810037341c80
80284ff6
 8100184f8dc0 8100184f8dc0  80285409
 00607fb0 81001d44ee80 00606998 0001
Call Trace:
 [] aio_put_req+0x21/0x59
80284ff1:   e8 cd f1 ff ff  callq  802841c3
<__aio_put_req>
 [] io_submit_one+0x2bd/0x2e3
 [] sys_io_submit+0x9b/0x108
 [] default_wake_function+0x0/0xe
 [] system_call+0x7e/0x83


Code: 0f 0b 68 76 e6 4a 80 c2 fd 01 85 c0 74 07 31 c0 e9 21 01 00 
RIP  [] __aio_put_req+0x27/0x162
 RSP 
NMI Watchdog detected LOCKUP on CPU 1
CPU 1 
Modules linked in: firmware_class
Pid: 11362, comm: aiostress Not tainted 2.6.19 #1
RIP: 0010:[]  []
_spin_lock_irq+0x8/0x10
RSP: 0018:81000c167c10  EFLAGS: 0086
RAX: 81003e0f0001 RBX: 81001d44ee80 RCX: 2ab979b0ef40
RDX:  RSI: 0246 RDI: 81001d44eeb8
RBP: 81001d44ee80 R08: 0286 R09: 0018
R10: 0286 R11: 0046 R12: 0001
R13: 000b R14: 804a5904 R15: 
FS:  () GS:810037e896c0()
knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 2aea207c3950 CR3: 1009c000 CR4: 06e0
Process aiostress (pid: 11362, threadinfo 81000c166000, task
81003f49f5a0)
Stack:  80284050 0286 802441cc
0018
8047cebe 0246 81001d44ee80 
80285880 81003e0f12c0 81003e0f12c0 81003f49f5a0
Call Trace:
[] aio_cancel_all+0x12/0x87
0x12:   e8 9d 93 1f 00  callq  8047d3ed
<_spin_lock_irq>
[] lock_hrtimer_base+0x26/0x4c
[] __down_read+0x12/0x9a
[] exit_aio+0x2e/0x8c
[] mmput+0x19/0x98
[] do_exit+0x1f7/0x825
[] vgacon_set_cursor_size+0x36/0xd1
[] kernel_math_error+0x0/0x90
[] do_invalid_op+0xad/0xb7
[] __aio_put_req+0x27/0x162
[] __generic_file_aio_write_nolock+0x2d6/0x3fe
[] __switch_to+0x26f/0x27e
[] error_exit+0x0/0x84
[] __aio_put_req+0x27/0x162
[] aio_put_req+0x21/0x59
[] io_submit_one+0x2bd/0x2e3
[] sys_io_submit+0x9b/0x108
[] default_wake_function+0x0/0xe
[] system_call+0x7e/0x83
Code: 83 3f 00 7e f9 eb f2 c3 53 48 89 fb e8 06 7b db ff f0 ff 0b 
<1>Fixing recursive fault but reboot is needed!

===

It stable happens after 2-5 aiosterss runs on 2.6.20-rc6 as well as on
2.6.19 in 4 processor Xeon.

fs/aio.c:509 line is  BUG_ON(req->ki_users < 0);
It may be an effect of return 1 from invalidate_inode_pages2_range() one
second earlier as you can see in log.
To investigate a root cause of ret=0 in invalidate_inode_pages2_range()
I have added printk() calls.
As a result I can say that
 invalidate_inode_pages2_range() reports BUG: warning at
mm/truncate.c:398  occurs becouse of
 invalidate_complete_page2()returns 0; it
returns 0 because of
 try_to_release_page()returns 0; it
returns 0 because of
 ext3_releasepage()returns 0; it
returns 0 because of
 journal_try_to_free_buffers() returns 0; it returns
0 because of
 try_to_free_buffers() retur

RE: [patch 1/] timers: tsc using for cpu scheduling

2005-07-23 Thread Ananiev, Leonid I
[EMAIL PROTECTED] "math is hard, lets go
shopping!"
- */
-static unsigned long cyc2ns_scale;
-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
-
-static inline void set_cyc2ns_scale(unsigned long cpu_mhz)
-{
-   cyc2ns_scale = (1000 << CYC2NS_SCALE_FACTOR)/cpu_mhz;
-}
-
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-   return (cyc * cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
-}
-
 static unsigned long long monotonic_clock_hpet(void)
 {
unsigned long long last_offset, this_offset, base;
diff -ur a/arch/i386/kernel/timers/timer_tsc.c
b/arch/i386/kernel/timers/timer_tsc.c
--- a/arch/i386/kernel/timers/timer_tsc.c   2005-06-17
23:48:29.0 +0400
+++ b/arch/i386/kernel/timers/timer_tsc.c   2005-07-17
22:28:00.0 +0400
@@ -37,7 +37,6 @@
 
 extern spinlock_t i8253_lock;
 
-static int use_tsc;
 /* Number of usecs that the last interrupt was delayed */
 static int delay_at_last_interrupt;
 
@@ -46,34 +45,6 @@
 static unsigned long long monotonic_base;
 static seqlock_t monotonic_lock = SEQLOCK_UNLOCKED;
 
-/* convert from cycles(64bits) => nanoseconds (64bits)
- *  basic equation:
- * ns = cycles / (freq / ns_per_sec)
- * ns = cycles * (ns_per_sec / freq)
- * ns = cycles * (10^9 / (cpu_mhz * 10^6))
- * ns = cycles * (10^3 / cpu_mhz)
- *
- * Then we use scaling math (suggested by george@mvista.com) to
get:
- * ns = cycles * (10^3 * SC / cpu_mhz) / SC
- * ns = cycles * cyc2ns_scale / SC
- *
- * And since SC is a constant power of two, we can convert the div
- *  into a shift.   
- * [EMAIL PROTECTED] "math is hard, lets go
shopping!"
- */
-static unsigned long cyc2ns_scale; 
-#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
-
-static inline void set_cyc2ns_scale(unsigned long cpu_mhz)
-{
-   cyc2ns_scale = (1000 << CYC2NS_SCALE_FACTOR)/cpu_mhz;
-}
-
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-   return (cyc * cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
-}
-
 static int count2; /* counter for mark_offset_tsc() */
 
 /* Cached *multiplier* to convert TSC counts to microseconds.
@@ -131,30 +102,6 @@
return base + cycles_2_ns(this_offset - last_offset);
 }
 
-/*
- * Scheduler clock - returns current time in nanosec units.
- */
-unsigned long long sched_clock(void)
-{
-   unsigned long long this_offset;
-
-   /*
-* In the NUMA case we dont use the TSC as they are not
-* synchronized across all CPUs.
-*/
-#ifndef CONFIG_NUMA
-   if (!use_tsc)
-#endif
-   /* no locking but a rare wrong value is not a big deal
*/
-   return jiffies_64 * (10 / HZ);
-
-   /* Read the Time Stamp Counter */
-   rdtscll(this_offset);
-
-   /* return the value in ns */
-   return cycles_2_ns(this_offset);
-}
-
 static void delay_tsc(unsigned long loops)
 {
unsigned long bclock, now;
@@ -284,7 +231,7 @@
 #ifndef CONFIG_SMP
if (cpu_khz)
cpu_khz = cpufreq_scale(cpu_khz_ref, ref_freq,
freq->new);
-   if (use_tsc) {
+   if (cyc2ns_scale) {
if (!(freq->flags & CPUFREQ_CONST_LOOPS)) {
fast_gettimeoffset_quotient =
cpufreq_scale(fast_gettimeoffset_ref, freq->new, ref_freq);
set_cyc2ns_scale(cpu_khz/1000);
@@ -520,7 +467,6 @@
 
if (tsc_quotient) {
fast_gettimeoffset_quotient = tsc_quotient;
-   use_tsc = 1;
/*
 *  We could be more selective here I
suspect
 *  and just enable this for the next intel
chips ?



-Original Message-
From: john stultz [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 06, 2005 5:43 AM
To: Ananiev, Leonid I
Cc: lkml; Dominik Brodowski; Semenikhin, Sergey V
Subject: RE: [patch 1/] timers: tsc using for cpu scheduling

On Tue, 2005-07-05 at 21:08 +0400, Ananiev, Leonid I wrote:
>   Not only page faults may increase process priority. Someone can
> write two threads with mutexes so that each thread will spent much
less
> than 1 msec for calculations on cpu than lock-unlock mutexes and yield
> cpu to brother which wait mutex unlock and will do the same. Both
> threads will have high priority according to other threads during
> infinite time. The scheduler will not see the time spent by both
> considered threads on cpu.

Oh, I don't doubt there is a problem. I'm just asking if using the TSC
is really the only way to properly ding the process? 

I'm not a scheduler guy, so forgive my ignorance, but since the TSC may
not be available everywhere, might there be an alternative way to ding
the process? Surely something is keeping track 

RE: [patch 1/] timers: tsc using for cpu scheduling

2005-07-05 Thread Ananiev, Leonid I
John,

Not only page faults may increase process priority. Someone can
write two threads with mutexes so that each thread will spent much less
than 1 msec for calculations on cpu than lock-unlock mutexes and yield
cpu to brother which wait mutex unlock and will do the same. Both
threads will have high priority according to other threads during
infinite time. The scheduler will not see the time spent by both
considered threads on cpu.
The cpu scheduler does not need in real time value. It is need
the number of cpu clocks spent for considered task/thread for priority
calculation. It is not need to modify TSC tick rate for cpu scheduling.
The TSC can be used for priority calculation in NUMA because we
do not compare TSCs of different cpu's. 

> there are other cases where the TSC cannot be used for
> sched_clock, such as on systems that do not have TSCs...

> You're patch removes any fallback for the case where the TSC cannot be
used.
No. Now there is two global kernel values: cyc2ns_scale and use_tsc.
We may say that
use_tsc = (cyc2ns_scale != 0);
Now instead of 
 'if (use_tsc) than ...'
I propose to write
 'if ((cyc2ns_scale != 0) than...'

> This I don't agree with because there are situations where we cannot
use
the TSC.

The patch says that if there are PMT and TSC timers concurrently than
Linux will use TSC for CPU scheduler priority calculatin.
1 millisecond jiffies on the base of PMT were used patch in Linux before
this. So user can see that if CPU has TSC it is worst than CPU which has
not TSC because Linux choose slightly more precise but exactly 100
times more gross variant in this case.

Leonid


-Original Message-
From: john stultz [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 30, 2005 11:08 PM
To: Ananiev, Leonid I
Cc: lkml; Dominik Brodowski
Subject: Re: [patch 1/] timers: tsc using for cpu scheduling

On Thu, 2005-06-30 at 15:43 +0400, Ananiev, Leonid I wrote:
>   Patch for using TSC but not PMT in cpu scheduler.
> 
>   It was noted that under high memory load the process which
> generates a lot page faults does not lose own priority at all if
> processor has Power Management Timer (PMT) and for this reason Linux
> uses jiffies_64 for process priority calculation instead of Time Stamp
> Clock (TSC). As a result time is measured with 100 nsec
granularity:

[snip]

>   If process regularly uses processor much less than 1 msec
> scheduler does not see processor using by this process and does not
> decrease priority of process as it is performed on platforms which
have
> not PMT or PMT using is suppressed in .config file. The "list of
timers,
> ordered by preference" in linux kernel dictates to use Power
Management
> timer, if it is on processor and measure process run time in
> milliseconds (jiffies) actually but not in nanoseconds. As a result
the
> process which provoke to memory threshing and bad cpu and cash using
has
> much higher priority than regular interactive process. You can see
that
> cpu using time grows (it is measured by other way - not by function
> sched_clock()) but priority is not bring down and the process is
> considered as high priority interactive one.


So let me try to restate the problem, and you can let me know if I'm
misunderstanding you.

When the TSC is not used for timekeeping, sched_clock() uses jiffies to
measure time. This results in very course 1ms (or really 1/HZ)
granularity from sched_clock().

This course granularity is causing issues in the scheduler, as best that
I can understand, where processes that cause lots of pagefaults are not
properly dinged for their time.

Is that about right?


> It is known that TSC is incorrect according to astronomical real time
as
> a result of PM throttling. But for scheduler purposes the value of
work
> executed for each process but not real time spent on cpu makes sense.
> The scheduler actually does not consider process run time as a real
> time: it uses division of variable run_time  on CURRENT_BONUS before
> comparison it with sleep average time:
>   run_time /= CURRENT_BONUS;
>   task->sleep_avg -= run_time;
>   So run_time is not a real time but some measure of cpu work
> which is performed for current process and we have the right and have
to
> use TSC for scheduler purpose if TSC is there on processor and does
> function. 


Well, not quite. First of all, I believe (Dominik would know better)
that not all CPUs that support frequency scaling actually modify their
TSC frequency. So in some cases the TSC is time and in others it is
work.

Further, there are other cases where the TSC cannot be used for
sched_clock, such as on systems that do not have TSCs, and NUMA systems
where the TSCs are not synched. 



&g