RE: [PATCH 1/3] aio: fix oops because of extra IO control block freeing.
ZB>If we really decide to remove EIOCBRETRY support we'd get rid of all ZB>the retry infrastructure and remove the EIOCBRETRY errno so their ZB>builds failed. Originally EIOCBRETRY was used in fs/read_write.c for vector IO. And EIOCBRETRY was deleted from it after. Now EIOCBRETRY is used in drivers/usb/gadget/inode.c only. And the patch 2/3 proposes to set iocb flag instead of EIOCBRETRY errno in inode.c. Agree that the name kiocbSetPgBusy() could be better chosen if kiocbSetPgBusy() is just instead for vector IO. But why other than DIO developers must continue using EIOCBRETRY? Thay could use the same way as fs/read_write.c uses? ZB> I was waiting for them to arrive before responding. Sorry. I've listed "linux-kernel@vger.kernel.org; [EMAIL PROTECTED]" in a single (that is incorrect) thunderbird 'To' line; I have got response from Stern; and thought that mailing was OK. ZB>- ext3_releasepage() returns failure if it races with kjournald ZB>holding a reference while it waits for a transaction to commit. The patch proposes iterative synchronization way to solve this problem. ZB>- generic_file_direct_IO() causes an oops if it clobbers -EIOCBQUEUED ZB>with the return code from invalidate_inode_pages2_range()..- ZB>releasepage(). After patching invalidate_inode_pages2_range() does not change return value. ZB>This fix makes the incorrect assertion that *any* failure from ZB>invalidate_inode_pages2_range(), which might not have anything to do ZB>with this race you're currently seeing, is transitory and should ZB>trigger a retry. That's wrong, it should be returning the error. Just patch makes to retry iocb if page is transitory busy for any reason. If busy -> do retry IO operation later. We will get correct errno if mapping was deleted or retry success if page is dirty or committed. Leonid - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 2/3] aio: fix oops because of extra IO control block freeing.
>Where is kiocbSetPgBusy() defined, It is defined in include/linux/aio.h (introduced in [PATCH 1/3]) > and where in the documentation is it explained? I will add a comment for kiocbSetPgBusy()in aio.h. > For that matter, what is the reason for changing the return value at all? EIORETRY is not IO return value at all. It is processing stage. It conflicts with IO return value. That is why a flag in iocb is used. It should be noted that loop but not EIOCBRETRY was used for vector IO (mm/filemap.c) Leonid >-Original Message- >From: Alan Stern [mailto:[EMAIL PROTECTED] >Sent: Monday, March 05, 2007 10:36 PM >To: Ananiev, Leonid I >Cc: Kernel development list >Subject: Re: [PATCH 2/3] aio: fix oops because of extra IO control block freeing. > >On Mon, 5 Mar 2007, Leonid Ananiev wrote: > >> From Leonid Ananiev >> >> This patch finishes moving from using errno EIOCBRETRY to using flag in >> IO control block for aio retrying. After this change the process will be >> kicked for direct aio as it was for sync aio. >> >> Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]> >> >> The patch is applied to 2.6.20 or 2.6.21-rc2 >> diff -uprN -X linux-2.6.20-aio21/Documentation/dontdiff >> linux-2.6.20-aio21/drivers/usb/gadget/inode.c >> linux-2.6.20-aio22/drivers/usb/gadget/inode.c >> --- linux-2.6.20-aio21/drivers/usb/gadget/inode.c2007-03-04 >> 21:45:52.0 +0300 >> +++ linux-2.6.20-aio22/drivers/usb/gadget/inode.c2007-03-05 >> 18:19:35.0 +0300 >> @@ -692,7 +692,10 @@ fail: >> kfree(priv); >> put_ep(epdata); >> } else >> -value = (iv ? -EIOCBRETRY : -EIOCBQUEUED); >> +if (iv) >> +kiocbSetPgBusy(iocb); >> +else >> +value = -EIOCBQUEUED; >> return value; >> } >> >> diff -uprN -X linux-2.6.20-aio21/Documentation/dontdiff >> linux-2.6.20-aio21/fs/ocfs2/dlmglue.c linux-2.6.20-aio22/fs/ocfs2/dlmglue.c >> --- linux-2.6.20-aio21/fs/ocfs2/dlmglue.c2007-03-04 21:45:52.0 >> +0300 >> +++ linux-2.6.20-aio22/fs/ocfs2/dlmglue.c2007-03-04 22:57:50.0 >> +0300 >> @@ -1639,7 +1639,7 @@ int ocfs2_meta_lock_full(struct inode *i >> >> status = ocfs2_cluster_lock(osb, lockres, level, dlm_flags, arg_flags); >> if (status < 0) { >> -if (status != -EAGAIN && status != -EIOCBRETRY) >> +if (status != -EAGAIN) >> mlog_errno(status); >> goto bail; >> } >> diff -uprN -X linux-2.6.20-aio21/Documentation/dontdiff >> linux-2.6.20-aio21/include/linux/aio.h >> linux-2.6.20-aio22/include/linux/aio.h >> --- linux-2.6.20-aio21/include/linux/aio.h 2007-03-04 21:46:45.0 >> +0300 >> +++ linux-2.6.20-aio22/include/linux/aio.h 2007-03-04 22:57:50.0 >> +0300 >> @@ -79,15 +79,6 @@ struct kioctx; >>* not ask the method again -- ki_retry must ensure forward progress. >>* aio_complete() must be called once and only once in the future, >> multiple >>* calls may result in undefined behaviour. >> - * >> - * If ki_retry returns -EIOCBRETRY it has made a promise that kick_iocb() >> - * will be called on the kiocb pointer in the future. This may happen >> - * through generic helpers that associate kiocb->ki_wait with a wait >> - * queue head that ki_retry uses via current->io_wait. It can also happen >> - * with custom tracking and manual calls to kick_iocb(), though that is >> - * discouraged. In either case, kick_iocb() must be called once and only >> - * once. ki_retry must ensure forward progress, the AIO core will wait >> - * indefinitely for kick_iocb() to be called. >>*/ >> struct kiocb { >> struct list_headki_run_list; >> diff -uprN -X linux-2.6.20-aio21/Documentation/dontdiff >> linux-2.6.20-aio21/include/linux/errno.h >> linux-2.6.20-aio22/include/linux/errno.h >> --- linux-2.6.20-aio21/include/linux/errno.h 2007-03-04 >> 21:45:51.0 +0300 >> +++ linux-2.6.20-aio22/include/linux/errno.h 2007-03-04 >> 22:57:50.0 +0300 >> @@ -22,7 +22,6 @@ >> #define EBADTYPE 527 /* Type not supported by server */ >> #define EJUKEBOX 528 /* Request initiated, but will not complete >> before timeout */ >> #define EIOCBQUEUED529 /* iocb queued, will get completion event */ >> -#define EIOCBRETRY 530 /* iocb queued, will trigger a retry */ >> >> #endif > >Where is kiocbSetPgBusy() defined,
Cfq first request patch changes IO performance
The patch "cfq-iosched: defer slice activation to first request being active" http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=c ommitdiff;h=44f7c16065c83060cbb9dd9b367141682a6e2b8e;hp=99f9628aba4d8fb3 b8d955c9efded0d0a1995fad Sysbench fileio (random read/write mix) -15% Tiobench sequential write -3% Tiobench random write -5% Iozone record re-write -21% Iozone mmap sequential read -20% Iozone mmap random write+45% Aio-stress direct sequential read +29% For other 3 tiobench, 7 aiostress and 13 iozone tests the result changes are less than variation. Considered patch does not impact on performance if all threads have equal IO activity (tiobench). If single thread has IO activity the performance becomes unstable and is decreased mainly. Leonid - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
> kjournald submited buffers for IO and waiting for them to finish. It is means that the patch incorrectly moves internal kernel synchronization problem into user space as EIO instead of fixing a root cause or perform iterative synchronization. After patching users will be surprised a lot of EIO (remember 47% EIO in aio-stress runs after patching) will buy new disk... This patch is harmful. Leonid >-Original Message- >From: Zach Brown [mailto:[EMAIL PROTECTED] >Sent: Wednesday, February 21, 2007 9:25 PM >To: Ken Chen >Cc: Ananiev, Leonid I; Chris Mason; linux-aio; Linux Kernel Mailing List; Benjamin LaHaise; Suparna >bhattacharya; Andrew Morton; Badari Pulavarty >Subject: Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event > > >On Feb 21, 2007, at 12:35 AM, Ken Chen wrote: > >> On 2/20/07, Ananiev, Leonid <[EMAIL PROTECTED]> wrote: >>> 1) mem=1G in kernel boot param if you have more >>> 2) unmount; mk2fs; mount >>> 3) dd if=/dev/zero of= bs=1M count=1200 >>> 4) aiostress -s 1200m -O -o 2 -i 1 -r 16k >>> 5) if i++<50 goto 2). >> >> Would you please instrument the call chain of >> invalidate_complete_page2() and tell us exactly where it returns zero >> value in your failure case? >> >> invalidate_complete_page2 >> try_to_release_page >> ext3_releasepage >>journal_try_to_free_buffers >> ??? > >For what it's worth, Badari has explained this race in the past in a >credible way. I'll take the liberty of pasting a mail from him: > >" >kjournald submited buffers for IO and waiting for them to finish. >Note that it has a ref. against the buffer. > >journal_commit_transaction() > ... > submited buffers for IO > /* Waiting for IO to complete */ > while (commit_transaction->t_locked_list) { > ... > get_bh(bh); > if (buffer_locked(bh)) { > spin_unlock(&journal->j_list_lock); > wait_on_buffer(bh); <<<<<< > spin_lock(&journal->j_list_lock); > } > > .. > put_bh(bh); > } > >Now, DIO process comes to frees the jh through >journal_try_to_free_buffers() >but fails to drop_buffers() since kjournald() has a reference against >it. >invalidate_inode_pages2_range() > .. > ext3_releasepage() > journal_try_to_free_buffers() > journal_put_journal_head() > __journal_try_to_free_buffer() > <--- freed jh > > try_to_free_buffers() > drop_buffers() > if (buffer_busy(bh)) > goto failed; > <<--- returns EIO due to >b_count > >" > >I don't mean to say that we shouldn't get traces to confirm the >theory, just sharing. And now we can point to this in the archives >next time :). > >- z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
> where it returns zero I've wrote in the mail http://lkml.org/lkml/2007/2/8/337 invalidate_inode_pages2_range() reports BUG: warning at mm/truncate.c:398 occurs becouse of invalidate_complete_page2()returns 0; it returns 0 because of try_to_release_page() returns 0; it returns 0 because of ext3_releasepage() returns 0; it returns 0 because of journal_try_to_free_buffers() returns 0; it returns 0 because of try_to_free_buffers() returns 0; it returns 0 because of drop_buffers() returns 0; it returns 0 because of buffer_busy(bh)returns 1; it returns 0 because of buffer_head count is 1 (bh->b_count==1) as additional printk reports. Leonid -Original Message- From: Ken Chen [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 21, 2007 11:36 AM To: Ananiev, Leonid I Cc: Chris Mason; Zach Brown; [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; Benjamin LaHaise; Suparna bhattacharya; Andrew Morton Subject: Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event On 2/20/07, Ananiev, Leonid <[EMAIL PROTECTED]> wrote: > 1) mem=1G in kernel boot param if you have more > 2) unmount; mk2fs; mount > 3) dd if=/dev/zero of= bs=1M count=1200 > 4) aiostress -s 1200m -O -o 2 -i 1 -r 16k > 5) if i++<50 goto 2). Would you please instrument the call chain of invalidate_complete_page2() and tell us exactly where it returns zero value in your failure case? invalidate_complete_page2 try_to_release_page ext3_releasepage journal_try_to_free_buffers ??? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
-O I've missed: aiostress -s 1200m -O -o 2 -i 1 -r 16k You are right I've used harness scripts. i=0; while ((i++<50)); do ~/bm/bin/runs I2 -; done & It runs bmrun harness script which is long for different hardware configurations and test options. The lines in aio-stress wrapper which could be useful for you are: gcc aiostress.c -Wall -o aiostress -lpthread -laio 2>>${file_log} ... sudo umount -f ${pdisk} ${ppath} > /dev/null 2>&1 sudo /sbin/mke2fs -j ${pdisk} > /dev/null 2>>${file_log} sudo mount ${pdisk} ${ppath} > /dev/null 2>>${file_log}... export LD_LIBRARY_PATH=/lib:/usr/lib:$LD_LIBRARY_PATH all_files_size=1200 # in megabytes ... else num_disks=1 ((file_size = all_files_size / num_disks)) testfiles=${ppath}/test/testfile dd if=/dev/zero of=$testfiles bs=1M count=${file_size} > /dev/null fi aiostress ${*} -v -o 2 -s ${file_size}m -i 1 -r 16k $testfiles > ${file_res} 2>&1 if [ "$1" == "-S" ] ; then info="O_SYNC" elif [ "$1" == "-O" ] ; then info="O_DIRECT" fi awk ${file_res} Which does 1) mem=1G in kernel boot param if you have more 2) unmount; mk2fs; mount 3) dd if=/dev/zero of= bs=1M count=1200 4) aiostress -s 1200m -O -o 2 -i 1 -r 16k 5) if i++<50 goto 2). Leonid -Original Message- From: Chris Mason [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 20, 2007 8:55 PM To: Ananiev, Leonid I Cc: Zach Brown; [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; Benjamin LaHaise; Suparna bhattacharya; Andrew Morton Subject: Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event On Tue, Feb 20, 2007 at 08:17:51PM +0300, Ananiev, Leonid I wrote: > aio-stress command lines used for test > 1) mem=1G in kernel boot param if you have more > 2) mk2fs for test_file > 3) dd if=/dev/zero of= bs=1M count=1200 > 4) aiostress -s 1200m -o 2 -i 1 -r 16k > Sorry, this aio-stress command line doesn't do O_DIRECT. Do you have this bundled into a test script that I can try? -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
There is change in the patch which is uncommented in preface. Now aio_ring_insert_entry() is not called if req->ki_users>=1. Before was called. Could you comment it? Leonid -Original Message- From: Zach Brown [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 20, 2007 12:39 AM To: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org Cc: Benjamin LaHaise; Suparna bhattacharya; Andrew Morton; Ananiev, Leonid I Subject: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event aio: propogate post-EIOCBQUEUED errors to completion event This addresses an oops reported by Leonid Ananiev <[EMAIL PROTECTED]> as archived at http://lkml.org/lkml/2007/2/8/337. O_DIRECT kicks off bios and returns -EIOCBQUEUED to indicate its intention to call aio_complete() once the bios complete. As we return from submission we must preserve the -EIOCBQUEUED return code so that fs/aio.c knows to let the bio completion call aio_complete(). This stops us from returning errors after O_DIRECT submission. But we have a few places that are very interested in generating errors after bio submission. The most critical of these is invalidating the page cache after a write. This avoids exposing stale data to buffered operations that are performed after the O_DIRECT write succeeds. We must do this after submission because the user buffer might have been an mmap()ed buffer of the region being written to. The get_user_pages() in the O_DIRECT completion path could have faulted in stale data. So this patch introduces a helper, aio_propogate_error(), which queues post-submission errors in the iocb so that they are given to the user completion event when aio_complete() is finally called. To get this working we change the aio_complete() path so that the ring insertion is performed as the final iocb reference is dropped. This gives the submission path time to queue its pending error before it drops its reference. This increases the space in the iocb as it has to record the two result codes from aio_complete() and the pending error from the submission path. This was tested by running O_DIRECT aio-stress concurrently with buffered reads while triggering EIO in invalidate_inode_pages2_range() with the help of a debugfs bool hack. Previously the kernel would oops as fs/aio.c and bio completion both called aio_complete(). With this patch aio-stress sees -EIO. Signed-off-by: Zach Brown <[EMAIL PROTECTED]> --- fs/aio.c| 49 +- include/linux/aio.h | 30 + mm/filemap.c|4 +-- 3 files changed, 57 insertions(+), 26 deletions(-) --- a/fs/aio.c Mon Feb 19 13:12:20 2007 -0800 +++ b/fs/aio.c Mon Feb 19 13:16:00 2007 -0800 @@ -193,8 +193,7 @@ static int aio_setup_ring(struct kioctx kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km); \ } while(0) -static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb, - long res, long res2) +static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb) { struct aio_ring_info*info; struct aio_ring *ring; @@ -213,12 +212,12 @@ static void aio_ring_insert_entry(struct event->obj = (u64)(unsigned long)iocb->ki_obj.user; event->data = iocb->ki_user_data; - event->res = res; - event->res2 = res2; - - dprintk("aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n", + event->res = iocb->ki_pending_err ? iocb->ki_pending_err : iocb->ki_res; + event->res2 = iocb->ki_res2; + + dprintk("aio_complete: %p[%lu]: %p: %p %Lx %d %lx %lx\n", ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data, - res, res2); + iocb->ki_pending_err, iocb->ki_res, iocb->ki_res2); /* after flagging the request as done, we * must never even look at it again @@ -459,6 +458,7 @@ static struct kiocb fastcall *__aio_get_ req->ki_cancel = NULL; req->ki_retry = NULL; req->ki_dtor = NULL; + req->ki_pending_err = 0; req->private = NULL; req->ki_iovec = NULL; INIT_LIST_HEAD(&req->ki_run_list); @@ -548,10 +548,14 @@ static int __aio_put_req(struct kioctx * assert_spin_locked(&ctx->ctx_lock); - req->ki_users --; + req->ki_users--; BUG_ON(req->ki_users < 0); if (likely(req->ki_users)) return 0; + + if (kiocbIsInserted(req)) + aio_ring_insert_entry(ctx, req); + list_del(&req->ki_list);/* remove from active_reqs */ req->ki_cancel = NULL; req->ki_retry = NULL; @@ -983,27 +987,24 @@ int fastcall aio_complete(struct kiocb * return 1; } - /* add a completion event to the ring buffer. -
RE: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
aio-stress command lines used for test 1) mem=1G in kernel boot param if you have more 2) mk2fs for test_file 3) dd if=/dev/zero of= bs=1M count=1200 4) aiostress -s 1200m -o 2 -i 1 -r 16k Leonid -Original Message- From: Chris Mason [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 20, 2007 8:04 PM To: Ananiev, Leonid I Cc: Zach Brown; [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; Benjamin LaHaise; Suparna bhattacharya; Andrew Morton Subject: Re: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event On Tue, Feb 20, 2007 at 07:57:49PM +0300, Ananiev, Leonid I wrote: > Zach> This addresses an oops reported by Leonid Ananiev > <[EMAIL PROTECTED]> > Zach> as archived at http://lkml.org/lkml/2007/2/8/337. > > Zach> This was tested by running O_DIRECT aio-stress concurrently with > buffered reads > > The oops was with aio-stress only running in the loop > WITHOUT buffered or mmaped IO which are patched and discussed now. > Actually 47% aio is finished with EIO after patch. I looked through the thread and couldn't find the aio-stress command line used for the whole test. Could you please post it here? If aio+dio is being used to extend the file, you'll get pages in the page cache, which is hopefully why you're able to trigger the EIOs. -chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event
Zach> This addresses an oops reported by Leonid Ananiev <[EMAIL PROTECTED]> Zach> as archived at http://lkml.org/lkml/2007/2/8/337. Zach> This was tested by running O_DIRECT aio-stress concurrently with buffered reads The oops was with aio-stress only running in the loop WITHOUT buffered or mmaped IO which are patched and discussed now. Actually 47% aio is finished with EIO after patch. Leonnid -Original Message- From: Zach Brown [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 20, 2007 12:39 AM To: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org Cc: Benjamin LaHaise; Suparna bhattacharya; Andrew Morton; Ananiev, Leonid I Subject: [PATCH 2/2] aio: propogate post-EIOCBQUEUED errors to completion event aio: propogate post-EIOCBQUEUED errors to completion event This addresses an oops reported by Leonid Ananiev <[EMAIL PROTECTED]> as archived at http://lkml.org/lkml/2007/2/8/337. O_DIRECT kicks off bios and returns -EIOCBQUEUED to indicate its intention to call aio_complete() once the bios complete. As we return from submission we must preserve the -EIOCBQUEUED return code so that fs/aio.c knows to let the bio completion call aio_complete(). This stops us from returning errors after O_DIRECT submission. But we have a few places that are very interested in generating errors after bio submission. The most critical of these is invalidating the page cache after a write. This avoids exposing stale data to buffered operations that are performed after the O_DIRECT write succeeds. We must do this after submission because the user buffer might have been an mmap()ed buffer of the region being written to. The get_user_pages() in the O_DIRECT completion path could have faulted in stale data. So this patch introduces a helper, aio_propogate_error(), which queues post-submission errors in the iocb so that they are given to the user completion event when aio_complete() is finally called. To get this working we change the aio_complete() path so that the ring insertion is performed as the final iocb reference is dropped. This gives the submission path time to queue its pending error before it drops its reference. This increases the space in the iocb as it has to record the two result codes from aio_complete() and the pending error from the submission path. This was tested by running O_DIRECT aio-stress concurrently with buffered reads while triggering EIO in invalidate_inode_pages2_range() with the help of a debugfs bool hack. Previously the kernel would oops as fs/aio.c and bio completion both called aio_complete(). With this patch aio-stress sees -EIO. Signed-off-by: Zach Brown <[EMAIL PROTECTED]> --- fs/aio.c| 49 +- include/linux/aio.h | 30 + mm/filemap.c|4 +-- 3 files changed, 57 insertions(+), 26 deletions(-) --- a/fs/aio.c Mon Feb 19 13:12:20 2007 -0800 +++ b/fs/aio.c Mon Feb 19 13:16:00 2007 -0800 @@ -193,8 +193,7 @@ static int aio_setup_ring(struct kioctx kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km); \ } while(0) -static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb, - long res, long res2) +static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb) { struct aio_ring_info*info; struct aio_ring *ring; @@ -213,12 +212,12 @@ static void aio_ring_insert_entry(struct event->obj = (u64)(unsigned long)iocb->ki_obj.user; event->data = iocb->ki_user_data; - event->res = res; - event->res2 = res2; - - dprintk("aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n", + event->res = iocb->ki_pending_err ? iocb->ki_pending_err : iocb->ki_res; + event->res2 = iocb->ki_res2; + + dprintk("aio_complete: %p[%lu]: %p: %p %Lx %d %lx %lx\n", ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data, - res, res2); + iocb->ki_pending_err, iocb->ki_res, iocb->ki_res2); /* after flagging the request as done, we * must never even look at it again @@ -459,6 +458,7 @@ static struct kiocb fastcall *__aio_get_ req->ki_cancel = NULL; req->ki_retry = NULL; req->ki_dtor = NULL; + req->ki_pending_err = 0; req->private = NULL; req->ki_iovec = NULL; INIT_LIST_HEAD(&req->ki_run_list); @@ -548,10 +548,14 @@ static int __aio_put_req(struct kioctx * assert_spin_locked(&ctx->ctx_lock); - req->ki_users --; + req->ki_users--; BUG_ON(req->ki_users < 0); if (likely(req->ki_users)) return 0; + + if (kiocbIsInserted(req)) + aio_ring_insert_entry(ctx, req); + list_del(&req->ki_list);/* remove from activ
RE: [PATCH] aio: propogate post-EIOCBQUEUED errors to completion event
I have applied the Brown's patch and run aio-stress O_DIRECT random write 100 times There is 4% throughput degradation. 47 of 100 aio-stress runs reported an EIO. 1 aio-stress error output example: adding stage random write starting with random write file size 1200MB, record size 16KB, depth 64, ios per iteration 1 max io_submit 1, buffer alignment set to 4KB threads 1 files 1 contexts 1 context offset 2MB verification on adding file /mnt/stp/test/testfile thread 0 io err 18446744073709551611 (Input/output error) op 1, size 16384 io err 18446744073709551611 (Input/output error) op 1, size 16384 io err 18446744073709551611 (Input/output error) op 1, size 16384 latency min 0.01 avg 0.16 max 6923.40 76784 < 100 13 < 250 2 < 500 0 < 1000 0 < 5000 1 < 1 completion latency min 0.21 avg 153.07 max 1103.15 34330 < 100 26669 < 250 13653 < 500 2082 < 1000 2 < 5000 0 < 1 random write on /mnt/stp/test/testfile (6.06 MB/s) 1200.00 MB in 198.03s 3 errors on oper, last 4294967291 thread 0 random write totals (6.06 MB/s) 1200.00 MB in 198.04s 3 errors on oper, last 4294967291 Leonid -Original Message- From: Zach Brown [mailto:[EMAIL PROTECTED] Sent: Monday, February 19, 2007 11:35 PM To: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org Cc: Benjamin LaHaise; Suparna bhattacharya; Andrew Morton; Ananiev, Leonid I Subject: [PATCH] aio: propogate post-EIOCBQUEUED errors to completion event aio: propogate post-EIOCBQUEUED errors to completion event This addresses an oops reported by Leonid Ananiev <[EMAIL PROTECTED]> as archived at http://lkml.org/lkml/2007/2/8/337. O_DIRECT kicks off bios and returns -EIOCBQUEUED to indicate its intention to call aio_complete() once the bios complete. As we return from submission we must preserve the -EIOCBQUEUED return code so that fs/aio.c knows to let the bio completion call aio_complete(). This stops us from returning errors after O_DIRECT submission. But we have a few places that are very interested in generating errors after bio submission. The most critical of these is invalidating the page cache after a write. This avoids exposing stale data to buffered operations that are performed after the O_DIRECT write succeeds. We must do this after submission because the user buffer might have been an mmap()ed buffer of the region being written to. The get_user_pages() in the O_DIRECT completion path could have faulted in stale data. So this patch introduces a helper, aio_propogate_error(), which queues post-submission errors in the iocb so that they are given to the user completion event when aio_complete() is finally called. To get this working we change the aio_complete() path so that the ring insertion is performed as the final iocb reference is dropped. This gives the submission path time to queue its pending error before it drops its reference. This increases the space in the iocb as it has to record the two result codes from aio_complete() and the pending error from the submission path. This was tested by running O_DIRECT aio-stress concurrently with buffered reads while triggering EIO in invalidate_inode_pages2_range() with the help of a debugfs bool hack. Previously the kernel would oops as fs/aio.c and bio completion both called aio_complete(). With this patch aio-stress sees -EIO. Signed-off-by: Zach Brown <[EMAIL PROTECTED]> --- fs/aio.c| 109 ++ include/linux/aio.h | 30 +++ mm/filemap.c|4 - 3 files changed, 91 insertions(+), 52 deletions(-) --- a/fs/aio.c Mon Feb 19 11:52:27 2007 -0800 +++ b/fs/aio.c Mon Feb 19 12:32:52 2007 -0800 @@ -192,6 +192,46 @@ static int aio_setup_ring(struct kioctx (void)__event; \ kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km); \ } while(0) + +static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb) +{ + struct aio_ring_info*info; + struct aio_ring *ring; + struct io_event *event; + unsigned long tail; + + assert_spin_locked(&ctx->ctx_lock); + + info = &ctx->ring_info; + ring = kmap_atomic(info->ring_pages[0], KM_IRQ1); + + tail = info->tail; + event = aio_ring_event(info, tail, KM_IRQ0); + if (++tail >= info->nr) + tail = 0; + + event->obj = (u64)(unsigned long)iocb->ki_obj.user; + event->data = iocb->ki_user_data; + event->res = iocb->ki_pending_err ? iocb->ki_pending_err : iocb->ki_res; + event->res2 = iocb->ki_res2; + + dprintk("aio_complete: %p[%lu]: %p: %p %Lx %d %lx %lx\n", + ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data, + iocb->ki_pending_err, iocb->ki_res, iocb->ki_res2); + + /* after flagging the requ
RE: [PATCH] aio: propogate post-EIOCBQUEUED errors to completion event
> while triggering EIO in invalidate_inode_pages2_range() ... > With this patch aio-stress sees -EIO. Actually if invalidate_inode_pages2_range() returns EIO it means that internal kernel synchronization conflict was happen. It is reported to user as hardware IO error. Iteration in synchronization process could be performed instead. Leonid -Original Message- From: Zach Brown [mailto:[EMAIL PROTECTED] Sent: Monday, February 19, 2007 11:35 PM To: [EMAIL PROTECTED]; linux-kernel@vger.kernel.org Cc: Benjamin LaHaise; Suparna bhattacharya; Andrew Morton; Ananiev, Leonid I Subject: [PATCH] aio: propogate post-EIOCBQUEUED errors to completion event aio: propogate post-EIOCBQUEUED errors to completion event This addresses an oops reported by Leonid Ananiev <[EMAIL PROTECTED]> as archived at http://lkml.org/lkml/2007/2/8/337. O_DIRECT kicks off bios and returns -EIOCBQUEUED to indicate its intention to call aio_complete() once the bios complete. As we return from submission we must preserve the -EIOCBQUEUED return code so that fs/aio.c knows to let the bio completion call aio_complete(). This stops us from returning errors after O_DIRECT submission. But we have a few places that are very interested in generating errors after bio submission. The most critical of these is invalidating the page cache after a write. This avoids exposing stale data to buffered operations that are performed after the O_DIRECT write succeeds. We must do this after submission because the user buffer might have been an mmap()ed buffer of the region being written to. The get_user_pages() in the O_DIRECT completion path could have faulted in stale data. So this patch introduces a helper, aio_propogate_error(), which queues post-submission errors in the iocb so that they are given to the user completion event when aio_complete() is finally called. To get this working we change the aio_complete() path so that the ring insertion is performed as the final iocb reference is dropped. This gives the submission path time to queue its pending error before it drops its reference. This increases the space in the iocb as it has to record the two result codes from aio_complete() and the pending error from the submission path. This was tested by running O_DIRECT aio-stress concurrently with buffered reads while triggering EIO in invalidate_inode_pages2_range() with the help of a debugfs bool hack. Previously the kernel would oops as fs/aio.c and bio completion both called aio_complete(). With this patch aio-stress sees -EIO. Signed-off-by: Zach Brown <[EMAIL PROTECTED]> --- fs/aio.c| 109 ++ include/linux/aio.h | 30 +++ mm/filemap.c|4 - 3 files changed, 91 insertions(+), 52 deletions(-) --- a/fs/aio.c Mon Feb 19 11:52:27 2007 -0800 +++ b/fs/aio.c Mon Feb 19 12:32:52 2007 -0800 @@ -192,6 +192,46 @@ static int aio_setup_ring(struct kioctx (void)__event; \ kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km); \ } while(0) + +static void aio_ring_insert_entry(struct kioctx *ctx, struct kiocb *iocb) +{ + struct aio_ring_info*info; + struct aio_ring *ring; + struct io_event *event; + unsigned long tail; + + assert_spin_locked(&ctx->ctx_lock); + + info = &ctx->ring_info; + ring = kmap_atomic(info->ring_pages[0], KM_IRQ1); + + tail = info->tail; + event = aio_ring_event(info, tail, KM_IRQ0); + if (++tail >= info->nr) + tail = 0; + + event->obj = (u64)(unsigned long)iocb->ki_obj.user; + event->data = iocb->ki_user_data; + event->res = iocb->ki_pending_err ? iocb->ki_pending_err : iocb->ki_res; + event->res2 = iocb->ki_res2; + + dprintk("aio_complete: %p[%lu]: %p: %p %Lx %d %lx %lx\n", + ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data, + iocb->ki_pending_err, iocb->ki_res, iocb->ki_res2); + + /* after flagging the request as done, we +* must never even look at it again +*/ + smp_wmb(); /* make event visible before updating tail */ + + info->tail = tail; + ring->tail = tail; + + put_aio_ring_event(event, KM_IRQ0); + kunmap_atomic(ring, KM_IRQ1); + + pr_debug("added to ring %p at [%lu]\n", iocb, tail); +} /* ioctx_alloc * Allocates and initializes an ioctx. Returns an ERR_PTR if it failed. @@ -418,6 +458,7 @@ static struct kiocb fastcall *__aio_get_ req->ki_cancel = NULL; req->ki_retry = NULL; req->ki_dtor = NULL; + req->ki_pending_err = 0; req->private = NULL; req->ki_iovec = NULL; INIT_LIST_HEAD(&req->ki_run_list); @@ -507,10 +548,14 @@ static int __aio_put_req
RE: [PATCH] aio: fix kernel bug when page is temporally busy
Zach Brown wrote: > It will not return until kiocbSetKicked() is called, > and that is only called from kick_iocb(). There is no test or wait of Kicked in considered for (;;) aio_write() loop. Zach Brown wrote: >> The proposed patch does not crate this bug if any. > Right, and I said that in the mail you're quoting. ... > You're introducing other bugs with the patch. Could you list other introduced bugs? >> It is interesting that I've not seen any EIOCBQUEUED returned >> to aio_run_iocb() during 5 hours aiostress running. > What arguments are you running aio-stress with? -EIOCBQUEUED is only > used for O_DIRECT I wrote in the vary first mail that the panic is appearing in random write O_DIRECT aio-stress running. Other aio-stress modes where tested after patching as well. > and then only in certain circumstances. Looking closely into sources we can see that EIOCBQUEUED never may be returned to aio_run_iocb(). include/linux/aio.h says * If ki_retry returns -EIOCBRETRY ... Could you point source line which "returns -EIOCBRETRY"? Leonid -Original Message- From: Zach Brown [mailto:[EMAIL PROTECTED] Sent: Friday, February 16, 2007 3:01 AM To: Ananiev, Leonid I Cc: Ken Chen; [EMAIL PROTECTED]; Andrew Morton; linux-kernel@vger.kernel.org; linux-aio; Chris Mason Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy On Feb 15, 2007, at 3:32 PM, Ananiev, Leonid I wrote: >>> If EIOCBRETRY then generic_file_aio_write() will be recalled for the >>> same iocb. >> Only if kick_iocb() is called. It won't be called if i_i_p2_r() was >> the only thing to return -EIOCBRETRY. > It is not need to call kick_iocb() > for generic_file_aio_write() calling. > It is recalled without any wakeup waiting: > for (;;) { > ret = filp->f_op->aio_write(&kiocb, &iov, 1, > kiocb.ki_pos); > if (ret != -EIOCBRETRY) > break; > wait_on_retry_sync_kiocb(&kiocb); > } > Note: wait_on_retry_sync_kiocb() does not wait. Yes it does. It will not return until kiocbSetKicked() is called, and that is only called from kick_iocb(). >>> It overwrites -EIOCBQUEUED > Do you mean that there is one more kernel bug which > overwrites -EIOCBQUEUED by any errno or number of bytes and this > new value is returned to caller as an IO result > while IO is not finished yet. > > The proposed patch does not crate this bug if any. Right, and I said that in the mail you're quoting. > It actually fixes a kernel panic bag when iocb.users count becomes > incorrect. The bag " Kernel BUG at fs/aio.c:509" is there because > aio_run_iocb() have not a chance to differ real EIO and > EIO which is actually means EAGAYN or EIOCBRETRY. Yes, I understand the bug you're trying to fix. You're introducing other bugs with the patch. It will not be merged. > It is interesting that I've not seen any EIOCBQUEUED returned > to aio_run_iocb() during 5 hours aiostress running. What arguments are you running aio-stress with? -EIOCBQUEUED is only used for O_DIRECT, and then only in certain circumstances. - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] aio: fix kernel bug when page is temporally busy
>> If EIOCBRETRY then generic_file_aio_write() will be recalled for the >> same iocb. > Only if kick_iocb() is called. It won't be called if i_i_p2_r() was > the only thing to return -EIOCBRETRY. It is not need to call kick_iocb() for generic_file_aio_write() calling. It is recalled without any wakeup waiting: for (;;) { ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos); if (ret != -EIOCBRETRY) break; wait_on_retry_sync_kiocb(&kiocb); } Note: wait_on_retry_sync_kiocb() does not wait. That is for dio. For aio iocb generic_file_aio_write() call is required from ki_run_list while next io_submit or read_events() is called. So when an IO hang may happen? >> It overwrites -EIOCBQUEUED Do you mean that there is one more kernel bug which overwrites -EIOCBQUEUED by any errno or number of bytes and this new value is returned to caller as an IO result while IO is not finished yet. The proposed patch does not crate this bug if any. It actually fixes a kernel panic bag when iocb.users count becomes incorrect. The bag " Kernel BUG at fs/aio.c:509" is there because aio_run_iocb() have not a chance to differ real EIO and EIO which is actually means EAGAYN or EIOCBRETRY. I'me sure the patch changes source code in correct direction: to differentiate that two kinds of EIOs. > Have you read the giant comment over the definition of struct kiocb > in include/linux/aio.h? I have read. But compiler has not: it did not create an object code for * If ki_retry returns -EIOCBRETRY ... >>> This can lead to reference count confusion. >> But just reference count confusion was deleted by patch. Isn't it? >Sorry, I don't understand what you're trying to ask here. One of reference count iocb.users confusion is deleted by the patch. I'm not sure that there is other bag. At least I have not see IO hang while testing. It is interesting that I've not seen any EIOCBQUEUED returned to aio_run_iocb() during 5 hours aiostress running. Does it mean that EIOCBQUEUED is always reset and never returned? Leonid -Original Message----- From: Zach Brown [mailto:[EMAIL PROTECTED] Sent: Thursday, February 15, 2007 10:23 PM To: Ananiev, Leonid I Cc: Ken Chen; [EMAIL PROTECTED]; Andrew Morton; linux-kernel@vger.kernel.org; linux-aio; Chris Mason Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy On Feb 15, 2007, at 11:11 AM, Ananiev, Leonid I wrote: >> It returns -EIOCBRETRY without guaranteeing that kick_iocb() will be >> called. This can lead to operations hanging > > If EIOCBRETRY then generic_file_aio_write() will be recalled for the > same iocb. Only if kick_iocb() is called. It won't be called if i_i_p2_r() was the only thing to return -EIOCBRETRY. > >> It overwrites -EIOCBQUEUED, leading to an aio_complete() while a >> retry is happening. > > EIOCBQUEUED or EIOCBRETRY does not lead to aio_complete() call: Not by fs/aio.c, but *by the place that originated -EIOCBQUEUED*. Later. After IO has completed. see fs/direct-io.c:dio_bio_end_aio(). This is what -EIOCBQUEUED means! It's a promise to call aio_complete () in the future. Have you read the giant comment over the definition of struct kiocb in include/linux/aio.h? >> This can lead to reference count confusion. > But just reference count confusion was deleted by patch. Isn't it? Sorry, I don't understand what you're trying to ask here. - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] aio: fix kernel bug when page is temporally busy
-Original Message- From: Zach Brown [mailto:[EMAIL PROTECTED] Sent: Thursday, February 15, 2007 10:23 PM To: Ananiev, Leonid I Cc: Ken Chen; [EMAIL PROTECTED]; Andrew Morton; linux-kernel@vger.kernel.org; linux-aio; Chris Mason Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy On Feb 15, 2007, at 11:11 AM, Ananiev, Leonid I wrote: >> It returns -EIOCBRETRY without guaranteeing that kick_iocb() will be >> called. This can lead to operations hanging > > If EIOCBRETRY then generic_file_aio_write() will be recalled for the > same iocb. Only if kick_iocb() is called. It won't be called if i_i_p2_r() was the only thing to return -EIOCBRETRY. > >> It overwrites -EIOCBQUEUED, leading to an aio_complete() while a >> retry is happening. > > EIOCBQUEUED or EIOCBRETRY does not lead to aio_complete() call: Not by fs/aio.c, but *by the place that originated -EIOCBQUEUED*. Later. After IO has completed. see fs/direct-io.c:dio_bio_end_aio(). This is what -EIOCBQUEUED means! It's a promise to call aio_complete () in the future. Have you read the giant comment over the definition of struct kiocb in include/linux/aio.h? >> This can lead to reference count confusion. > But just reference count confusion was deleted by patch. Isn't it? Sorry, I don't understand what you're trying to ask here. - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] aio: fix kernel bug when page is temporally busy
> It returns -EIOCBRETRY without guaranteeing that kick_iocb() will be > called. This can lead to operations hanging If EIOCBRETRY then generic_file_aio_write() will be recalled for the same iocb. > It overwrites -EIOCBQUEUED, leading to an aio_complete() while a > retry is happening. EIOCBQUEUED or EIOCBRETRY does not lead to aio_complete() call: if (ret != -EIOCBRETRY && ret != -EIOCBQUEUED) { aio_complete(iocb, ret, 0); > This can lead to reference count confusion. But just reference count confusion was deleted by patch. Isn't it? Leonid -Original Message- From: Zach Brown [mailto:[EMAIL PROTECTED] Sent: Thursday, February 15, 2007 9:25 PM To: Ananiev, Leonid I Cc: Ken Chen; [EMAIL PROTECTED]; Andrew Morton; linux-kernel@vger.kernel.org; linux-aio; Chris Mason Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy > If invalidate_inode_pages2_range() will return EIOCBRETRY as the patch > "aio: fix kernel bug when page is temporally busy" Sorry Leonid, this patch is not safe. It returns -EIOCBRETRY without guaranteeing that kick_iocb() will be called. This can lead to operations hanging, both AIO and calls that come through do_sync_{read,write}. It overwrites -EIOCBQUEUED, leading to an aio_complete() while a retry is happening. This can lead to reference count confusion. Double-frees, referencing freed memory, that kind of thing. This isn't a new problem. The current code that overwrites with -EIO has this problem. But moving to -EIOCBRETRY does introduce new behaviour of aio_complete() and the retry path racing. I'll have a candidate patch to address the problem of EIO being raised on the way back up from a path which has returned -EIOCBQUEUED. - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] aio: fix kernel bug when page is temporally busy
Ken Chen wrote > It might shut up kernel > panic by eliminate double calls to aio_complete(), but it will > silently introduce data corruption. I had got kernel panic after an hour of aiostress running. After patching I have not got aiostress massage "verify error, file %s offset %Lu contents (offset:bad:good):\n" during 5 hour aiostress running with 'verify' option. Looking closely into aiostress.c ftp://ftp.suse.com/pub/people/mason/utils/aio-stress.c we can see that this program may write in random mode THE SAME contents to the same file offset asynchronously from different buffers and do not cure about it. Does Ken consider that kernel panic is the best way to prevent data corruption in such kind of programs? > So any error value returned from invalidate_inode_pages2_range() has > to be taken seriously in the direct IO submit path instead of dropping > it to the floor. If invalidate_inode_pages2_range() will return EIOCBRETRY as the patch "aio: fix kernel bug when page is temporally busy" sets then do_sync_read/write() will not drop IO submit but will retry it: for (;;) { ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos); if (ret != -EIOCBRETRY) break; wait_on_retry_sync_kiocb(&kiocb); } And do_sync_read/write() will not return EIO if page is busy as it does now, before patching. Ken Chen wrote: > I also think the original patch is wrong. What do you mean saying 'also'? Leonid - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] aio: fix kernel bug when page is temporally busy
> Please, tell us what problem this is fixing so that we can look into > alternative solutions. This patch fixes a kernel panic "Kernel BUG at fs/aio.c:509" http://marc.theaimsgroup.com/?l=linux-kernel&m=117031052517746&w=2 First of all it was found that the kernel panic happens after IO error reporting. But later it was found that the actual reason is not in real IO error but in a busy page. EIO is returned if page has unstable state while IO completion result is processing. If aio_run_iocb() as well as in do_sync_read/write() get EIO result it is considered as IO is finished and IO control block could be free; result is reported to caller. I suppose that aio_run_iocb() as well as in do_sync_read/write() should have a chance to differ real EIO and EIO which is actually means EAGAYN. That functions test for EIOCBRETRY meaning EAGAYN purpose. do_sync_read/write() retries IO as well as aio: for (;;) { ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos); if (ret != -EIOCBRETRY) break; wait_on_retry_sync_kiocb(&kiocb); } By the way EIOCBRETRY is never actually set in the kernel. So what is the purpose of previous lines? More details about EIO returning: aio_complete: res=-5 iocb=81002b38f080 ki_users=2 [] invalidate_inode_pages2_range+0x236/0x26b [] ext3_direct_IO+0x16c/0x19e [] ext3_get_block+0x0/0xe2 [] generic_file_direct_IO+0xb9/0xcf [] generic_file_direct_write+0x67/0x10e [] __generic_file_aio_write_nolock+0x2d6/0x3fe [] __switch_to+0x26f/0x27e [] thread_return+0x0/0xd8 [] generic_file_aio_write+0x67/0xc7 [] ext3_file_write+0x0/0x8f [] ext3_file_write+0x16/0x8f [] ext3_file_write+0x0/0x8f [] aio_rw_vect_retry+0x72/0x14f [] aio_run_iocb+0xe6/0x187 [] io_submit_one+0x296/0x2e3 [] sys_io_submit+0x9b/0x108 [] default_wake_function+0x0/0xe [] system_call+0x7e/0x83 Call Trace: [] aio_complete+0x5c/0x18c [] finished_one_bio+0xac/0xf3 [] dio_bio_complete+0x95/0xaa [] dio_bio_end_aio+0x20/0x25 [] __end_that_request_first+0x10e/0x3fd [] scsi_end_request+0x27/0xc9 [] scsi_io_completion+0xf0/0x2c5 [] ahd_done+0x537/0x580 [] sd_rw_intr+0x182/0x1ad [] blk_done_softirq+0x5c/0x6a [] __do_softirq+0x55/0xc3 [] ack_apic_level+0x37/0x4b [] call_softirq+0x1c/0x28 [] do_softirq+0x2c/0x7d [] do_IRQ+0xa8/0xc8 [] mwait_idle+0x0/0x20 [] ret_from_intr+0x0/0xa [] mwait_idle_with_hints+0x44/0x45 [] mwait_idle+0xc/0x20 [] cpu_idle+0x8b/0xb0 [] start_secondary+0x462/0x471 -Original Message- From: Andrew Morton [mailto:[EMAIL PROTECTED] Sent: Thursday, February 15, 2007 6:31 AM To: Ananiev, Leonid I Cc: linux-kernel@vger.kernel.org Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy On Wed, 14 Feb 2007 20:51:33 +0300 "Ananiev, Leonid I" <[EMAIL PROTECTED]> wrote: > Fix kernel bug when IO page is temporally busy: > invalidate_inode_pages2_range() returns EIOCBRETRY but not EIO. > invalidate_inode_pages2() returns EIO as earlier. > > Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]> > --- > --- linux-2.6.20/mm/truncate.c 2007-02-04 10:44:54.0 -0800 > +++ linux-2.6.20p/mm/truncate.c 2007-02-08 22:56:52.0 -0800 > @@ -366,7 +366,7 @@ static int do_launder_page(struct addres > * Any pages which are found to be mapped into pagetables are unmapped prior to > * invalidation. > * > - * Returns -EIO if any pages could not be invalidated. > + * Returns -EIOCBRETRY if any pages could not be invalidated. > */ > int invalidate_inode_pages2_range(struct address_space *mapping, > pgoff_t start, pgoff_t end) > @@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct > } > ret = do_launder_page(mapping, page); > if (ret == 0 && !invalidate_complete_page2(mapping, page)) > - ret = -EIO; > + ret = -EIOCBRETRY; > unlock_page(page); > } > pagevec_release(&pvec); > @@ -444,6 +444,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages > */ > int invalidate_inode_pages2(struct address_space *mapping) > { > - return invalidate_inode_pages2_range(mapping, 0, -1); > + int ret = invalidate_inode_pages2_range(mapping, 0, -1); > + return (ret < 0)?-EIO:ret; > } > EXPORT_SYMBOL_GPL(invalidate_inode_pages2); If someone later uses invalidate_inode_pages2_range() elsewhere, they're going to need to know to convert -EIOCBRETRY into -EIO, if they weren't called by aio. Or something. Please, tell us what problem this is fixing so that we can look into alternative solutions. For example, one acceptable-but-ugly solution might be to do: static inline int invalidate_inode_page
[PATCH] aio: fix kernel bug when page is temporally busy
-wrap lines are fixed. Sorry. >From Leonid Ananiev Fix kernel bug if IO page is temporally busy: invalidate_inode_pages2_range() returns EIOCBRETRY but not EIO. invalidate_inode_pages2() returns EIO as earlier. Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]> --- --- linux-2.6.20/mm/truncate.c 2007-02-04 10:44:54.0 -0800 +++ linux-2.6.20p/mm/truncate.c 2007-02-08 22:56:52.0 -0800 @@ -366,7 +366,7 @@ static int do_launder_page(struct addres * Any pages which are found to be mapped into pagetables are unmapped prior to * invalidation. *This is path is modified so that invalidate_inode_pages2() returns EIO as earlier. Wrap lines are fixed - From Leonid Ananiev Fix kernel bug when IO page is temporally busy: invalidate_inode_pages2_range() returns EIOCBRETRY but not EIO. invalidate_inode_pages2() returns EIO as earlier. Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]> --- --- linux-2.6.20/mm/truncate.c 2007-02-04 10:44:54.0 -0800 +++ linux-2.6.20p/mm/truncate.c 2007-02-08 22:56:52.0 -0800 @@ -366,7 +366,7 @@ static int do_launder_page(struct addres * Any pages which are found to be mapped into pagetables are unmapped prior to * invalidation. * - * Returns -EIO if any pages could not be invalidated. + * Returns -EIOCBRETRY if any pages could not be invalidated. */ int invalidate_inode_pages2_range(struct address_space *mapping, pgoff_t start, pgoff_t end) @@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct } ret = do_launder_page(mapping, page); if (ret == 0 && !invalidate_complete_page2(mapping, page)) - ret = -EIO; + ret = -EIOCBRETRY; unlock_page(page); } pagevec_release(&pvec); @@ -444,6 +444,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages */ int invalidate_inode_pages2(struct address_space *mapping) { - return invalidate_inode_pages2_range(mapping, 0, -1); + int ret = invalidate_inode_pages2_range(mapping, 0, -1); + return (ret < 0)?-EIO:ret; } EXPORT_SYMBOL_GPL(invalidate_inode_pages2); - * Returns -EIO if any pages could not be invalidated. + * Returns -EIOCBRETRY if any pages could not be invalidated. */ int invalidate_inode_pages2_range(struct address_space *mapping, pgoff_t start, pgoff_t end) @@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct } ret = do_launder_page(mapping, page); if (ret == 0 && !invalidate_complete_page2(mapping, page)) - ret = -EIO; + ret = -EIOCBRETRY; unlock_page(page); } pagevec_release(&pvec); @@ -444,6 +444,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages */ int invalidate_inode_pages2(struct address_space *mapping) { - return invalidate_inode_pages2_range(mapping, 0, -1); + int ret = invalidate_inode_pages2_range(mapping, 0, -1); + return (ret < 0)?-EIO:ret; } EXPORT_SYMBOL_GPL(invalidate_inode_pages2); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] aio: fix kernel bug when page is temporally busy
Andrew, You wrote on Friday, February 09, 2007 8:53 AM > invalidate_inode_pages2() has other callers. I suspect with this change > we'll end up leaking EIOCBRETRY back to userspace. The path is modified so that invalidate_inode_pages2() returns EIO as earlier. could you consider modified patch The patch against 2.6.20. Long story: The kernel panic is happening after hours of AIO benchmark running in mcp. First of all it was found that the kernel panic happens if IO error is reported. But later it was found that the actual reason is not in real IO error but in a busy page. While the current CPU tests if IO is completed it happens that another CPU at the same time processes IO completion in soft_irq. The considered buffer page is busy now by second CPU and invalidate_inode_pages2_range() returns EIO in this case. First CPU reports EIO to caller ; completes IO and frees control block in aio_complete(). Second CPU frees the same control block once more. The patch makes invalidate_inode_pages2_range() to return EIOCBRETRY which is tested just in aio_run_iocb(). It retries IO competition check if EIOCBRETRY is got. EIOCBRETRY is tested in do_sync_read/write() functions as well. And direct IO competition will be retested "instead of dropping it to the floor". >From Leonid Ananiev Fix kernel bug when IO page is temporally busy: invalidate_inode_pages2_range() returns EIOCBRETRY but not EIO. invalidate_inode_pages2() returns EIO as earlier. Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]> --- --- linux-2.6.20/mm/truncate.c 2007-02-04 10:44:54.0 -0800 +++ linux-2.6.20p/mm/truncate.c 2007-02-08 22:56:52.0 -0800 @@ -366,7 +366,7 @@ static int do_launder_page(struct addres * Any pages which are found to be mapped into pagetables are unmapped prior to * invalidation. * - * Returns -EIO if any pages could not be invalidated. + * Returns -EIOCBRETRY if any pages could not be invalidated. */ int invalidate_inode_pages2_range(struct address_space *mapping, pgoff_t start, pgoff_t end) @@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct } ret = do_launder_page(mapping, page); if (ret == 0 && !invalidate_complete_page2(mapping, page)) - ret = -EIO; + ret = -EIOCBRETRY; unlock_page(page); } pagevec_release(&pvec); @@ -444,6 +444,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages */ int invalidate_inode_pages2(struct address_space *mapping) { - return invalidate_inode_pages2_range(mapping, 0, -1); + int ret = invalidate_inode_pages2_range(mapping, 0, -1); + return (ret < 0)?-EIO:ret; } EXPORT_SYMBOL_GPL(invalidate_inode_pages2); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Recall: [PATCH] aio: fix kernel bug when page is temporally busy
Ananiev, Leonid I would like to recall the message, "[PATCH] aio: fix kernel bug when page is temporally busy". - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Recall: [PATCH] aio: fix kernel bug when page is temporally busy
Ananiev, Leonid I would like to recall the message, "[PATCH] aio: fix kernel bug when page is temporally busy". - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] aio: fix kernel bug when page is temporally busy
Andrew, You wrote on Friday, February 09, 2007 8:53 AM > invalidate_inode_pages2() has other callers. I suspect with this change > we'll end up leaking EIOCBRETRY back to userspace. The path is modified so that invalidate_inode_pages2() returns EIO as earlier. could you consider modified patch The patch against 2.6.20. Long story: The kernel panic is happening after hours of AIO benchmark running in mcp. First of all it was found that the kernel panic happens if IO error is reported. But later it was found that the actual reason is not in real IO error but in a busy page. While the current CPU tests if IO is completed it happens that another CPU at the same time processes IO completion in soft_irq. The considered buffer page is busy now by second CPU and invalidate_inode_pages2_range() returns EIO in this case. First CPU reports EIO to caller ; completes IO and frees control block in aio_complete(). Second CPU frees the same control block once more. The patch makes invalidate_inode_pages2_range() to return EIOCBRETRY which is tested just in aio_run_iocb(). It retries IO competition check if EIOCBRETRY is got. EIOCBRETRY is tested in do_sync_read/write() functions as well. And direct IO competition will be retested "instead of dropping it to the floor". >From Leonid Ananiev Fix kernel bug when IO page is temporally busy: invalidate_inode_pages2_range() returns EIOCBRETRY but not EIO. invalidate_inode_pages2() returns EIO as earlier. Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]> --- --- linux-2.6.20/mm/truncate.c 2007-02-04 10:44:54.0 -0800 +++ linux-2.6.20p/mm/truncate.c 2007-02-08 22:56:52.0 -0800 @@ -366,7 +366,7 @@ static int do_launder_page(struct addres * Any pages which are found to be mapped into pagetables are unmapped prior to * invalidation. * - * Returns -EIO if any pages could not be invalidated. + * Returns -EIOCBRETRY if any pages could not be invalidated. */ int invalidate_inode_pages2_range(struct address_space *mapping, pgoff_t start, pgoff_t end) @@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct } ret = do_launder_page(mapping, page); if (ret == 0 && !invalidate_complete_page2(mapping, page)) - ret = -EIO; + ret = -EIOCBRETRY; unlock_page(page); } pagevec_release(&pvec); @@ -444,6 +444,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages */ int invalidate_inode_pages2(struct address_space *mapping) { - return invalidate_inode_pages2_range(mapping, 0, -1); + int ret = invalidate_inode_pages2_range(mapping, 0, -1); + return (ret < 0)?-EIO:ret; } EXPORT_SYMBOL_GPL(invalidate_inode_pages2); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] aio: fix kernel bug when page is temporally busy
Ken Chen writes: > So any error value returned from invalidate_inode_pages2_range() has > to be taken seriously in the direct IO submit path instead of dropping > it to the floor. If invalidate_inode_pages2_range() will return EIOCBRETRY as the patch "aio: fix kernel bug when page is temporally busy" serts then do_sync_read/write() will not drop IO submit but will retry it: for (;;) { ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos); if (ret != -EIOCBRETRY) break; wait_on_retry_sync_kiocb(&kiocb); } And do_sync_read/write() will not return EIO if page is busy as it does now, befor patching. Curretly do_sync_read/write() tests for EIOCBRETRY. But EIOCBRETRY is not set ever. Leonid - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] aio: fix kernel bug when page is temporally busy
> If invalidate_inode_pages2_range() says it can not invalidate pages, > while dio to the same file offset range is in flight, something is > really wrong there. If invalidate_inode_pages2_range() says it can not invalidate pages It means that soft_irq does completing IO now on other cpu. Next retry() call in aio_run_iocb() will see the IO well completed. The patch is updated: invalidate_inode_pages2() returns EIO as earlier for nfs and other. But invalidate_inode_pages2_range() returns EIOCBRETRY for aio and dio. The patch against 2.6.20. >From Leonid Ananiev Fix kernel bug when IO page is temporally busy: invalidate_inode_pages2_range() returns EIOCBRETRY but not EIO. Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]> --- --- linux-2.6.20/mm/truncate.c 2007-02-04 10:44:54.0 -0800 +++ linux-2.6.20p/mm/truncate.c 2007-02-08 22:56:52.0 -0800 @@ -366,7 +366,7 @@ static int do_launder_page(struct addres * Any pages which are found to be mapped into pagetables are unmapped prior to * invalidation. * - * Returns -EIO if any pages could not be invalidated. + * Returns -EIOCBRETRY if any pages could not be invalidated. */ int invalidate_inode_pages2_range(struct address_space *mapping, pgoff_t start, pgoff_t end) @@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct } ret = do_launder_page(mapping, page); if (ret == 0 && !invalidate_complete_page2(mapping, page)) - ret = -EIO; + ret = -EIOCBRETRY; unlock_page(page); } pagevec_release(&pvec); @@ -444,6 +444,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages */ int invalidate_inode_pages2(struct address_space *mapping) { - return invalidate_inode_pages2_range(mapping, 0, -1); + int ret = invalidate_inode_pages2_range(mapping, 0, -1); + return (ret < 0)?-EIO:ret; } EXPORT_SYMBOL_GPL(invalidate_inode_pages2); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] aio: fix kernel bug when page is temporally busy
> If invalidate_inode_pages2_range() says it can not invalidate pages, > while dio to the same file offset range is in flight, something is > really wrong there. If invalidate_inode_pages2_range() says it can not invalidate pages It means that soft_irq does completing now on other cpu. Next retry() will see a IO well completed. Leonid -Original Message- From: Ken Chen [mailto:[EMAIL PROTECTED] Sent: Saturday, February 10, 2007 9:05 PM To: Ananiev, Leonid I Cc: [EMAIL PROTECTED]; Andrew Morton; linux-kernel@vger.kernel.org; linux-aio; Zach Brown; Chris Mason; Badari Pulavarty Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy On 2/9/07, Ananiev, Leonid I <[EMAIL PROTECTED]> wrote: > I have used EIOCBRETRY in the patch to minimize source code modification > only. > [...] > A lot of errno's have different meaning in different functions or > contexts. EAGAIN could be used instated of EIOCBRETRY for irredundant > set. I also think the original patch is wrong. It might shut up kernel panic by eliminate double calls to aio_complete(), but it will silently introduce data corruption. If invalidate_inode_pages2_range() says it can not invalidate pages, while dio to the same file offset range is in flight, something is really wrong there. In generic_file_direct_IO, the function explicitly flushes all dirty pages and wait on them before submits DIO. So any error value returned from invalidate_inode_pages2_range() has to be taken seriously in the direct IO submit path instead of dropping it to the floor. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] aio: fix kernel bug when page is temporally busy
But if page is busy as invalidate_inode_pages2_range() says because of bh_count>0 then aio_complet() is not called from aio_run_iocb() and next retry() will get the iocb result. Leonid -Original Message- From: Ken Chen [mailto:[EMAIL PROTECTED] Sent: Saturday, February 10, 2007 9:05 PM To: Ananiev, Leonid I Cc: [EMAIL PROTECTED]; Andrew Morton; linux-kernel@vger.kernel.org; linux-aio; Zach Brown; Chris Mason; Badari Pulavarty Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy On 2/9/07, Ananiev, Leonid I <[EMAIL PROTECTED]> wrote: > I have used EIOCBRETRY in the patch to minimize source code modification > only. > [...] > A lot of errno's have different meaning in different functions or > contexts. EAGAIN could be used instated of EIOCBRETRY for irredundant > set. I also think the original patch is wrong. It might shut up kernel panic by eliminate double calls to aio_complete(), but it will silently introduce data corruption. If invalidate_inode_pages2_range() says it can not invalidate pages, while dio to the same file offset range is in flight, something is really wrong there. In generic_file_direct_IO, the function explicitly flushes all dirty pages and wait on them before submits DIO. So any error value returned from invalidate_inode_pages2_range() has to be taken seriously in the direct IO submit path instead of dropping it to the floor. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] aio: fix kernel bug when page is temporally busy
>>noone seems to be setting the error values to EIOCBRETRY > We'd want retries to act only on the remaining bytes which haven't been > transferred as yet, so even in the EIOCBRETRY... OK. But any variable is not set to EIOCBRETRY in 2.6.14, 2.6.19, 2.6.20. Is it correct? Leonid -Original Message- From: Suparna Bhattacharya [mailto:[EMAIL PROTECTED] Sent: Friday, February 09, 2007 2:06 PM To: Jiri Kosina Cc: Andrew Morton; Ananiev, Leonid I; linux-kernel@vger.kernel.org; linux-aio; Zach Brown; Chris Mason; Badari Pulavarty; Jan Kara Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy On Fri, Feb 09, 2007 at 11:40:27AM +0100, Jiri Kosina wrote: > On Fri, 9 Feb 2007, Andrew Morton wrote: > > > > @@ -1204,7 +1204,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, > > > do_generic_file_read(filp,ppos,&desc,file_read_actor); > > > retval += desc.written; > > > if (desc.error) { > > > - retval = retval ?: desc.error; > > > + retval = desc.error; > > > break; > > > } > > Nope. On error the read() syscall must return the number of bytes which > > were successfully read. > > You are right. > > In current mainline this even is not a problem, because noone seems to be > setting the error values to EIOCBRETRY. But it still stinks a bit, as > there are tests for EIOCBRETRY. We'd want retries to act only on the remaining bytes which haven't been transferred as yet, so even in the EIOCBRETRY case the right thing to do is to return the number of bytes that were successfully read without blocking. The high level AIO code (see aio_rw_vect_rety) has the ability to handle this. So this is still correct. Is there a real bug that you are seeing here ? Regards Suparna > > -- > Jiri Kosina > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to [EMAIL PROTECTED] For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED] -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] aio: fix kernel bug when page is temporally busy
I have used EIOCBRETRY in the patch to minimize source code modification only. It is notable that EIOCBRETRY is never set in kernel, but tested only. EAGAIN - means that you may want to recall the same function with the same argument. But user have not to recall it just now. He may want to free or process some other his resources before recall for success. EIOCBRETRY - means that you may not recall the same function with the same argument because an argument has no sense already in cuurnt context. The result is in processing already. You may want to recall much higher level or external function to get the common result. It is means that inernal caller should ignore or let through errno to its caller but do not recall the function which has returned this errno. A lot of errno's have different meaning in different functions or contexts. EAGAIN could be used instated of EIOCBRETRY for irredundant set. Leonid -Original Message- From: Suparna Bhattacharya [mailto:[EMAIL PROTECTED] Sent: Friday, February 09, 2007 10:16 AM To: Ananiev, Leonid I Cc: Andrew Morton; linux-kernel@vger.kernel.org; linux-aio; Zach Brown; Chris Mason; Badari Pulavarty Subject: Re: [PATCH] aio: fix kernel bug when page is temporally busy On Fri, Feb 09, 2007 at 08:41:41AM +0300, Ananiev, Leonid I wrote: > > invalidate_inode_pages2() has other callers. I suspect with this > change > > we'll end up leaking EIOCBRETRY back to userspace. > > EIOCBRETRY is used and caught already in do_sync_read() and > do_sync_readv_writev(). > > Below fixed patch against kernel 2.6.20. I agree with Andrew, this doesn't look like the right solution. EIOCBRETRY isn't the same as EAGAIN or "try again later". EIOCBRETRY means "I have queued up an action to notify (wakeup) the callback to retry the operation once data is ready". Regards Suparna > > >From Leonid Ananiev > > Fix kernel bug when IO page is temporally busy: > invalidate_inode_pages2() returns EIOCBRETRY but not EIO.. > > Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]> > --- > --- linux-2.6.20/mm/truncate.c2007-02-04 10:44:54.0 -0800 > +++ linux-2.6.20p/mm/truncate.c 2007-02-08 11:38:11.0 -0800 > @@ -366,7 +366,7 @@ static int do_launder_page(struct addres > * Any pages which are found to be mapped into pagetables are unmapped > prior to > * invalidation. > * > - * Returns -EIO if any pages could not be invalidated. > + * Returns -EIOCBRETRY if any pages could not be invalidated. > */ > int invalidate_inode_pages2_range(struct address_space *mapping, > pgoff_t start, pgoff_t end) > @@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct > } > ret = do_launder_page(mapping, page); > if (ret == 0 && > !invalidate_complete_page2(mapping, page)) > - ret = -EIO; > + ret = -EIOCBRETRY; > unlock_page(page); > } > pagevec_release(&pvec); > @@ -440,7 +440,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages > * Any pages which are found to be mapped into pagetables are unmapped > prior to > * invalidation. > * > - * Returns -EIO if any pages could not be invalidated. > + * Returns -EIOCBRETRY if any pages could not be invalidated. > */ > int invalidate_inode_pages2(struct address_space *mapping) > { > > -- > To unsubscribe, send a message with 'unsubscribe linux-aio' in > the body to [EMAIL PROTECTED] For more info on Linux AIO, > see: http://www.kvack.org/aio/ > Don't email: mailto:"[EMAIL PROTECTED]">[EMAIL PROTECTED] -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] aio: fix kernel bug when page is temporally busy
> invalidate_inode_pages2() has other callers. I suspect with this change > we'll end up leaking EIOCBRETRY back to userspace. EIOCBRETRY is used and caught already in do_sync_read() and do_sync_readv_writev(). Below fixed patch against kernel 2.6.20. >From Leonid Ananiev Fix kernel bug when IO page is temporally busy: invalidate_inode_pages2() returns EIOCBRETRY but not EIO.. Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]> --- --- linux-2.6.20/mm/truncate.c 2007-02-04 10:44:54.0 -0800 +++ linux-2.6.20p/mm/truncate.c 2007-02-08 11:38:11.0 -0800 @@ -366,7 +366,7 @@ static int do_launder_page(struct addres * Any pages which are found to be mapped into pagetables are unmapped prior to * invalidation. * - * Returns -EIO if any pages could not be invalidated. + * Returns -EIOCBRETRY if any pages could not be invalidated. */ int invalidate_inode_pages2_range(struct address_space *mapping, pgoff_t start, pgoff_t end) @@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct } ret = do_launder_page(mapping, page); if (ret == 0 && !invalidate_complete_page2(mapping, page)) - ret = -EIO; + ret = -EIOCBRETRY; unlock_page(page); } pagevec_release(&pvec); @@ -440,7 +440,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages * Any pages which are found to be mapped into pagetables are unmapped prior to * invalidation. * - * Returns -EIO if any pages could not be invalidated. + * Returns -EIOCBRETRY if any pages could not be invalidated. */ int invalidate_inode_pages2(struct address_space *mapping) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] aio: fix kernel bug when page is temporally busy
>From Leonid Ananiev Fix "Kernel BUG at fs/aio.c:509". Return EIOCBRETRY but not EIO if page is busy. Signed-off-by: Leonid Ananiev <[EMAIL PROTECTED]> "Kernel BUG at fs/aio.c:509" http://marc.theaimsgroup.com/?l=linux-kernel&m=117031052517746&w=2 is present in 2.6.20 as well as 2.6.19. The investigation shows that the bug's panic occurs when aio_complete() is called with argument ret=EIO which is returned from invalidate_inode_pages2_range() because the page is temporally busy. Call Trace: [] invalidate_inode_pages2_range+0x236/0x26b [] ext3_direct_IO+0x16c/0x19e [] ext3_get_block+0x0/0xe2 [] generic_file_direct_IO+0xb9/0xcf [] generic_file_direct_write+0x67/0x10e [] __generic_file_aio_write_nolock+0x2d6/0x3fe [] __switch_to+0x26f/0x27e [] thread_return+0x0/0xd8 [] generic_file_aio_write+0x67/0xc7 [] ext3_file_write+0x0/0x8f [] ext3_file_write+0x16/0x8f [] ext3_file_write+0x0/0x8f [] aio_rw_vect_retry+0x72/0x14f [] aio_run_iocb+0xe6/0x187 [] io_submit_one+0x296/0x2e3 [] sys_io_submit+0x9b/0x108 [] default_wake_function+0x0/0xe [] system_call+0x7e/0x83 As a result aio_complete() is called while it should not be. When IO is finished aio_complete() is called once more and iocb->ki_users becomes negative. Taking into account aio.c lines in aio_run_iocb() if (ret != -EIOCBRETRY && ret != -EIOCBQUEUED) { aio_complete(iocb, ret, 0); proposed patch makes function invalidate_inode_pages2_range() to return EIOCBRETRY but not EIO if page is busy. The patch is tested on 2.6.20. --- linux-2.6.20/mm/truncate.c 2007-02-04 10:44:54.0 -0800 +++ linux-2.6.20p/mm/truncate.c 2007-02-08 11:38:11.0 -0800 @@ -366,7 +366,7 @@ static int do_launder_page(struct addres * Any pages which are found to be mapped into pagetables are unmapped prior to * invalidation. * - * Returns -EIO if any pages could not be invalidated. + * Returns -EIOCBRETRY if any pages could not be invalidated. */ int invalidate_inode_pages2_range(struct address_space *mapping, pgoff_t start, pgoff_t end) @@ -423,7 +423,7 @@ int invalidate_inode_pages2_range(struct } ret = do_launder_page(mapping, page); if (ret == 0 && !invalidate_complete_page2(mapping, page)) - ret = -EIO; + ret = -EIOCBRETRY; unlock_page(page); } pagevec_release(&pvec); @@ -440,7 +440,7 @@ EXPORT_SYMBOL_GPL(invalidate_inode_pages * Any pages which are found to be mapped into pagetables are unmapped prior to * invalidation. * - * Returns -EIO if any pages could not be invalidated. + * Returns -EIOCBRETRY if any pages could not be invalidated. */ int invalidate_inode_pages2(struct address_space *mapping) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Kernel BUG at fs/aio.c:509
>> BUG: warning at mm/truncate.c:398/invalidate_inode_pages2_range() > That warning got removed from -rc7. 2.6.20-rc-7: After 57 minutes and after 25 minutes of aiostress running I've got 2 system hangs without any message in log. On the third time I opened the console and got kernel panic massage: $ [ cut here ] kernel BUG at fs/aio.c:507! invalid opcode: [1] SMP CPU 2 Modules linked in: firmware_class Pid: 0, comm: swapper Not tainted 2.6.20-rc7 #1 RIP: 0010:[] [] __aio_put_req+0x22/0x146 RSP: 0018:810037e4fd70 EFLAGS: 00010086 RAX: RBX: 810037142ec0 RCX: 25934000 RDX: RSI: 810037142ec0 RDI: 810005f8be40 RBP: 810005f8be40 R08: 6db6db6db6db6db7 R09: 81000100 R10: R11: 802531b0 R12: 810005f8bea8 R13: 4000 R14: R15: 0213 FS: () GS:81000225dec0() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: 003cdca945d0 CR3: 39002000 CR4: 06e0 Process swapper (pid: 0, threadinfo 810037e48000, task 810037e47560) Stack: 810037142ec0 810005f8be40 810005f8bea8 80284958 8100381f2c00 8100381f2f24 0246 802912a1 Call Trace: [] aio_complete+0x16b/0x1c7 [] dio_bio_end_aio+0x9e/0xbe [] __end_that_request_first+0x10e/0x3f7 [] blk_run_queue+0x41/0x72 [] scsi_end_request+0x27/0xc9 [] scsi_io_completion+0xf0/0x2c5 [] ahd_done+0x52b/0x574 [] sd_rw_intr+0x182/0x1ad [] blk_done_softirq+0x5c/0x6a [] __do_softirq+0x55/0xc3 [] ack_apic_level+0x37/0x4b [] call_softirq+0x1c/0x28 [] do_softirq+0x2c/0x7d [] do_IRQ+0xb1/0xd0 [] mwait_idle+0x0/0x46 [] ret_from_intr+0x0/0xa [] scsi_request_fn+0x0/0x33e [] mwait_idle+0x42/0x46 [] cpu_idle+0x8c/0xaf [] start_secondary+0x462/0x471 Code: 0f 0b eb fe 74 07 31 c0 e9 12 01 00 00 4c 8d a6 e0 00 00 00 RIP [] __aio_put_req+0x22/0x146 RSP <0>Kernel panic - not syncing: Aiee, killing interrupt handler! > Oh. 2.6.19. Did you try 2.6.19.latest? The latest 2.6.19.2 functions exactly as 2.6.19 does. I should add that I see on user screen: ...kernel: invalid opcode: [1] SMP Leonid -Original Message- From: Andrew Morton [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 10:16 AM To: Ananiev, Leonid I Cc: linux-kernel@vger.kernel.org; [EMAIL PROTECTED] Subject: Re: Kernel BUG at fs/aio.c:509 On Thu, 1 Feb 2007 09:13:51 +0300 "Ananiev, Leonid I" <[EMAIL PROTECTED]> wrote: > While repeatedly running 'aiostress -O -o 2' test I've got in > /var/log/messages: > > BUG: warning at mm/truncate.c:398/invalidate_inode_pages2_range() > Call Trace: > [] invalidate_inode_pages2_range+0x236/0x26b > [] ext3_direct_IO+0x16c/0x19e > [] ext3_get_block+0x0/0xe2 > [] generic_file_direct_IO+0xb9/0xcf > [] generic_file_direct_write+0x67/0x10e > [] __generic_file_aio_write_nolock+0x2d6/0x3fe > [] __switch_to+0x26f/0x27e > [] thread_return+0x0/0xd8 > [] generic_file_aio_write+0x67/0xc7 > [] ext3_file_write+0x0/0x8f > [] ext3_file_write+0x16/0x8f > [] ext3_file_write+0x0/0x8f > [] aio_rw_vect_retry+0x72/0x14f > [] aio_run_iocb+0xe6/0x187 > [] io_submit_one+0x296/0x2e3 > [] sys_io_submit+0x9b/0x108 > [] default_wake_function+0x0/0xe > [] system_call+0x7e/0x83 That warning got removed from -rc7. It's harmless, in most cases. We think. It's a bit of a worry that aio-on-ext3 is triggering it. That's unexpected. > --- [cut here ] - [please bite here ] - > Kernel BUG at fs/aio.c:509 > invalid opcode: [1] SMP > CPU 1 > Modules linked in: firmware_class > Pid: 11362, comm: aiostress Not tainted 2.6.19 #1 Oh. 2.6.19. Did you try 2.6.19.latest? > RIP: 0010:[] [] > __aio_put_req+0x27/0x162 > RSP: 0018:81000c167e88 EFLAGS: 00010096 > RAX: RBX: 8100184f8dc0 RCX: 1d44f000 > RDX: 000d RSI: 8100184f8dc0 RDI: 81001d44ee80 > RBP: 81001d44ee80 R08: 6db6db6db6db6db7 R09: 81000100 > R10: R11: 0046 R12: 810037341c80 > R13: 81001d44ee80 R14: 00606998 R15: > FS: 2ab979b0ee90() GS:810037e896c0() > knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 2aea207c3950 CR3: 1009c000 CR4: 06e0 > Process aiostress (pid: 11362, threadinfo 81000c166000, task > 81003f49f5a0) > Stack: 8100184f8dc0 81001d44ee80 810037341c80 > 80284ff6 > 8100184f8dc0 8100184f8dc0 80285409 > 00607fb0 81001d44ee80
Kernel BUG at fs/aio.c:509
While repeatedly running 'aiostress -O -o 2' test I've got in /var/log/messages: BUG: warning at mm/truncate.c:398/invalidate_inode_pages2_range() Call Trace: [] invalidate_inode_pages2_range+0x236/0x26b [] ext3_direct_IO+0x16c/0x19e [] ext3_get_block+0x0/0xe2 [] generic_file_direct_IO+0xb9/0xcf [] generic_file_direct_write+0x67/0x10e [] __generic_file_aio_write_nolock+0x2d6/0x3fe [] __switch_to+0x26f/0x27e [] thread_return+0x0/0xd8 [] generic_file_aio_write+0x67/0xc7 [] ext3_file_write+0x0/0x8f [] ext3_file_write+0x16/0x8f [] ext3_file_write+0x0/0x8f [] aio_rw_vect_retry+0x72/0x14f [] aio_run_iocb+0xe6/0x187 [] io_submit_one+0x296/0x2e3 [] sys_io_submit+0x9b/0x108 [] default_wake_function+0x0/0xe [] system_call+0x7e/0x83 --- [cut here ] - [please bite here ] - Kernel BUG at fs/aio.c:509 invalid opcode: [1] SMP CPU 1 Modules linked in: firmware_class Pid: 11362, comm: aiostress Not tainted 2.6.19 #1 RIP: 0010:[] [] __aio_put_req+0x27/0x162 RSP: 0018:81000c167e88 EFLAGS: 00010096 RAX: RBX: 8100184f8dc0 RCX: 1d44f000 RDX: 000d RSI: 8100184f8dc0 RDI: 81001d44ee80 RBP: 81001d44ee80 R08: 6db6db6db6db6db7 R09: 81000100 R10: R11: 0046 R12: 810037341c80 R13: 81001d44ee80 R14: 00606998 R15: FS: 2ab979b0ee90() GS:810037e896c0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 2aea207c3950 CR3: 1009c000 CR4: 06e0 Process aiostress (pid: 11362, threadinfo 81000c166000, task 81003f49f5a0) Stack: 8100184f8dc0 81001d44ee80 810037341c80 80284ff6 8100184f8dc0 8100184f8dc0 80285409 00607fb0 81001d44ee80 00606998 0001 Call Trace: [] aio_put_req+0x21/0x59 80284ff1: e8 cd f1 ff ff callq 802841c3 <__aio_put_req> [] io_submit_one+0x2bd/0x2e3 [] sys_io_submit+0x9b/0x108 [] default_wake_function+0x0/0xe [] system_call+0x7e/0x83 Code: 0f 0b 68 76 e6 4a 80 c2 fd 01 85 c0 74 07 31 c0 e9 21 01 00 RIP [] __aio_put_req+0x27/0x162 RSP NMI Watchdog detected LOCKUP on CPU 1 CPU 1 Modules linked in: firmware_class Pid: 11362, comm: aiostress Not tainted 2.6.19 #1 RIP: 0010:[] [] _spin_lock_irq+0x8/0x10 RSP: 0018:81000c167c10 EFLAGS: 0086 RAX: 81003e0f0001 RBX: 81001d44ee80 RCX: 2ab979b0ef40 RDX: RSI: 0246 RDI: 81001d44eeb8 RBP: 81001d44ee80 R08: 0286 R09: 0018 R10: 0286 R11: 0046 R12: 0001 R13: 000b R14: 804a5904 R15: FS: () GS:810037e896c0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 2aea207c3950 CR3: 1009c000 CR4: 06e0 Process aiostress (pid: 11362, threadinfo 81000c166000, task 81003f49f5a0) Stack: 80284050 0286 802441cc 0018 8047cebe 0246 81001d44ee80 80285880 81003e0f12c0 81003e0f12c0 81003f49f5a0 Call Trace: [] aio_cancel_all+0x12/0x87 0x12: e8 9d 93 1f 00 callq 8047d3ed <_spin_lock_irq> [] lock_hrtimer_base+0x26/0x4c [] __down_read+0x12/0x9a [] exit_aio+0x2e/0x8c [] mmput+0x19/0x98 [] do_exit+0x1f7/0x825 [] vgacon_set_cursor_size+0x36/0xd1 [] kernel_math_error+0x0/0x90 [] do_invalid_op+0xad/0xb7 [] __aio_put_req+0x27/0x162 [] __generic_file_aio_write_nolock+0x2d6/0x3fe [] __switch_to+0x26f/0x27e [] error_exit+0x0/0x84 [] __aio_put_req+0x27/0x162 [] aio_put_req+0x21/0x59 [] io_submit_one+0x2bd/0x2e3 [] sys_io_submit+0x9b/0x108 [] default_wake_function+0x0/0xe [] system_call+0x7e/0x83 Code: 83 3f 00 7e f9 eb f2 c3 53 48 89 fb e8 06 7b db ff f0 ff 0b <1>Fixing recursive fault but reboot is needed! === It stable happens after 2-5 aiosterss runs on 2.6.20-rc6 as well as on 2.6.19 in 4 processor Xeon. fs/aio.c:509 line is BUG_ON(req->ki_users < 0); It may be an effect of return 1 from invalidate_inode_pages2_range() one second earlier as you can see in log. To investigate a root cause of ret=0 in invalidate_inode_pages2_range() I have added printk() calls. As a result I can say that invalidate_inode_pages2_range() reports BUG: warning at mm/truncate.c:398 occurs becouse of invalidate_complete_page2()returns 0; it returns 0 because of try_to_release_page()returns 0; it returns 0 because of ext3_releasepage()returns 0; it returns 0 because of journal_try_to_free_buffers() returns 0; it returns 0 because of try_to_free_buffers() retur
RE: [patch 1/] timers: tsc using for cpu scheduling
[EMAIL PROTECTED] "math is hard, lets go shopping!" - */ -static unsigned long cyc2ns_scale; -#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */ - -static inline void set_cyc2ns_scale(unsigned long cpu_mhz) -{ - cyc2ns_scale = (1000 << CYC2NS_SCALE_FACTOR)/cpu_mhz; -} - -static inline unsigned long long cycles_2_ns(unsigned long long cyc) -{ - return (cyc * cyc2ns_scale) >> CYC2NS_SCALE_FACTOR; -} - static unsigned long long monotonic_clock_hpet(void) { unsigned long long last_offset, this_offset, base; diff -ur a/arch/i386/kernel/timers/timer_tsc.c b/arch/i386/kernel/timers/timer_tsc.c --- a/arch/i386/kernel/timers/timer_tsc.c 2005-06-17 23:48:29.0 +0400 +++ b/arch/i386/kernel/timers/timer_tsc.c 2005-07-17 22:28:00.0 +0400 @@ -37,7 +37,6 @@ extern spinlock_t i8253_lock; -static int use_tsc; /* Number of usecs that the last interrupt was delayed */ static int delay_at_last_interrupt; @@ -46,34 +45,6 @@ static unsigned long long monotonic_base; static seqlock_t monotonic_lock = SEQLOCK_UNLOCKED; -/* convert from cycles(64bits) => nanoseconds (64bits) - * basic equation: - * ns = cycles / (freq / ns_per_sec) - * ns = cycles * (ns_per_sec / freq) - * ns = cycles * (10^9 / (cpu_mhz * 10^6)) - * ns = cycles * (10^3 / cpu_mhz) - * - * Then we use scaling math (suggested by george@mvista.com) to get: - * ns = cycles * (10^3 * SC / cpu_mhz) / SC - * ns = cycles * cyc2ns_scale / SC - * - * And since SC is a constant power of two, we can convert the div - * into a shift. - * [EMAIL PROTECTED] "math is hard, lets go shopping!" - */ -static unsigned long cyc2ns_scale; -#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */ - -static inline void set_cyc2ns_scale(unsigned long cpu_mhz) -{ - cyc2ns_scale = (1000 << CYC2NS_SCALE_FACTOR)/cpu_mhz; -} - -static inline unsigned long long cycles_2_ns(unsigned long long cyc) -{ - return (cyc * cyc2ns_scale) >> CYC2NS_SCALE_FACTOR; -} - static int count2; /* counter for mark_offset_tsc() */ /* Cached *multiplier* to convert TSC counts to microseconds. @@ -131,30 +102,6 @@ return base + cycles_2_ns(this_offset - last_offset); } -/* - * Scheduler clock - returns current time in nanosec units. - */ -unsigned long long sched_clock(void) -{ - unsigned long long this_offset; - - /* -* In the NUMA case we dont use the TSC as they are not -* synchronized across all CPUs. -*/ -#ifndef CONFIG_NUMA - if (!use_tsc) -#endif - /* no locking but a rare wrong value is not a big deal */ - return jiffies_64 * (10 / HZ); - - /* Read the Time Stamp Counter */ - rdtscll(this_offset); - - /* return the value in ns */ - return cycles_2_ns(this_offset); -} - static void delay_tsc(unsigned long loops) { unsigned long bclock, now; @@ -284,7 +231,7 @@ #ifndef CONFIG_SMP if (cpu_khz) cpu_khz = cpufreq_scale(cpu_khz_ref, ref_freq, freq->new); - if (use_tsc) { + if (cyc2ns_scale) { if (!(freq->flags & CPUFREQ_CONST_LOOPS)) { fast_gettimeoffset_quotient = cpufreq_scale(fast_gettimeoffset_ref, freq->new, ref_freq); set_cyc2ns_scale(cpu_khz/1000); @@ -520,7 +467,6 @@ if (tsc_quotient) { fast_gettimeoffset_quotient = tsc_quotient; - use_tsc = 1; /* * We could be more selective here I suspect * and just enable this for the next intel chips ? -Original Message- From: john stultz [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 06, 2005 5:43 AM To: Ananiev, Leonid I Cc: lkml; Dominik Brodowski; Semenikhin, Sergey V Subject: RE: [patch 1/] timers: tsc using for cpu scheduling On Tue, 2005-07-05 at 21:08 +0400, Ananiev, Leonid I wrote: > Not only page faults may increase process priority. Someone can > write two threads with mutexes so that each thread will spent much less > than 1 msec for calculations on cpu than lock-unlock mutexes and yield > cpu to brother which wait mutex unlock and will do the same. Both > threads will have high priority according to other threads during > infinite time. The scheduler will not see the time spent by both > considered threads on cpu. Oh, I don't doubt there is a problem. I'm just asking if using the TSC is really the only way to properly ding the process? I'm not a scheduler guy, so forgive my ignorance, but since the TSC may not be available everywhere, might there be an alternative way to ding the process? Surely something is keeping track
RE: [patch 1/] timers: tsc using for cpu scheduling
John, Not only page faults may increase process priority. Someone can write two threads with mutexes so that each thread will spent much less than 1 msec for calculations on cpu than lock-unlock mutexes and yield cpu to brother which wait mutex unlock and will do the same. Both threads will have high priority according to other threads during infinite time. The scheduler will not see the time spent by both considered threads on cpu. The cpu scheduler does not need in real time value. It is need the number of cpu clocks spent for considered task/thread for priority calculation. It is not need to modify TSC tick rate for cpu scheduling. The TSC can be used for priority calculation in NUMA because we do not compare TSCs of different cpu's. > there are other cases where the TSC cannot be used for > sched_clock, such as on systems that do not have TSCs... > You're patch removes any fallback for the case where the TSC cannot be used. No. Now there is two global kernel values: cyc2ns_scale and use_tsc. We may say that use_tsc = (cyc2ns_scale != 0); Now instead of 'if (use_tsc) than ...' I propose to write 'if ((cyc2ns_scale != 0) than...' > This I don't agree with because there are situations where we cannot use the TSC. The patch says that if there are PMT and TSC timers concurrently than Linux will use TSC for CPU scheduler priority calculatin. 1 millisecond jiffies on the base of PMT were used patch in Linux before this. So user can see that if CPU has TSC it is worst than CPU which has not TSC because Linux choose slightly more precise but exactly 100 times more gross variant in this case. Leonid -Original Message- From: john stultz [mailto:[EMAIL PROTECTED] Sent: Thursday, June 30, 2005 11:08 PM To: Ananiev, Leonid I Cc: lkml; Dominik Brodowski Subject: Re: [patch 1/] timers: tsc using for cpu scheduling On Thu, 2005-06-30 at 15:43 +0400, Ananiev, Leonid I wrote: > Patch for using TSC but not PMT in cpu scheduler. > > It was noted that under high memory load the process which > generates a lot page faults does not lose own priority at all if > processor has Power Management Timer (PMT) and for this reason Linux > uses jiffies_64 for process priority calculation instead of Time Stamp > Clock (TSC). As a result time is measured with 100 nsec granularity: [snip] > If process regularly uses processor much less than 1 msec > scheduler does not see processor using by this process and does not > decrease priority of process as it is performed on platforms which have > not PMT or PMT using is suppressed in .config file. The "list of timers, > ordered by preference" in linux kernel dictates to use Power Management > timer, if it is on processor and measure process run time in > milliseconds (jiffies) actually but not in nanoseconds. As a result the > process which provoke to memory threshing and bad cpu and cash using has > much higher priority than regular interactive process. You can see that > cpu using time grows (it is measured by other way - not by function > sched_clock()) but priority is not bring down and the process is > considered as high priority interactive one. So let me try to restate the problem, and you can let me know if I'm misunderstanding you. When the TSC is not used for timekeeping, sched_clock() uses jiffies to measure time. This results in very course 1ms (or really 1/HZ) granularity from sched_clock(). This course granularity is causing issues in the scheduler, as best that I can understand, where processes that cause lots of pagefaults are not properly dinged for their time. Is that about right? > It is known that TSC is incorrect according to astronomical real time as > a result of PM throttling. But for scheduler purposes the value of work > executed for each process but not real time spent on cpu makes sense. > The scheduler actually does not consider process run time as a real > time: it uses division of variable run_time on CURRENT_BONUS before > comparison it with sleep average time: > run_time /= CURRENT_BONUS; > task->sleep_avg -= run_time; > So run_time is not a real time but some measure of cpu work > which is performed for current process and we have the right and have to > use TSC for scheduler purpose if TSC is there on processor and does > function. Well, not quite. First of all, I believe (Dominik would know better) that not all CPUs that support frequency scaling actually modify their TSC frequency. So in some cases the TSC is time and in others it is work. Further, there are other cases where the TSC cannot be used for sched_clock, such as on systems that do not have TSCs, and NUMA systems where the TSCs are not synched. &g