Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On 12/05/2013 04:16 PM, Dave Chinner wrote: > On Fri, Dec 06, 2013 at 08:26:51AM +1100, Dave Chinner wrote: >> On Thu, Dec 05, 2013 at 10:22:20PM +0800, Ming Lei wrote: >>> On Wed, 4 Dec 2013 12:58:38 +1100 >>> Dave Chinner wrote: >>> And I just hit this from running xfs_repair which is doing multithreaded direct IO directly on /dev/vdc: >> . So this is looking like another virtio+blk_mq problem >>> >>> This one might be caused by using-after-'free' request because the freed >>> request may be allocated to another CPU immediately, and the below patch >>> may be helpful, would you mind testing it? >>> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>> index 6875736..2358bdf 100644 >>> --- a/block/blk-mq.c >>> +++ b/block/blk-mq.c >>> @@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int >>> error) >>> >>> blk_account_io_completion(rq, bytes); >>> >>> + blk_account_io_done(rq); >>> + >>> if (rq->end_io) >>> rq->end_io(rq, error); >>> else >>> blk_mq_free_request(rq); >>> - >>> - blk_account_io_done(rq); >>> } >>> >>> void __blk_mq_end_io(struct request *rq, int error) >> >> Testing it now. I'll let you know how it goes in couple of hours. > > Ok, this looks like it is working correctly. The read/write io stats > are now working properly, too. Thanks for finding the problem > quickly! Great, thanks for testing, Dave. The fix is now in Linus' tree. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On 12/05/2013 04:16 PM, Dave Chinner wrote: On Fri, Dec 06, 2013 at 08:26:51AM +1100, Dave Chinner wrote: On Thu, Dec 05, 2013 at 10:22:20PM +0800, Ming Lei wrote: On Wed, 4 Dec 2013 12:58:38 +1100 Dave Chinner da...@fromorbit.com wrote: And I just hit this from running xfs_repair which is doing multithreaded direct IO directly on /dev/vdc: . So this is looking like another virtio+blk_mq problem This one might be caused by using-after-'free' request because the freed request may be allocated to another CPU immediately, and the below patch may be helpful, would you mind testing it? diff --git a/block/blk-mq.c b/block/blk-mq.c index 6875736..2358bdf 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int error) blk_account_io_completion(rq, bytes); + blk_account_io_done(rq); + if (rq-end_io) rq-end_io(rq, error); else blk_mq_free_request(rq); - - blk_account_io_done(rq); } void __blk_mq_end_io(struct request *rq, int error) Testing it now. I'll let you know how it goes in couple of hours. Ok, this looks like it is working correctly. The read/write io stats are now working properly, too. Thanks for finding the problem quickly! Great, thanks for testing, Dave. The fix is now in Linus' tree. -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Fri, Dec 06, 2013 at 08:26:51AM +1100, Dave Chinner wrote: > On Thu, Dec 05, 2013 at 10:22:20PM +0800, Ming Lei wrote: > > On Wed, 4 Dec 2013 12:58:38 +1100 > > Dave Chinner wrote: > > > > > > > > And I just hit this from running xfs_repair which is doing > > > multithreaded direct IO directly on /dev/vdc: > . > > > So this is looking like another virtio+blk_mq problem > > > > This one might be caused by using-after-'free' request because the freed > > request may be allocated to another CPU immediately, and the below patch > > may be helpful, would you mind testing it? > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index 6875736..2358bdf 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int > > error) > > > > blk_account_io_completion(rq, bytes); > > > > + blk_account_io_done(rq); > > + > > if (rq->end_io) > > rq->end_io(rq, error); > > else > > blk_mq_free_request(rq); > > - > > - blk_account_io_done(rq); > > } > > > > void __blk_mq_end_io(struct request *rq, int error) > > Testing it now. I'll let you know how it goes in couple of hours. Ok, this looks like it is working correctly. The read/write io stats are now working properly, too. Thanks for finding the problem quickly! Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Thu, Dec 05, 2013 at 10:22:20PM +0800, Ming Lei wrote: > On Wed, 4 Dec 2013 12:58:38 +1100 > Dave Chinner wrote: > > > > > And I just hit this from running xfs_repair which is doing > > multithreaded direct IO directly on /dev/vdc: . > > So this is looking like another virtio+blk_mq problem > > This one might be caused by using-after-'free' request because the freed > request may be allocated to another CPU immediately, and the below patch > may be helpful, would you mind testing it? > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 6875736..2358bdf 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int > error) > > blk_account_io_completion(rq, bytes); > > + blk_account_io_done(rq); > + > if (rq->end_io) > rq->end_io(rq, error); > else > blk_mq_free_request(rq); > - > - blk_account_io_done(rq); > } > > void __blk_mq_end_io(struct request *rq, int error) Testing it now. I'll let you know how it goes in couple of hours. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On 12/05/2013 07:22 AM, Ming Lei wrote: > On Wed, 4 Dec 2013 12:58:38 +1100 > Dave Chinner wrote: > >> >> And I just hit this from running xfs_repair which is doing >> multithreaded direct IO directly on /dev/vdc: >> >> [ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at >> 0328 >> [ 1776.510446] IP: [] blk_account_io_done+0x6a/0x180 >> [ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0 >> [ 1776.512577] Oops: [#1] SMP >> [ 1776.512577] Modules linked in: >> [ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75 >> [ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 >> [ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: >> 81f26000 >> [ 1776.512577] RIP: 0010:[] [] >> blk_account_io_done+0x6a/0x180 >> [ 1776.512577] RSP: :88011bc03da8 EFLAGS: 00010046 >> [ 1776.512577] RAX: 1000 RBX: RCX: >> >> [ 1776.512577] RDX: RSI: 0002 RDI: >> 8800dba59fc0 >> [ 1776.512577] RBP: 88011bc03db8 R08: R09: >> 88021a828928 >> [ 1776.512577] R10: R11: 0001 R12: >> >> [ 1776.512577] R13: R14: 2000 R15: >> >> [ 1776.512577] FS: () GS:88011bc0() >> knlGS: >> [ 1776.512577] CS: 0010 DS: ES: CR0: 8005003b >> [ 1776.512577] CR2: 0328 CR3: 00041a4b CR4: >> 06f0 >> [ 1776.512577] Stack: >> [ 1776.512577] 8800dba59fc0 88011bc03df8 >> 8175e4b8 >> [ 1776.512577] 01c0 8800dba59fc0 e8fcfbc03a80 >> 88041aac0800 >> [ 1776.512577] 88011ad53b10 0096 88011bc03e08 >> 8175e536 >> [ 1776.512577] Call Trace: >> [ 1776.512577] >> [ 1776.512577] [] blk_mq_complete_request+0xb8/0x120 >> [ 1776.512577] [] __blk_mq_end_io+0x16/0x20 >> [ 1776.512577] [] blk_mq_end_io+0x68/0xd0 >> [ 1776.512577] [] virtblk_done+0x67/0x110 >> [ 1776.512577] [] vring_interrupt+0x35/0x60 >> [ 1776.512577] [] handle_irq_event_percpu+0x54/0x1e0 >> [ 1776.512577] [] handle_irq_event+0x48/0x70 >> [ 1776.512577] [] handle_edge_irq+0x77/0x110 >> [ 1776.512577] [] handle_irq+0xbf/0x150 >> [ 1776.512577] [] ? __atomic_notifier_call_chain+0x12/0x20 >> [ 1776.512577] [] ? atomic_notifier_call_chain+0x16/0x20 >> [ 1776.512577] [] do_IRQ+0x5a/0xe0 >> [ 1776.512577] [] common_interrupt+0x6d/0x6d >> [ 1776.512577] >> [ 1776.512577] [] ? native_safe_halt+0x6/0x10 >> [ 1776.512577] [] default_idle+0x1f/0xc0 >> [ 1776.512577] [] arch_cpu_idle+0x26/0x30 >> [ 1776.512577] [] cpu_startup_entry+0x81/0x240 >> [ 1776.512577] [] rest_init+0x77/0x80 >> [ 1776.512577] [] start_kernel+0x3c8/0x3d5 >> [ 1776.512577] [] ? repair_env_string+0x5e/0x5e >> [ 1776.512577] [] ? early_idt_handlers+0x117/0x120 >> [ 1776.512577] [] x86_64_start_reservations+0x2a/0x2c >> [ 1776.512577] [] x86_64_start_kernel+0x146/0x155 >> [ 1776.512577] Code: 54 41 89 c4 53 41 83 e4 01 48 2b 97 d8 00 00 00 48 8b >> 9f d0 00 00 00 4d 63 e4 65 44 8b 04 25 1c b0 00 00 49 8d 74 24 02 49 63 c8 >> <48> 8b 83 28 03 >> [ 1776.512577] RIP [] blk_account_io_done+0x6a/0x180 >> [ 1776.512577] RSP >> [ 1776.512577] CR2: 0328 >> [ 1776.512577] ---[ end trace 33a5787221ff13b4 ]--- >> [ 1776.512577] Kernel panic - not syncing: Fatal exception in interrupt >> [ 1776.512577] Shutting down cpus with NMI >> >> So this is looking like another virtio+blk_mq problem > > This one might be caused by using-after-'free' request because the freed > request may be allocated to another CPU immediately, and the below patch > may be helpful, would you mind testing it? > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 6875736..2358bdf 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int > error) > > blk_account_io_completion(rq, bytes); > > + blk_account_io_done(rq); > + > if (rq->end_io) > rq->end_io(rq, error); > else > blk_mq_free_request(rq); > - > - blk_account_io_done(rq); > } > > void __blk_mq_end_io(struct request *rq, int error) That's definitely a bug... Good catch, Ming. Applied. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Wed, 4 Dec 2013 12:58:38 +1100 Dave Chinner wrote: > > And I just hit this from running xfs_repair which is doing > multithreaded direct IO directly on /dev/vdc: > > [ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at > 0328 > [ 1776.510446] IP: [] blk_account_io_done+0x6a/0x180 > [ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0 > [ 1776.512577] Oops: [#1] SMP > [ 1776.512577] Modules linked in: > [ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75 > [ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > [ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: > 81f26000 > [ 1776.512577] RIP: 0010:[] [] > blk_account_io_done+0x6a/0x180 > [ 1776.512577] RSP: :88011bc03da8 EFLAGS: 00010046 > [ 1776.512577] RAX: 1000 RBX: RCX: > > [ 1776.512577] RDX: RSI: 0002 RDI: > 8800dba59fc0 > [ 1776.512577] RBP: 88011bc03db8 R08: R09: > 88021a828928 > [ 1776.512577] R10: R11: 0001 R12: > > [ 1776.512577] R13: R14: 2000 R15: > > [ 1776.512577] FS: () GS:88011bc0() > knlGS: > [ 1776.512577] CS: 0010 DS: ES: CR0: 8005003b > [ 1776.512577] CR2: 0328 CR3: 00041a4b CR4: > 06f0 > [ 1776.512577] Stack: > [ 1776.512577] 8800dba59fc0 88011bc03df8 > 8175e4b8 > [ 1776.512577] 01c0 8800dba59fc0 e8fcfbc03a80 > 88041aac0800 > [ 1776.512577] 88011ad53b10 0096 88011bc03e08 > 8175e536 > [ 1776.512577] Call Trace: > [ 1776.512577] > [ 1776.512577] [] blk_mq_complete_request+0xb8/0x120 > [ 1776.512577] [] __blk_mq_end_io+0x16/0x20 > [ 1776.512577] [] blk_mq_end_io+0x68/0xd0 > [ 1776.512577] [] virtblk_done+0x67/0x110 > [ 1776.512577] [] vring_interrupt+0x35/0x60 > [ 1776.512577] [] handle_irq_event_percpu+0x54/0x1e0 > [ 1776.512577] [] handle_irq_event+0x48/0x70 > [ 1776.512577] [] handle_edge_irq+0x77/0x110 > [ 1776.512577] [] handle_irq+0xbf/0x150 > [ 1776.512577] [] ? __atomic_notifier_call_chain+0x12/0x20 > [ 1776.512577] [] ? atomic_notifier_call_chain+0x16/0x20 > [ 1776.512577] [] do_IRQ+0x5a/0xe0 > [ 1776.512577] [] common_interrupt+0x6d/0x6d > [ 1776.512577] > [ 1776.512577] [] ? native_safe_halt+0x6/0x10 > [ 1776.512577] [] default_idle+0x1f/0xc0 > [ 1776.512577] [] arch_cpu_idle+0x26/0x30 > [ 1776.512577] [] cpu_startup_entry+0x81/0x240 > [ 1776.512577] [] rest_init+0x77/0x80 > [ 1776.512577] [] start_kernel+0x3c8/0x3d5 > [ 1776.512577] [] ? repair_env_string+0x5e/0x5e > [ 1776.512577] [] ? early_idt_handlers+0x117/0x120 > [ 1776.512577] [] x86_64_start_reservations+0x2a/0x2c > [ 1776.512577] [] x86_64_start_kernel+0x146/0x155 > [ 1776.512577] Code: 54 41 89 c4 53 41 83 e4 01 48 2b 97 d8 00 00 00 48 8b 9f > d0 00 00 00 4d 63 e4 65 44 8b 04 25 1c b0 00 00 49 8d 74 24 02 49 63 c8 <48> > 8b 83 28 03 > [ 1776.512577] RIP [] blk_account_io_done+0x6a/0x180 > [ 1776.512577] RSP > [ 1776.512577] CR2: 0328 > [ 1776.512577] ---[ end trace 33a5787221ff13b4 ]--- > [ 1776.512577] Kernel panic - not syncing: Fatal exception in interrupt > [ 1776.512577] Shutting down cpus with NMI > > So this is looking like another virtio+blk_mq problem This one might be caused by using-after-'free' request because the freed request may be allocated to another CPU immediately, and the below patch may be helpful, would you mind testing it? diff --git a/block/blk-mq.c b/block/blk-mq.c index 6875736..2358bdf 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int error) blk_account_io_completion(rq, bytes); + blk_account_io_done(rq); + if (rq->end_io) rq->end_io(rq, error); else blk_mq_free_request(rq); - - blk_account_io_done(rq); } void __blk_mq_end_io(struct request *rq, int error) Thanks, -- Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Wed, 4 Dec 2013 12:58:38 +1100 Dave Chinner da...@fromorbit.com wrote: And I just hit this from running xfs_repair which is doing multithreaded direct IO directly on /dev/vdc: [ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at 0328 [ 1776.510446] IP: [81755b6a] blk_account_io_done+0x6a/0x180 [ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0 [ 1776.512577] Oops: [#1] SMP [ 1776.512577] Modules linked in: [ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75 [ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: 81f26000 [ 1776.512577] RIP: 0010:[81755b6a] [81755b6a] blk_account_io_done+0x6a/0x180 [ 1776.512577] RSP: :88011bc03da8 EFLAGS: 00010046 [ 1776.512577] RAX: 1000 RBX: RCX: [ 1776.512577] RDX: RSI: 0002 RDI: 8800dba59fc0 [ 1776.512577] RBP: 88011bc03db8 R08: R09: 88021a828928 [ 1776.512577] R10: R11: 0001 R12: [ 1776.512577] R13: R14: 2000 R15: [ 1776.512577] FS: () GS:88011bc0() knlGS: [ 1776.512577] CS: 0010 DS: ES: CR0: 8005003b [ 1776.512577] CR2: 0328 CR3: 00041a4b CR4: 06f0 [ 1776.512577] Stack: [ 1776.512577] 8800dba59fc0 88011bc03df8 8175e4b8 [ 1776.512577] 01c0 8800dba59fc0 e8fcfbc03a80 88041aac0800 [ 1776.512577] 88011ad53b10 0096 88011bc03e08 8175e536 [ 1776.512577] Call Trace: [ 1776.512577] IRQ [ 1776.512577] [8175e4b8] blk_mq_complete_request+0xb8/0x120 [ 1776.512577] [8175e536] __blk_mq_end_io+0x16/0x20 [ 1776.512577] [8175e5a8] blk_mq_end_io+0x68/0xd0 [ 1776.512577] [81852e47] virtblk_done+0x67/0x110 [ 1776.512577] [817f7925] vring_interrupt+0x35/0x60 [ 1776.512577] [810e48a4] handle_irq_event_percpu+0x54/0x1e0 [ 1776.512577] [810e4a78] handle_irq_event+0x48/0x70 [ 1776.512577] [810e71d7] handle_edge_irq+0x77/0x110 [ 1776.512577] [8104796f] handle_irq+0xbf/0x150 [ 1776.512577] [81aec8c2] ? __atomic_notifier_call_chain+0x12/0x20 [ 1776.512577] [81aec8e6] ? atomic_notifier_call_chain+0x16/0x20 [ 1776.512577] [81af2b8a] do_IRQ+0x5a/0xe0 [ 1776.512577] [81ae892d] common_interrupt+0x6d/0x6d [ 1776.512577] EOI [ 1776.512577] [8107e626] ? native_safe_halt+0x6/0x10 [ 1776.512577] [8104e50f] default_idle+0x1f/0xc0 [ 1776.512577] [8104edb6] arch_cpu_idle+0x26/0x30 [ 1776.512577] [810e3d81] cpu_startup_entry+0x81/0x240 [ 1776.512577] [81ad4f97] rest_init+0x77/0x80 [ 1776.512577] [82041ee7] start_kernel+0x3c8/0x3d5 [ 1776.512577] [8204191e] ? repair_env_string+0x5e/0x5e [ 1776.512577] [82041117] ? early_idt_handlers+0x117/0x120 [ 1776.512577] [820415c2] x86_64_start_reservations+0x2a/0x2c [ 1776.512577] [8204170a] x86_64_start_kernel+0x146/0x155 [ 1776.512577] Code: 54 41 89 c4 53 41 83 e4 01 48 2b 97 d8 00 00 00 48 8b 9f d0 00 00 00 4d 63 e4 65 44 8b 04 25 1c b0 00 00 49 8d 74 24 02 49 63 c8 48 8b 83 28 03 [ 1776.512577] RIP [81755b6a] blk_account_io_done+0x6a/0x180 [ 1776.512577] RSP 88011bc03da8 [ 1776.512577] CR2: 0328 [ 1776.512577] ---[ end trace 33a5787221ff13b4 ]--- [ 1776.512577] Kernel panic - not syncing: Fatal exception in interrupt [ 1776.512577] Shutting down cpus with NMI So this is looking like another virtio+blk_mq problem This one might be caused by using-after-'free' request because the freed request may be allocated to another CPU immediately, and the below patch may be helpful, would you mind testing it? diff --git a/block/blk-mq.c b/block/blk-mq.c index 6875736..2358bdf 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int error) blk_account_io_completion(rq, bytes); + blk_account_io_done(rq); + if (rq-end_io) rq-end_io(rq, error); else blk_mq_free_request(rq); - - blk_account_io_done(rq); } void __blk_mq_end_io(struct request *rq, int error) Thanks, -- Ming Lei -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On 12/05/2013 07:22 AM, Ming Lei wrote: On Wed, 4 Dec 2013 12:58:38 +1100 Dave Chinner da...@fromorbit.com wrote: And I just hit this from running xfs_repair which is doing multithreaded direct IO directly on /dev/vdc: [ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at 0328 [ 1776.510446] IP: [81755b6a] blk_account_io_done+0x6a/0x180 [ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0 [ 1776.512577] Oops: [#1] SMP [ 1776.512577] Modules linked in: [ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75 [ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: 81f26000 [ 1776.512577] RIP: 0010:[81755b6a] [81755b6a] blk_account_io_done+0x6a/0x180 [ 1776.512577] RSP: :88011bc03da8 EFLAGS: 00010046 [ 1776.512577] RAX: 1000 RBX: RCX: [ 1776.512577] RDX: RSI: 0002 RDI: 8800dba59fc0 [ 1776.512577] RBP: 88011bc03db8 R08: R09: 88021a828928 [ 1776.512577] R10: R11: 0001 R12: [ 1776.512577] R13: R14: 2000 R15: [ 1776.512577] FS: () GS:88011bc0() knlGS: [ 1776.512577] CS: 0010 DS: ES: CR0: 8005003b [ 1776.512577] CR2: 0328 CR3: 00041a4b CR4: 06f0 [ 1776.512577] Stack: [ 1776.512577] 8800dba59fc0 88011bc03df8 8175e4b8 [ 1776.512577] 01c0 8800dba59fc0 e8fcfbc03a80 88041aac0800 [ 1776.512577] 88011ad53b10 0096 88011bc03e08 8175e536 [ 1776.512577] Call Trace: [ 1776.512577] IRQ [ 1776.512577] [8175e4b8] blk_mq_complete_request+0xb8/0x120 [ 1776.512577] [8175e536] __blk_mq_end_io+0x16/0x20 [ 1776.512577] [8175e5a8] blk_mq_end_io+0x68/0xd0 [ 1776.512577] [81852e47] virtblk_done+0x67/0x110 [ 1776.512577] [817f7925] vring_interrupt+0x35/0x60 [ 1776.512577] [810e48a4] handle_irq_event_percpu+0x54/0x1e0 [ 1776.512577] [810e4a78] handle_irq_event+0x48/0x70 [ 1776.512577] [810e71d7] handle_edge_irq+0x77/0x110 [ 1776.512577] [8104796f] handle_irq+0xbf/0x150 [ 1776.512577] [81aec8c2] ? __atomic_notifier_call_chain+0x12/0x20 [ 1776.512577] [81aec8e6] ? atomic_notifier_call_chain+0x16/0x20 [ 1776.512577] [81af2b8a] do_IRQ+0x5a/0xe0 [ 1776.512577] [81ae892d] common_interrupt+0x6d/0x6d [ 1776.512577] EOI [ 1776.512577] [8107e626] ? native_safe_halt+0x6/0x10 [ 1776.512577] [8104e50f] default_idle+0x1f/0xc0 [ 1776.512577] [8104edb6] arch_cpu_idle+0x26/0x30 [ 1776.512577] [810e3d81] cpu_startup_entry+0x81/0x240 [ 1776.512577] [81ad4f97] rest_init+0x77/0x80 [ 1776.512577] [82041ee7] start_kernel+0x3c8/0x3d5 [ 1776.512577] [8204191e] ? repair_env_string+0x5e/0x5e [ 1776.512577] [82041117] ? early_idt_handlers+0x117/0x120 [ 1776.512577] [820415c2] x86_64_start_reservations+0x2a/0x2c [ 1776.512577] [8204170a] x86_64_start_kernel+0x146/0x155 [ 1776.512577] Code: 54 41 89 c4 53 41 83 e4 01 48 2b 97 d8 00 00 00 48 8b 9f d0 00 00 00 4d 63 e4 65 44 8b 04 25 1c b0 00 00 49 8d 74 24 02 49 63 c8 48 8b 83 28 03 [ 1776.512577] RIP [81755b6a] blk_account_io_done+0x6a/0x180 [ 1776.512577] RSP 88011bc03da8 [ 1776.512577] CR2: 0328 [ 1776.512577] ---[ end trace 33a5787221ff13b4 ]--- [ 1776.512577] Kernel panic - not syncing: Fatal exception in interrupt [ 1776.512577] Shutting down cpus with NMI So this is looking like another virtio+blk_mq problem This one might be caused by using-after-'free' request because the freed request may be allocated to another CPU immediately, and the below patch may be helpful, would you mind testing it? diff --git a/block/blk-mq.c b/block/blk-mq.c index 6875736..2358bdf 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int error) blk_account_io_completion(rq, bytes); + blk_account_io_done(rq); + if (rq-end_io) rq-end_io(rq, error); else blk_mq_free_request(rq); - - blk_account_io_done(rq); } void __blk_mq_end_io(struct request *rq, int error) That's definitely a bug... Good catch, Ming. Applied. -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Thu, Dec 05, 2013 at 10:22:20PM +0800, Ming Lei wrote: On Wed, 4 Dec 2013 12:58:38 +1100 Dave Chinner da...@fromorbit.com wrote: And I just hit this from running xfs_repair which is doing multithreaded direct IO directly on /dev/vdc: . So this is looking like another virtio+blk_mq problem This one might be caused by using-after-'free' request because the freed request may be allocated to another CPU immediately, and the below patch may be helpful, would you mind testing it? diff --git a/block/blk-mq.c b/block/blk-mq.c index 6875736..2358bdf 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int error) blk_account_io_completion(rq, bytes); + blk_account_io_done(rq); + if (rq-end_io) rq-end_io(rq, error); else blk_mq_free_request(rq); - - blk_account_io_done(rq); } void __blk_mq_end_io(struct request *rq, int error) Testing it now. I'll let you know how it goes in couple of hours. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Fri, Dec 06, 2013 at 08:26:51AM +1100, Dave Chinner wrote: On Thu, Dec 05, 2013 at 10:22:20PM +0800, Ming Lei wrote: On Wed, 4 Dec 2013 12:58:38 +1100 Dave Chinner da...@fromorbit.com wrote: And I just hit this from running xfs_repair which is doing multithreaded direct IO directly on /dev/vdc: . So this is looking like another virtio+blk_mq problem This one might be caused by using-after-'free' request because the freed request may be allocated to another CPU immediately, and the below patch may be helpful, would you mind testing it? diff --git a/block/blk-mq.c b/block/blk-mq.c index 6875736..2358bdf 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int error) blk_account_io_completion(rq, bytes); + blk_account_io_done(rq); + if (rq-end_io) rq-end_io(rq, error); else blk_mq_free_request(rq); - - blk_account_io_done(rq); } void __blk_mq_end_io(struct request *rq, int error) Testing it now. I'll let you know how it goes in couple of hours. Ok, this looks like it is working correctly. The read/write io stats are now working properly, too. Thanks for finding the problem quickly! Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Wed, Dec 04, 2013 at 08:41:43PM -0700, Jens Axboe wrote: > On Thu, Dec 05 2013, Dave Chinner wrote: > > On Wed, Dec 04, 2013 at 03:17:49PM +1100, Dave Chinner wrote: > > > On Tue, Dec 03, 2013 at 08:47:12PM -0700, Jens Axboe wrote: > > > > On Wed, Dec 04 2013, Dave Chinner wrote: > > > > > On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote: > > > > > > On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote: > > > > > > > Hi Jens, > > > > > > > > > > > > > > Not sure who to direct this to or CC, so I figured you are the > > > > > > > person to do that. I just had xfstests generic/299 (an AIO/DIO > > > > > > > test) > > > > > > > oops in dio_complete() like so: > > > > > > > > > > > > > > > > > [ 9650.590630] > > > > > > > [ 9650.590630] [] dio_complete+0xa3/0x140 > > > > > > > [ 9650.590630] [] dio_bio_end_aio+0x7a/0x110 > > > > > > > [ 9650.590630] [] ? dio_bio_end_aio+0x5/0x110 > > > > > > > [ 9650.590630] [] bio_endio+0x1d/0x30 > > > > > > > [ 9650.590630] [] > > > > > > > blk_mq_complete_request+0x5f/0x120 > > > > > > > [ 9650.590630] [] __blk_mq_end_io+0x16/0x20 > > > > > > > [ 9650.590630] [] blk_mq_end_io+0x68/0xd0 > > > > > > > [ 9650.590630] [] virtblk_done+0x67/0x110 > > > > > > > [ 9650.590630] [] vring_interrupt+0x35/0x60 > > > . > > > > > > And I just hit this from running xfs_repair which is doing > > > > > > multithreaded direct IO directly on /dev/vdc: > > > > > > > > > > > > > > > [ 1776.510446] IP: [] > > > > > > blk_account_io_done+0x6a/0x180 > > > > > > > > > [ 1776.512577] [] > > > > > > blk_mq_complete_request+0xb8/0x120 > > > > > > [ 1776.512577] [] __blk_mq_end_io+0x16/0x20 > > > > > > [ 1776.512577] [] blk_mq_end_io+0x68/0xd0 > > > > > > [ 1776.512577] [] virtblk_done+0x67/0x110 > > > > > > [ 1776.512577] [] vring_interrupt+0x35/0x60 > > > > > > [ 1776.512577] [] > > > > > > handle_irq_event_percpu+0x54/0x1e0 > > > . > > > > > > So this is looking like another virtio+blk_mq problem > > > > > > > > > > This one is definitely reproducable. Just hit it again... > > > > > > > > I'll take a look at this. You don't happen to have gdb dumps of the > > > > lines associated with those crashes? Just to save me some digging > > > > time... > > > > > > Only this: > > > > > > (gdb) l *(dio_complete+0xa3) > > > 0x811ddae3 is in dio_complete (fs/direct-io.c:282). > > > 277 } > > > 278 > > > 279 aio_complete(dio->iocb, ret, 0); > > > 280 } > > > 281 > > > 282 kmem_cache_free(dio_cache, dio); > > > 283 return ret; > > > 284 } > > > 285 > > > 286 static void dio_aio_complete_work(struct work_struct *work) > > > > > > And this: > > > > > > (gdb) l *(blk_account_io_done+0x6a) > > > 0x81755b6a is in blk_account_io_done (block/blk-core.c:2049). > > > 2044int cpu; > > > 2045 > > > 2046cpu = part_stat_lock(); > > > 2047part = req->part; > > > 2048 > > > 2049part_stat_inc(cpu, part, ios[rw]); > > > 2050part_stat_add(cpu, part, ticks[rw], duration); > > > 2051part_round_stats(cpu, part); > > > 2052part_dec_in_flight(part, rw); > > > 2053 > > > > > > as I've rebuild the kernel with different patches since the one > > > running on the machine that is triggering the problem. > > > > Any update on this, Jens? I've hit this blk_account_io_done() panic > > 10 times in the past 2 hours while trying to do xfs_repair > > testing > > No, sorry, no updates yet... I haven't had time to look into it today. > To reproduce tomorrow, can you mail me your exact setup (kvm invocation, > etc) and how your guest is setup and if there's any special way I need > to run xfstest or xfs_repair? The virtio device that I'm hitting is "/mnt/fast-ssd/vm-100TB-sparse.img" which is a 100TB file on a 160GB XFS filesystem on an SSD, created with $ xfs_io -f -c "truncate 100t" -c "extsize 1m" /mnt/fast-ssd/vm-100TB-sparse.img If I stat it, I get: $ xfs_io -c stat /mnt/fast-ssd/vm-100TB-sparse.img fd.path = "/mnt/fast-ssd/vm-100TB-sparse.img" fd.flags = non-sync,non-direct,read-write stat.ino = 131 stat.type = regular file stat.size = 109951162777600 stat.blocks = 259333400 fsxattr.xflags = 0x800 [--e---] fsxattr.projid = 0 fsxattr.extsize = 1048576 fsxattr.nextents = 83108 fsxattr.naextents = 0 dioattr.mem = 0x200 dioattr.miniosz = 512 dioattr.maxiosz = 2147483136 $ The VM is run by this script: $ cat run-vm-4.sh #!/bin/sh sudo /usr/bin/qemu-system-x86_64 \ -no-fd-bootchk \ -localtime \ -boot c \ -serial pty \ -nographic \ -alt-grab \ -smp 16 -m 16384 \ -machine accel=kvm \ -hda /vm-images/vm-4/root.img \ -drive file=/vm-images/vm-4/vm-4-test.img,if=virtio,cache=none \ -drive
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Thu, Dec 05 2013, Dave Chinner wrote: > On Wed, Dec 04, 2013 at 03:17:49PM +1100, Dave Chinner wrote: > > On Tue, Dec 03, 2013 at 08:47:12PM -0700, Jens Axboe wrote: > > > On Wed, Dec 04 2013, Dave Chinner wrote: > > > > On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote: > > > > > On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote: > > > > > > Hi Jens, > > > > > > > > > > > > Not sure who to direct this to or CC, so I figured you are the > > > > > > person to do that. I just had xfstests generic/299 (an AIO/DIO test) > > > > > > oops in dio_complete() like so: > > > > > > > > > > > > > > [ 9650.590630] > > > > > > [ 9650.590630] [] dio_complete+0xa3/0x140 > > > > > > [ 9650.590630] [] dio_bio_end_aio+0x7a/0x110 > > > > > > [ 9650.590630] [] ? dio_bio_end_aio+0x5/0x110 > > > > > > [ 9650.590630] [] bio_endio+0x1d/0x30 > > > > > > [ 9650.590630] [] > > > > > > blk_mq_complete_request+0x5f/0x120 > > > > > > [ 9650.590630] [] __blk_mq_end_io+0x16/0x20 > > > > > > [ 9650.590630] [] blk_mq_end_io+0x68/0xd0 > > > > > > [ 9650.590630] [] virtblk_done+0x67/0x110 > > > > > > [ 9650.590630] [] vring_interrupt+0x35/0x60 > > . > > > > > And I just hit this from running xfs_repair which is doing > > > > > multithreaded direct IO directly on /dev/vdc: > > > > > > > > > > > > [ 1776.510446] IP: [] blk_account_io_done+0x6a/0x180 > > > > > > > [ 1776.512577] [] > > > > > blk_mq_complete_request+0xb8/0x120 > > > > > [ 1776.512577] [] __blk_mq_end_io+0x16/0x20 > > > > > [ 1776.512577] [] blk_mq_end_io+0x68/0xd0 > > > > > [ 1776.512577] [] virtblk_done+0x67/0x110 > > > > > [ 1776.512577] [] vring_interrupt+0x35/0x60 > > > > > [ 1776.512577] [] > > > > > handle_irq_event_percpu+0x54/0x1e0 > > . > > > > > So this is looking like another virtio+blk_mq problem > > > > > > > > This one is definitely reproducable. Just hit it again... > > > > > > I'll take a look at this. You don't happen to have gdb dumps of the > > > lines associated with those crashes? Just to save me some digging > > > time... > > > > Only this: > > > > (gdb) l *(dio_complete+0xa3) > > 0x811ddae3 is in dio_complete (fs/direct-io.c:282). > > 277 } > > 278 > > 279 aio_complete(dio->iocb, ret, 0); > > 280 } > > 281 > > 282 kmem_cache_free(dio_cache, dio); > > 283 return ret; > > 284 } > > 285 > > 286 static void dio_aio_complete_work(struct work_struct *work) > > > > And this: > > > > (gdb) l *(blk_account_io_done+0x6a) > > 0x81755b6a is in blk_account_io_done (block/blk-core.c:2049). > > 2044int cpu; > > 2045 > > 2046cpu = part_stat_lock(); > > 2047part = req->part; > > 2048 > > 2049part_stat_inc(cpu, part, ios[rw]); > > 2050part_stat_add(cpu, part, ticks[rw], duration); > > 2051part_round_stats(cpu, part); > > 2052part_dec_in_flight(part, rw); > > 2053 > > > > as I've rebuild the kernel with different patches since the one > > running on the machine that is triggering the problem. > > Any update on this, Jens? I've hit this blk_account_io_done() panic > 10 times in the past 2 hours while trying to do xfs_repair > testing No, sorry, no updates yet... I haven't had time to look into it today. To reproduce tomorrow, can you mail me your exact setup (kvm invocation, etc) and how your guest is setup and if there's any special way I need to run xfstest or xfs_repair? -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Wed, Dec 04, 2013 at 03:17:49PM +1100, Dave Chinner wrote: > On Tue, Dec 03, 2013 at 08:47:12PM -0700, Jens Axboe wrote: > > On Wed, Dec 04 2013, Dave Chinner wrote: > > > On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote: > > > > On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote: > > > > > Hi Jens, > > > > > > > > > > Not sure who to direct this to or CC, so I figured you are the > > > > > person to do that. I just had xfstests generic/299 (an AIO/DIO test) > > > > > oops in dio_complete() like so: > > > > > > > > > > > [ 9650.590630] > > > > > [ 9650.590630] [] dio_complete+0xa3/0x140 > > > > > [ 9650.590630] [] dio_bio_end_aio+0x7a/0x110 > > > > > [ 9650.590630] [] ? dio_bio_end_aio+0x5/0x110 > > > > > [ 9650.590630] [] bio_endio+0x1d/0x30 > > > > > [ 9650.590630] [] > > > > > blk_mq_complete_request+0x5f/0x120 > > > > > [ 9650.590630] [] __blk_mq_end_io+0x16/0x20 > > > > > [ 9650.590630] [] blk_mq_end_io+0x68/0xd0 > > > > > [ 9650.590630] [] virtblk_done+0x67/0x110 > > > > > [ 9650.590630] [] vring_interrupt+0x35/0x60 > . > > > > And I just hit this from running xfs_repair which is doing > > > > multithreaded direct IO directly on /dev/vdc: > > > > > > > > > [ 1776.510446] IP: [] blk_account_io_done+0x6a/0x180 > > > > > [ 1776.512577] [] blk_mq_complete_request+0xb8/0x120 > > > > [ 1776.512577] [] __blk_mq_end_io+0x16/0x20 > > > > [ 1776.512577] [] blk_mq_end_io+0x68/0xd0 > > > > [ 1776.512577] [] virtblk_done+0x67/0x110 > > > > [ 1776.512577] [] vring_interrupt+0x35/0x60 > > > > [ 1776.512577] [] handle_irq_event_percpu+0x54/0x1e0 > . > > > > So this is looking like another virtio+blk_mq problem > > > > > > This one is definitely reproducable. Just hit it again... > > > > I'll take a look at this. You don't happen to have gdb dumps of the > > lines associated with those crashes? Just to save me some digging > > time... > > Only this: > > (gdb) l *(dio_complete+0xa3) > 0x811ddae3 is in dio_complete (fs/direct-io.c:282). > 277 } > 278 > 279 aio_complete(dio->iocb, ret, 0); > 280 } > 281 > 282 kmem_cache_free(dio_cache, dio); > 283 return ret; > 284 } > 285 > 286 static void dio_aio_complete_work(struct work_struct *work) > > And this: > > (gdb) l *(blk_account_io_done+0x6a) > 0x81755b6a is in blk_account_io_done (block/blk-core.c:2049). > 2044int cpu; > 2045 > 2046cpu = part_stat_lock(); > 2047part = req->part; > 2048 > 2049part_stat_inc(cpu, part, ios[rw]); > 2050part_stat_add(cpu, part, ticks[rw], duration); > 2051part_round_stats(cpu, part); > 2052part_dec_in_flight(part, rw); > 2053 > > as I've rebuild the kernel with different patches since the one > running on the machine that is triggering the problem. Any update on this, Jens? I've hit this blk_account_io_done() panic 10 times in the past 2 hours while trying to do xfs_repair testing Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Wed, Dec 04, 2013 at 03:17:49PM +1100, Dave Chinner wrote: On Tue, Dec 03, 2013 at 08:47:12PM -0700, Jens Axboe wrote: On Wed, Dec 04 2013, Dave Chinner wrote: On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote: On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote: Hi Jens, Not sure who to direct this to or CC, so I figured you are the person to do that. I just had xfstests generic/299 (an AIO/DIO test) oops in dio_complete() like so: [ 9650.590630] IRQ [ 9650.590630] [811ddae3] dio_complete+0xa3/0x140 [ 9650.590630] [811ddc2a] dio_bio_end_aio+0x7a/0x110 [ 9650.590630] [811ddbb5] ? dio_bio_end_aio+0x5/0x110 [ 9650.590630] [811d8a9d] bio_endio+0x1d/0x30 [ 9650.590630] [8175d65f] blk_mq_complete_request+0x5f/0x120 [ 9650.590630] [8175d736] __blk_mq_end_io+0x16/0x20 [ 9650.590630] [8175d7a8] blk_mq_end_io+0x68/0xd0 [ 9650.590630] [818539a7] virtblk_done+0x67/0x110 [ 9650.590630] [817f74c5] vring_interrupt+0x35/0x60 . And I just hit this from running xfs_repair which is doing multithreaded direct IO directly on /dev/vdc: [ 1776.510446] IP: [81755b6a] blk_account_io_done+0x6a/0x180 [ 1776.512577] [8175e4b8] blk_mq_complete_request+0xb8/0x120 [ 1776.512577] [8175e536] __blk_mq_end_io+0x16/0x20 [ 1776.512577] [8175e5a8] blk_mq_end_io+0x68/0xd0 [ 1776.512577] [81852e47] virtblk_done+0x67/0x110 [ 1776.512577] [817f7925] vring_interrupt+0x35/0x60 [ 1776.512577] [810e48a4] handle_irq_event_percpu+0x54/0x1e0 . So this is looking like another virtio+blk_mq problem This one is definitely reproducable. Just hit it again... I'll take a look at this. You don't happen to have gdb dumps of the lines associated with those crashes? Just to save me some digging time... Only this: (gdb) l *(dio_complete+0xa3) 0x811ddae3 is in dio_complete (fs/direct-io.c:282). 277 } 278 279 aio_complete(dio-iocb, ret, 0); 280 } 281 282 kmem_cache_free(dio_cache, dio); 283 return ret; 284 } 285 286 static void dio_aio_complete_work(struct work_struct *work) And this: (gdb) l *(blk_account_io_done+0x6a) 0x81755b6a is in blk_account_io_done (block/blk-core.c:2049). 2044int cpu; 2045 2046cpu = part_stat_lock(); 2047part = req-part; 2048 2049part_stat_inc(cpu, part, ios[rw]); 2050part_stat_add(cpu, part, ticks[rw], duration); 2051part_round_stats(cpu, part); 2052part_dec_in_flight(part, rw); 2053 as I've rebuild the kernel with different patches since the one running on the machine that is triggering the problem. Any update on this, Jens? I've hit this blk_account_io_done() panic 10 times in the past 2 hours while trying to do xfs_repair testing Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Thu, Dec 05 2013, Dave Chinner wrote: On Wed, Dec 04, 2013 at 03:17:49PM +1100, Dave Chinner wrote: On Tue, Dec 03, 2013 at 08:47:12PM -0700, Jens Axboe wrote: On Wed, Dec 04 2013, Dave Chinner wrote: On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote: On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote: Hi Jens, Not sure who to direct this to or CC, so I figured you are the person to do that. I just had xfstests generic/299 (an AIO/DIO test) oops in dio_complete() like so: [ 9650.590630] IRQ [ 9650.590630] [811ddae3] dio_complete+0xa3/0x140 [ 9650.590630] [811ddc2a] dio_bio_end_aio+0x7a/0x110 [ 9650.590630] [811ddbb5] ? dio_bio_end_aio+0x5/0x110 [ 9650.590630] [811d8a9d] bio_endio+0x1d/0x30 [ 9650.590630] [8175d65f] blk_mq_complete_request+0x5f/0x120 [ 9650.590630] [8175d736] __blk_mq_end_io+0x16/0x20 [ 9650.590630] [8175d7a8] blk_mq_end_io+0x68/0xd0 [ 9650.590630] [818539a7] virtblk_done+0x67/0x110 [ 9650.590630] [817f74c5] vring_interrupt+0x35/0x60 . And I just hit this from running xfs_repair which is doing multithreaded direct IO directly on /dev/vdc: [ 1776.510446] IP: [81755b6a] blk_account_io_done+0x6a/0x180 [ 1776.512577] [8175e4b8] blk_mq_complete_request+0xb8/0x120 [ 1776.512577] [8175e536] __blk_mq_end_io+0x16/0x20 [ 1776.512577] [8175e5a8] blk_mq_end_io+0x68/0xd0 [ 1776.512577] [81852e47] virtblk_done+0x67/0x110 [ 1776.512577] [817f7925] vring_interrupt+0x35/0x60 [ 1776.512577] [810e48a4] handle_irq_event_percpu+0x54/0x1e0 . So this is looking like another virtio+blk_mq problem This one is definitely reproducable. Just hit it again... I'll take a look at this. You don't happen to have gdb dumps of the lines associated with those crashes? Just to save me some digging time... Only this: (gdb) l *(dio_complete+0xa3) 0x811ddae3 is in dio_complete (fs/direct-io.c:282). 277 } 278 279 aio_complete(dio-iocb, ret, 0); 280 } 281 282 kmem_cache_free(dio_cache, dio); 283 return ret; 284 } 285 286 static void dio_aio_complete_work(struct work_struct *work) And this: (gdb) l *(blk_account_io_done+0x6a) 0x81755b6a is in blk_account_io_done (block/blk-core.c:2049). 2044int cpu; 2045 2046cpu = part_stat_lock(); 2047part = req-part; 2048 2049part_stat_inc(cpu, part, ios[rw]); 2050part_stat_add(cpu, part, ticks[rw], duration); 2051part_round_stats(cpu, part); 2052part_dec_in_flight(part, rw); 2053 as I've rebuild the kernel with different patches since the one running on the machine that is triggering the problem. Any update on this, Jens? I've hit this blk_account_io_done() panic 10 times in the past 2 hours while trying to do xfs_repair testing No, sorry, no updates yet... I haven't had time to look into it today. To reproduce tomorrow, can you mail me your exact setup (kvm invocation, etc) and how your guest is setup and if there's any special way I need to run xfstest or xfs_repair? -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Wed, Dec 04, 2013 at 08:41:43PM -0700, Jens Axboe wrote: On Thu, Dec 05 2013, Dave Chinner wrote: On Wed, Dec 04, 2013 at 03:17:49PM +1100, Dave Chinner wrote: On Tue, Dec 03, 2013 at 08:47:12PM -0700, Jens Axboe wrote: On Wed, Dec 04 2013, Dave Chinner wrote: On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote: On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote: Hi Jens, Not sure who to direct this to or CC, so I figured you are the person to do that. I just had xfstests generic/299 (an AIO/DIO test) oops in dio_complete() like so: [ 9650.590630] IRQ [ 9650.590630] [811ddae3] dio_complete+0xa3/0x140 [ 9650.590630] [811ddc2a] dio_bio_end_aio+0x7a/0x110 [ 9650.590630] [811ddbb5] ? dio_bio_end_aio+0x5/0x110 [ 9650.590630] [811d8a9d] bio_endio+0x1d/0x30 [ 9650.590630] [8175d65f] blk_mq_complete_request+0x5f/0x120 [ 9650.590630] [8175d736] __blk_mq_end_io+0x16/0x20 [ 9650.590630] [8175d7a8] blk_mq_end_io+0x68/0xd0 [ 9650.590630] [818539a7] virtblk_done+0x67/0x110 [ 9650.590630] [817f74c5] vring_interrupt+0x35/0x60 . And I just hit this from running xfs_repair which is doing multithreaded direct IO directly on /dev/vdc: [ 1776.510446] IP: [81755b6a] blk_account_io_done+0x6a/0x180 [ 1776.512577] [8175e4b8] blk_mq_complete_request+0xb8/0x120 [ 1776.512577] [8175e536] __blk_mq_end_io+0x16/0x20 [ 1776.512577] [8175e5a8] blk_mq_end_io+0x68/0xd0 [ 1776.512577] [81852e47] virtblk_done+0x67/0x110 [ 1776.512577] [817f7925] vring_interrupt+0x35/0x60 [ 1776.512577] [810e48a4] handle_irq_event_percpu+0x54/0x1e0 . So this is looking like another virtio+blk_mq problem This one is definitely reproducable. Just hit it again... I'll take a look at this. You don't happen to have gdb dumps of the lines associated with those crashes? Just to save me some digging time... Only this: (gdb) l *(dio_complete+0xa3) 0x811ddae3 is in dio_complete (fs/direct-io.c:282). 277 } 278 279 aio_complete(dio-iocb, ret, 0); 280 } 281 282 kmem_cache_free(dio_cache, dio); 283 return ret; 284 } 285 286 static void dio_aio_complete_work(struct work_struct *work) And this: (gdb) l *(blk_account_io_done+0x6a) 0x81755b6a is in blk_account_io_done (block/blk-core.c:2049). 2044int cpu; 2045 2046cpu = part_stat_lock(); 2047part = req-part; 2048 2049part_stat_inc(cpu, part, ios[rw]); 2050part_stat_add(cpu, part, ticks[rw], duration); 2051part_round_stats(cpu, part); 2052part_dec_in_flight(part, rw); 2053 as I've rebuild the kernel with different patches since the one running on the machine that is triggering the problem. Any update on this, Jens? I've hit this blk_account_io_done() panic 10 times in the past 2 hours while trying to do xfs_repair testing No, sorry, no updates yet... I haven't had time to look into it today. To reproduce tomorrow, can you mail me your exact setup (kvm invocation, etc) and how your guest is setup and if there's any special way I need to run xfstest or xfs_repair? The virtio device that I'm hitting is /mnt/fast-ssd/vm-100TB-sparse.img which is a 100TB file on a 160GB XFS filesystem on an SSD, created with $ xfs_io -f -c truncate 100t -c extsize 1m /mnt/fast-ssd/vm-100TB-sparse.img If I stat it, I get: $ xfs_io -c stat /mnt/fast-ssd/vm-100TB-sparse.img fd.path = /mnt/fast-ssd/vm-100TB-sparse.img fd.flags = non-sync,non-direct,read-write stat.ino = 131 stat.type = regular file stat.size = 109951162777600 stat.blocks = 259333400 fsxattr.xflags = 0x800 [--e---] fsxattr.projid = 0 fsxattr.extsize = 1048576 fsxattr.nextents = 83108 fsxattr.naextents = 0 dioattr.mem = 0x200 dioattr.miniosz = 512 dioattr.maxiosz = 2147483136 $ The VM is run by this script: $ cat run-vm-4.sh #!/bin/sh sudo /usr/bin/qemu-system-x86_64 \ -no-fd-bootchk \ -localtime \ -boot c \ -serial pty \ -nographic \ -alt-grab \ -smp 16 -m 16384 \ -machine accel=kvm \ -hda /vm-images/vm-4/root.img \ -drive file=/vm-images/vm-4/vm-4-test.img,if=virtio,cache=none \ -drive file=/vm-images/vm-4/vm-4-scratch.img,if=virtio,cache=none \ -drive file=/mnt/fast-ssd/vm-100TB-sparse.img,if=virtio,cache=none \ -drive
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Tue, Dec 03, 2013 at 08:47:12PM -0700, Jens Axboe wrote: > On Wed, Dec 04 2013, Dave Chinner wrote: > > On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote: > > > On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote: > > > > Hi Jens, > > > > > > > > Not sure who to direct this to or CC, so I figured you are the > > > > person to do that. I just had xfstests generic/299 (an AIO/DIO test) > > > > oops in dio_complete() like so: > > > > > > > > [ 9650.590630] > > > > [ 9650.590630] [] dio_complete+0xa3/0x140 > > > > [ 9650.590630] [] dio_bio_end_aio+0x7a/0x110 > > > > [ 9650.590630] [] ? dio_bio_end_aio+0x5/0x110 > > > > [ 9650.590630] [] bio_endio+0x1d/0x30 > > > > [ 9650.590630] [] blk_mq_complete_request+0x5f/0x120 > > > > [ 9650.590630] [] __blk_mq_end_io+0x16/0x20 > > > > [ 9650.590630] [] blk_mq_end_io+0x68/0xd0 > > > > [ 9650.590630] [] virtblk_done+0x67/0x110 > > > > [ 9650.590630] [] vring_interrupt+0x35/0x60 . > > > And I just hit this from running xfs_repair which is doing > > > multithreaded direct IO directly on /dev/vdc: > > > > > > [ 1776.510446] IP: [] blk_account_io_done+0x6a/0x180 > > > [ 1776.512577] [] blk_mq_complete_request+0xb8/0x120 > > > [ 1776.512577] [] __blk_mq_end_io+0x16/0x20 > > > [ 1776.512577] [] blk_mq_end_io+0x68/0xd0 > > > [ 1776.512577] [] virtblk_done+0x67/0x110 > > > [ 1776.512577] [] vring_interrupt+0x35/0x60 > > > [ 1776.512577] [] handle_irq_event_percpu+0x54/0x1e0 . > > > So this is looking like another virtio+blk_mq problem > > > > This one is definitely reproducable. Just hit it again... > > I'll take a look at this. You don't happen to have gdb dumps of the > lines associated with those crashes? Just to save me some digging > time... Only this: (gdb) l *(dio_complete+0xa3) 0x811ddae3 is in dio_complete (fs/direct-io.c:282). 277 } 278 279 aio_complete(dio->iocb, ret, 0); 280 } 281 282 kmem_cache_free(dio_cache, dio); 283 return ret; 284 } 285 286 static void dio_aio_complete_work(struct work_struct *work) And this: (gdb) l *(blk_account_io_done+0x6a) 0x81755b6a is in blk_account_io_done (block/blk-core.c:2049). 2044int cpu; 2045 2046cpu = part_stat_lock(); 2047part = req->part; 2048 2049part_stat_inc(cpu, part, ios[rw]); 2050part_stat_add(cpu, part, ticks[rw], duration); 2051part_round_stats(cpu, part); 2052part_dec_in_flight(part, rw); 2053 as I've rebuild the kernel with different patches since the one running on the machine that is triggering the problem. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Wed, Dec 04 2013, Dave Chinner wrote: > On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote: > > On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote: > > > Hi Jens, > > > > > > Not sure who to direct this to or CC, so I figured you are the > > > person to do that. I just had xfstests generic/299 (an AIO/DIO test) > > > oops in dio_complete() like so: > > > > > > [ 9650.586724] general protection fault: [#1] SMP DEBUG_PAGEALLOC > > > [ 9650.590131] Modules linked in: > > > [ 9650.590630] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ > > > #73 > > > [ 9650.590630] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > > > [ 9650.590630] task: 81f35480 ti: 81f24000 task.ti: > > > 81f24000 > > > [ 9650.590630] RIP: 0010:[] [] > > > aio_complete+0xb1/0x1e0 > > > [ 9650.590630] RSP: 0018:88007f803cf8 EFLAGS: 00010003 > > > [ 9650.590630] RAX: 0086 RBX: 8800688ae000 RCX: > > > 00d6d6d6 > > > [ 9650.590630] RDX: 6b6b6b6b6b6b6b6b RSI: 1000 RDI: > > > 8800688ae1c4 > > > [ 9650.590630] RBP: 88007f803d28 R08: R09: > > > 6b6b6b6c > > > [ 9650.590630] R10: ff79a1a35779b009 R11: 0010 R12: > > > 88006953a540 > > > [ 9650.590630] R13: 1000 R14: R15: > > > 8800688ae1c4 > > > [ 9650.590630] FS: () GS:88007f80() > > > knlGS: > > > [ 9650.590630] CS: 0010 DS: ES: CR0: 8005003b > > > [ 9650.590630] CR2: 7ffa9c48448f CR3: 28061000 CR4: > > > 06f0 > > > [ 9650.590630] Stack: > > > [ 9650.590630] 88007f803d18 1000 880068b80800 > > > 1000 > > > [ 9650.590630] 0001 31f6d000 88007f803d68 > > > 811ddae3 > > > [ 9650.590630] 0086 880068b80800 > > > 880068b80828 > > > [ 9650.590630] Call Trace: > > > [ 9650.590630] > > > [ 9650.590630] [] dio_complete+0xa3/0x140 > > > [ 9650.590630] [] dio_bio_end_aio+0x7a/0x110 > > > [ 9650.590630] [] ? dio_bio_end_aio+0x5/0x110 > > > [ 9650.590630] [] bio_endio+0x1d/0x30 > > > [ 9650.590630] [] blk_mq_complete_request+0x5f/0x120 > > > [ 9650.590630] [] __blk_mq_end_io+0x16/0x20 > > > [ 9650.590630] [] blk_mq_end_io+0x68/0xd0 > > > [ 9650.590630] [] virtblk_done+0x67/0x110 > > > [ 9650.590630] [] vring_interrupt+0x35/0x60 > > > [ 9650.590630] [] handle_irq_event_percpu+0x54/0x1e0 > > > [ 9650.590630] [] handle_irq_event+0x48/0x70 > > > [ 9650.590630] [] ? kvm_guest_apic_eoi_write+0x5/0x50 > > > [ 9650.590630] [] handle_edge_irq+0x77/0x110 > > > [ 9650.590630] [] handle_irq+0xbf/0x150 > > > [ 9650.590630] [] ? handle_irq+0x5/0x150 > > > [ 9650.590630] [] ? > > > atomic_notifier_call_chain+0x16/0x20 > > > [ 9650.590630] [] do_IRQ+0x5a/0xe0 > > > [ 9650.590630] [] common_interrupt+0x6d/0x6d > > > [ 9650.590630] > > > [ 9650.590630] [] ? native_safe_halt+0x6/0x10 > > > [ 9650.590630] [] ? default_idle+0x5/0xc0 > > > [ 9650.590630] [] default_idle+0x1f/0xc0 > > > [ 9650.590630] [] arch_cpu_idle+0x26/0x30 > > > [ 9650.590630] [] cpu_startup_entry+0x81/0x240 > > > [ 9650.590630] [] rest_init+0x77/0x80 > > > [ 9650.590630] [] start_kernel+0x3cd/0x3da > > > [ 9650.590630] [] ? repair_env_string+0x5e/0x5e > > > [ 9650.590630] [] ? early_idt_handlers+0x117/0x120 > > > [ 9650.590630] [] x86_64_start_reservations+0x2a/0x2c > > > [ 9650.590630] [] x86_64_start_kernel+0x146/0x155 > > > [ 9650.590630] Code: e8 05 b5 8f 00 44 8b 8b c0 01 00 00 45 31 c0 48 8b > > > 93 98 00 00 00 41 83 c1 01 44 3b 8b 80 00 00 00 44 89 c9 45 0f 42 c1 c1 > > > e9 07 <48> 8b 0c ca 65 > > > [ 9650.590630] RIP [] aio_complete+0xb1/0x1e0 > > > [ 9650.590630] RSP > > > > > > I'm not sure if it is related to blk_mq, virtio, dio or bio changes > > > (or even somethign else), but I haven't seen anything like this > > > before. I've only seen it once so far (haven't rerun the test yet at > > > all). > > > > And I just hit this from running xfs_repair which is doing > > multithreaded direct IO directly on /dev/vdc: > > > > [ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at > > 0328 > > [ 1776.510446] IP: [] blk_account_io_done+0x6a/0x180 > > [ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0 > > [ 1776.512577] Oops: [#1] SMP > > [ 1776.512577] Modules linked in: > > [ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75 > > [ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > > [ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: > > 81f26000 > > [ 1776.512577] RIP: 0010:[] [] > > blk_account_io_done+0x6a/0x180 > > [ 1776.512577] RSP: :88011bc03da8 EFLAGS: 00010046 > > [ 1776.512577] RAX: 1000 RBX: RCX: > > > > [ 1776.512577] RDX: RSI:
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote: > On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote: > > Hi Jens, > > > > Not sure who to direct this to or CC, so I figured you are the > > person to do that. I just had xfstests generic/299 (an AIO/DIO test) > > oops in dio_complete() like so: > > > > [ 9650.586724] general protection fault: [#1] SMP DEBUG_PAGEALLOC > > [ 9650.590131] Modules linked in: > > [ 9650.590630] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #73 > > [ 9650.590630] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > > [ 9650.590630] task: 81f35480 ti: 81f24000 task.ti: > > 81f24000 > > [ 9650.590630] RIP: 0010:[] [] > > aio_complete+0xb1/0x1e0 > > [ 9650.590630] RSP: 0018:88007f803cf8 EFLAGS: 00010003 > > [ 9650.590630] RAX: 0086 RBX: 8800688ae000 RCX: > > 00d6d6d6 > > [ 9650.590630] RDX: 6b6b6b6b6b6b6b6b RSI: 1000 RDI: > > 8800688ae1c4 > > [ 9650.590630] RBP: 88007f803d28 R08: R09: > > 6b6b6b6c > > [ 9650.590630] R10: ff79a1a35779b009 R11: 0010 R12: > > 88006953a540 > > [ 9650.590630] R13: 1000 R14: R15: > > 8800688ae1c4 > > [ 9650.590630] FS: () GS:88007f80() > > knlGS: > > [ 9650.590630] CS: 0010 DS: ES: CR0: 8005003b > > [ 9650.590630] CR2: 7ffa9c48448f CR3: 28061000 CR4: > > 06f0 > > [ 9650.590630] Stack: > > [ 9650.590630] 88007f803d18 1000 880068b80800 > > 1000 > > [ 9650.590630] 0001 31f6d000 88007f803d68 > > 811ddae3 > > [ 9650.590630] 0086 880068b80800 > > 880068b80828 > > [ 9650.590630] Call Trace: > > [ 9650.590630] > > [ 9650.590630] [] dio_complete+0xa3/0x140 > > [ 9650.590630] [] dio_bio_end_aio+0x7a/0x110 > > [ 9650.590630] [] ? dio_bio_end_aio+0x5/0x110 > > [ 9650.590630] [] bio_endio+0x1d/0x30 > > [ 9650.590630] [] blk_mq_complete_request+0x5f/0x120 > > [ 9650.590630] [] __blk_mq_end_io+0x16/0x20 > > [ 9650.590630] [] blk_mq_end_io+0x68/0xd0 > > [ 9650.590630] [] virtblk_done+0x67/0x110 > > [ 9650.590630] [] vring_interrupt+0x35/0x60 > > [ 9650.590630] [] handle_irq_event_percpu+0x54/0x1e0 > > [ 9650.590630] [] handle_irq_event+0x48/0x70 > > [ 9650.590630] [] ? kvm_guest_apic_eoi_write+0x5/0x50 > > [ 9650.590630] [] handle_edge_irq+0x77/0x110 > > [ 9650.590630] [] handle_irq+0xbf/0x150 > > [ 9650.590630] [] ? handle_irq+0x5/0x150 > > [ 9650.590630] [] ? atomic_notifier_call_chain+0x16/0x20 > > [ 9650.590630] [] do_IRQ+0x5a/0xe0 > > [ 9650.590630] [] common_interrupt+0x6d/0x6d > > [ 9650.590630] > > [ 9650.590630] [] ? native_safe_halt+0x6/0x10 > > [ 9650.590630] [] ? default_idle+0x5/0xc0 > > [ 9650.590630] [] default_idle+0x1f/0xc0 > > [ 9650.590630] [] arch_cpu_idle+0x26/0x30 > > [ 9650.590630] [] cpu_startup_entry+0x81/0x240 > > [ 9650.590630] [] rest_init+0x77/0x80 > > [ 9650.590630] [] start_kernel+0x3cd/0x3da > > [ 9650.590630] [] ? repair_env_string+0x5e/0x5e > > [ 9650.590630] [] ? early_idt_handlers+0x117/0x120 > > [ 9650.590630] [] x86_64_start_reservations+0x2a/0x2c > > [ 9650.590630] [] x86_64_start_kernel+0x146/0x155 > > [ 9650.590630] Code: e8 05 b5 8f 00 44 8b 8b c0 01 00 00 45 31 c0 48 8b 93 > > 98 00 00 00 41 83 c1 01 44 3b 8b 80 00 00 00 44 89 c9 45 0f 42 c1 c1 e9 07 > > <48> 8b 0c ca 65 > > [ 9650.590630] RIP [] aio_complete+0xb1/0x1e0 > > [ 9650.590630] RSP > > > > I'm not sure if it is related to blk_mq, virtio, dio or bio changes > > (or even somethign else), but I haven't seen anything like this > > before. I've only seen it once so far (haven't rerun the test yet at > > all). > > And I just hit this from running xfs_repair which is doing > multithreaded direct IO directly on /dev/vdc: > > [ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at > 0328 > [ 1776.510446] IP: [] blk_account_io_done+0x6a/0x180 > [ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0 > [ 1776.512577] Oops: [#1] SMP > [ 1776.512577] Modules linked in: > [ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75 > [ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > [ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: > 81f26000 > [ 1776.512577] RIP: 0010:[] [] > blk_account_io_done+0x6a/0x180 > [ 1776.512577] RSP: :88011bc03da8 EFLAGS: 00010046 > [ 1776.512577] RAX: 1000 RBX: RCX: > > [ 1776.512577] RDX: RSI: 0002 RDI: > 8800dba59fc0 > [ 1776.512577] RBP: 88011bc03db8 R08: R09: > 88021a828928 > [ 1776.512577] R10: R11: 0001 R12: > > [ 1776.512577] R13: R14:
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote: > Hi Jens, > > Not sure who to direct this to or CC, so I figured you are the > person to do that. I just had xfstests generic/299 (an AIO/DIO test) > oops in dio_complete() like so: > > [ 9650.586724] general protection fault: [#1] SMP DEBUG_PAGEALLOC > [ 9650.590131] Modules linked in: > [ 9650.590630] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #73 > [ 9650.590630] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 > [ 9650.590630] task: 81f35480 ti: 81f24000 task.ti: > 81f24000 > [ 9650.590630] RIP: 0010:[] [] > aio_complete+0xb1/0x1e0 > [ 9650.590630] RSP: 0018:88007f803cf8 EFLAGS: 00010003 > [ 9650.590630] RAX: 0086 RBX: 8800688ae000 RCX: > 00d6d6d6 > [ 9650.590630] RDX: 6b6b6b6b6b6b6b6b RSI: 1000 RDI: > 8800688ae1c4 > [ 9650.590630] RBP: 88007f803d28 R08: R09: > 6b6b6b6c > [ 9650.590630] R10: ff79a1a35779b009 R11: 0010 R12: > 88006953a540 > [ 9650.590630] R13: 1000 R14: R15: > 8800688ae1c4 > [ 9650.590630] FS: () GS:88007f80() > knlGS: > [ 9650.590630] CS: 0010 DS: ES: CR0: 8005003b > [ 9650.590630] CR2: 7ffa9c48448f CR3: 28061000 CR4: > 06f0 > [ 9650.590630] Stack: > [ 9650.590630] 88007f803d18 1000 880068b80800 > 1000 > [ 9650.590630] 0001 31f6d000 88007f803d68 > 811ddae3 > [ 9650.590630] 0086 880068b80800 > 880068b80828 > [ 9650.590630] Call Trace: > [ 9650.590630] > [ 9650.590630] [] dio_complete+0xa3/0x140 > [ 9650.590630] [] dio_bio_end_aio+0x7a/0x110 > [ 9650.590630] [] ? dio_bio_end_aio+0x5/0x110 > [ 9650.590630] [] bio_endio+0x1d/0x30 > [ 9650.590630] [] blk_mq_complete_request+0x5f/0x120 > [ 9650.590630] [] __blk_mq_end_io+0x16/0x20 > [ 9650.590630] [] blk_mq_end_io+0x68/0xd0 > [ 9650.590630] [] virtblk_done+0x67/0x110 > [ 9650.590630] [] vring_interrupt+0x35/0x60 > [ 9650.590630] [] handle_irq_event_percpu+0x54/0x1e0 > [ 9650.590630] [] handle_irq_event+0x48/0x70 > [ 9650.590630] [] ? kvm_guest_apic_eoi_write+0x5/0x50 > [ 9650.590630] [] handle_edge_irq+0x77/0x110 > [ 9650.590630] [] handle_irq+0xbf/0x150 > [ 9650.590630] [] ? handle_irq+0x5/0x150 > [ 9650.590630] [] ? atomic_notifier_call_chain+0x16/0x20 > [ 9650.590630] [] do_IRQ+0x5a/0xe0 > [ 9650.590630] [] common_interrupt+0x6d/0x6d > [ 9650.590630] > [ 9650.590630] [] ? native_safe_halt+0x6/0x10 > [ 9650.590630] [] ? default_idle+0x5/0xc0 > [ 9650.590630] [] default_idle+0x1f/0xc0 > [ 9650.590630] [] arch_cpu_idle+0x26/0x30 > [ 9650.590630] [] cpu_startup_entry+0x81/0x240 > [ 9650.590630] [] rest_init+0x77/0x80 > [ 9650.590630] [] start_kernel+0x3cd/0x3da > [ 9650.590630] [] ? repair_env_string+0x5e/0x5e > [ 9650.590630] [] ? early_idt_handlers+0x117/0x120 > [ 9650.590630] [] x86_64_start_reservations+0x2a/0x2c > [ 9650.590630] [] x86_64_start_kernel+0x146/0x155 > [ 9650.590630] Code: e8 05 b5 8f 00 44 8b 8b c0 01 00 00 45 31 c0 48 8b 93 98 > 00 00 00 41 83 c1 01 44 3b 8b 80 00 00 00 44 89 c9 45 0f 42 c1 c1 e9 07 <48> > 8b 0c ca 65 > [ 9650.590630] RIP [] aio_complete+0xb1/0x1e0 > [ 9650.590630] RSP > > I'm not sure if it is related to blk_mq, virtio, dio or bio changes > (or even somethign else), but I haven't seen anything like this > before. I've only seen it once so far (haven't rerun the test yet at > all). And I just hit this from running xfs_repair which is doing multithreaded direct IO directly on /dev/vdc: [ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at 0328 [ 1776.510446] IP: [] blk_account_io_done+0x6a/0x180 [ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0 [ 1776.512577] Oops: [#1] SMP [ 1776.512577] Modules linked in: [ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75 [ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: 81f26000 [ 1776.512577] RIP: 0010:[] [] blk_account_io_done+0x6a/0x180 [ 1776.512577] RSP: :88011bc03da8 EFLAGS: 00010046 [ 1776.512577] RAX: 1000 RBX: RCX: [ 1776.512577] RDX: RSI: 0002 RDI: 8800dba59fc0 [ 1776.512577] RBP: 88011bc03db8 R08: R09: 88021a828928 [ 1776.512577] R10: R11: 0001 R12: [ 1776.512577] R13: R14: 2000 R15: [ 1776.512577] FS: () GS:88011bc0() knlGS: [ 1776.512577] CS: 0010 DS: ES: CR0: 8005003b [ 1776.512577] CR2: 0328 CR3: 00041a4b CR4: 06f0 [
[OOPS, 3.13-rc2] null ptr in dio_complete()
Hi Jens, Not sure who to direct this to or CC, so I figured you are the person to do that. I just had xfstests generic/299 (an AIO/DIO test) oops in dio_complete() like so: [ 9650.586724] general protection fault: [#1] SMP DEBUG_PAGEALLOC [ 9650.590131] Modules linked in: [ 9650.590630] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #73 [ 9650.590630] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 9650.590630] task: 81f35480 ti: 81f24000 task.ti: 81f24000 [ 9650.590630] RIP: 0010:[] [] aio_complete+0xb1/0x1e0 [ 9650.590630] RSP: 0018:88007f803cf8 EFLAGS: 00010003 [ 9650.590630] RAX: 0086 RBX: 8800688ae000 RCX: 00d6d6d6 [ 9650.590630] RDX: 6b6b6b6b6b6b6b6b RSI: 1000 RDI: 8800688ae1c4 [ 9650.590630] RBP: 88007f803d28 R08: R09: 6b6b6b6c [ 9650.590630] R10: ff79a1a35779b009 R11: 0010 R12: 88006953a540 [ 9650.590630] R13: 1000 R14: R15: 8800688ae1c4 [ 9650.590630] FS: () GS:88007f80() knlGS: [ 9650.590630] CS: 0010 DS: ES: CR0: 8005003b [ 9650.590630] CR2: 7ffa9c48448f CR3: 28061000 CR4: 06f0 [ 9650.590630] Stack: [ 9650.590630] 88007f803d18 1000 880068b80800 1000 [ 9650.590630] 0001 31f6d000 88007f803d68 811ddae3 [ 9650.590630] 0086 880068b80800 880068b80828 [ 9650.590630] Call Trace: [ 9650.590630] [ 9650.590630] [] dio_complete+0xa3/0x140 [ 9650.590630] [] dio_bio_end_aio+0x7a/0x110 [ 9650.590630] [] ? dio_bio_end_aio+0x5/0x110 [ 9650.590630] [] bio_endio+0x1d/0x30 [ 9650.590630] [] blk_mq_complete_request+0x5f/0x120 [ 9650.590630] [] __blk_mq_end_io+0x16/0x20 [ 9650.590630] [] blk_mq_end_io+0x68/0xd0 [ 9650.590630] [] virtblk_done+0x67/0x110 [ 9650.590630] [] vring_interrupt+0x35/0x60 [ 9650.590630] [] handle_irq_event_percpu+0x54/0x1e0 [ 9650.590630] [] handle_irq_event+0x48/0x70 [ 9650.590630] [] ? kvm_guest_apic_eoi_write+0x5/0x50 [ 9650.590630] [] handle_edge_irq+0x77/0x110 [ 9650.590630] [] handle_irq+0xbf/0x150 [ 9650.590630] [] ? handle_irq+0x5/0x150 [ 9650.590630] [] ? atomic_notifier_call_chain+0x16/0x20 [ 9650.590630] [] do_IRQ+0x5a/0xe0 [ 9650.590630] [] common_interrupt+0x6d/0x6d [ 9650.590630] [ 9650.590630] [] ? native_safe_halt+0x6/0x10 [ 9650.590630] [] ? default_idle+0x5/0xc0 [ 9650.590630] [] default_idle+0x1f/0xc0 [ 9650.590630] [] arch_cpu_idle+0x26/0x30 [ 9650.590630] [] cpu_startup_entry+0x81/0x240 [ 9650.590630] [] rest_init+0x77/0x80 [ 9650.590630] [] start_kernel+0x3cd/0x3da [ 9650.590630] [] ? repair_env_string+0x5e/0x5e [ 9650.590630] [] ? early_idt_handlers+0x117/0x120 [ 9650.590630] [] x86_64_start_reservations+0x2a/0x2c [ 9650.590630] [] x86_64_start_kernel+0x146/0x155 [ 9650.590630] Code: e8 05 b5 8f 00 44 8b 8b c0 01 00 00 45 31 c0 48 8b 93 98 00 00 00 41 83 c1 01 44 3b 8b 80 00 00 00 44 89 c9 45 0f 42 c1 c1 e9 07 <48> 8b 0c ca 65 [ 9650.590630] RIP [] aio_complete+0xb1/0x1e0 [ 9650.590630] RSP I'm not sure if it is related to blk_mq, virtio, dio or bio changes (or even somethign else), but I haven't seen anything like this before. I've only seen it once so far (haven't rerun the test yet at all). FWIW, I'd also like to point out that I'm seeing XFS on the virtio stack get perilously close to blowing the stack on 3.13-rc2: [ 387.048771] fill (11196) used greatest stack depth: 1048 bytes left This is new - on 3.12 generally the output would be around 2500 bytes left, so somewhere below of XFS there a lot more stack being consumed in the IO path on 3.13. I'll do some more tracing of this to try to find out where it is all going Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[OOPS, 3.13-rc2] null ptr in dio_complete()
Hi Jens, Not sure who to direct this to or CC, so I figured you are the person to do that. I just had xfstests generic/299 (an AIO/DIO test) oops in dio_complete() like so: [ 9650.586724] general protection fault: [#1] SMP DEBUG_PAGEALLOC [ 9650.590131] Modules linked in: [ 9650.590630] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #73 [ 9650.590630] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 9650.590630] task: 81f35480 ti: 81f24000 task.ti: 81f24000 [ 9650.590630] RIP: 0010:[811ee5f1] [811ee5f1] aio_complete+0xb1/0x1e0 [ 9650.590630] RSP: 0018:88007f803cf8 EFLAGS: 00010003 [ 9650.590630] RAX: 0086 RBX: 8800688ae000 RCX: 00d6d6d6 [ 9650.590630] RDX: 6b6b6b6b6b6b6b6b RSI: 1000 RDI: 8800688ae1c4 [ 9650.590630] RBP: 88007f803d28 R08: R09: 6b6b6b6c [ 9650.590630] R10: ff79a1a35779b009 R11: 0010 R12: 88006953a540 [ 9650.590630] R13: 1000 R14: R15: 8800688ae1c4 [ 9650.590630] FS: () GS:88007f80() knlGS: [ 9650.590630] CS: 0010 DS: ES: CR0: 8005003b [ 9650.590630] CR2: 7ffa9c48448f CR3: 28061000 CR4: 06f0 [ 9650.590630] Stack: [ 9650.590630] 88007f803d18 1000 880068b80800 1000 [ 9650.590630] 0001 31f6d000 88007f803d68 811ddae3 [ 9650.590630] 0086 880068b80800 880068b80828 [ 9650.590630] Call Trace: [ 9650.590630] IRQ [ 9650.590630] [811ddae3] dio_complete+0xa3/0x140 [ 9650.590630] [811ddc2a] dio_bio_end_aio+0x7a/0x110 [ 9650.590630] [811ddbb5] ? dio_bio_end_aio+0x5/0x110 [ 9650.590630] [811d8a9d] bio_endio+0x1d/0x30 [ 9650.590630] [8175d65f] blk_mq_complete_request+0x5f/0x120 [ 9650.590630] [8175d736] __blk_mq_end_io+0x16/0x20 [ 9650.590630] [8175d7a8] blk_mq_end_io+0x68/0xd0 [ 9650.590630] [818539a7] virtblk_done+0x67/0x110 [ 9650.590630] [817f74c5] vring_interrupt+0x35/0x60 [ 9650.590630] [810e4e34] handle_irq_event_percpu+0x54/0x1e0 [ 9650.590630] [810e5008] handle_irq_event+0x48/0x70 [ 9650.590630] [8107d6f5] ? kvm_guest_apic_eoi_write+0x5/0x50 [ 9650.590630] [810e7807] handle_edge_irq+0x77/0x110 [ 9650.590630] [8104796f] handle_irq+0xbf/0x150 [ 9650.590630] [810478b5] ? handle_irq+0x5/0x150 [ 9650.590630] [81aee136] ? atomic_notifier_call_chain+0x16/0x20 [ 9650.590630] [81af43ca] do_IRQ+0x5a/0xe0 [ 9650.590630] [81aea16d] common_interrupt+0x6d/0x6d [ 9650.590630] EOI [ 9650.590630] [8107e616] ? native_safe_halt+0x6/0x10 [ 9650.590630] [8104e4f5] ? default_idle+0x5/0xc0 [ 9650.590630] [8104e50f] default_idle+0x1f/0xc0 [ 9650.590630] [8104edb6] arch_cpu_idle+0x26/0x30 [ 9650.590630] [810e4331] cpu_startup_entry+0x81/0x240 [ 9650.590630] [81ad5d67] rest_init+0x77/0x80 [ 9650.590630] [8203ceec] start_kernel+0x3cd/0x3da [ 9650.590630] [8203c91e] ? repair_env_string+0x5e/0x5e [ 9650.590630] [8203c117] ? early_idt_handlers+0x117/0x120 [ 9650.590630] [8203c5c2] x86_64_start_reservations+0x2a/0x2c [ 9650.590630] [8203c70a] x86_64_start_kernel+0x146/0x155 [ 9650.590630] Code: e8 05 b5 8f 00 44 8b 8b c0 01 00 00 45 31 c0 48 8b 93 98 00 00 00 41 83 c1 01 44 3b 8b 80 00 00 00 44 89 c9 45 0f 42 c1 c1 e9 07 48 8b 0c ca 65 [ 9650.590630] RIP [811ee5f1] aio_complete+0xb1/0x1e0 [ 9650.590630] RSP 88007f803cf8 I'm not sure if it is related to blk_mq, virtio, dio or bio changes (or even somethign else), but I haven't seen anything like this before. I've only seen it once so far (haven't rerun the test yet at all). FWIW, I'd also like to point out that I'm seeing XFS on the virtio stack get perilously close to blowing the stack on 3.13-rc2: [ 387.048771] fill (11196) used greatest stack depth: 1048 bytes left This is new - on 3.12 generally the output would be around 2500 bytes left, so somewhere below of XFS there a lot more stack being consumed in the IO path on 3.13. I'll do some more tracing of this to try to find out where it is all going Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote: Hi Jens, Not sure who to direct this to or CC, so I figured you are the person to do that. I just had xfstests generic/299 (an AIO/DIO test) oops in dio_complete() like so: [ 9650.586724] general protection fault: [#1] SMP DEBUG_PAGEALLOC [ 9650.590131] Modules linked in: [ 9650.590630] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #73 [ 9650.590630] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 9650.590630] task: 81f35480 ti: 81f24000 task.ti: 81f24000 [ 9650.590630] RIP: 0010:[811ee5f1] [811ee5f1] aio_complete+0xb1/0x1e0 [ 9650.590630] RSP: 0018:88007f803cf8 EFLAGS: 00010003 [ 9650.590630] RAX: 0086 RBX: 8800688ae000 RCX: 00d6d6d6 [ 9650.590630] RDX: 6b6b6b6b6b6b6b6b RSI: 1000 RDI: 8800688ae1c4 [ 9650.590630] RBP: 88007f803d28 R08: R09: 6b6b6b6c [ 9650.590630] R10: ff79a1a35779b009 R11: 0010 R12: 88006953a540 [ 9650.590630] R13: 1000 R14: R15: 8800688ae1c4 [ 9650.590630] FS: () GS:88007f80() knlGS: [ 9650.590630] CS: 0010 DS: ES: CR0: 8005003b [ 9650.590630] CR2: 7ffa9c48448f CR3: 28061000 CR4: 06f0 [ 9650.590630] Stack: [ 9650.590630] 88007f803d18 1000 880068b80800 1000 [ 9650.590630] 0001 31f6d000 88007f803d68 811ddae3 [ 9650.590630] 0086 880068b80800 880068b80828 [ 9650.590630] Call Trace: [ 9650.590630] IRQ [ 9650.590630] [811ddae3] dio_complete+0xa3/0x140 [ 9650.590630] [811ddc2a] dio_bio_end_aio+0x7a/0x110 [ 9650.590630] [811ddbb5] ? dio_bio_end_aio+0x5/0x110 [ 9650.590630] [811d8a9d] bio_endio+0x1d/0x30 [ 9650.590630] [8175d65f] blk_mq_complete_request+0x5f/0x120 [ 9650.590630] [8175d736] __blk_mq_end_io+0x16/0x20 [ 9650.590630] [8175d7a8] blk_mq_end_io+0x68/0xd0 [ 9650.590630] [818539a7] virtblk_done+0x67/0x110 [ 9650.590630] [817f74c5] vring_interrupt+0x35/0x60 [ 9650.590630] [810e4e34] handle_irq_event_percpu+0x54/0x1e0 [ 9650.590630] [810e5008] handle_irq_event+0x48/0x70 [ 9650.590630] [8107d6f5] ? kvm_guest_apic_eoi_write+0x5/0x50 [ 9650.590630] [810e7807] handle_edge_irq+0x77/0x110 [ 9650.590630] [8104796f] handle_irq+0xbf/0x150 [ 9650.590630] [810478b5] ? handle_irq+0x5/0x150 [ 9650.590630] [81aee136] ? atomic_notifier_call_chain+0x16/0x20 [ 9650.590630] [81af43ca] do_IRQ+0x5a/0xe0 [ 9650.590630] [81aea16d] common_interrupt+0x6d/0x6d [ 9650.590630] EOI [ 9650.590630] [8107e616] ? native_safe_halt+0x6/0x10 [ 9650.590630] [8104e4f5] ? default_idle+0x5/0xc0 [ 9650.590630] [8104e50f] default_idle+0x1f/0xc0 [ 9650.590630] [8104edb6] arch_cpu_idle+0x26/0x30 [ 9650.590630] [810e4331] cpu_startup_entry+0x81/0x240 [ 9650.590630] [81ad5d67] rest_init+0x77/0x80 [ 9650.590630] [8203ceec] start_kernel+0x3cd/0x3da [ 9650.590630] [8203c91e] ? repair_env_string+0x5e/0x5e [ 9650.590630] [8203c117] ? early_idt_handlers+0x117/0x120 [ 9650.590630] [8203c5c2] x86_64_start_reservations+0x2a/0x2c [ 9650.590630] [8203c70a] x86_64_start_kernel+0x146/0x155 [ 9650.590630] Code: e8 05 b5 8f 00 44 8b 8b c0 01 00 00 45 31 c0 48 8b 93 98 00 00 00 41 83 c1 01 44 3b 8b 80 00 00 00 44 89 c9 45 0f 42 c1 c1 e9 07 48 8b 0c ca 65 [ 9650.590630] RIP [811ee5f1] aio_complete+0xb1/0x1e0 [ 9650.590630] RSP 88007f803cf8 I'm not sure if it is related to blk_mq, virtio, dio or bio changes (or even somethign else), but I haven't seen anything like this before. I've only seen it once so far (haven't rerun the test yet at all). And I just hit this from running xfs_repair which is doing multithreaded direct IO directly on /dev/vdc: [ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at 0328 [ 1776.510446] IP: [81755b6a] blk_account_io_done+0x6a/0x180 [ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0 [ 1776.512577] Oops: [#1] SMP [ 1776.512577] Modules linked in: [ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75 [ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: 81f26000 [ 1776.512577] RIP: 0010:[81755b6a] [81755b6a] blk_account_io_done+0x6a/0x180 [ 1776.512577] RSP: :88011bc03da8 EFLAGS: 00010046 [ 1776.512577] RAX: 1000 RBX: RCX: [ 1776.512577] RDX: RSI: 0002 RDI:
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote: On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote: Hi Jens, Not sure who to direct this to or CC, so I figured you are the person to do that. I just had xfstests generic/299 (an AIO/DIO test) oops in dio_complete() like so: [ 9650.586724] general protection fault: [#1] SMP DEBUG_PAGEALLOC [ 9650.590131] Modules linked in: [ 9650.590630] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #73 [ 9650.590630] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 9650.590630] task: 81f35480 ti: 81f24000 task.ti: 81f24000 [ 9650.590630] RIP: 0010:[811ee5f1] [811ee5f1] aio_complete+0xb1/0x1e0 [ 9650.590630] RSP: 0018:88007f803cf8 EFLAGS: 00010003 [ 9650.590630] RAX: 0086 RBX: 8800688ae000 RCX: 00d6d6d6 [ 9650.590630] RDX: 6b6b6b6b6b6b6b6b RSI: 1000 RDI: 8800688ae1c4 [ 9650.590630] RBP: 88007f803d28 R08: R09: 6b6b6b6c [ 9650.590630] R10: ff79a1a35779b009 R11: 0010 R12: 88006953a540 [ 9650.590630] R13: 1000 R14: R15: 8800688ae1c4 [ 9650.590630] FS: () GS:88007f80() knlGS: [ 9650.590630] CS: 0010 DS: ES: CR0: 8005003b [ 9650.590630] CR2: 7ffa9c48448f CR3: 28061000 CR4: 06f0 [ 9650.590630] Stack: [ 9650.590630] 88007f803d18 1000 880068b80800 1000 [ 9650.590630] 0001 31f6d000 88007f803d68 811ddae3 [ 9650.590630] 0086 880068b80800 880068b80828 [ 9650.590630] Call Trace: [ 9650.590630] IRQ [ 9650.590630] [811ddae3] dio_complete+0xa3/0x140 [ 9650.590630] [811ddc2a] dio_bio_end_aio+0x7a/0x110 [ 9650.590630] [811ddbb5] ? dio_bio_end_aio+0x5/0x110 [ 9650.590630] [811d8a9d] bio_endio+0x1d/0x30 [ 9650.590630] [8175d65f] blk_mq_complete_request+0x5f/0x120 [ 9650.590630] [8175d736] __blk_mq_end_io+0x16/0x20 [ 9650.590630] [8175d7a8] blk_mq_end_io+0x68/0xd0 [ 9650.590630] [818539a7] virtblk_done+0x67/0x110 [ 9650.590630] [817f74c5] vring_interrupt+0x35/0x60 [ 9650.590630] [810e4e34] handle_irq_event_percpu+0x54/0x1e0 [ 9650.590630] [810e5008] handle_irq_event+0x48/0x70 [ 9650.590630] [8107d6f5] ? kvm_guest_apic_eoi_write+0x5/0x50 [ 9650.590630] [810e7807] handle_edge_irq+0x77/0x110 [ 9650.590630] [8104796f] handle_irq+0xbf/0x150 [ 9650.590630] [810478b5] ? handle_irq+0x5/0x150 [ 9650.590630] [81aee136] ? atomic_notifier_call_chain+0x16/0x20 [ 9650.590630] [81af43ca] do_IRQ+0x5a/0xe0 [ 9650.590630] [81aea16d] common_interrupt+0x6d/0x6d [ 9650.590630] EOI [ 9650.590630] [8107e616] ? native_safe_halt+0x6/0x10 [ 9650.590630] [8104e4f5] ? default_idle+0x5/0xc0 [ 9650.590630] [8104e50f] default_idle+0x1f/0xc0 [ 9650.590630] [8104edb6] arch_cpu_idle+0x26/0x30 [ 9650.590630] [810e4331] cpu_startup_entry+0x81/0x240 [ 9650.590630] [81ad5d67] rest_init+0x77/0x80 [ 9650.590630] [8203ceec] start_kernel+0x3cd/0x3da [ 9650.590630] [8203c91e] ? repair_env_string+0x5e/0x5e [ 9650.590630] [8203c117] ? early_idt_handlers+0x117/0x120 [ 9650.590630] [8203c5c2] x86_64_start_reservations+0x2a/0x2c [ 9650.590630] [8203c70a] x86_64_start_kernel+0x146/0x155 [ 9650.590630] Code: e8 05 b5 8f 00 44 8b 8b c0 01 00 00 45 31 c0 48 8b 93 98 00 00 00 41 83 c1 01 44 3b 8b 80 00 00 00 44 89 c9 45 0f 42 c1 c1 e9 07 48 8b 0c ca 65 [ 9650.590630] RIP [811ee5f1] aio_complete+0xb1/0x1e0 [ 9650.590630] RSP 88007f803cf8 I'm not sure if it is related to blk_mq, virtio, dio or bio changes (or even somethign else), but I haven't seen anything like this before. I've only seen it once so far (haven't rerun the test yet at all). And I just hit this from running xfs_repair which is doing multithreaded direct IO directly on /dev/vdc: [ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at 0328 [ 1776.510446] IP: [81755b6a] blk_account_io_done+0x6a/0x180 [ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0 [ 1776.512577] Oops: [#1] SMP [ 1776.512577] Modules linked in: [ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75 [ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: 81f26000 [ 1776.512577] RIP: 0010:[81755b6a] [81755b6a] blk_account_io_done+0x6a/0x180 [ 1776.512577] RSP: :88011bc03da8 EFLAGS:
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Wed, Dec 04 2013, Dave Chinner wrote: On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote: On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote: Hi Jens, Not sure who to direct this to or CC, so I figured you are the person to do that. I just had xfstests generic/299 (an AIO/DIO test) oops in dio_complete() like so: [ 9650.586724] general protection fault: [#1] SMP DEBUG_PAGEALLOC [ 9650.590131] Modules linked in: [ 9650.590630] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #73 [ 9650.590630] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 9650.590630] task: 81f35480 ti: 81f24000 task.ti: 81f24000 [ 9650.590630] RIP: 0010:[811ee5f1] [811ee5f1] aio_complete+0xb1/0x1e0 [ 9650.590630] RSP: 0018:88007f803cf8 EFLAGS: 00010003 [ 9650.590630] RAX: 0086 RBX: 8800688ae000 RCX: 00d6d6d6 [ 9650.590630] RDX: 6b6b6b6b6b6b6b6b RSI: 1000 RDI: 8800688ae1c4 [ 9650.590630] RBP: 88007f803d28 R08: R09: 6b6b6b6c [ 9650.590630] R10: ff79a1a35779b009 R11: 0010 R12: 88006953a540 [ 9650.590630] R13: 1000 R14: R15: 8800688ae1c4 [ 9650.590630] FS: () GS:88007f80() knlGS: [ 9650.590630] CS: 0010 DS: ES: CR0: 8005003b [ 9650.590630] CR2: 7ffa9c48448f CR3: 28061000 CR4: 06f0 [ 9650.590630] Stack: [ 9650.590630] 88007f803d18 1000 880068b80800 1000 [ 9650.590630] 0001 31f6d000 88007f803d68 811ddae3 [ 9650.590630] 0086 880068b80800 880068b80828 [ 9650.590630] Call Trace: [ 9650.590630] IRQ [ 9650.590630] [811ddae3] dio_complete+0xa3/0x140 [ 9650.590630] [811ddc2a] dio_bio_end_aio+0x7a/0x110 [ 9650.590630] [811ddbb5] ? dio_bio_end_aio+0x5/0x110 [ 9650.590630] [811d8a9d] bio_endio+0x1d/0x30 [ 9650.590630] [8175d65f] blk_mq_complete_request+0x5f/0x120 [ 9650.590630] [8175d736] __blk_mq_end_io+0x16/0x20 [ 9650.590630] [8175d7a8] blk_mq_end_io+0x68/0xd0 [ 9650.590630] [818539a7] virtblk_done+0x67/0x110 [ 9650.590630] [817f74c5] vring_interrupt+0x35/0x60 [ 9650.590630] [810e4e34] handle_irq_event_percpu+0x54/0x1e0 [ 9650.590630] [810e5008] handle_irq_event+0x48/0x70 [ 9650.590630] [8107d6f5] ? kvm_guest_apic_eoi_write+0x5/0x50 [ 9650.590630] [810e7807] handle_edge_irq+0x77/0x110 [ 9650.590630] [8104796f] handle_irq+0xbf/0x150 [ 9650.590630] [810478b5] ? handle_irq+0x5/0x150 [ 9650.590630] [81aee136] ? atomic_notifier_call_chain+0x16/0x20 [ 9650.590630] [81af43ca] do_IRQ+0x5a/0xe0 [ 9650.590630] [81aea16d] common_interrupt+0x6d/0x6d [ 9650.590630] EOI [ 9650.590630] [8107e616] ? native_safe_halt+0x6/0x10 [ 9650.590630] [8104e4f5] ? default_idle+0x5/0xc0 [ 9650.590630] [8104e50f] default_idle+0x1f/0xc0 [ 9650.590630] [8104edb6] arch_cpu_idle+0x26/0x30 [ 9650.590630] [810e4331] cpu_startup_entry+0x81/0x240 [ 9650.590630] [81ad5d67] rest_init+0x77/0x80 [ 9650.590630] [8203ceec] start_kernel+0x3cd/0x3da [ 9650.590630] [8203c91e] ? repair_env_string+0x5e/0x5e [ 9650.590630] [8203c117] ? early_idt_handlers+0x117/0x120 [ 9650.590630] [8203c5c2] x86_64_start_reservations+0x2a/0x2c [ 9650.590630] [8203c70a] x86_64_start_kernel+0x146/0x155 [ 9650.590630] Code: e8 05 b5 8f 00 44 8b 8b c0 01 00 00 45 31 c0 48 8b 93 98 00 00 00 41 83 c1 01 44 3b 8b 80 00 00 00 44 89 c9 45 0f 42 c1 c1 e9 07 48 8b 0c ca 65 [ 9650.590630] RIP [811ee5f1] aio_complete+0xb1/0x1e0 [ 9650.590630] RSP 88007f803cf8 I'm not sure if it is related to blk_mq, virtio, dio or bio changes (or even somethign else), but I haven't seen anything like this before. I've only seen it once so far (haven't rerun the test yet at all). And I just hit this from running xfs_repair which is doing multithreaded direct IO directly on /dev/vdc: [ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at 0328 [ 1776.510446] IP: [81755b6a] blk_account_io_done+0x6a/0x180 [ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0 [ 1776.512577] Oops: [#1] SMP [ 1776.512577] Modules linked in: [ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75 [ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: 81f26000 [
Re: [OOPS, 3.13-rc2] null ptr in dio_complete()
On Tue, Dec 03, 2013 at 08:47:12PM -0700, Jens Axboe wrote: On Wed, Dec 04 2013, Dave Chinner wrote: On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote: On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote: Hi Jens, Not sure who to direct this to or CC, so I figured you are the person to do that. I just had xfstests generic/299 (an AIO/DIO test) oops in dio_complete() like so: [ 9650.590630] IRQ [ 9650.590630] [811ddae3] dio_complete+0xa3/0x140 [ 9650.590630] [811ddc2a] dio_bio_end_aio+0x7a/0x110 [ 9650.590630] [811ddbb5] ? dio_bio_end_aio+0x5/0x110 [ 9650.590630] [811d8a9d] bio_endio+0x1d/0x30 [ 9650.590630] [8175d65f] blk_mq_complete_request+0x5f/0x120 [ 9650.590630] [8175d736] __blk_mq_end_io+0x16/0x20 [ 9650.590630] [8175d7a8] blk_mq_end_io+0x68/0xd0 [ 9650.590630] [818539a7] virtblk_done+0x67/0x110 [ 9650.590630] [817f74c5] vring_interrupt+0x35/0x60 . And I just hit this from running xfs_repair which is doing multithreaded direct IO directly on /dev/vdc: [ 1776.510446] IP: [81755b6a] blk_account_io_done+0x6a/0x180 [ 1776.512577] [8175e4b8] blk_mq_complete_request+0xb8/0x120 [ 1776.512577] [8175e536] __blk_mq_end_io+0x16/0x20 [ 1776.512577] [8175e5a8] blk_mq_end_io+0x68/0xd0 [ 1776.512577] [81852e47] virtblk_done+0x67/0x110 [ 1776.512577] [817f7925] vring_interrupt+0x35/0x60 [ 1776.512577] [810e48a4] handle_irq_event_percpu+0x54/0x1e0 . So this is looking like another virtio+blk_mq problem This one is definitely reproducable. Just hit it again... I'll take a look at this. You don't happen to have gdb dumps of the lines associated with those crashes? Just to save me some digging time... Only this: (gdb) l *(dio_complete+0xa3) 0x811ddae3 is in dio_complete (fs/direct-io.c:282). 277 } 278 279 aio_complete(dio-iocb, ret, 0); 280 } 281 282 kmem_cache_free(dio_cache, dio); 283 return ret; 284 } 285 286 static void dio_aio_complete_work(struct work_struct *work) And this: (gdb) l *(blk_account_io_done+0x6a) 0x81755b6a is in blk_account_io_done (block/blk-core.c:2049). 2044int cpu; 2045 2046cpu = part_stat_lock(); 2047part = req-part; 2048 2049part_stat_inc(cpu, part, ios[rw]); 2050part_stat_add(cpu, part, ticks[rw], duration); 2051part_round_stats(cpu, part); 2052part_dec_in_flight(part, rw); 2053 as I've rebuild the kernel with different patches since the one running on the machine that is triggering the problem. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/