Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-06 Thread Jens Axboe
On 12/05/2013 04:16 PM, Dave Chinner wrote:
> On Fri, Dec 06, 2013 at 08:26:51AM +1100, Dave Chinner wrote:
>> On Thu, Dec 05, 2013 at 10:22:20PM +0800, Ming Lei wrote:
>>> On Wed, 4 Dec 2013 12:58:38 +1100
>>> Dave Chinner  wrote:
>>>

 And I just hit this from running xfs_repair which is doing
 multithreaded direct IO directly on /dev/vdc:
>> .
 So this is looking like another virtio+blk_mq problem
>>>
>>> This one might be caused by using-after-'free' request because the freed
>>> request may be allocated to another CPU immediately, and the below patch
>>> may be helpful, would you mind testing it?
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 6875736..2358bdf 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int 
>>> error)
>>>  
>>> blk_account_io_completion(rq, bytes);
>>>  
>>> +   blk_account_io_done(rq);
>>> +
>>> if (rq->end_io)
>>> rq->end_io(rq, error);
>>> else
>>> blk_mq_free_request(rq);
>>> -
>>> -   blk_account_io_done(rq);
>>>  }
>>>  
>>>  void __blk_mq_end_io(struct request *rq, int error)
>>
>> Testing it now. I'll let you know how it goes in couple of hours.
> 
> Ok, this looks like it is working correctly. The read/write io stats
> are now working properly, too. Thanks for finding the problem
> quickly!

Great, thanks for testing, Dave. The fix is now in Linus' tree.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-06 Thread Jens Axboe
On 12/05/2013 04:16 PM, Dave Chinner wrote:
 On Fri, Dec 06, 2013 at 08:26:51AM +1100, Dave Chinner wrote:
 On Thu, Dec 05, 2013 at 10:22:20PM +0800, Ming Lei wrote:
 On Wed, 4 Dec 2013 12:58:38 +1100
 Dave Chinner da...@fromorbit.com wrote:


 And I just hit this from running xfs_repair which is doing
 multithreaded direct IO directly on /dev/vdc:
 .
 So this is looking like another virtio+blk_mq problem

 This one might be caused by using-after-'free' request because the freed
 request may be allocated to another CPU immediately, and the below patch
 may be helpful, would you mind testing it?

 diff --git a/block/blk-mq.c b/block/blk-mq.c
 index 6875736..2358bdf 100644
 --- a/block/blk-mq.c
 +++ b/block/blk-mq.c
 @@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int 
 error)
  
 blk_account_io_completion(rq, bytes);
  
 +   blk_account_io_done(rq);
 +
 if (rq-end_io)
 rq-end_io(rq, error);
 else
 blk_mq_free_request(rq);
 -
 -   blk_account_io_done(rq);
  }
  
  void __blk_mq_end_io(struct request *rq, int error)

 Testing it now. I'll let you know how it goes in couple of hours.
 
 Ok, this looks like it is working correctly. The read/write io stats
 are now working properly, too. Thanks for finding the problem
 quickly!

Great, thanks for testing, Dave. The fix is now in Linus' tree.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-05 Thread Dave Chinner
On Fri, Dec 06, 2013 at 08:26:51AM +1100, Dave Chinner wrote:
> On Thu, Dec 05, 2013 at 10:22:20PM +0800, Ming Lei wrote:
> > On Wed, 4 Dec 2013 12:58:38 +1100
> > Dave Chinner  wrote:
> > 
> > > 
> > > And I just hit this from running xfs_repair which is doing
> > > multithreaded direct IO directly on /dev/vdc:
> .
> > > So this is looking like another virtio+blk_mq problem
> > 
> > This one might be caused by using-after-'free' request because the freed
> > request may be allocated to another CPU immediately, and the below patch
> > may be helpful, would you mind testing it?
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 6875736..2358bdf 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int 
> > error)
> >  
> > blk_account_io_completion(rq, bytes);
> >  
> > +   blk_account_io_done(rq);
> > +
> > if (rq->end_io)
> > rq->end_io(rq, error);
> > else
> > blk_mq_free_request(rq);
> > -
> > -   blk_account_io_done(rq);
> >  }
> >  
> >  void __blk_mq_end_io(struct request *rq, int error)
> 
> Testing it now. I'll let you know how it goes in couple of hours.

Ok, this looks like it is working correctly. The read/write io stats
are now working properly, too. Thanks for finding the problem
quickly!

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-05 Thread Dave Chinner
On Thu, Dec 05, 2013 at 10:22:20PM +0800, Ming Lei wrote:
> On Wed, 4 Dec 2013 12:58:38 +1100
> Dave Chinner  wrote:
> 
> > 
> > And I just hit this from running xfs_repair which is doing
> > multithreaded direct IO directly on /dev/vdc:
.
> > So this is looking like another virtio+blk_mq problem
> 
> This one might be caused by using-after-'free' request because the freed
> request may be allocated to another CPU immediately, and the below patch
> may be helpful, would you mind testing it?
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 6875736..2358bdf 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int 
> error)
>  
>   blk_account_io_completion(rq, bytes);
>  
> + blk_account_io_done(rq);
> +
>   if (rq->end_io)
>   rq->end_io(rq, error);
>   else
>   blk_mq_free_request(rq);
> -
> - blk_account_io_done(rq);
>  }
>  
>  void __blk_mq_end_io(struct request *rq, int error)

Testing it now. I'll let you know how it goes in couple of hours.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-05 Thread Jens Axboe
On 12/05/2013 07:22 AM, Ming Lei wrote:
> On Wed, 4 Dec 2013 12:58:38 +1100
> Dave Chinner  wrote:
> 
>>
>> And I just hit this from running xfs_repair which is doing
>> multithreaded direct IO directly on /dev/vdc:
>>
>> [ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at 
>> 0328
>> [ 1776.510446] IP: [] blk_account_io_done+0x6a/0x180
>> [ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0
>> [ 1776.512577] Oops:  [#1] SMP
>> [ 1776.512577] Modules linked in:
>> [ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75
>> [ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>> [ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: 
>> 81f26000
>> [ 1776.512577] RIP: 0010:[]  [] 
>> blk_account_io_done+0x6a/0x180
>> [ 1776.512577] RSP: :88011bc03da8  EFLAGS: 00010046
>> [ 1776.512577] RAX: 1000 RBX:  RCX: 
>> 
>> [ 1776.512577] RDX:  RSI: 0002 RDI: 
>> 8800dba59fc0
>> [ 1776.512577] RBP: 88011bc03db8 R08:  R09: 
>> 88021a828928
>> [ 1776.512577] R10:  R11: 0001 R12: 
>> 
>> [ 1776.512577] R13:  R14: 2000 R15: 
>> 
>> [ 1776.512577] FS:  () GS:88011bc0() 
>> knlGS:
>> [ 1776.512577] CS:  0010 DS:  ES:  CR0: 8005003b
>> [ 1776.512577] CR2: 0328 CR3: 00041a4b CR4: 
>> 06f0
>> [ 1776.512577] Stack:
>> [ 1776.512577]   8800dba59fc0 88011bc03df8 
>> 8175e4b8
>> [ 1776.512577]  01c0 8800dba59fc0 e8fcfbc03a80 
>> 88041aac0800
>> [ 1776.512577]  88011ad53b10 0096 88011bc03e08 
>> 8175e536
>> [ 1776.512577] Call Trace:
>> [ 1776.512577]  
>> [ 1776.512577]  [] blk_mq_complete_request+0xb8/0x120
>> [ 1776.512577]  [] __blk_mq_end_io+0x16/0x20
>> [ 1776.512577]  [] blk_mq_end_io+0x68/0xd0
>> [ 1776.512577]  [] virtblk_done+0x67/0x110
>> [ 1776.512577]  [] vring_interrupt+0x35/0x60
>> [ 1776.512577]  [] handle_irq_event_percpu+0x54/0x1e0
>> [ 1776.512577]  [] handle_irq_event+0x48/0x70
>> [ 1776.512577]  [] handle_edge_irq+0x77/0x110
>> [ 1776.512577]  [] handle_irq+0xbf/0x150
>> [ 1776.512577]  [] ? __atomic_notifier_call_chain+0x12/0x20
>> [ 1776.512577]  [] ? atomic_notifier_call_chain+0x16/0x20
>> [ 1776.512577]  [] do_IRQ+0x5a/0xe0
>> [ 1776.512577]  [] common_interrupt+0x6d/0x6d
>> [ 1776.512577]  
>> [ 1776.512577]  [] ? native_safe_halt+0x6/0x10
>> [ 1776.512577]  [] default_idle+0x1f/0xc0
>> [ 1776.512577]  [] arch_cpu_idle+0x26/0x30
>> [ 1776.512577]  [] cpu_startup_entry+0x81/0x240
>> [ 1776.512577]  [] rest_init+0x77/0x80
>> [ 1776.512577]  [] start_kernel+0x3c8/0x3d5
>> [ 1776.512577]  [] ? repair_env_string+0x5e/0x5e
>> [ 1776.512577]  [] ? early_idt_handlers+0x117/0x120
>> [ 1776.512577]  [] x86_64_start_reservations+0x2a/0x2c
>> [ 1776.512577]  [] x86_64_start_kernel+0x146/0x155
>> [ 1776.512577] Code: 54 41 89 c4 53 41 83 e4 01 48 2b 97 d8 00 00 00 48 8b 
>> 9f d0 00 00 00 4d 63 e4 65 44 8b 04 25 1c b0 00 00 49 8d 74 24 02 49 63 c8 
>> <48> 8b 83 28 03
>> [ 1776.512577] RIP  [] blk_account_io_done+0x6a/0x180
>> [ 1776.512577]  RSP 
>> [ 1776.512577] CR2: 0328
>> [ 1776.512577] ---[ end trace 33a5787221ff13b4 ]---
>> [ 1776.512577] Kernel panic - not syncing: Fatal exception in interrupt
>> [ 1776.512577] Shutting down cpus with NMI
>>
>> So this is looking like another virtio+blk_mq problem
> 
> This one might be caused by using-after-'free' request because the freed
> request may be allocated to another CPU immediately, and the below patch
> may be helpful, would you mind testing it?
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 6875736..2358bdf 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int 
> error)
>  
>   blk_account_io_completion(rq, bytes);
>  
> + blk_account_io_done(rq);
> +
>   if (rq->end_io)
>   rq->end_io(rq, error);
>   else
>   blk_mq_free_request(rq);
> -
> - blk_account_io_done(rq);
>  }
>  
>  void __blk_mq_end_io(struct request *rq, int error)

That's definitely a bug... Good catch, Ming. Applied.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-05 Thread Ming Lei
On Wed, 4 Dec 2013 12:58:38 +1100
Dave Chinner  wrote:

> 
> And I just hit this from running xfs_repair which is doing
> multithreaded direct IO directly on /dev/vdc:
> 
> [ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at 
> 0328
> [ 1776.510446] IP: [] blk_account_io_done+0x6a/0x180
> [ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0
> [ 1776.512577] Oops:  [#1] SMP
> [ 1776.512577] Modules linked in:
> [ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75
> [ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: 
> 81f26000
> [ 1776.512577] RIP: 0010:[]  [] 
> blk_account_io_done+0x6a/0x180
> [ 1776.512577] RSP: :88011bc03da8  EFLAGS: 00010046
> [ 1776.512577] RAX: 1000 RBX:  RCX: 
> 
> [ 1776.512577] RDX:  RSI: 0002 RDI: 
> 8800dba59fc0
> [ 1776.512577] RBP: 88011bc03db8 R08:  R09: 
> 88021a828928
> [ 1776.512577] R10:  R11: 0001 R12: 
> 
> [ 1776.512577] R13:  R14: 2000 R15: 
> 
> [ 1776.512577] FS:  () GS:88011bc0() 
> knlGS:
> [ 1776.512577] CS:  0010 DS:  ES:  CR0: 8005003b
> [ 1776.512577] CR2: 0328 CR3: 00041a4b CR4: 
> 06f0
> [ 1776.512577] Stack:
> [ 1776.512577]   8800dba59fc0 88011bc03df8 
> 8175e4b8
> [ 1776.512577]  01c0 8800dba59fc0 e8fcfbc03a80 
> 88041aac0800
> [ 1776.512577]  88011ad53b10 0096 88011bc03e08 
> 8175e536
> [ 1776.512577] Call Trace:
> [ 1776.512577]  
> [ 1776.512577]  [] blk_mq_complete_request+0xb8/0x120
> [ 1776.512577]  [] __blk_mq_end_io+0x16/0x20
> [ 1776.512577]  [] blk_mq_end_io+0x68/0xd0
> [ 1776.512577]  [] virtblk_done+0x67/0x110
> [ 1776.512577]  [] vring_interrupt+0x35/0x60
> [ 1776.512577]  [] handle_irq_event_percpu+0x54/0x1e0
> [ 1776.512577]  [] handle_irq_event+0x48/0x70
> [ 1776.512577]  [] handle_edge_irq+0x77/0x110
> [ 1776.512577]  [] handle_irq+0xbf/0x150
> [ 1776.512577]  [] ? __atomic_notifier_call_chain+0x12/0x20
> [ 1776.512577]  [] ? atomic_notifier_call_chain+0x16/0x20
> [ 1776.512577]  [] do_IRQ+0x5a/0xe0
> [ 1776.512577]  [] common_interrupt+0x6d/0x6d
> [ 1776.512577]  
> [ 1776.512577]  [] ? native_safe_halt+0x6/0x10
> [ 1776.512577]  [] default_idle+0x1f/0xc0
> [ 1776.512577]  [] arch_cpu_idle+0x26/0x30
> [ 1776.512577]  [] cpu_startup_entry+0x81/0x240
> [ 1776.512577]  [] rest_init+0x77/0x80
> [ 1776.512577]  [] start_kernel+0x3c8/0x3d5
> [ 1776.512577]  [] ? repair_env_string+0x5e/0x5e
> [ 1776.512577]  [] ? early_idt_handlers+0x117/0x120
> [ 1776.512577]  [] x86_64_start_reservations+0x2a/0x2c
> [ 1776.512577]  [] x86_64_start_kernel+0x146/0x155
> [ 1776.512577] Code: 54 41 89 c4 53 41 83 e4 01 48 2b 97 d8 00 00 00 48 8b 9f 
> d0 00 00 00 4d 63 e4 65 44 8b 04 25 1c b0 00 00 49 8d 74 24 02 49 63 c8 <48> 
> 8b 83 28 03
> [ 1776.512577] RIP  [] blk_account_io_done+0x6a/0x180
> [ 1776.512577]  RSP 
> [ 1776.512577] CR2: 0328
> [ 1776.512577] ---[ end trace 33a5787221ff13b4 ]---
> [ 1776.512577] Kernel panic - not syncing: Fatal exception in interrupt
> [ 1776.512577] Shutting down cpus with NMI
> 
> So this is looking like another virtio+blk_mq problem

This one might be caused by using-after-'free' request because the freed
request may be allocated to another CPU immediately, and the below patch
may be helpful, would you mind testing it?

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6875736..2358bdf 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int 
error)
 
blk_account_io_completion(rq, bytes);
 
+   blk_account_io_done(rq);
+
if (rq->end_io)
rq->end_io(rq, error);
else
blk_mq_free_request(rq);
-
-   blk_account_io_done(rq);
 }
 
 void __blk_mq_end_io(struct request *rq, int error)


Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-05 Thread Ming Lei
On Wed, 4 Dec 2013 12:58:38 +1100
Dave Chinner da...@fromorbit.com wrote:

 
 And I just hit this from running xfs_repair which is doing
 multithreaded direct IO directly on /dev/vdc:
 
 [ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at 
 0328
 [ 1776.510446] IP: [81755b6a] blk_account_io_done+0x6a/0x180
 [ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0
 [ 1776.512577] Oops:  [#1] SMP
 [ 1776.512577] Modules linked in:
 [ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75
 [ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 [ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: 
 81f26000
 [ 1776.512577] RIP: 0010:[81755b6a]  [81755b6a] 
 blk_account_io_done+0x6a/0x180
 [ 1776.512577] RSP: :88011bc03da8  EFLAGS: 00010046
 [ 1776.512577] RAX: 1000 RBX:  RCX: 
 
 [ 1776.512577] RDX:  RSI: 0002 RDI: 
 8800dba59fc0
 [ 1776.512577] RBP: 88011bc03db8 R08:  R09: 
 88021a828928
 [ 1776.512577] R10:  R11: 0001 R12: 
 
 [ 1776.512577] R13:  R14: 2000 R15: 
 
 [ 1776.512577] FS:  () GS:88011bc0() 
 knlGS:
 [ 1776.512577] CS:  0010 DS:  ES:  CR0: 8005003b
 [ 1776.512577] CR2: 0328 CR3: 00041a4b CR4: 
 06f0
 [ 1776.512577] Stack:
 [ 1776.512577]   8800dba59fc0 88011bc03df8 
 8175e4b8
 [ 1776.512577]  01c0 8800dba59fc0 e8fcfbc03a80 
 88041aac0800
 [ 1776.512577]  88011ad53b10 0096 88011bc03e08 
 8175e536
 [ 1776.512577] Call Trace:
 [ 1776.512577]  IRQ
 [ 1776.512577]  [8175e4b8] blk_mq_complete_request+0xb8/0x120
 [ 1776.512577]  [8175e536] __blk_mq_end_io+0x16/0x20
 [ 1776.512577]  [8175e5a8] blk_mq_end_io+0x68/0xd0
 [ 1776.512577]  [81852e47] virtblk_done+0x67/0x110
 [ 1776.512577]  [817f7925] vring_interrupt+0x35/0x60
 [ 1776.512577]  [810e48a4] handle_irq_event_percpu+0x54/0x1e0
 [ 1776.512577]  [810e4a78] handle_irq_event+0x48/0x70
 [ 1776.512577]  [810e71d7] handle_edge_irq+0x77/0x110
 [ 1776.512577]  [8104796f] handle_irq+0xbf/0x150
 [ 1776.512577]  [81aec8c2] ? __atomic_notifier_call_chain+0x12/0x20
 [ 1776.512577]  [81aec8e6] ? atomic_notifier_call_chain+0x16/0x20
 [ 1776.512577]  [81af2b8a] do_IRQ+0x5a/0xe0
 [ 1776.512577]  [81ae892d] common_interrupt+0x6d/0x6d
 [ 1776.512577]  EOI
 [ 1776.512577]  [8107e626] ? native_safe_halt+0x6/0x10
 [ 1776.512577]  [8104e50f] default_idle+0x1f/0xc0
 [ 1776.512577]  [8104edb6] arch_cpu_idle+0x26/0x30
 [ 1776.512577]  [810e3d81] cpu_startup_entry+0x81/0x240
 [ 1776.512577]  [81ad4f97] rest_init+0x77/0x80
 [ 1776.512577]  [82041ee7] start_kernel+0x3c8/0x3d5
 [ 1776.512577]  [8204191e] ? repair_env_string+0x5e/0x5e
 [ 1776.512577]  [82041117] ? early_idt_handlers+0x117/0x120
 [ 1776.512577]  [820415c2] x86_64_start_reservations+0x2a/0x2c
 [ 1776.512577]  [8204170a] x86_64_start_kernel+0x146/0x155
 [ 1776.512577] Code: 54 41 89 c4 53 41 83 e4 01 48 2b 97 d8 00 00 00 48 8b 9f 
 d0 00 00 00 4d 63 e4 65 44 8b 04 25 1c b0 00 00 49 8d 74 24 02 49 63 c8 48 
 8b 83 28 03
 [ 1776.512577] RIP  [81755b6a] blk_account_io_done+0x6a/0x180
 [ 1776.512577]  RSP 88011bc03da8
 [ 1776.512577] CR2: 0328
 [ 1776.512577] ---[ end trace 33a5787221ff13b4 ]---
 [ 1776.512577] Kernel panic - not syncing: Fatal exception in interrupt
 [ 1776.512577] Shutting down cpus with NMI
 
 So this is looking like another virtio+blk_mq problem

This one might be caused by using-after-'free' request because the freed
request may be allocated to another CPU immediately, and the below patch
may be helpful, would you mind testing it?

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6875736..2358bdf 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int 
error)
 
blk_account_io_completion(rq, bytes);
 
+   blk_account_io_done(rq);
+
if (rq-end_io)
rq-end_io(rq, error);
else
blk_mq_free_request(rq);
-
-   blk_account_io_done(rq);
 }
 
 void __blk_mq_end_io(struct request *rq, int error)


Thanks,
-- 
Ming Lei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-05 Thread Jens Axboe
On 12/05/2013 07:22 AM, Ming Lei wrote:
 On Wed, 4 Dec 2013 12:58:38 +1100
 Dave Chinner da...@fromorbit.com wrote:
 

 And I just hit this from running xfs_repair which is doing
 multithreaded direct IO directly on /dev/vdc:

 [ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at 
 0328
 [ 1776.510446] IP: [81755b6a] blk_account_io_done+0x6a/0x180
 [ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0
 [ 1776.512577] Oops:  [#1] SMP
 [ 1776.512577] Modules linked in:
 [ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75
 [ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 [ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: 
 81f26000
 [ 1776.512577] RIP: 0010:[81755b6a]  [81755b6a] 
 blk_account_io_done+0x6a/0x180
 [ 1776.512577] RSP: :88011bc03da8  EFLAGS: 00010046
 [ 1776.512577] RAX: 1000 RBX:  RCX: 
 
 [ 1776.512577] RDX:  RSI: 0002 RDI: 
 8800dba59fc0
 [ 1776.512577] RBP: 88011bc03db8 R08:  R09: 
 88021a828928
 [ 1776.512577] R10:  R11: 0001 R12: 
 
 [ 1776.512577] R13:  R14: 2000 R15: 
 
 [ 1776.512577] FS:  () GS:88011bc0() 
 knlGS:
 [ 1776.512577] CS:  0010 DS:  ES:  CR0: 8005003b
 [ 1776.512577] CR2: 0328 CR3: 00041a4b CR4: 
 06f0
 [ 1776.512577] Stack:
 [ 1776.512577]   8800dba59fc0 88011bc03df8 
 8175e4b8
 [ 1776.512577]  01c0 8800dba59fc0 e8fcfbc03a80 
 88041aac0800
 [ 1776.512577]  88011ad53b10 0096 88011bc03e08 
 8175e536
 [ 1776.512577] Call Trace:
 [ 1776.512577]  IRQ
 [ 1776.512577]  [8175e4b8] blk_mq_complete_request+0xb8/0x120
 [ 1776.512577]  [8175e536] __blk_mq_end_io+0x16/0x20
 [ 1776.512577]  [8175e5a8] blk_mq_end_io+0x68/0xd0
 [ 1776.512577]  [81852e47] virtblk_done+0x67/0x110
 [ 1776.512577]  [817f7925] vring_interrupt+0x35/0x60
 [ 1776.512577]  [810e48a4] handle_irq_event_percpu+0x54/0x1e0
 [ 1776.512577]  [810e4a78] handle_irq_event+0x48/0x70
 [ 1776.512577]  [810e71d7] handle_edge_irq+0x77/0x110
 [ 1776.512577]  [8104796f] handle_irq+0xbf/0x150
 [ 1776.512577]  [81aec8c2] ? __atomic_notifier_call_chain+0x12/0x20
 [ 1776.512577]  [81aec8e6] ? atomic_notifier_call_chain+0x16/0x20
 [ 1776.512577]  [81af2b8a] do_IRQ+0x5a/0xe0
 [ 1776.512577]  [81ae892d] common_interrupt+0x6d/0x6d
 [ 1776.512577]  EOI
 [ 1776.512577]  [8107e626] ? native_safe_halt+0x6/0x10
 [ 1776.512577]  [8104e50f] default_idle+0x1f/0xc0
 [ 1776.512577]  [8104edb6] arch_cpu_idle+0x26/0x30
 [ 1776.512577]  [810e3d81] cpu_startup_entry+0x81/0x240
 [ 1776.512577]  [81ad4f97] rest_init+0x77/0x80
 [ 1776.512577]  [82041ee7] start_kernel+0x3c8/0x3d5
 [ 1776.512577]  [8204191e] ? repair_env_string+0x5e/0x5e
 [ 1776.512577]  [82041117] ? early_idt_handlers+0x117/0x120
 [ 1776.512577]  [820415c2] x86_64_start_reservations+0x2a/0x2c
 [ 1776.512577]  [8204170a] x86_64_start_kernel+0x146/0x155
 [ 1776.512577] Code: 54 41 89 c4 53 41 83 e4 01 48 2b 97 d8 00 00 00 48 8b 
 9f d0 00 00 00 4d 63 e4 65 44 8b 04 25 1c b0 00 00 49 8d 74 24 02 49 63 c8 
 48 8b 83 28 03
 [ 1776.512577] RIP  [81755b6a] blk_account_io_done+0x6a/0x180
 [ 1776.512577]  RSP 88011bc03da8
 [ 1776.512577] CR2: 0328
 [ 1776.512577] ---[ end trace 33a5787221ff13b4 ]---
 [ 1776.512577] Kernel panic - not syncing: Fatal exception in interrupt
 [ 1776.512577] Shutting down cpus with NMI

 So this is looking like another virtio+blk_mq problem
 
 This one might be caused by using-after-'free' request because the freed
 request may be allocated to another CPU immediately, and the below patch
 may be helpful, would you mind testing it?
 
 diff --git a/block/blk-mq.c b/block/blk-mq.c
 index 6875736..2358bdf 100644
 --- a/block/blk-mq.c
 +++ b/block/blk-mq.c
 @@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int 
 error)
  
   blk_account_io_completion(rq, bytes);
  
 + blk_account_io_done(rq);
 +
   if (rq-end_io)
   rq-end_io(rq, error);
   else
   blk_mq_free_request(rq);
 -
 - blk_account_io_done(rq);
  }
  
  void __blk_mq_end_io(struct request *rq, int error)

That's definitely a bug... Good catch, Ming. Applied.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-05 Thread Dave Chinner
On Thu, Dec 05, 2013 at 10:22:20PM +0800, Ming Lei wrote:
 On Wed, 4 Dec 2013 12:58:38 +1100
 Dave Chinner da...@fromorbit.com wrote:
 
  
  And I just hit this from running xfs_repair which is doing
  multithreaded direct IO directly on /dev/vdc:
.
  So this is looking like another virtio+blk_mq problem
 
 This one might be caused by using-after-'free' request because the freed
 request may be allocated to another CPU immediately, and the below patch
 may be helpful, would you mind testing it?
 
 diff --git a/block/blk-mq.c b/block/blk-mq.c
 index 6875736..2358bdf 100644
 --- a/block/blk-mq.c
 +++ b/block/blk-mq.c
 @@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int 
 error)
  
   blk_account_io_completion(rq, bytes);
  
 + blk_account_io_done(rq);
 +
   if (rq-end_io)
   rq-end_io(rq, error);
   else
   blk_mq_free_request(rq);
 -
 - blk_account_io_done(rq);
  }
  
  void __blk_mq_end_io(struct request *rq, int error)

Testing it now. I'll let you know how it goes in couple of hours.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-05 Thread Dave Chinner
On Fri, Dec 06, 2013 at 08:26:51AM +1100, Dave Chinner wrote:
 On Thu, Dec 05, 2013 at 10:22:20PM +0800, Ming Lei wrote:
  On Wed, 4 Dec 2013 12:58:38 +1100
  Dave Chinner da...@fromorbit.com wrote:
  
   
   And I just hit this from running xfs_repair which is doing
   multithreaded direct IO directly on /dev/vdc:
 .
   So this is looking like another virtio+blk_mq problem
  
  This one might be caused by using-after-'free' request because the freed
  request may be allocated to another CPU immediately, and the below patch
  may be helpful, would you mind testing it?
  
  diff --git a/block/blk-mq.c b/block/blk-mq.c
  index 6875736..2358bdf 100644
  --- a/block/blk-mq.c
  +++ b/block/blk-mq.c
  @@ -311,12 +311,12 @@ void blk_mq_complete_request(struct request *rq, int 
  error)
   
  blk_account_io_completion(rq, bytes);
   
  +   blk_account_io_done(rq);
  +
  if (rq-end_io)
  rq-end_io(rq, error);
  else
  blk_mq_free_request(rq);
  -
  -   blk_account_io_done(rq);
   }
   
   void __blk_mq_end_io(struct request *rq, int error)
 
 Testing it now. I'll let you know how it goes in couple of hours.

Ok, this looks like it is working correctly. The read/write io stats
are now working properly, too. Thanks for finding the problem
quickly!

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-04 Thread Dave Chinner
On Wed, Dec 04, 2013 at 08:41:43PM -0700, Jens Axboe wrote:
> On Thu, Dec 05 2013, Dave Chinner wrote:
> > On Wed, Dec 04, 2013 at 03:17:49PM +1100, Dave Chinner wrote:
> > > On Tue, Dec 03, 2013 at 08:47:12PM -0700, Jens Axboe wrote:
> > > > On Wed, Dec 04 2013, Dave Chinner wrote:
> > > > > On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote:
> > > > > > On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote:
> > > > > > > Hi Jens,
> > > > > > > 
> > > > > > > Not sure who to direct this to or CC, so I figured you are the
> > > > > > > person to do that. I just had xfstests generic/299 (an AIO/DIO 
> > > > > > > test)
> > > > > > > oops in dio_complete() like so:
> > > > > > > 
> > > 
> > > > > > > [ 9650.590630]  
> > > > > > > [ 9650.590630]  [] dio_complete+0xa3/0x140
> > > > > > > [ 9650.590630]  [] dio_bio_end_aio+0x7a/0x110
> > > > > > > [ 9650.590630]  [] ? dio_bio_end_aio+0x5/0x110
> > > > > > > [ 9650.590630]  [] bio_endio+0x1d/0x30
> > > > > > > [ 9650.590630]  [] 
> > > > > > > blk_mq_complete_request+0x5f/0x120
> > > > > > > [ 9650.590630]  [] __blk_mq_end_io+0x16/0x20
> > > > > > > [ 9650.590630]  [] blk_mq_end_io+0x68/0xd0
> > > > > > > [ 9650.590630]  [] virtblk_done+0x67/0x110
> > > > > > > [ 9650.590630]  [] vring_interrupt+0x35/0x60
> > > .
> > > > > > And I just hit this from running xfs_repair which is doing
> > > > > > multithreaded direct IO directly on /dev/vdc:
> > > > > > 
> > > 
> > > > > > [ 1776.510446] IP: [] 
> > > > > > blk_account_io_done+0x6a/0x180
> > > 
> > > > > > [ 1776.512577]  [] 
> > > > > > blk_mq_complete_request+0xb8/0x120
> > > > > > [ 1776.512577]  [] __blk_mq_end_io+0x16/0x20
> > > > > > [ 1776.512577]  [] blk_mq_end_io+0x68/0xd0
> > > > > > [ 1776.512577]  [] virtblk_done+0x67/0x110
> > > > > > [ 1776.512577]  [] vring_interrupt+0x35/0x60
> > > > > > [ 1776.512577]  [] 
> > > > > > handle_irq_event_percpu+0x54/0x1e0
> > > .
> > > > > > So this is looking like another virtio+blk_mq problem
> > > > > 
> > > > > This one is definitely reproducable. Just hit it again...
> > > > 
> > > > I'll take a look at this. You don't happen to have gdb dumps of the
> > > > lines associated with those crashes? Just to save me some digging
> > > > time...
> > > 
> > > Only this:
> > > 
> > > (gdb) l *(dio_complete+0xa3)
> > > 0x811ddae3 is in dio_complete (fs/direct-io.c:282).
> > > 277 }
> > > 278
> > > 279 aio_complete(dio->iocb, ret, 0);
> > > 280 }
> > > 281
> > > 282 kmem_cache_free(dio_cache, dio);
> > > 283 return ret;
> > > 284 }
> > > 285
> > > 286 static void dio_aio_complete_work(struct work_struct *work)
> > > 
> > > And this:
> > > 
> > > (gdb) l *(blk_account_io_done+0x6a)
> > > 0x81755b6a is in blk_account_io_done (block/blk-core.c:2049).
> > > 2044int cpu;
> > > 2045
> > > 2046cpu = part_stat_lock();
> > > 2047part = req->part;
> > > 2048
> > > 2049part_stat_inc(cpu, part, ios[rw]);
> > > 2050part_stat_add(cpu, part, ticks[rw], duration);
> > > 2051part_round_stats(cpu, part);
> > > 2052part_dec_in_flight(part, rw);
> > > 2053
> > > 
> > > as I've rebuild the kernel with different patches since the one
> > > running on the machine that is triggering the problem.
> > 
> > Any update on this, Jens? I've hit this blk_account_io_done() panic
> > 10 times in the past 2 hours while trying to do xfs_repair
> > testing
> 
> No, sorry, no updates yet... I haven't had time to look into it today.
> To reproduce tomorrow, can you mail me your exact setup (kvm invocation,
> etc) and how your guest is setup and if there's any special way I need
> to run xfstest or xfs_repair?

The virtio device that I'm hitting is "/mnt/fast-ssd/vm-100TB-sparse.img"
which is a 100TB file on a 160GB XFS filesystem on an SSD, created
with

$ xfs_io -f -c "truncate 100t" -c "extsize 1m" /mnt/fast-ssd/vm-100TB-sparse.img

If I stat it, I get:

$ xfs_io -c stat /mnt/fast-ssd/vm-100TB-sparse.img 
fd.path = "/mnt/fast-ssd/vm-100TB-sparse.img"
fd.flags = non-sync,non-direct,read-write
stat.ino = 131
stat.type = regular file
stat.size = 109951162777600
stat.blocks = 259333400
fsxattr.xflags = 0x800 [--e---]
fsxattr.projid = 0
fsxattr.extsize = 1048576
fsxattr.nextents = 83108
fsxattr.naextents = 0
dioattr.mem = 0x200
dioattr.miniosz = 512
dioattr.maxiosz = 2147483136
$

The VM is run by this script:

$ cat run-vm-4.sh 
#!/bin/sh
sudo /usr/bin/qemu-system-x86_64 \
-no-fd-bootchk \
-localtime \
-boot c \
-serial pty \
-nographic \
-alt-grab \
-smp 16 -m 16384 \
-machine accel=kvm \
-hda /vm-images/vm-4/root.img \
-drive file=/vm-images/vm-4/vm-4-test.img,if=virtio,cache=none \
-drive 

Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-04 Thread Jens Axboe
On Thu, Dec 05 2013, Dave Chinner wrote:
> On Wed, Dec 04, 2013 at 03:17:49PM +1100, Dave Chinner wrote:
> > On Tue, Dec 03, 2013 at 08:47:12PM -0700, Jens Axboe wrote:
> > > On Wed, Dec 04 2013, Dave Chinner wrote:
> > > > On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote:
> > > > > On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote:
> > > > > > Hi Jens,
> > > > > > 
> > > > > > Not sure who to direct this to or CC, so I figured you are the
> > > > > > person to do that. I just had xfstests generic/299 (an AIO/DIO test)
> > > > > > oops in dio_complete() like so:
> > > > > > 
> > 
> > > > > > [ 9650.590630]  
> > > > > > [ 9650.590630]  [] dio_complete+0xa3/0x140
> > > > > > [ 9650.590630]  [] dio_bio_end_aio+0x7a/0x110
> > > > > > [ 9650.590630]  [] ? dio_bio_end_aio+0x5/0x110
> > > > > > [ 9650.590630]  [] bio_endio+0x1d/0x30
> > > > > > [ 9650.590630]  [] 
> > > > > > blk_mq_complete_request+0x5f/0x120
> > > > > > [ 9650.590630]  [] __blk_mq_end_io+0x16/0x20
> > > > > > [ 9650.590630]  [] blk_mq_end_io+0x68/0xd0
> > > > > > [ 9650.590630]  [] virtblk_done+0x67/0x110
> > > > > > [ 9650.590630]  [] vring_interrupt+0x35/0x60
> > .
> > > > > And I just hit this from running xfs_repair which is doing
> > > > > multithreaded direct IO directly on /dev/vdc:
> > > > > 
> > 
> > > > > [ 1776.510446] IP: [] blk_account_io_done+0x6a/0x180
> > 
> > > > > [ 1776.512577]  [] 
> > > > > blk_mq_complete_request+0xb8/0x120
> > > > > [ 1776.512577]  [] __blk_mq_end_io+0x16/0x20
> > > > > [ 1776.512577]  [] blk_mq_end_io+0x68/0xd0
> > > > > [ 1776.512577]  [] virtblk_done+0x67/0x110
> > > > > [ 1776.512577]  [] vring_interrupt+0x35/0x60
> > > > > [ 1776.512577]  [] 
> > > > > handle_irq_event_percpu+0x54/0x1e0
> > .
> > > > > So this is looking like another virtio+blk_mq problem
> > > > 
> > > > This one is definitely reproducable. Just hit it again...
> > > 
> > > I'll take a look at this. You don't happen to have gdb dumps of the
> > > lines associated with those crashes? Just to save me some digging
> > > time...
> > 
> > Only this:
> > 
> > (gdb) l *(dio_complete+0xa3)
> > 0x811ddae3 is in dio_complete (fs/direct-io.c:282).
> > 277 }
> > 278
> > 279 aio_complete(dio->iocb, ret, 0);
> > 280 }
> > 281
> > 282 kmem_cache_free(dio_cache, dio);
> > 283 return ret;
> > 284 }
> > 285
> > 286 static void dio_aio_complete_work(struct work_struct *work)
> > 
> > And this:
> > 
> > (gdb) l *(blk_account_io_done+0x6a)
> > 0x81755b6a is in blk_account_io_done (block/blk-core.c:2049).
> > 2044int cpu;
> > 2045
> > 2046cpu = part_stat_lock();
> > 2047part = req->part;
> > 2048
> > 2049part_stat_inc(cpu, part, ios[rw]);
> > 2050part_stat_add(cpu, part, ticks[rw], duration);
> > 2051part_round_stats(cpu, part);
> > 2052part_dec_in_flight(part, rw);
> > 2053
> > 
> > as I've rebuild the kernel with different patches since the one
> > running on the machine that is triggering the problem.
> 
> Any update on this, Jens? I've hit this blk_account_io_done() panic
> 10 times in the past 2 hours while trying to do xfs_repair
> testing

No, sorry, no updates yet... I haven't had time to look into it today.
To reproduce tomorrow, can you mail me your exact setup (kvm invocation,
etc) and how your guest is setup and if there's any special way I need
to run xfstest or xfs_repair?

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-04 Thread Dave Chinner
On Wed, Dec 04, 2013 at 03:17:49PM +1100, Dave Chinner wrote:
> On Tue, Dec 03, 2013 at 08:47:12PM -0700, Jens Axboe wrote:
> > On Wed, Dec 04 2013, Dave Chinner wrote:
> > > On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote:
> > > > > Hi Jens,
> > > > > 
> > > > > Not sure who to direct this to or CC, so I figured you are the
> > > > > person to do that. I just had xfstests generic/299 (an AIO/DIO test)
> > > > > oops in dio_complete() like so:
> > > > > 
> 
> > > > > [ 9650.590630]  
> > > > > [ 9650.590630]  [] dio_complete+0xa3/0x140
> > > > > [ 9650.590630]  [] dio_bio_end_aio+0x7a/0x110
> > > > > [ 9650.590630]  [] ? dio_bio_end_aio+0x5/0x110
> > > > > [ 9650.590630]  [] bio_endio+0x1d/0x30
> > > > > [ 9650.590630]  [] 
> > > > > blk_mq_complete_request+0x5f/0x120
> > > > > [ 9650.590630]  [] __blk_mq_end_io+0x16/0x20
> > > > > [ 9650.590630]  [] blk_mq_end_io+0x68/0xd0
> > > > > [ 9650.590630]  [] virtblk_done+0x67/0x110
> > > > > [ 9650.590630]  [] vring_interrupt+0x35/0x60
> .
> > > > And I just hit this from running xfs_repair which is doing
> > > > multithreaded direct IO directly on /dev/vdc:
> > > > 
> 
> > > > [ 1776.510446] IP: [] blk_account_io_done+0x6a/0x180
> 
> > > > [ 1776.512577]  [] blk_mq_complete_request+0xb8/0x120
> > > > [ 1776.512577]  [] __blk_mq_end_io+0x16/0x20
> > > > [ 1776.512577]  [] blk_mq_end_io+0x68/0xd0
> > > > [ 1776.512577]  [] virtblk_done+0x67/0x110
> > > > [ 1776.512577]  [] vring_interrupt+0x35/0x60
> > > > [ 1776.512577]  [] handle_irq_event_percpu+0x54/0x1e0
> .
> > > > So this is looking like another virtio+blk_mq problem
> > > 
> > > This one is definitely reproducable. Just hit it again...
> > 
> > I'll take a look at this. You don't happen to have gdb dumps of the
> > lines associated with those crashes? Just to save me some digging
> > time...
> 
> Only this:
> 
> (gdb) l *(dio_complete+0xa3)
> 0x811ddae3 is in dio_complete (fs/direct-io.c:282).
> 277 }
> 278
> 279 aio_complete(dio->iocb, ret, 0);
> 280 }
> 281
> 282 kmem_cache_free(dio_cache, dio);
> 283 return ret;
> 284 }
> 285
> 286 static void dio_aio_complete_work(struct work_struct *work)
> 
> And this:
> 
> (gdb) l *(blk_account_io_done+0x6a)
> 0x81755b6a is in blk_account_io_done (block/blk-core.c:2049).
> 2044int cpu;
> 2045
> 2046cpu = part_stat_lock();
> 2047part = req->part;
> 2048
> 2049part_stat_inc(cpu, part, ios[rw]);
> 2050part_stat_add(cpu, part, ticks[rw], duration);
> 2051part_round_stats(cpu, part);
> 2052part_dec_in_flight(part, rw);
> 2053
> 
> as I've rebuild the kernel with different patches since the one
> running on the machine that is triggering the problem.

Any update on this, Jens? I've hit this blk_account_io_done() panic
10 times in the past 2 hours while trying to do xfs_repair
testing

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-04 Thread Dave Chinner
On Wed, Dec 04, 2013 at 03:17:49PM +1100, Dave Chinner wrote:
 On Tue, Dec 03, 2013 at 08:47:12PM -0700, Jens Axboe wrote:
  On Wed, Dec 04 2013, Dave Chinner wrote:
   On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote:
On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote:
 Hi Jens,
 
 Not sure who to direct this to or CC, so I figured you are the
 person to do that. I just had xfstests generic/299 (an AIO/DIO test)
 oops in dio_complete() like so:
 
 
 [ 9650.590630]  IRQ
 [ 9650.590630]  [811ddae3] dio_complete+0xa3/0x140
 [ 9650.590630]  [811ddc2a] dio_bio_end_aio+0x7a/0x110
 [ 9650.590630]  [811ddbb5] ? dio_bio_end_aio+0x5/0x110
 [ 9650.590630]  [811d8a9d] bio_endio+0x1d/0x30
 [ 9650.590630]  [8175d65f] 
 blk_mq_complete_request+0x5f/0x120
 [ 9650.590630]  [8175d736] __blk_mq_end_io+0x16/0x20
 [ 9650.590630]  [8175d7a8] blk_mq_end_io+0x68/0xd0
 [ 9650.590630]  [818539a7] virtblk_done+0x67/0x110
 [ 9650.590630]  [817f74c5] vring_interrupt+0x35/0x60
 .
And I just hit this from running xfs_repair which is doing
multithreaded direct IO directly on /dev/vdc:

 
[ 1776.510446] IP: [81755b6a] blk_account_io_done+0x6a/0x180
 
[ 1776.512577]  [8175e4b8] blk_mq_complete_request+0xb8/0x120
[ 1776.512577]  [8175e536] __blk_mq_end_io+0x16/0x20
[ 1776.512577]  [8175e5a8] blk_mq_end_io+0x68/0xd0
[ 1776.512577]  [81852e47] virtblk_done+0x67/0x110
[ 1776.512577]  [817f7925] vring_interrupt+0x35/0x60
[ 1776.512577]  [810e48a4] handle_irq_event_percpu+0x54/0x1e0
 .
So this is looking like another virtio+blk_mq problem
   
   This one is definitely reproducable. Just hit it again...
  
  I'll take a look at this. You don't happen to have gdb dumps of the
  lines associated with those crashes? Just to save me some digging
  time...
 
 Only this:
 
 (gdb) l *(dio_complete+0xa3)
 0x811ddae3 is in dio_complete (fs/direct-io.c:282).
 277 }
 278
 279 aio_complete(dio-iocb, ret, 0);
 280 }
 281
 282 kmem_cache_free(dio_cache, dio);
 283 return ret;
 284 }
 285
 286 static void dio_aio_complete_work(struct work_struct *work)
 
 And this:
 
 (gdb) l *(blk_account_io_done+0x6a)
 0x81755b6a is in blk_account_io_done (block/blk-core.c:2049).
 2044int cpu;
 2045
 2046cpu = part_stat_lock();
 2047part = req-part;
 2048
 2049part_stat_inc(cpu, part, ios[rw]);
 2050part_stat_add(cpu, part, ticks[rw], duration);
 2051part_round_stats(cpu, part);
 2052part_dec_in_flight(part, rw);
 2053
 
 as I've rebuild the kernel with different patches since the one
 running on the machine that is triggering the problem.

Any update on this, Jens? I've hit this blk_account_io_done() panic
10 times in the past 2 hours while trying to do xfs_repair
testing

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-04 Thread Jens Axboe
On Thu, Dec 05 2013, Dave Chinner wrote:
 On Wed, Dec 04, 2013 at 03:17:49PM +1100, Dave Chinner wrote:
  On Tue, Dec 03, 2013 at 08:47:12PM -0700, Jens Axboe wrote:
   On Wed, Dec 04 2013, Dave Chinner wrote:
On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote:
 On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote:
  Hi Jens,
  
  Not sure who to direct this to or CC, so I figured you are the
  person to do that. I just had xfstests generic/299 (an AIO/DIO test)
  oops in dio_complete() like so:
  
  
  [ 9650.590630]  IRQ
  [ 9650.590630]  [811ddae3] dio_complete+0xa3/0x140
  [ 9650.590630]  [811ddc2a] dio_bio_end_aio+0x7a/0x110
  [ 9650.590630]  [811ddbb5] ? dio_bio_end_aio+0x5/0x110
  [ 9650.590630]  [811d8a9d] bio_endio+0x1d/0x30
  [ 9650.590630]  [8175d65f] 
  blk_mq_complete_request+0x5f/0x120
  [ 9650.590630]  [8175d736] __blk_mq_end_io+0x16/0x20
  [ 9650.590630]  [8175d7a8] blk_mq_end_io+0x68/0xd0
  [ 9650.590630]  [818539a7] virtblk_done+0x67/0x110
  [ 9650.590630]  [817f74c5] vring_interrupt+0x35/0x60
  .
 And I just hit this from running xfs_repair which is doing
 multithreaded direct IO directly on /dev/vdc:
 
  
 [ 1776.510446] IP: [81755b6a] blk_account_io_done+0x6a/0x180
  
 [ 1776.512577]  [8175e4b8] 
 blk_mq_complete_request+0xb8/0x120
 [ 1776.512577]  [8175e536] __blk_mq_end_io+0x16/0x20
 [ 1776.512577]  [8175e5a8] blk_mq_end_io+0x68/0xd0
 [ 1776.512577]  [81852e47] virtblk_done+0x67/0x110
 [ 1776.512577]  [817f7925] vring_interrupt+0x35/0x60
 [ 1776.512577]  [810e48a4] 
 handle_irq_event_percpu+0x54/0x1e0
  .
 So this is looking like another virtio+blk_mq problem

This one is definitely reproducable. Just hit it again...
   
   I'll take a look at this. You don't happen to have gdb dumps of the
   lines associated with those crashes? Just to save me some digging
   time...
  
  Only this:
  
  (gdb) l *(dio_complete+0xa3)
  0x811ddae3 is in dio_complete (fs/direct-io.c:282).
  277 }
  278
  279 aio_complete(dio-iocb, ret, 0);
  280 }
  281
  282 kmem_cache_free(dio_cache, dio);
  283 return ret;
  284 }
  285
  286 static void dio_aio_complete_work(struct work_struct *work)
  
  And this:
  
  (gdb) l *(blk_account_io_done+0x6a)
  0x81755b6a is in blk_account_io_done (block/blk-core.c:2049).
  2044int cpu;
  2045
  2046cpu = part_stat_lock();
  2047part = req-part;
  2048
  2049part_stat_inc(cpu, part, ios[rw]);
  2050part_stat_add(cpu, part, ticks[rw], duration);
  2051part_round_stats(cpu, part);
  2052part_dec_in_flight(part, rw);
  2053
  
  as I've rebuild the kernel with different patches since the one
  running on the machine that is triggering the problem.
 
 Any update on this, Jens? I've hit this blk_account_io_done() panic
 10 times in the past 2 hours while trying to do xfs_repair
 testing

No, sorry, no updates yet... I haven't had time to look into it today.
To reproduce tomorrow, can you mail me your exact setup (kvm invocation,
etc) and how your guest is setup and if there's any special way I need
to run xfstest or xfs_repair?

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-04 Thread Dave Chinner
On Wed, Dec 04, 2013 at 08:41:43PM -0700, Jens Axboe wrote:
 On Thu, Dec 05 2013, Dave Chinner wrote:
  On Wed, Dec 04, 2013 at 03:17:49PM +1100, Dave Chinner wrote:
   On Tue, Dec 03, 2013 at 08:47:12PM -0700, Jens Axboe wrote:
On Wed, Dec 04 2013, Dave Chinner wrote:
 On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote:
  On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote:
   Hi Jens,
   
   Not sure who to direct this to or CC, so I figured you are the
   person to do that. I just had xfstests generic/299 (an AIO/DIO 
   test)
   oops in dio_complete() like so:
   
   
   [ 9650.590630]  IRQ
   [ 9650.590630]  [811ddae3] dio_complete+0xa3/0x140
   [ 9650.590630]  [811ddc2a] dio_bio_end_aio+0x7a/0x110
   [ 9650.590630]  [811ddbb5] ? dio_bio_end_aio+0x5/0x110
   [ 9650.590630]  [811d8a9d] bio_endio+0x1d/0x30
   [ 9650.590630]  [8175d65f] 
   blk_mq_complete_request+0x5f/0x120
   [ 9650.590630]  [8175d736] __blk_mq_end_io+0x16/0x20
   [ 9650.590630]  [8175d7a8] blk_mq_end_io+0x68/0xd0
   [ 9650.590630]  [818539a7] virtblk_done+0x67/0x110
   [ 9650.590630]  [817f74c5] vring_interrupt+0x35/0x60
   .
  And I just hit this from running xfs_repair which is doing
  multithreaded direct IO directly on /dev/vdc:
  
   
  [ 1776.510446] IP: [81755b6a] 
  blk_account_io_done+0x6a/0x180
   
  [ 1776.512577]  [8175e4b8] 
  blk_mq_complete_request+0xb8/0x120
  [ 1776.512577]  [8175e536] __blk_mq_end_io+0x16/0x20
  [ 1776.512577]  [8175e5a8] blk_mq_end_io+0x68/0xd0
  [ 1776.512577]  [81852e47] virtblk_done+0x67/0x110
  [ 1776.512577]  [817f7925] vring_interrupt+0x35/0x60
  [ 1776.512577]  [810e48a4] 
  handle_irq_event_percpu+0x54/0x1e0
   .
  So this is looking like another virtio+blk_mq problem
 
 This one is definitely reproducable. Just hit it again...

I'll take a look at this. You don't happen to have gdb dumps of the
lines associated with those crashes? Just to save me some digging
time...
   
   Only this:
   
   (gdb) l *(dio_complete+0xa3)
   0x811ddae3 is in dio_complete (fs/direct-io.c:282).
   277 }
   278
   279 aio_complete(dio-iocb, ret, 0);
   280 }
   281
   282 kmem_cache_free(dio_cache, dio);
   283 return ret;
   284 }
   285
   286 static void dio_aio_complete_work(struct work_struct *work)
   
   And this:
   
   (gdb) l *(blk_account_io_done+0x6a)
   0x81755b6a is in blk_account_io_done (block/blk-core.c:2049).
   2044int cpu;
   2045
   2046cpu = part_stat_lock();
   2047part = req-part;
   2048
   2049part_stat_inc(cpu, part, ios[rw]);
   2050part_stat_add(cpu, part, ticks[rw], duration);
   2051part_round_stats(cpu, part);
   2052part_dec_in_flight(part, rw);
   2053
   
   as I've rebuild the kernel with different patches since the one
   running on the machine that is triggering the problem.
  
  Any update on this, Jens? I've hit this blk_account_io_done() panic
  10 times in the past 2 hours while trying to do xfs_repair
  testing
 
 No, sorry, no updates yet... I haven't had time to look into it today.
 To reproduce tomorrow, can you mail me your exact setup (kvm invocation,
 etc) and how your guest is setup and if there's any special way I need
 to run xfstest or xfs_repair?

The virtio device that I'm hitting is /mnt/fast-ssd/vm-100TB-sparse.img
which is a 100TB file on a 160GB XFS filesystem on an SSD, created
with

$ xfs_io -f -c truncate 100t -c extsize 1m /mnt/fast-ssd/vm-100TB-sparse.img

If I stat it, I get:

$ xfs_io -c stat /mnt/fast-ssd/vm-100TB-sparse.img 
fd.path = /mnt/fast-ssd/vm-100TB-sparse.img
fd.flags = non-sync,non-direct,read-write
stat.ino = 131
stat.type = regular file
stat.size = 109951162777600
stat.blocks = 259333400
fsxattr.xflags = 0x800 [--e---]
fsxattr.projid = 0
fsxattr.extsize = 1048576
fsxattr.nextents = 83108
fsxattr.naextents = 0
dioattr.mem = 0x200
dioattr.miniosz = 512
dioattr.maxiosz = 2147483136
$

The VM is run by this script:

$ cat run-vm-4.sh 
#!/bin/sh
sudo /usr/bin/qemu-system-x86_64 \
-no-fd-bootchk \
-localtime \
-boot c \
-serial pty \
-nographic \
-alt-grab \
-smp 16 -m 16384 \
-machine accel=kvm \
-hda /vm-images/vm-4/root.img \
-drive file=/vm-images/vm-4/vm-4-test.img,if=virtio,cache=none \
-drive file=/vm-images/vm-4/vm-4-scratch.img,if=virtio,cache=none \
-drive file=/mnt/fast-ssd/vm-100TB-sparse.img,if=virtio,cache=none \
-drive 

Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-03 Thread Dave Chinner
On Tue, Dec 03, 2013 at 08:47:12PM -0700, Jens Axboe wrote:
> On Wed, Dec 04 2013, Dave Chinner wrote:
> > On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote:
> > > On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote:
> > > > Hi Jens,
> > > > 
> > > > Not sure who to direct this to or CC, so I figured you are the
> > > > person to do that. I just had xfstests generic/299 (an AIO/DIO test)
> > > > oops in dio_complete() like so:
> > > > 

> > > > [ 9650.590630]  
> > > > [ 9650.590630]  [] dio_complete+0xa3/0x140
> > > > [ 9650.590630]  [] dio_bio_end_aio+0x7a/0x110
> > > > [ 9650.590630]  [] ? dio_bio_end_aio+0x5/0x110
> > > > [ 9650.590630]  [] bio_endio+0x1d/0x30
> > > > [ 9650.590630]  [] blk_mq_complete_request+0x5f/0x120
> > > > [ 9650.590630]  [] __blk_mq_end_io+0x16/0x20
> > > > [ 9650.590630]  [] blk_mq_end_io+0x68/0xd0
> > > > [ 9650.590630]  [] virtblk_done+0x67/0x110
> > > > [ 9650.590630]  [] vring_interrupt+0x35/0x60
.
> > > And I just hit this from running xfs_repair which is doing
> > > multithreaded direct IO directly on /dev/vdc:
> > > 

> > > [ 1776.510446] IP: [] blk_account_io_done+0x6a/0x180

> > > [ 1776.512577]  [] blk_mq_complete_request+0xb8/0x120
> > > [ 1776.512577]  [] __blk_mq_end_io+0x16/0x20
> > > [ 1776.512577]  [] blk_mq_end_io+0x68/0xd0
> > > [ 1776.512577]  [] virtblk_done+0x67/0x110
> > > [ 1776.512577]  [] vring_interrupt+0x35/0x60
> > > [ 1776.512577]  [] handle_irq_event_percpu+0x54/0x1e0
.
> > > So this is looking like another virtio+blk_mq problem
> > 
> > This one is definitely reproducable. Just hit it again...
> 
> I'll take a look at this. You don't happen to have gdb dumps of the
> lines associated with those crashes? Just to save me some digging
> time...

Only this:

(gdb) l *(dio_complete+0xa3)
0x811ddae3 is in dio_complete (fs/direct-io.c:282).
277 }
278
279 aio_complete(dio->iocb, ret, 0);
280 }
281
282 kmem_cache_free(dio_cache, dio);
283 return ret;
284 }
285
286 static void dio_aio_complete_work(struct work_struct *work)

And this:

(gdb) l *(blk_account_io_done+0x6a)
0x81755b6a is in blk_account_io_done (block/blk-core.c:2049).
2044int cpu;
2045
2046cpu = part_stat_lock();
2047part = req->part;
2048
2049part_stat_inc(cpu, part, ios[rw]);
2050part_stat_add(cpu, part, ticks[rw], duration);
2051part_round_stats(cpu, part);
2052part_dec_in_flight(part, rw);
2053

as I've rebuild the kernel with different patches since the one
running on the machine that is triggering the problem.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-03 Thread Jens Axboe
On Wed, Dec 04 2013, Dave Chinner wrote:
> On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote:
> > On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote:
> > > Hi Jens,
> > > 
> > > Not sure who to direct this to or CC, so I figured you are the
> > > person to do that. I just had xfstests generic/299 (an AIO/DIO test)
> > > oops in dio_complete() like so:
> > > 
> > > [ 9650.586724] general protection fault:  [#1] SMP DEBUG_PAGEALLOC
> > > [ 9650.590131] Modules linked in:
> > > [ 9650.590630] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ 
> > > #73
> > > [ 9650.590630] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > > [ 9650.590630] task: 81f35480 ti: 81f24000 task.ti: 
> > > 81f24000
> > > [ 9650.590630] RIP: 0010:[]  [] 
> > > aio_complete+0xb1/0x1e0
> > > [ 9650.590630] RSP: 0018:88007f803cf8  EFLAGS: 00010003
> > > [ 9650.590630] RAX: 0086 RBX: 8800688ae000 RCX: 
> > > 00d6d6d6
> > > [ 9650.590630] RDX: 6b6b6b6b6b6b6b6b RSI: 1000 RDI: 
> > > 8800688ae1c4
> > > [ 9650.590630] RBP: 88007f803d28 R08:  R09: 
> > > 6b6b6b6c
> > > [ 9650.590630] R10: ff79a1a35779b009 R11: 0010 R12: 
> > > 88006953a540
> > > [ 9650.590630] R13: 1000 R14:  R15: 
> > > 8800688ae1c4
> > > [ 9650.590630] FS:  () GS:88007f80() 
> > > knlGS:
> > > [ 9650.590630] CS:  0010 DS:  ES:  CR0: 8005003b
> > > [ 9650.590630] CR2: 7ffa9c48448f CR3: 28061000 CR4: 
> > > 06f0
> > > [ 9650.590630] Stack:
> > > [ 9650.590630]  88007f803d18 1000 880068b80800 
> > > 1000
> > > [ 9650.590630]  0001 31f6d000 88007f803d68 
> > > 811ddae3
> > > [ 9650.590630]  0086 880068b80800  
> > > 880068b80828
> > > [ 9650.590630] Call Trace:
> > > [ 9650.590630]  
> > > [ 9650.590630]  [] dio_complete+0xa3/0x140
> > > [ 9650.590630]  [] dio_bio_end_aio+0x7a/0x110
> > > [ 9650.590630]  [] ? dio_bio_end_aio+0x5/0x110
> > > [ 9650.590630]  [] bio_endio+0x1d/0x30
> > > [ 9650.590630]  [] blk_mq_complete_request+0x5f/0x120
> > > [ 9650.590630]  [] __blk_mq_end_io+0x16/0x20
> > > [ 9650.590630]  [] blk_mq_end_io+0x68/0xd0
> > > [ 9650.590630]  [] virtblk_done+0x67/0x110
> > > [ 9650.590630]  [] vring_interrupt+0x35/0x60
> > > [ 9650.590630]  [] handle_irq_event_percpu+0x54/0x1e0
> > > [ 9650.590630]  [] handle_irq_event+0x48/0x70
> > > [ 9650.590630]  [] ? kvm_guest_apic_eoi_write+0x5/0x50
> > > [ 9650.590630]  [] handle_edge_irq+0x77/0x110
> > > [ 9650.590630]  [] handle_irq+0xbf/0x150
> > > [ 9650.590630]  [] ? handle_irq+0x5/0x150
> > > [ 9650.590630]  [] ? 
> > > atomic_notifier_call_chain+0x16/0x20
> > > [ 9650.590630]  [] do_IRQ+0x5a/0xe0
> > > [ 9650.590630]  [] common_interrupt+0x6d/0x6d
> > > [ 9650.590630]  
> > > [ 9650.590630]  [] ? native_safe_halt+0x6/0x10
> > > [ 9650.590630]  [] ? default_idle+0x5/0xc0
> > > [ 9650.590630]  [] default_idle+0x1f/0xc0
> > > [ 9650.590630]  [] arch_cpu_idle+0x26/0x30
> > > [ 9650.590630]  [] cpu_startup_entry+0x81/0x240
> > > [ 9650.590630]  [] rest_init+0x77/0x80
> > > [ 9650.590630]  [] start_kernel+0x3cd/0x3da
> > > [ 9650.590630]  [] ? repair_env_string+0x5e/0x5e
> > > [ 9650.590630]  [] ? early_idt_handlers+0x117/0x120
> > > [ 9650.590630]  [] x86_64_start_reservations+0x2a/0x2c
> > > [ 9650.590630]  [] x86_64_start_kernel+0x146/0x155
> > > [ 9650.590630] Code: e8 05 b5 8f 00 44 8b 8b c0 01 00 00 45 31 c0 48 8b 
> > > 93 98 00 00 00 41 83 c1 01 44 3b 8b 80 00 00 00 44 89 c9 45 0f 42 c1 c1 
> > > e9 07 <48> 8b 0c ca 65
> > > [ 9650.590630] RIP  [] aio_complete+0xb1/0x1e0
> > > [ 9650.590630]  RSP 
> > > 
> > > I'm not sure if it is related to blk_mq, virtio, dio  or bio changes
> > > (or even somethign else), but I haven't seen anything like this
> > > before. I've only seen it once so far (haven't rerun the test yet at
> > > all).
> > 
> > And I just hit this from running xfs_repair which is doing
> > multithreaded direct IO directly on /dev/vdc:
> > 
> > [ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at 
> > 0328
> > [ 1776.510446] IP: [] blk_account_io_done+0x6a/0x180
> > [ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0
> > [ 1776.512577] Oops:  [#1] SMP
> > [ 1776.512577] Modules linked in:
> > [ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75
> > [ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > [ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: 
> > 81f26000
> > [ 1776.512577] RIP: 0010:[]  [] 
> > blk_account_io_done+0x6a/0x180
> > [ 1776.512577] RSP: :88011bc03da8  EFLAGS: 00010046
> > [ 1776.512577] RAX: 1000 RBX:  RCX: 
> > 
> > [ 1776.512577] RDX:  RSI: 

Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-03 Thread Dave Chinner
On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote:
> On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote:
> > Hi Jens,
> > 
> > Not sure who to direct this to or CC, so I figured you are the
> > person to do that. I just had xfstests generic/299 (an AIO/DIO test)
> > oops in dio_complete() like so:
> > 
> > [ 9650.586724] general protection fault:  [#1] SMP DEBUG_PAGEALLOC
> > [ 9650.590131] Modules linked in:
> > [ 9650.590630] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #73
> > [ 9650.590630] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > [ 9650.590630] task: 81f35480 ti: 81f24000 task.ti: 
> > 81f24000
> > [ 9650.590630] RIP: 0010:[]  [] 
> > aio_complete+0xb1/0x1e0
> > [ 9650.590630] RSP: 0018:88007f803cf8  EFLAGS: 00010003
> > [ 9650.590630] RAX: 0086 RBX: 8800688ae000 RCX: 
> > 00d6d6d6
> > [ 9650.590630] RDX: 6b6b6b6b6b6b6b6b RSI: 1000 RDI: 
> > 8800688ae1c4
> > [ 9650.590630] RBP: 88007f803d28 R08:  R09: 
> > 6b6b6b6c
> > [ 9650.590630] R10: ff79a1a35779b009 R11: 0010 R12: 
> > 88006953a540
> > [ 9650.590630] R13: 1000 R14:  R15: 
> > 8800688ae1c4
> > [ 9650.590630] FS:  () GS:88007f80() 
> > knlGS:
> > [ 9650.590630] CS:  0010 DS:  ES:  CR0: 8005003b
> > [ 9650.590630] CR2: 7ffa9c48448f CR3: 28061000 CR4: 
> > 06f0
> > [ 9650.590630] Stack:
> > [ 9650.590630]  88007f803d18 1000 880068b80800 
> > 1000
> > [ 9650.590630]  0001 31f6d000 88007f803d68 
> > 811ddae3
> > [ 9650.590630]  0086 880068b80800  
> > 880068b80828
> > [ 9650.590630] Call Trace:
> > [ 9650.590630]  
> > [ 9650.590630]  [] dio_complete+0xa3/0x140
> > [ 9650.590630]  [] dio_bio_end_aio+0x7a/0x110
> > [ 9650.590630]  [] ? dio_bio_end_aio+0x5/0x110
> > [ 9650.590630]  [] bio_endio+0x1d/0x30
> > [ 9650.590630]  [] blk_mq_complete_request+0x5f/0x120
> > [ 9650.590630]  [] __blk_mq_end_io+0x16/0x20
> > [ 9650.590630]  [] blk_mq_end_io+0x68/0xd0
> > [ 9650.590630]  [] virtblk_done+0x67/0x110
> > [ 9650.590630]  [] vring_interrupt+0x35/0x60
> > [ 9650.590630]  [] handle_irq_event_percpu+0x54/0x1e0
> > [ 9650.590630]  [] handle_irq_event+0x48/0x70
> > [ 9650.590630]  [] ? kvm_guest_apic_eoi_write+0x5/0x50
> > [ 9650.590630]  [] handle_edge_irq+0x77/0x110
> > [ 9650.590630]  [] handle_irq+0xbf/0x150
> > [ 9650.590630]  [] ? handle_irq+0x5/0x150
> > [ 9650.590630]  [] ? atomic_notifier_call_chain+0x16/0x20
> > [ 9650.590630]  [] do_IRQ+0x5a/0xe0
> > [ 9650.590630]  [] common_interrupt+0x6d/0x6d
> > [ 9650.590630]  
> > [ 9650.590630]  [] ? native_safe_halt+0x6/0x10
> > [ 9650.590630]  [] ? default_idle+0x5/0xc0
> > [ 9650.590630]  [] default_idle+0x1f/0xc0
> > [ 9650.590630]  [] arch_cpu_idle+0x26/0x30
> > [ 9650.590630]  [] cpu_startup_entry+0x81/0x240
> > [ 9650.590630]  [] rest_init+0x77/0x80
> > [ 9650.590630]  [] start_kernel+0x3cd/0x3da
> > [ 9650.590630]  [] ? repair_env_string+0x5e/0x5e
> > [ 9650.590630]  [] ? early_idt_handlers+0x117/0x120
> > [ 9650.590630]  [] x86_64_start_reservations+0x2a/0x2c
> > [ 9650.590630]  [] x86_64_start_kernel+0x146/0x155
> > [ 9650.590630] Code: e8 05 b5 8f 00 44 8b 8b c0 01 00 00 45 31 c0 48 8b 93 
> > 98 00 00 00 41 83 c1 01 44 3b 8b 80 00 00 00 44 89 c9 45 0f 42 c1 c1 e9 07 
> > <48> 8b 0c ca 65
> > [ 9650.590630] RIP  [] aio_complete+0xb1/0x1e0
> > [ 9650.590630]  RSP 
> > 
> > I'm not sure if it is related to blk_mq, virtio, dio  or bio changes
> > (or even somethign else), but I haven't seen anything like this
> > before. I've only seen it once so far (haven't rerun the test yet at
> > all).
> 
> And I just hit this from running xfs_repair which is doing
> multithreaded direct IO directly on /dev/vdc:
> 
> [ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at 
> 0328
> [ 1776.510446] IP: [] blk_account_io_done+0x6a/0x180
> [ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0
> [ 1776.512577] Oops:  [#1] SMP
> [ 1776.512577] Modules linked in:
> [ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75
> [ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: 
> 81f26000
> [ 1776.512577] RIP: 0010:[]  [] 
> blk_account_io_done+0x6a/0x180
> [ 1776.512577] RSP: :88011bc03da8  EFLAGS: 00010046
> [ 1776.512577] RAX: 1000 RBX:  RCX: 
> 
> [ 1776.512577] RDX:  RSI: 0002 RDI: 
> 8800dba59fc0
> [ 1776.512577] RBP: 88011bc03db8 R08:  R09: 
> 88021a828928
> [ 1776.512577] R10:  R11: 0001 R12: 
> 
> [ 1776.512577] R13:  R14: 

Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-03 Thread Dave Chinner
On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote:
> Hi Jens,
> 
> Not sure who to direct this to or CC, so I figured you are the
> person to do that. I just had xfstests generic/299 (an AIO/DIO test)
> oops in dio_complete() like so:
> 
> [ 9650.586724] general protection fault:  [#1] SMP DEBUG_PAGEALLOC
> [ 9650.590131] Modules linked in:
> [ 9650.590630] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #73
> [ 9650.590630] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 9650.590630] task: 81f35480 ti: 81f24000 task.ti: 
> 81f24000
> [ 9650.590630] RIP: 0010:[]  [] 
> aio_complete+0xb1/0x1e0
> [ 9650.590630] RSP: 0018:88007f803cf8  EFLAGS: 00010003
> [ 9650.590630] RAX: 0086 RBX: 8800688ae000 RCX: 
> 00d6d6d6
> [ 9650.590630] RDX: 6b6b6b6b6b6b6b6b RSI: 1000 RDI: 
> 8800688ae1c4
> [ 9650.590630] RBP: 88007f803d28 R08:  R09: 
> 6b6b6b6c
> [ 9650.590630] R10: ff79a1a35779b009 R11: 0010 R12: 
> 88006953a540
> [ 9650.590630] R13: 1000 R14:  R15: 
> 8800688ae1c4
> [ 9650.590630] FS:  () GS:88007f80() 
> knlGS:
> [ 9650.590630] CS:  0010 DS:  ES:  CR0: 8005003b
> [ 9650.590630] CR2: 7ffa9c48448f CR3: 28061000 CR4: 
> 06f0
> [ 9650.590630] Stack:
> [ 9650.590630]  88007f803d18 1000 880068b80800 
> 1000
> [ 9650.590630]  0001 31f6d000 88007f803d68 
> 811ddae3
> [ 9650.590630]  0086 880068b80800  
> 880068b80828
> [ 9650.590630] Call Trace:
> [ 9650.590630]  
> [ 9650.590630]  [] dio_complete+0xa3/0x140
> [ 9650.590630]  [] dio_bio_end_aio+0x7a/0x110
> [ 9650.590630]  [] ? dio_bio_end_aio+0x5/0x110
> [ 9650.590630]  [] bio_endio+0x1d/0x30
> [ 9650.590630]  [] blk_mq_complete_request+0x5f/0x120
> [ 9650.590630]  [] __blk_mq_end_io+0x16/0x20
> [ 9650.590630]  [] blk_mq_end_io+0x68/0xd0
> [ 9650.590630]  [] virtblk_done+0x67/0x110
> [ 9650.590630]  [] vring_interrupt+0x35/0x60
> [ 9650.590630]  [] handle_irq_event_percpu+0x54/0x1e0
> [ 9650.590630]  [] handle_irq_event+0x48/0x70
> [ 9650.590630]  [] ? kvm_guest_apic_eoi_write+0x5/0x50
> [ 9650.590630]  [] handle_edge_irq+0x77/0x110
> [ 9650.590630]  [] handle_irq+0xbf/0x150
> [ 9650.590630]  [] ? handle_irq+0x5/0x150
> [ 9650.590630]  [] ? atomic_notifier_call_chain+0x16/0x20
> [ 9650.590630]  [] do_IRQ+0x5a/0xe0
> [ 9650.590630]  [] common_interrupt+0x6d/0x6d
> [ 9650.590630]  
> [ 9650.590630]  [] ? native_safe_halt+0x6/0x10
> [ 9650.590630]  [] ? default_idle+0x5/0xc0
> [ 9650.590630]  [] default_idle+0x1f/0xc0
> [ 9650.590630]  [] arch_cpu_idle+0x26/0x30
> [ 9650.590630]  [] cpu_startup_entry+0x81/0x240
> [ 9650.590630]  [] rest_init+0x77/0x80
> [ 9650.590630]  [] start_kernel+0x3cd/0x3da
> [ 9650.590630]  [] ? repair_env_string+0x5e/0x5e
> [ 9650.590630]  [] ? early_idt_handlers+0x117/0x120
> [ 9650.590630]  [] x86_64_start_reservations+0x2a/0x2c
> [ 9650.590630]  [] x86_64_start_kernel+0x146/0x155
> [ 9650.590630] Code: e8 05 b5 8f 00 44 8b 8b c0 01 00 00 45 31 c0 48 8b 93 98 
> 00 00 00 41 83 c1 01 44 3b 8b 80 00 00 00 44 89 c9 45 0f 42 c1 c1 e9 07 <48> 
> 8b 0c ca 65
> [ 9650.590630] RIP  [] aio_complete+0xb1/0x1e0
> [ 9650.590630]  RSP 
> 
> I'm not sure if it is related to blk_mq, virtio, dio  or bio changes
> (or even somethign else), but I haven't seen anything like this
> before. I've only seen it once so far (haven't rerun the test yet at
> all).

And I just hit this from running xfs_repair which is doing
multithreaded direct IO directly on /dev/vdc:

[ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at 
0328
[ 1776.510446] IP: [] blk_account_io_done+0x6a/0x180
[ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0
[ 1776.512577] Oops:  [#1] SMP
[ 1776.512577] Modules linked in:
[ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75
[ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: 
81f26000
[ 1776.512577] RIP: 0010:[]  [] 
blk_account_io_done+0x6a/0x180
[ 1776.512577] RSP: :88011bc03da8  EFLAGS: 00010046
[ 1776.512577] RAX: 1000 RBX:  RCX: 
[ 1776.512577] RDX:  RSI: 0002 RDI: 8800dba59fc0
[ 1776.512577] RBP: 88011bc03db8 R08:  R09: 88021a828928
[ 1776.512577] R10:  R11: 0001 R12: 
[ 1776.512577] R13:  R14: 2000 R15: 
[ 1776.512577] FS:  () GS:88011bc0() 
knlGS:
[ 1776.512577] CS:  0010 DS:  ES:  CR0: 8005003b
[ 1776.512577] CR2: 0328 CR3: 00041a4b CR4: 06f0
[ 

[OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-03 Thread Dave Chinner
Hi Jens,

Not sure who to direct this to or CC, so I figured you are the
person to do that. I just had xfstests generic/299 (an AIO/DIO test)
oops in dio_complete() like so:

[ 9650.586724] general protection fault:  [#1] SMP DEBUG_PAGEALLOC
[ 9650.590131] Modules linked in:
[ 9650.590630] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #73
[ 9650.590630] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 9650.590630] task: 81f35480 ti: 81f24000 task.ti: 
81f24000
[ 9650.590630] RIP: 0010:[]  [] 
aio_complete+0xb1/0x1e0
[ 9650.590630] RSP: 0018:88007f803cf8  EFLAGS: 00010003
[ 9650.590630] RAX: 0086 RBX: 8800688ae000 RCX: 00d6d6d6
[ 9650.590630] RDX: 6b6b6b6b6b6b6b6b RSI: 1000 RDI: 8800688ae1c4
[ 9650.590630] RBP: 88007f803d28 R08:  R09: 6b6b6b6c
[ 9650.590630] R10: ff79a1a35779b009 R11: 0010 R12: 88006953a540
[ 9650.590630] R13: 1000 R14:  R15: 8800688ae1c4
[ 9650.590630] FS:  () GS:88007f80() 
knlGS:
[ 9650.590630] CS:  0010 DS:  ES:  CR0: 8005003b
[ 9650.590630] CR2: 7ffa9c48448f CR3: 28061000 CR4: 06f0
[ 9650.590630] Stack:
[ 9650.590630]  88007f803d18 1000 880068b80800 
1000
[ 9650.590630]  0001 31f6d000 88007f803d68 
811ddae3
[ 9650.590630]  0086 880068b80800  
880068b80828
[ 9650.590630] Call Trace:
[ 9650.590630]  
[ 9650.590630]  [] dio_complete+0xa3/0x140
[ 9650.590630]  [] dio_bio_end_aio+0x7a/0x110
[ 9650.590630]  [] ? dio_bio_end_aio+0x5/0x110
[ 9650.590630]  [] bio_endio+0x1d/0x30
[ 9650.590630]  [] blk_mq_complete_request+0x5f/0x120
[ 9650.590630]  [] __blk_mq_end_io+0x16/0x20
[ 9650.590630]  [] blk_mq_end_io+0x68/0xd0
[ 9650.590630]  [] virtblk_done+0x67/0x110
[ 9650.590630]  [] vring_interrupt+0x35/0x60
[ 9650.590630]  [] handle_irq_event_percpu+0x54/0x1e0
[ 9650.590630]  [] handle_irq_event+0x48/0x70
[ 9650.590630]  [] ? kvm_guest_apic_eoi_write+0x5/0x50
[ 9650.590630]  [] handle_edge_irq+0x77/0x110
[ 9650.590630]  [] handle_irq+0xbf/0x150
[ 9650.590630]  [] ? handle_irq+0x5/0x150
[ 9650.590630]  [] ? atomic_notifier_call_chain+0x16/0x20
[ 9650.590630]  [] do_IRQ+0x5a/0xe0
[ 9650.590630]  [] common_interrupt+0x6d/0x6d
[ 9650.590630]  
[ 9650.590630]  [] ? native_safe_halt+0x6/0x10
[ 9650.590630]  [] ? default_idle+0x5/0xc0
[ 9650.590630]  [] default_idle+0x1f/0xc0
[ 9650.590630]  [] arch_cpu_idle+0x26/0x30
[ 9650.590630]  [] cpu_startup_entry+0x81/0x240
[ 9650.590630]  [] rest_init+0x77/0x80
[ 9650.590630]  [] start_kernel+0x3cd/0x3da
[ 9650.590630]  [] ? repair_env_string+0x5e/0x5e
[ 9650.590630]  [] ? early_idt_handlers+0x117/0x120
[ 9650.590630]  [] x86_64_start_reservations+0x2a/0x2c
[ 9650.590630]  [] x86_64_start_kernel+0x146/0x155
[ 9650.590630] Code: e8 05 b5 8f 00 44 8b 8b c0 01 00 00 45 31 c0 48 8b 93 98 
00 00 00 41 83 c1 01 44 3b 8b 80 00 00 00 44 89 c9 45 0f 42 c1 c1 e9 07 <48> 8b 
0c ca 65
[ 9650.590630] RIP  [] aio_complete+0xb1/0x1e0
[ 9650.590630]  RSP 

I'm not sure if it is related to blk_mq, virtio, dio  or bio changes
(or even somethign else), but I haven't seen anything like this
before. I've only seen it once so far (haven't rerun the test yet at
all).

FWIW, I'd also like to point out that I'm seeing XFS on the
virtio stack get perilously close to blowing the stack on 3.13-rc2:

[  387.048771] fill (11196) used greatest stack depth: 1048 bytes left

This is new - on 3.12 generally the output would be around 2500
bytes left, so somewhere below of XFS there a lot more stack being
consumed in the IO path on 3.13. I'll do some more tracing of this
to try to find out where it is all going

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-03 Thread Dave Chinner
Hi Jens,

Not sure who to direct this to or CC, so I figured you are the
person to do that. I just had xfstests generic/299 (an AIO/DIO test)
oops in dio_complete() like so:

[ 9650.586724] general protection fault:  [#1] SMP DEBUG_PAGEALLOC
[ 9650.590131] Modules linked in:
[ 9650.590630] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #73
[ 9650.590630] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 9650.590630] task: 81f35480 ti: 81f24000 task.ti: 
81f24000
[ 9650.590630] RIP: 0010:[811ee5f1]  [811ee5f1] 
aio_complete+0xb1/0x1e0
[ 9650.590630] RSP: 0018:88007f803cf8  EFLAGS: 00010003
[ 9650.590630] RAX: 0086 RBX: 8800688ae000 RCX: 00d6d6d6
[ 9650.590630] RDX: 6b6b6b6b6b6b6b6b RSI: 1000 RDI: 8800688ae1c4
[ 9650.590630] RBP: 88007f803d28 R08:  R09: 6b6b6b6c
[ 9650.590630] R10: ff79a1a35779b009 R11: 0010 R12: 88006953a540
[ 9650.590630] R13: 1000 R14:  R15: 8800688ae1c4
[ 9650.590630] FS:  () GS:88007f80() 
knlGS:
[ 9650.590630] CS:  0010 DS:  ES:  CR0: 8005003b
[ 9650.590630] CR2: 7ffa9c48448f CR3: 28061000 CR4: 06f0
[ 9650.590630] Stack:
[ 9650.590630]  88007f803d18 1000 880068b80800 
1000
[ 9650.590630]  0001 31f6d000 88007f803d68 
811ddae3
[ 9650.590630]  0086 880068b80800  
880068b80828
[ 9650.590630] Call Trace:
[ 9650.590630]  IRQ
[ 9650.590630]  [811ddae3] dio_complete+0xa3/0x140
[ 9650.590630]  [811ddc2a] dio_bio_end_aio+0x7a/0x110
[ 9650.590630]  [811ddbb5] ? dio_bio_end_aio+0x5/0x110
[ 9650.590630]  [811d8a9d] bio_endio+0x1d/0x30
[ 9650.590630]  [8175d65f] blk_mq_complete_request+0x5f/0x120
[ 9650.590630]  [8175d736] __blk_mq_end_io+0x16/0x20
[ 9650.590630]  [8175d7a8] blk_mq_end_io+0x68/0xd0
[ 9650.590630]  [818539a7] virtblk_done+0x67/0x110
[ 9650.590630]  [817f74c5] vring_interrupt+0x35/0x60
[ 9650.590630]  [810e4e34] handle_irq_event_percpu+0x54/0x1e0
[ 9650.590630]  [810e5008] handle_irq_event+0x48/0x70
[ 9650.590630]  [8107d6f5] ? kvm_guest_apic_eoi_write+0x5/0x50
[ 9650.590630]  [810e7807] handle_edge_irq+0x77/0x110
[ 9650.590630]  [8104796f] handle_irq+0xbf/0x150
[ 9650.590630]  [810478b5] ? handle_irq+0x5/0x150
[ 9650.590630]  [81aee136] ? atomic_notifier_call_chain+0x16/0x20
[ 9650.590630]  [81af43ca] do_IRQ+0x5a/0xe0
[ 9650.590630]  [81aea16d] common_interrupt+0x6d/0x6d
[ 9650.590630]  EOI
[ 9650.590630]  [8107e616] ? native_safe_halt+0x6/0x10
[ 9650.590630]  [8104e4f5] ? default_idle+0x5/0xc0
[ 9650.590630]  [8104e50f] default_idle+0x1f/0xc0
[ 9650.590630]  [8104edb6] arch_cpu_idle+0x26/0x30
[ 9650.590630]  [810e4331] cpu_startup_entry+0x81/0x240
[ 9650.590630]  [81ad5d67] rest_init+0x77/0x80
[ 9650.590630]  [8203ceec] start_kernel+0x3cd/0x3da
[ 9650.590630]  [8203c91e] ? repair_env_string+0x5e/0x5e
[ 9650.590630]  [8203c117] ? early_idt_handlers+0x117/0x120
[ 9650.590630]  [8203c5c2] x86_64_start_reservations+0x2a/0x2c
[ 9650.590630]  [8203c70a] x86_64_start_kernel+0x146/0x155
[ 9650.590630] Code: e8 05 b5 8f 00 44 8b 8b c0 01 00 00 45 31 c0 48 8b 93 98 
00 00 00 41 83 c1 01 44 3b 8b 80 00 00 00 44 89 c9 45 0f 42 c1 c1 e9 07 48 8b 
0c ca 65
[ 9650.590630] RIP  [811ee5f1] aio_complete+0xb1/0x1e0
[ 9650.590630]  RSP 88007f803cf8

I'm not sure if it is related to blk_mq, virtio, dio  or bio changes
(or even somethign else), but I haven't seen anything like this
before. I've only seen it once so far (haven't rerun the test yet at
all).

FWIW, I'd also like to point out that I'm seeing XFS on the
virtio stack get perilously close to blowing the stack on 3.13-rc2:

[  387.048771] fill (11196) used greatest stack depth: 1048 bytes left

This is new - on 3.12 generally the output would be around 2500
bytes left, so somewhere below of XFS there a lot more stack being
consumed in the IO path on 3.13. I'll do some more tracing of this
to try to find out where it is all going

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-03 Thread Dave Chinner
On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote:
 Hi Jens,
 
 Not sure who to direct this to or CC, so I figured you are the
 person to do that. I just had xfstests generic/299 (an AIO/DIO test)
 oops in dio_complete() like so:
 
 [ 9650.586724] general protection fault:  [#1] SMP DEBUG_PAGEALLOC
 [ 9650.590131] Modules linked in:
 [ 9650.590630] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #73
 [ 9650.590630] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 [ 9650.590630] task: 81f35480 ti: 81f24000 task.ti: 
 81f24000
 [ 9650.590630] RIP: 0010:[811ee5f1]  [811ee5f1] 
 aio_complete+0xb1/0x1e0
 [ 9650.590630] RSP: 0018:88007f803cf8  EFLAGS: 00010003
 [ 9650.590630] RAX: 0086 RBX: 8800688ae000 RCX: 
 00d6d6d6
 [ 9650.590630] RDX: 6b6b6b6b6b6b6b6b RSI: 1000 RDI: 
 8800688ae1c4
 [ 9650.590630] RBP: 88007f803d28 R08:  R09: 
 6b6b6b6c
 [ 9650.590630] R10: ff79a1a35779b009 R11: 0010 R12: 
 88006953a540
 [ 9650.590630] R13: 1000 R14:  R15: 
 8800688ae1c4
 [ 9650.590630] FS:  () GS:88007f80() 
 knlGS:
 [ 9650.590630] CS:  0010 DS:  ES:  CR0: 8005003b
 [ 9650.590630] CR2: 7ffa9c48448f CR3: 28061000 CR4: 
 06f0
 [ 9650.590630] Stack:
 [ 9650.590630]  88007f803d18 1000 880068b80800 
 1000
 [ 9650.590630]  0001 31f6d000 88007f803d68 
 811ddae3
 [ 9650.590630]  0086 880068b80800  
 880068b80828
 [ 9650.590630] Call Trace:
 [ 9650.590630]  IRQ
 [ 9650.590630]  [811ddae3] dio_complete+0xa3/0x140
 [ 9650.590630]  [811ddc2a] dio_bio_end_aio+0x7a/0x110
 [ 9650.590630]  [811ddbb5] ? dio_bio_end_aio+0x5/0x110
 [ 9650.590630]  [811d8a9d] bio_endio+0x1d/0x30
 [ 9650.590630]  [8175d65f] blk_mq_complete_request+0x5f/0x120
 [ 9650.590630]  [8175d736] __blk_mq_end_io+0x16/0x20
 [ 9650.590630]  [8175d7a8] blk_mq_end_io+0x68/0xd0
 [ 9650.590630]  [818539a7] virtblk_done+0x67/0x110
 [ 9650.590630]  [817f74c5] vring_interrupt+0x35/0x60
 [ 9650.590630]  [810e4e34] handle_irq_event_percpu+0x54/0x1e0
 [ 9650.590630]  [810e5008] handle_irq_event+0x48/0x70
 [ 9650.590630]  [8107d6f5] ? kvm_guest_apic_eoi_write+0x5/0x50
 [ 9650.590630]  [810e7807] handle_edge_irq+0x77/0x110
 [ 9650.590630]  [8104796f] handle_irq+0xbf/0x150
 [ 9650.590630]  [810478b5] ? handle_irq+0x5/0x150
 [ 9650.590630]  [81aee136] ? atomic_notifier_call_chain+0x16/0x20
 [ 9650.590630]  [81af43ca] do_IRQ+0x5a/0xe0
 [ 9650.590630]  [81aea16d] common_interrupt+0x6d/0x6d
 [ 9650.590630]  EOI
 [ 9650.590630]  [8107e616] ? native_safe_halt+0x6/0x10
 [ 9650.590630]  [8104e4f5] ? default_idle+0x5/0xc0
 [ 9650.590630]  [8104e50f] default_idle+0x1f/0xc0
 [ 9650.590630]  [8104edb6] arch_cpu_idle+0x26/0x30
 [ 9650.590630]  [810e4331] cpu_startup_entry+0x81/0x240
 [ 9650.590630]  [81ad5d67] rest_init+0x77/0x80
 [ 9650.590630]  [8203ceec] start_kernel+0x3cd/0x3da
 [ 9650.590630]  [8203c91e] ? repair_env_string+0x5e/0x5e
 [ 9650.590630]  [8203c117] ? early_idt_handlers+0x117/0x120
 [ 9650.590630]  [8203c5c2] x86_64_start_reservations+0x2a/0x2c
 [ 9650.590630]  [8203c70a] x86_64_start_kernel+0x146/0x155
 [ 9650.590630] Code: e8 05 b5 8f 00 44 8b 8b c0 01 00 00 45 31 c0 48 8b 93 98 
 00 00 00 41 83 c1 01 44 3b 8b 80 00 00 00 44 89 c9 45 0f 42 c1 c1 e9 07 48 
 8b 0c ca 65
 [ 9650.590630] RIP  [811ee5f1] aio_complete+0xb1/0x1e0
 [ 9650.590630]  RSP 88007f803cf8
 
 I'm not sure if it is related to blk_mq, virtio, dio  or bio changes
 (or even somethign else), but I haven't seen anything like this
 before. I've only seen it once so far (haven't rerun the test yet at
 all).

And I just hit this from running xfs_repair which is doing
multithreaded direct IO directly on /dev/vdc:

[ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at 
0328
[ 1776.510446] IP: [81755b6a] blk_account_io_done+0x6a/0x180
[ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0
[ 1776.512577] Oops:  [#1] SMP
[ 1776.512577] Modules linked in:
[ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75
[ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: 
81f26000
[ 1776.512577] RIP: 0010:[81755b6a]  [81755b6a] 
blk_account_io_done+0x6a/0x180
[ 1776.512577] RSP: :88011bc03da8  EFLAGS: 00010046
[ 1776.512577] RAX: 1000 RBX:  RCX: 
[ 1776.512577] RDX:  RSI: 0002 RDI: 

Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-03 Thread Dave Chinner
On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote:
 On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote:
  Hi Jens,
  
  Not sure who to direct this to or CC, so I figured you are the
  person to do that. I just had xfstests generic/299 (an AIO/DIO test)
  oops in dio_complete() like so:
  
  [ 9650.586724] general protection fault:  [#1] SMP DEBUG_PAGEALLOC
  [ 9650.590131] Modules linked in:
  [ 9650.590630] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #73
  [ 9650.590630] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  [ 9650.590630] task: 81f35480 ti: 81f24000 task.ti: 
  81f24000
  [ 9650.590630] RIP: 0010:[811ee5f1]  [811ee5f1] 
  aio_complete+0xb1/0x1e0
  [ 9650.590630] RSP: 0018:88007f803cf8  EFLAGS: 00010003
  [ 9650.590630] RAX: 0086 RBX: 8800688ae000 RCX: 
  00d6d6d6
  [ 9650.590630] RDX: 6b6b6b6b6b6b6b6b RSI: 1000 RDI: 
  8800688ae1c4
  [ 9650.590630] RBP: 88007f803d28 R08:  R09: 
  6b6b6b6c
  [ 9650.590630] R10: ff79a1a35779b009 R11: 0010 R12: 
  88006953a540
  [ 9650.590630] R13: 1000 R14:  R15: 
  8800688ae1c4
  [ 9650.590630] FS:  () GS:88007f80() 
  knlGS:
  [ 9650.590630] CS:  0010 DS:  ES:  CR0: 8005003b
  [ 9650.590630] CR2: 7ffa9c48448f CR3: 28061000 CR4: 
  06f0
  [ 9650.590630] Stack:
  [ 9650.590630]  88007f803d18 1000 880068b80800 
  1000
  [ 9650.590630]  0001 31f6d000 88007f803d68 
  811ddae3
  [ 9650.590630]  0086 880068b80800  
  880068b80828
  [ 9650.590630] Call Trace:
  [ 9650.590630]  IRQ
  [ 9650.590630]  [811ddae3] dio_complete+0xa3/0x140
  [ 9650.590630]  [811ddc2a] dio_bio_end_aio+0x7a/0x110
  [ 9650.590630]  [811ddbb5] ? dio_bio_end_aio+0x5/0x110
  [ 9650.590630]  [811d8a9d] bio_endio+0x1d/0x30
  [ 9650.590630]  [8175d65f] blk_mq_complete_request+0x5f/0x120
  [ 9650.590630]  [8175d736] __blk_mq_end_io+0x16/0x20
  [ 9650.590630]  [8175d7a8] blk_mq_end_io+0x68/0xd0
  [ 9650.590630]  [818539a7] virtblk_done+0x67/0x110
  [ 9650.590630]  [817f74c5] vring_interrupt+0x35/0x60
  [ 9650.590630]  [810e4e34] handle_irq_event_percpu+0x54/0x1e0
  [ 9650.590630]  [810e5008] handle_irq_event+0x48/0x70
  [ 9650.590630]  [8107d6f5] ? kvm_guest_apic_eoi_write+0x5/0x50
  [ 9650.590630]  [810e7807] handle_edge_irq+0x77/0x110
  [ 9650.590630]  [8104796f] handle_irq+0xbf/0x150
  [ 9650.590630]  [810478b5] ? handle_irq+0x5/0x150
  [ 9650.590630]  [81aee136] ? atomic_notifier_call_chain+0x16/0x20
  [ 9650.590630]  [81af43ca] do_IRQ+0x5a/0xe0
  [ 9650.590630]  [81aea16d] common_interrupt+0x6d/0x6d
  [ 9650.590630]  EOI
  [ 9650.590630]  [8107e616] ? native_safe_halt+0x6/0x10
  [ 9650.590630]  [8104e4f5] ? default_idle+0x5/0xc0
  [ 9650.590630]  [8104e50f] default_idle+0x1f/0xc0
  [ 9650.590630]  [8104edb6] arch_cpu_idle+0x26/0x30
  [ 9650.590630]  [810e4331] cpu_startup_entry+0x81/0x240
  [ 9650.590630]  [81ad5d67] rest_init+0x77/0x80
  [ 9650.590630]  [8203ceec] start_kernel+0x3cd/0x3da
  [ 9650.590630]  [8203c91e] ? repair_env_string+0x5e/0x5e
  [ 9650.590630]  [8203c117] ? early_idt_handlers+0x117/0x120
  [ 9650.590630]  [8203c5c2] x86_64_start_reservations+0x2a/0x2c
  [ 9650.590630]  [8203c70a] x86_64_start_kernel+0x146/0x155
  [ 9650.590630] Code: e8 05 b5 8f 00 44 8b 8b c0 01 00 00 45 31 c0 48 8b 93 
  98 00 00 00 41 83 c1 01 44 3b 8b 80 00 00 00 44 89 c9 45 0f 42 c1 c1 e9 07 
  48 8b 0c ca 65
  [ 9650.590630] RIP  [811ee5f1] aio_complete+0xb1/0x1e0
  [ 9650.590630]  RSP 88007f803cf8
  
  I'm not sure if it is related to blk_mq, virtio, dio  or bio changes
  (or even somethign else), but I haven't seen anything like this
  before. I've only seen it once so far (haven't rerun the test yet at
  all).
 
 And I just hit this from running xfs_repair which is doing
 multithreaded direct IO directly on /dev/vdc:
 
 [ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at 
 0328
 [ 1776.510446] IP: [81755b6a] blk_account_io_done+0x6a/0x180
 [ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0
 [ 1776.512577] Oops:  [#1] SMP
 [ 1776.512577] Modules linked in:
 [ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75
 [ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 [ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: 
 81f26000
 [ 1776.512577] RIP: 0010:[81755b6a]  [81755b6a] 
 blk_account_io_done+0x6a/0x180
 [ 1776.512577] RSP: :88011bc03da8  EFLAGS: 

Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-03 Thread Jens Axboe
On Wed, Dec 04 2013, Dave Chinner wrote:
 On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote:
  On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote:
   Hi Jens,
   
   Not sure who to direct this to or CC, so I figured you are the
   person to do that. I just had xfstests generic/299 (an AIO/DIO test)
   oops in dio_complete() like so:
   
   [ 9650.586724] general protection fault:  [#1] SMP DEBUG_PAGEALLOC
   [ 9650.590131] Modules linked in:
   [ 9650.590630] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ 
   #73
   [ 9650.590630] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
   [ 9650.590630] task: 81f35480 ti: 81f24000 task.ti: 
   81f24000
   [ 9650.590630] RIP: 0010:[811ee5f1]  [811ee5f1] 
   aio_complete+0xb1/0x1e0
   [ 9650.590630] RSP: 0018:88007f803cf8  EFLAGS: 00010003
   [ 9650.590630] RAX: 0086 RBX: 8800688ae000 RCX: 
   00d6d6d6
   [ 9650.590630] RDX: 6b6b6b6b6b6b6b6b RSI: 1000 RDI: 
   8800688ae1c4
   [ 9650.590630] RBP: 88007f803d28 R08:  R09: 
   6b6b6b6c
   [ 9650.590630] R10: ff79a1a35779b009 R11: 0010 R12: 
   88006953a540
   [ 9650.590630] R13: 1000 R14:  R15: 
   8800688ae1c4
   [ 9650.590630] FS:  () GS:88007f80() 
   knlGS:
   [ 9650.590630] CS:  0010 DS:  ES:  CR0: 8005003b
   [ 9650.590630] CR2: 7ffa9c48448f CR3: 28061000 CR4: 
   06f0
   [ 9650.590630] Stack:
   [ 9650.590630]  88007f803d18 1000 880068b80800 
   1000
   [ 9650.590630]  0001 31f6d000 88007f803d68 
   811ddae3
   [ 9650.590630]  0086 880068b80800  
   880068b80828
   [ 9650.590630] Call Trace:
   [ 9650.590630]  IRQ
   [ 9650.590630]  [811ddae3] dio_complete+0xa3/0x140
   [ 9650.590630]  [811ddc2a] dio_bio_end_aio+0x7a/0x110
   [ 9650.590630]  [811ddbb5] ? dio_bio_end_aio+0x5/0x110
   [ 9650.590630]  [811d8a9d] bio_endio+0x1d/0x30
   [ 9650.590630]  [8175d65f] blk_mq_complete_request+0x5f/0x120
   [ 9650.590630]  [8175d736] __blk_mq_end_io+0x16/0x20
   [ 9650.590630]  [8175d7a8] blk_mq_end_io+0x68/0xd0
   [ 9650.590630]  [818539a7] virtblk_done+0x67/0x110
   [ 9650.590630]  [817f74c5] vring_interrupt+0x35/0x60
   [ 9650.590630]  [810e4e34] handle_irq_event_percpu+0x54/0x1e0
   [ 9650.590630]  [810e5008] handle_irq_event+0x48/0x70
   [ 9650.590630]  [8107d6f5] ? kvm_guest_apic_eoi_write+0x5/0x50
   [ 9650.590630]  [810e7807] handle_edge_irq+0x77/0x110
   [ 9650.590630]  [8104796f] handle_irq+0xbf/0x150
   [ 9650.590630]  [810478b5] ? handle_irq+0x5/0x150
   [ 9650.590630]  [81aee136] ? 
   atomic_notifier_call_chain+0x16/0x20
   [ 9650.590630]  [81af43ca] do_IRQ+0x5a/0xe0
   [ 9650.590630]  [81aea16d] common_interrupt+0x6d/0x6d
   [ 9650.590630]  EOI
   [ 9650.590630]  [8107e616] ? native_safe_halt+0x6/0x10
   [ 9650.590630]  [8104e4f5] ? default_idle+0x5/0xc0
   [ 9650.590630]  [8104e50f] default_idle+0x1f/0xc0
   [ 9650.590630]  [8104edb6] arch_cpu_idle+0x26/0x30
   [ 9650.590630]  [810e4331] cpu_startup_entry+0x81/0x240
   [ 9650.590630]  [81ad5d67] rest_init+0x77/0x80
   [ 9650.590630]  [8203ceec] start_kernel+0x3cd/0x3da
   [ 9650.590630]  [8203c91e] ? repair_env_string+0x5e/0x5e
   [ 9650.590630]  [8203c117] ? early_idt_handlers+0x117/0x120
   [ 9650.590630]  [8203c5c2] x86_64_start_reservations+0x2a/0x2c
   [ 9650.590630]  [8203c70a] x86_64_start_kernel+0x146/0x155
   [ 9650.590630] Code: e8 05 b5 8f 00 44 8b 8b c0 01 00 00 45 31 c0 48 8b 
   93 98 00 00 00 41 83 c1 01 44 3b 8b 80 00 00 00 44 89 c9 45 0f 42 c1 c1 
   e9 07 48 8b 0c ca 65
   [ 9650.590630] RIP  [811ee5f1] aio_complete+0xb1/0x1e0
   [ 9650.590630]  RSP 88007f803cf8
   
   I'm not sure if it is related to blk_mq, virtio, dio  or bio changes
   (or even somethign else), but I haven't seen anything like this
   before. I've only seen it once so far (haven't rerun the test yet at
   all).
  
  And I just hit this from running xfs_repair which is doing
  multithreaded direct IO directly on /dev/vdc:
  
  [ 1776.508599] BUG: unable to handle kernel NULL pointer dereference at 
  0328
  [ 1776.510446] IP: [81755b6a] blk_account_io_done+0x6a/0x180
  [ 1776.511762] PGD 38e75b067 PUD 41987d067 PMD 0
  [ 1776.512577] Oops:  [#1] SMP
  [ 1776.512577] Modules linked in:
  [ 1776.512577] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0-rc2-dgc+ #75
  [ 1776.512577] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  [ 1776.512577] task: 81f37480 ti: 81f26000 task.ti: 
  81f26000
  [ 

Re: [OOPS, 3.13-rc2] null ptr in dio_complete()

2013-12-03 Thread Dave Chinner
On Tue, Dec 03, 2013 at 08:47:12PM -0700, Jens Axboe wrote:
 On Wed, Dec 04 2013, Dave Chinner wrote:
  On Wed, Dec 04, 2013 at 12:58:38PM +1100, Dave Chinner wrote:
   On Wed, Dec 04, 2013 at 08:59:40AM +1100, Dave Chinner wrote:
Hi Jens,

Not sure who to direct this to or CC, so I figured you are the
person to do that. I just had xfstests generic/299 (an AIO/DIO test)
oops in dio_complete() like so:


[ 9650.590630]  IRQ
[ 9650.590630]  [811ddae3] dio_complete+0xa3/0x140
[ 9650.590630]  [811ddc2a] dio_bio_end_aio+0x7a/0x110
[ 9650.590630]  [811ddbb5] ? dio_bio_end_aio+0x5/0x110
[ 9650.590630]  [811d8a9d] bio_endio+0x1d/0x30
[ 9650.590630]  [8175d65f] blk_mq_complete_request+0x5f/0x120
[ 9650.590630]  [8175d736] __blk_mq_end_io+0x16/0x20
[ 9650.590630]  [8175d7a8] blk_mq_end_io+0x68/0xd0
[ 9650.590630]  [818539a7] virtblk_done+0x67/0x110
[ 9650.590630]  [817f74c5] vring_interrupt+0x35/0x60
.
   And I just hit this from running xfs_repair which is doing
   multithreaded direct IO directly on /dev/vdc:
   

   [ 1776.510446] IP: [81755b6a] blk_account_io_done+0x6a/0x180

   [ 1776.512577]  [8175e4b8] blk_mq_complete_request+0xb8/0x120
   [ 1776.512577]  [8175e536] __blk_mq_end_io+0x16/0x20
   [ 1776.512577]  [8175e5a8] blk_mq_end_io+0x68/0xd0
   [ 1776.512577]  [81852e47] virtblk_done+0x67/0x110
   [ 1776.512577]  [817f7925] vring_interrupt+0x35/0x60
   [ 1776.512577]  [810e48a4] handle_irq_event_percpu+0x54/0x1e0
.
   So this is looking like another virtio+blk_mq problem
  
  This one is definitely reproducable. Just hit it again...
 
 I'll take a look at this. You don't happen to have gdb dumps of the
 lines associated with those crashes? Just to save me some digging
 time...

Only this:

(gdb) l *(dio_complete+0xa3)
0x811ddae3 is in dio_complete (fs/direct-io.c:282).
277 }
278
279 aio_complete(dio-iocb, ret, 0);
280 }
281
282 kmem_cache_free(dio_cache, dio);
283 return ret;
284 }
285
286 static void dio_aio_complete_work(struct work_struct *work)

And this:

(gdb) l *(blk_account_io_done+0x6a)
0x81755b6a is in blk_account_io_done (block/blk-core.c:2049).
2044int cpu;
2045
2046cpu = part_stat_lock();
2047part = req-part;
2048
2049part_stat_inc(cpu, part, ios[rw]);
2050part_stat_add(cpu, part, ticks[rw], duration);
2051part_round_stats(cpu, part);
2052part_dec_in_flight(part, rw);
2053

as I've rebuild the kernel with different patches since the one
running on the machine that is triggering the problem.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/