Re: bio linked list corruption.

2016-12-06 Thread Linus Torvalds
On Tue, Dec 6, 2016 at 12:16 AM, Peter Zijlstra wrote: >> >> Of course, I'm really hoping that this shmem.c use is the _only_ such >> case. But I doubt it. > > $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l > 28 Hmm. Most of them seem to be ok, because they use

Re: bio linked list corruption.

2016-12-06 Thread Linus Torvalds
On Tue, Dec 6, 2016 at 12:16 AM, Peter Zijlstra wrote: >> >> Of course, I'm really hoping that this shmem.c use is the _only_ such >> case. But I doubt it. > > $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l > 28 Hmm. Most of them seem to be ok, because they use "wait_event()", which will

Re: bio linked list corruption.

2016-12-06 Thread Vegard Nossum
On 5 December 2016 at 22:33, Vegard Nossum wrote: > On 5 December 2016 at 21:35, Linus Torvalds > wrote: >> Note for Ingo and Peter: this patch has not been tested at all. But >> Vegard did test an earlier patch of mine that just verified

Re: bio linked list corruption.

2016-12-06 Thread Vegard Nossum
On 5 December 2016 at 22:33, Vegard Nossum wrote: > On 5 December 2016 at 21:35, Linus Torvalds > wrote: >> Note for Ingo and Peter: this patch has not been tested at all. But >> Vegard did test an earlier patch of mine that just verified that yes, >> the issue really was that wait queue entries

Re: bio linked list corruption.

2016-12-06 Thread Ingo Molnar
* Peter Zijlstra wrote: > $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l > 28 This debug facility looks sensible. A couple of minor suggestions: > --- a/include/linux/wait.h > +++ b/include/linux/wait.h > @@ -39,6 +39,9 @@ struct wait_bit_queue { > struct

Re: bio linked list corruption.

2016-12-06 Thread Ingo Molnar
* Peter Zijlstra wrote: > $ git grep DECLARE_WAIT_QUEUE_HEAD_ONSTACK | wc -l > 28 This debug facility looks sensible. A couple of minor suggestions: > --- a/include/linux/wait.h > +++ b/include/linux/wait.h > @@ -39,6 +39,9 @@ struct wait_bit_queue { > struct __wait_queue_head { >

Re: bio linked list corruption.

2016-12-06 Thread Peter Zijlstra
On Mon, Dec 05, 2016 at 12:35:52PM -0800, Linus Torvalds wrote: > Adding the scheduler people to the participants list, and re-attaching > the patch, because while this patch is internal to the VM code, the > issue itself is not. > > There might well be other cases where somebody goes

Re: bio linked list corruption.

2016-12-06 Thread Peter Zijlstra
On Mon, Dec 05, 2016 at 12:35:52PM -0800, Linus Torvalds wrote: > Adding the scheduler people to the participants list, and re-attaching > the patch, because while this patch is internal to the VM code, the > issue itself is not. > > There might well be other cases where somebody goes

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 21:35, Linus Torvalds wrote: > Note for Ingo and Peter: this patch has not been tested at all. But > Vegard did test an earlier patch of mine that just verified that yes, > the issue really was that wait queue entries remained on the wait >

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 21:35, Linus Torvalds wrote: > Note for Ingo and Peter: this patch has not been tested at all. But > Vegard did test an earlier patch of mine that just verified that yes, > the issue really was that wait queue entries remained on the wait > queue head just as we were about

Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
Adding the scheduler people to the participants list, and re-attaching the patch, because while this patch is internal to the VM code, the issue itself is not. There might well be other cases where somebody goes "wake_up_all()" will wake everybody up, so I can put the wait queue head on the

Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
Adding the scheduler people to the participants list, and re-attaching the patch, because while this patch is internal to the VM code, the issue itself is not. There might well be other cases where somebody goes "wake_up_all()" will wake everybody up, so I can put the wait queue head on the

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 20:11, Vegard Nossum wrote: > On 5 December 2016 at 18:55, Linus Torvalds > wrote: >> On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum >> wrote: >> Since you apparently can recreate this fairly

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 20:11, Vegard Nossum wrote: > On 5 December 2016 at 18:55, Linus Torvalds > wrote: >> On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum >> wrote: >> Since you apparently can recreate this fairly easily, how about trying >> this stupid patch? >> >> NOTE! This is entirely

Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
On Mon, Dec 5, 2016 at 11:11 AM, Vegard Nossum wrote: > > [ cut here ] > WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0 Ok, good. So that's confirmed as the cause of this problem. And the call chain that I wanted is

Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
On Mon, Dec 5, 2016 at 11:11 AM, Vegard Nossum wrote: > > [ cut here ] > WARNING: CPU: 22 PID: 14012 at mm/shmem.c:2668 shmem_fallocate+0x9a7/0xac0 Ok, good. So that's confirmed as the cause of this problem. And the call chain that I wanted is obviously completely

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 18:55, Linus Torvalds wrote: > On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum wrote: >> >> The warning shows that it made it past the list_empty_careful() check >> in finish_wait() but then bugs out on the >task_list >>

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 18:55, Linus Torvalds wrote: > On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum wrote: >> >> The warning shows that it made it past the list_empty_careful() check >> in finish_wait() but then bugs out on the >task_list >> dereference. >> >> Anything stick out? > > I hate that

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 19:11, Andy Lutomirski wrote: > On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossum wrote: >> On 23 November 2016 at 20:58, Dave Jones wrote: >>> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: >>>

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 19:11, Andy Lutomirski wrote: > On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossum wrote: >> On 23 November 2016 at 20:58, Dave Jones wrote: >>> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: >>> >>> > [ 317.689216] BUG: Bad page state in process kworker/u8:8

Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
On Mon, Dec 5, 2016 at 10:11 AM, Andy Lutomirski wrote: > > So your kernel has been smp-alternatived. That 3e comes from > alternatives_smp_unlock. If you're running on SMP with UP > alternatives, things will break. I'm assuming he's just running in a VM with a single CPU.

Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
On Mon, Dec 5, 2016 at 10:11 AM, Andy Lutomirski wrote: > > So your kernel has been smp-alternatived. That 3e comes from > alternatives_smp_unlock. If you're running on SMP with UP > alternatives, things will break. I'm assuming he's just running in a VM with a single CPU. The problem that I

Re: bio linked list corruption.

2016-12-05 Thread Andy Lutomirski
On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossum wrote: > On 23 November 2016 at 20:58, Dave Jones wrote: >> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: >> >> > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4

Re: bio linked list corruption.

2016-12-05 Thread Andy Lutomirski
On Sun, Dec 4, 2016 at 3:04 PM, Vegard Nossum wrote: > On 23 November 2016 at 20:58, Dave Jones wrote: >> On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: >> >> > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 >> > trace from just before this happened. Does

Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum wrote: > > The warning shows that it made it past the list_empty_careful() check > in finish_wait() but then bugs out on the >task_list > dereference. > > Anything stick out? I hate that shmem waitqueue garbage. It's really

Re: bio linked list corruption.

2016-12-05 Thread Linus Torvalds
On Mon, Dec 5, 2016 at 9:09 AM, Vegard Nossum wrote: > > The warning shows that it made it past the list_empty_careful() check > in finish_wait() but then bugs out on the >task_list > dereference. > > Anything stick out? I hate that shmem waitqueue garbage. It's really subtle. I think the

Re: bio linked list corruption.

2016-12-05 Thread Dave Jones
On Mon, Dec 05, 2016 at 06:09:29PM +0100, Vegard Nossum wrote: > On 5 December 2016 at 12:10, Vegard Nossum wrote: > > On 5 December 2016 at 00:04, Vegard Nossum wrote: > >> FWIW I hit this as well: > >> > >> BUG: unable to handle kernel

Re: bio linked list corruption.

2016-12-05 Thread Dave Jones
On Mon, Dec 05, 2016 at 06:09:29PM +0100, Vegard Nossum wrote: > On 5 December 2016 at 12:10, Vegard Nossum wrote: > > On 5 December 2016 at 00:04, Vegard Nossum wrote: > >> FWIW I hit this as well: > >> > >> BUG: unable to handle kernel paging request at 81ff08b7 > >> IP: []

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 12:10, Vegard Nossum wrote: > On 5 December 2016 at 00:04, Vegard Nossum wrote: >> FWIW I hit this as well: >> >> BUG: unable to handle kernel paging request at 81ff08b7 >> IP: [] __lock_acquire.isra.32+0xda/0x1a30

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 12:10, Vegard Nossum wrote: > On 5 December 2016 at 00:04, Vegard Nossum wrote: >> FWIW I hit this as well: >> >> BUG: unable to handle kernel paging request at 81ff08b7 >> IP: [] __lock_acquire.isra.32+0xda/0x1a30 >> CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: G

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 00:04, Vegard Nossum wrote: > FWIW I hit this as well: > > BUG: unable to handle kernel paging request at 81ff08b7 > IP: [] __lock_acquire.isra.32+0xda/0x1a30 > CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB 4.9.0-rc7+ #217

Re: bio linked list corruption.

2016-12-05 Thread Vegard Nossum
On 5 December 2016 at 00:04, Vegard Nossum wrote: > FWIW I hit this as well: > > BUG: unable to handle kernel paging request at 81ff08b7 > IP: [] __lock_acquire.isra.32+0xda/0x1a30 > CPU: 0 PID: 21744 Comm: trinity-c56 Tainted: GB 4.9.0-rc7+ #217 [...] > I think you can

Re: bio linked list corruption.

2016-12-04 Thread Vegard Nossum
On 23 November 2016 at 20:58, Dave Jones wrote: > On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: > > > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 > > trace from just before this happened. Does this shed any light ? > > > >

Re: bio linked list corruption.

2016-12-04 Thread Vegard Nossum
On 23 November 2016 at 20:58, Dave Jones wrote: > On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: > > > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 > > trace from just before this happened. Does this shed any light ? > > > >

Re: bio linked list corruption.

2016-11-23 Thread Dave Jones
On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 > trace from just before this happened. Does this shed any light ? > > https://codemonkey.org.uk/junk/trace.txt crap, I just noticed the timestamps in the

Re: bio linked list corruption.

2016-11-23 Thread Dave Jones
On Wed, Nov 23, 2016 at 02:34:19PM -0500, Dave Jones wrote: > [ 317.689216] BUG: Bad page state in process kworker/u8:8 pfn:4d8fd4 > trace from just before this happened. Does this shed any light ? > > https://codemonkey.org.uk/junk/trace.txt crap, I just noticed the timestamps in the

Re: bio linked list corruption.

2016-11-23 Thread Dave Jones
On Mon, Oct 31, 2016 at 01:44:55PM -0600, Chris Mason wrote: > On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote: > >On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones > >wrote: > >> > >> BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 > >>

Re: bio linked list corruption.

2016-11-23 Thread Dave Jones
On Mon, Oct 31, 2016 at 01:44:55PM -0600, Chris Mason wrote: > On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote: > >On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones > >wrote: > >> > >> BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 > >> page:ea0013838e40 count:0

Re: bio linked list corruption.

2016-10-31 Thread Chris Mason
On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote: On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones wrote: BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c flags:

Re: bio linked list corruption.

2016-10-31 Thread Chris Mason
On Mon, Oct 31, 2016 at 12:35:16PM -0700, Linus Torvalds wrote: On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones wrote: BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c flags:

Re: bio linked list corruption.

2016-10-31 Thread Linus Torvalds
On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones wrote: > > BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 > page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c > flags: 0x400c(referenced|uptodate) > page dumped because:

Re: bio linked list corruption.

2016-10-31 Thread Linus Torvalds
On Mon, Oct 31, 2016 at 11:55 AM, Dave Jones wrote: > > BUG: Bad page state in process kworker/u8:12 pfn:4e0e39 > page:ea0013838e40 count:0 mapcount:0 mapping:8804a20310e0 index:0x100c > flags: 0x400c(referenced|uptodate) > page dumped because: non-NULL mapping Hmm. So this

Re: bio linked list corruption.

2016-10-31 Thread Dave Jones
On Wed, Oct 26, 2016 at 07:47:51PM -0400, Dave Jones wrote: > On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: > > > >-hctx->queued++; > > >-data->hctx = hctx; > > >-data->ctx = ctx; > > >+data->hctx = alloc_data.hctx; > > >+

Re: bio linked list corruption.

2016-10-31 Thread Dave Jones
On Wed, Oct 26, 2016 at 07:47:51PM -0400, Dave Jones wrote: > On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: > > > >-hctx->queued++; > > >-data->hctx = hctx; > > >-data->ctx = ctx; > > >+data->hctx = alloc_data.hctx; > > >+

Re: bio linked list corruption.

2016-10-27 Thread Dave Jones
On Thu, Oct 27, 2016 at 04:41:33PM +1100, Dave Chinner wrote: > And that's indicative of a delalloc metadata reservation being > being too small and so we're allocating unreserved blocks. > > Different symptoms, same underlying cause, I think. > > I see the latter assert from time to

Re: bio linked list corruption.

2016-10-27 Thread Dave Jones
On Thu, Oct 27, 2016 at 04:41:33PM +1100, Dave Chinner wrote: > And that's indicative of a delalloc metadata reservation being > being too small and so we're allocating unreserved blocks. > > Different symptoms, same underlying cause, I think. > > I see the latter assert from time to

Re: bio linked list corruption.

2016-10-27 Thread Jens Axboe
On 10/27/2016 10:34 AM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwig wrote: Dave, can you hit the warnings with this? Totally untested... Can we just kill off the unhelpful blk_map_ctx structure, e.g.: Yeah, I found that hard to read too.

Re: bio linked list corruption.

2016-10-27 Thread Jens Axboe
On 10/27/2016 10:34 AM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwig wrote: Dave, can you hit the warnings with this? Totally untested... Can we just kill off the unhelpful blk_map_ctx structure, e.g.: Yeah, I found that hard to read too. The difference between

Re: bio linked list corruption.

2016-10-27 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwig wrote: >> Dave, can you hit the warnings with this? Totally untested... > > Can we just kill off the unhelpful blk_map_ctx structure, e.g.: Yeah, I found that hard to read too. The difference between blk_map_ctx and

Re: bio linked list corruption.

2016-10-27 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 11:33 PM, Christoph Hellwig wrote: >> Dave, can you hit the warnings with this? Totally untested... > > Can we just kill off the unhelpful blk_map_ctx structure, e.g.: Yeah, I found that hard to read too. The difference between blk_map_ctx and blk_mq_alloc_data is

Re: bio linked list corruption.

2016-10-27 Thread Chris Mason
On 10/26/2016 08:00 PM, Jens Axboe wrote: > On 10/26/2016 05:47 PM, Dave Jones wrote: >> On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: >> >> > >-hctx->queued++; >> > >-data->hctx = hctx; >> > >-data->ctx = ctx; >> > >+data->hctx = alloc_data.hctx; >> > >+

Re: bio linked list corruption.

2016-10-27 Thread Chris Mason
On 10/26/2016 08:00 PM, Jens Axboe wrote: > On 10/26/2016 05:47 PM, Dave Jones wrote: >> On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: >> >> > >-hctx->queued++; >> > >-data->hctx = hctx; >> > >-data->ctx = ctx; >> > >+data->hctx = alloc_data.hctx; >> > >+

Re: bio linked list corruption.

2016-10-27 Thread Christoph Hellwig
> Dave, can you hit the warnings with this? Totally untested... Can we just kill off the unhelpful blk_map_ctx structure, e.g.: diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed..d74a74a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1190,21 +1190,15 @@ static inline bool

Re: bio linked list corruption.

2016-10-27 Thread Christoph Hellwig
> Dave, can you hit the warnings with this? Totally untested... Can we just kill off the unhelpful blk_map_ctx structure, e.g.: diff --git a/block/blk-mq.c b/block/blk-mq.c index ddc2eed..d74a74a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1190,21 +1190,15 @@ static inline bool

Re: bio linked list corruption.

2016-10-26 Thread Dave Chinner
On Tue, Oct 25, 2016 at 08:27:52PM -0400, Dave Jones wrote: > DaveC: Do these look like real problems, or is this more "looks like > random memory corruption" ? It's been a while since I did some stress > testing on XFS, so these might not be new.. > > XFS: Assertion failed: oldlen > newlen,

Re: bio linked list corruption.

2016-10-26 Thread Dave Chinner
On Tue, Oct 25, 2016 at 08:27:52PM -0400, Dave Jones wrote: > DaveC: Do these look like real problems, or is this more "looks like > random memory corruption" ? It's been a while since I did some stress > testing on XFS, so these might not be new.. > > XFS: Assertion failed: oldlen > newlen,

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 05:47 PM, Dave Jones wrote: On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: > >- hctx->queued++; > >- data->hctx = hctx; > >- data->ctx = ctx; > >+ data->hctx = alloc_data.hctx; > >+ data->ctx = alloc_data.ctx; > >+ data->hctx->queued++;

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 05:47 PM, Dave Jones wrote: On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: > >- hctx->queued++; > >- data->hctx = hctx; > >- data->ctx = ctx; > >+ data->hctx = alloc_data.hctx; > >+ data->ctx = alloc_data.ctx; > >+ data->hctx->queued++;

Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: > >- hctx->queued++; > >- data->hctx = hctx; > >- data->ctx = ctx; > >+ data->hctx = alloc_data.hctx; > >+ data->ctx = alloc_data.ctx; > >+ data->hctx->queued++; > >return rq; > > } > > This made it through

Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 07:38:08PM -0400, Chris Mason wrote: > >- hctx->queued++; > >- data->hctx = hctx; > >- data->ctx = ctx; > >+ data->hctx = alloc_data.hctx; > >+ data->ctx = alloc_data.ctx; > >+ data->hctx->queued++; > >return rq; > > } > > This made it through

Re: bio linked list corruption.

2016-10-26 Thread Chris Mason
On Wed, Oct 26, 2016 at 05:20:01PM -0600, Jens Axboe wrote: On 10/26/2016 05:08 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe wrote: Actually, I think I see what might trigger it. You are on nvme, iirc, and that has a deep queue. Yes. I have long since

Re: bio linked list corruption.

2016-10-26 Thread Chris Mason
On Wed, Oct 26, 2016 at 05:20:01PM -0600, Jens Axboe wrote: On 10/26/2016 05:08 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe wrote: Actually, I think I see what might trigger it. You are on nvme, iirc, and that has a deep queue. Yes. I have long since moved on from

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 05:19 PM, Chris Mason wrote: On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote: On 10/26/2016 04:58 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds wrote: Dave: it might be a good idea to split that

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 05:19 PM, Chris Mason wrote: On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote: On 10/26/2016 04:58 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds wrote: Dave: it might be a good idea to split that "WARN_ON_ONCE()" in blk_mq_merge_queue_io()

Re: bio linked list corruption.

2016-10-26 Thread Chris Mason
On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote: On 10/26/2016 04:58 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds wrote: Dave: it might be a good idea to split that "WARN_ON_ONCE()" in blk_mq_merge_queue_io() into two I

Re: bio linked list corruption.

2016-10-26 Thread Chris Mason
On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote: On 10/26/2016 04:58 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds wrote: Dave: it might be a good idea to split that "WARN_ON_ONCE()" in blk_mq_merge_queue_io() into two I did that myself too, since

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 05:08 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe wrote: Actually, I think I see what might trigger it. You are on nvme, iirc, and that has a deep queue. Yes. I have long since moved on from slow disks, so all my systems are not just

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 05:08 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe wrote: Actually, I think I see what might trigger it. You are on nvme, iirc, and that has a deep queue. Yes. I have long since moved on from slow disks, so all my systems are not just flash, but m.2

Re: bio linked list corruption.

2016-10-26 Thread Chris Mason
On Wed, Oct 26, 2016 at 03:07:10PM -0700, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason wrote: Today I turned off every CONFIG_DEBUG_* except for list debugging, and ran dbench 2048: [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33

Re: bio linked list corruption.

2016-10-26 Thread Chris Mason
On Wed, Oct 26, 2016 at 03:07:10PM -0700, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason wrote: Today I turned off every CONFIG_DEBUG_* except for list debugging, and ran dbench 2048: [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 __list_add+0xbe/0xd0 [

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe wrote: > > Actually, I think I see what might trigger it. You are on nvme, iirc, > and that has a deep queue. Yes. I have long since moved on from slow disks, so all my systems are not just flash, but m.2 nvme ssd's. So at least that

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 4:03 PM, Jens Axboe wrote: > > Actually, I think I see what might trigger it. You are on nvme, iirc, > and that has a deep queue. Yes. I have long since moved on from slow disks, so all my systems are not just flash, but m.2 nvme ssd's. So at least that could explain why

Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote: > On 10/26/2016 04:58 PM, Linus Torvalds wrote: > > On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds > > wrote: > >> > >> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > >>

Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 05:03:45PM -0600, Jens Axboe wrote: > On 10/26/2016 04:58 PM, Linus Torvalds wrote: > > On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds > > wrote: > >> > >> Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > >> blk_mq_merge_queue_io() into two > > >

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 04:58 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds wrote: Dave: it might be a good idea to split that "WARN_ON_ONCE()" in blk_mq_merge_queue_io() into two I did that myself too, since Dave sees this during boot. But

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 04:58 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds wrote: Dave: it might be a good idea to split that "WARN_ON_ONCE()" in blk_mq_merge_queue_io() into two I did that myself too, since Dave sees this during boot. But I'm not getting the warning ;(

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 05:01 PM, Dave Jones wrote: On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote: > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > blk_mq_merge_queue_io() into two, since right now it can trigger both > for the > >

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 05:01 PM, Dave Jones wrote: On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote: > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > blk_mq_merge_queue_io() into two, since right now it can trigger both > for the > >

Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote: > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > blk_mq_merge_queue_io() into two, since right now it can trigger both > for the > > blk_mq_bio_to_request(rq, bio); > > path _and_ for the >

Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 03:51:01PM -0700, Linus Torvalds wrote: > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > blk_mq_merge_queue_io() into two, since right now it can trigger both > for the > > blk_mq_bio_to_request(rq, bio); > > path _and_ for the >

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds wrote: > > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > blk_mq_merge_queue_io() into two I did that myself too, since Dave sees this during boot. But I'm not getting the warning ;( Dave gets it

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 3:51 PM, Linus Torvalds wrote: > > Dave: it might be a good idea to split that "WARN_ON_ONCE()" in > blk_mq_merge_queue_io() into two I did that myself too, since Dave sees this during boot. But I'm not getting the warning ;( Dave gets it with ext4, and thats' what I

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 04:51 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:40 PM, Dave Jones wrote: I gave it a shot too for shits & giggles. This falls out during boot. [9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0 Hmm.

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 04:51 PM, Linus Torvalds wrote: On Wed, Oct 26, 2016 at 3:40 PM, Dave Jones wrote: I gave it a shot too for shits & giggles. This falls out during boot. [9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 blk_sq_make_request+0x465/0x4a0 Hmm. That's the

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 04:40 PM, Dave Jones wrote: On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote: > Could you try the attached patch? It adds a couple of sanity tests: > > - a number of tests to verify that 'rq->queuelist' isn't already on > some queue when it is added to a queue

Re: bio linked list corruption.

2016-10-26 Thread Jens Axboe
On 10/26/2016 04:40 PM, Dave Jones wrote: On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote: > Could you try the attached patch? It adds a couple of sanity tests: > > - a number of tests to verify that 'rq->queuelist' isn't already on > some queue when it is added to a queue

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 3:40 PM, Dave Jones wrote: > > I gave it a shot too for shits & giggles. > This falls out during boot. > > [9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 > blk_sq_make_request+0x465/0x4a0 Hmm. That's the WARN_ON_ONCE(rq->mq_ctx

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 3:40 PM, Dave Jones wrote: > > I gave it a shot too for shits & giggles. > This falls out during boot. > > [9.278420] WARNING: CPU: 0 PID: 1 at block/blk-mq.c:1181 > blk_sq_make_request+0x465/0x4a0 Hmm. That's the WARN_ON_ONCE(rq->mq_ctx != ctx); that I added

Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote: > Could you try the attached patch? It adds a couple of sanity tests: > > - a number of tests to verify that 'rq->queuelist' isn't already on > some queue when it is added to a queue > > - one test to verify that rq->mq_ctx

Re: bio linked list corruption.

2016-10-26 Thread Dave Jones
On Wed, Oct 26, 2016 at 03:21:53PM -0700, Linus Torvalds wrote: > Could you try the attached patch? It adds a couple of sanity tests: > > - a number of tests to verify that 'rq->queuelist' isn't already on > some queue when it is added to a queue > > - one test to verify that rq->mq_ctx

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 2:52 PM, Chris Mason wrote: > > This one is special because CONFIG_VMAP_STACK is not set. Btrfs triggers in > < 10 minutes. > I've done 30 minutes each with XFS and Ext4 without luck. Ok, see the email I wrote that crossed yours - if it's really some list

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 2:52 PM, Chris Mason wrote: > > This one is special because CONFIG_VMAP_STACK is not set. Btrfs triggers in > < 10 minutes. > I've done 30 minutes each with XFS and Ext4 without luck. Ok, see the email I wrote that crossed yours - if it's really some list corruption on

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason wrote: > > Today I turned off every CONFIG_DEBUG_* except for list debugging, and > ran dbench 2048: > > [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 > __list_add+0xbe/0xd0 > [ 2759.119652] list_add corruption.

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 1:00 PM, Chris Mason wrote: > > Today I turned off every CONFIG_DEBUG_* except for list debugging, and > ran dbench 2048: > > [ 2759.118711] WARNING: CPU: 2 PID: 31039 at lib/list_debug.c:33 > __list_add+0xbe/0xd0 > [ 2759.119652] list_add corruption. prev->next should be

Re: bio linked list corruption.

2016-10-26 Thread Chris Mason
On 10/26/2016 04:00 PM, Chris Mason wrote: > > > On 10/26/2016 03:06 PM, Linus Torvalds wrote: >> On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones wrote: >>> >>> The stacks show nearly all of them are stuck in sync_inodes_sb >> >> That's just wb_wait_for_completion(), and

Re: bio linked list corruption.

2016-10-26 Thread Chris Mason
On 10/26/2016 04:00 PM, Chris Mason wrote: > > > On 10/26/2016 03:06 PM, Linus Torvalds wrote: >> On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones wrote: >>> >>> The stacks show nearly all of them are stuck in sync_inodes_sb >> >> That's just wb_wait_for_completion(), and it means that some IO

Re: bio linked list corruption.

2016-10-26 Thread Chris Mason
On 10/26/2016 03:06 PM, Linus Torvalds wrote: > On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones wrote: >> >> The stacks show nearly all of them are stuck in sync_inodes_sb > > That's just wb_wait_for_completion(), and it means that some IO isn't > completing. > > There's

Re: bio linked list corruption.

2016-10-26 Thread Chris Mason
On 10/26/2016 03:06 PM, Linus Torvalds wrote: > On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones wrote: >> >> The stacks show nearly all of them are stuck in sync_inodes_sb > > That's just wb_wait_for_completion(), and it means that some IO isn't > completing. > > There's also a lot of processes

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones wrote: > > The stacks show nearly all of them are stuck in sync_inodes_sb That's just wb_wait_for_completion(), and it means that some IO isn't completing. There's also a lot of processes waiting for inode_lock(), and a few

Re: bio linked list corruption.

2016-10-26 Thread Linus Torvalds
On Wed, Oct 26, 2016 at 11:42 AM, Dave Jones wrote: > > The stacks show nearly all of them are stuck in sync_inodes_sb That's just wb_wait_for_completion(), and it means that some IO isn't completing. There's also a lot of processes waiting for inode_lock(), and a few waiting for

  1   2   3   >