Re: NULL deref around blkmq in v4.0-rc1–rc7

2015-04-11 Thread Jan Engelhardt
On Saturday 2015-04-11 19:46, Linus Torvalds wrote:

>On Thu, Apr 9, 2015 at 5:07 PM, Linus Torvalds
> wrote:
>> On Thu, Apr 9, 2015 at 3:29 PM, Jan Engelhardt  wrote:
>>>
>>> Yes, this seems to solve it for me.
>>
>> Can you humor me, and try that patch on top of a few different kernel
>> versions that you had trouble with.
>
>Jan? I'm inclined to commit the patch for 4.0 even though it is late
>(it's not like it can hurt), but since you seem to be the only one
>seeing this issue, I'd really like you to double-check it..

Yeah, just commit. I'm currently "stuck" in the weekend.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NULL deref around blkmq in v4.0-rc1–rc7

2015-04-11 Thread Jens Axboe

On 04/11/2015 11:46 AM, Linus Torvalds wrote:

On Thu, Apr 9, 2015 at 5:07 PM, Linus Torvalds
 wrote:

On Thu, Apr 9, 2015 at 3:29 PM, Jan Engelhardt  wrote:


Yes, this seems to solve it for me.


Can you humor me, and try that patch on top of a few different kernel
versions that you had trouble with.


Jan? I'm inclined to commit the patch for 4.0 even though it is late
(it's not like it can hurt), but since you seem to be the only one
seeing this issue, I'd really like you to double-check it..


I already committed for 4.1 (marked stable), but I'd be fine with adding 
it to this series as well obviously.


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NULL deref around blkmq in v4.0-rc1–rc7

2015-04-11 Thread Linus Torvalds
On Thu, Apr 9, 2015 at 5:07 PM, Linus Torvalds
 wrote:
> On Thu, Apr 9, 2015 at 3:29 PM, Jan Engelhardt  wrote:
>>
>> Yes, this seems to solve it for me.
>
> Can you humor me, and try that patch on top of a few different kernel
> versions that you had trouble with.

Jan? I'm inclined to commit the patch for 4.0 even though it is late
(it's not like it can hurt), but since you seem to be the only one
seeing this issue, I'd really like you to double-check it..

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NULL deref around blkmq in v4.0-rc1–rc7

2015-04-11 Thread Jan Engelhardt
On Saturday 2015-04-11 19:46, Linus Torvalds wrote:

On Thu, Apr 9, 2015 at 5:07 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Thu, Apr 9, 2015 at 3:29 PM, Jan Engelhardt jeng...@inai.de wrote:

 Yes, this seems to solve it for me.

 Can you humor me, and try that patch on top of a few different kernel
 versions that you had trouble with.

Jan? I'm inclined to commit the patch for 4.0 even though it is late
(it's not like it can hurt), but since you seem to be the only one
seeing this issue, I'd really like you to double-check it..

Yeah, just commit. I'm currently stuck in the weekend.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NULL deref around blkmq in v4.0-rc1–rc7

2015-04-11 Thread Linus Torvalds
On Thu, Apr 9, 2015 at 5:07 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Thu, Apr 9, 2015 at 3:29 PM, Jan Engelhardt jeng...@inai.de wrote:

 Yes, this seems to solve it for me.

 Can you humor me, and try that patch on top of a few different kernel
 versions that you had trouble with.

Jan? I'm inclined to commit the patch for 4.0 even though it is late
(it's not like it can hurt), but since you seem to be the only one
seeing this issue, I'd really like you to double-check it..

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NULL deref around blkmq in v4.0-rc1–rc7

2015-04-11 Thread Jens Axboe

On 04/11/2015 11:46 AM, Linus Torvalds wrote:

On Thu, Apr 9, 2015 at 5:07 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

On Thu, Apr 9, 2015 at 3:29 PM, Jan Engelhardt jeng...@inai.de wrote:


Yes, this seems to solve it for me.


Can you humor me, and try that patch on top of a few different kernel
versions that you had trouble with.


Jan? I'm inclined to commit the patch for 4.0 even though it is late
(it's not like it can hurt), but since you seem to be the only one
seeing this issue, I'd really like you to double-check it..


I already committed for 4.1 (marked stable), but I'd be fine with adding 
it to this series as well obviously.


--
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NULL deref around blkmq in v4.0-rc1–rc7

2015-04-09 Thread Jens Axboe

On 04/09/2015 03:12 PM, Linus Torvalds wrote:

On Thu, Apr 9, 2015 at 11:24 AM, Jan Engelhardt  wrote:


It's fairly consistent (reproducible?). Only 1 in 15 or so (have not kept track
really) attempts does it not die.

With frame pointers:
  [] scsi_queue_rq+0x2e8/0x3d2
  [] __blk_mq_run_hw_queue+0x19b/0x2a2
  [] ? blk_mq_merge_queue_io+0x75/0x147
  [] ? __xfs_get_blocks+0x2f9/0x2f9 [xfs]
  [] blk_mq_run_hw_queue+0x4f/0x99
  [] blk_sq_make_request+0x163/0x170


Ok, good.

So the cmd comes from

 struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);

which in turn is just

 return (void *) rq + sizeof(*rq);

which in turn is written by some crazy monkey on crack. That's some
shit code. Why the hell you'd write it that way, when the natural
thing to do would be just

 return rq + 1;

without the sizeof, and without the cast.

The particular crazy monkey on crack is Jens Axboe, in commit 320ae51feed5c.

Jens, really. This code is shit.


That's a bit rough on a single line of code like that, don't you think? 
But yes, rq + 1 is identical and cleaner...



That ->sense_buffer thing is supposed to be initialized by the
blk_mq_ops.init_request() function, which is called - if it exists =
when the array of requests ('->rqs[]') is initialized.

And that code too looks like crap. It seems to be very clever, trying
to allocaet big contiguous chunks of RAM for the requests, but then
the initialization sequence is questionable as hell. It takes that
nonzeroed allocation, and zeroes a few fields randomly. The rest will
contain whatever garbage data they used to.

Does this entirely untested patch make any difference?

And Jens, this all really looks very fishy. When I look at these kinds
of core functions, and find just *stupid* code like this, it makes me
unhappy.


Not sure why it isn't all zeroed, definitely the saner thing to do at 
init time. So patch looks fine, should be applied regardless of whether 
or not it fixes this issue.


But it's _always_ been like that, so it's not a change in behavior. It 
is fragile, so perhaps some SCSI change modified alloc behavior and it's 
not causing and issue.


And if this is mpt, we recently ran into some list corruption issues due 
to a bug in the driver. It hit on reboot, but it was scan related, so 
could be a boot issue as well.


--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NULL deref around blkmq in v4.0-rc1–rc7

2015-04-09 Thread Linus Torvalds
On Thu, Apr 9, 2015 at 11:24 AM, Jan Engelhardt  wrote:
>
> It's fairly consistent (reproducible?). Only 1 in 15 or so (have not kept 
> track
> really) attempts does it not die.
>
> With frame pointers:
>  [] scsi_queue_rq+0x2e8/0x3d2
>  [] __blk_mq_run_hw_queue+0x19b/0x2a2
>  [] ? blk_mq_merge_queue_io+0x75/0x147
>  [] ? __xfs_get_blocks+0x2f9/0x2f9 [xfs]
>  [] blk_mq_run_hw_queue+0x4f/0x99
>  [] blk_sq_make_request+0x163/0x170

Ok, good.

So the cmd comes from

struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);

which in turn is just

return (void *) rq + sizeof(*rq);

which in turn is written by some crazy monkey on crack. That's some
shit code. Why the hell you'd write it that way, when the natural
thing to do would be just

return rq + 1;

without the sizeof, and without the cast.

The particular crazy monkey on crack is Jens Axboe, in commit 320ae51feed5c.

Jens, really. This code is shit.

That ->sense_buffer thing is supposed to be initialized by the
blk_mq_ops.init_request() function, which is called - if it exists =
when the array of requests ('->rqs[]') is initialized.

And that code too looks like crap. It seems to be very clever, trying
to allocaet big contiguous chunks of RAM for the requests, but then
the initialization sequence is questionable as hell. It takes that
nonzeroed allocation, and zeroes a few fields randomly. The rest will
contain whatever garbage data they used to.

Does this entirely untested patch make any difference?

And Jens, this all really looks very fishy. When I look at these kinds
of core functions, and find just *stupid* code like this, it makes me
unhappy.

Anyway, I assume that the actual "disk" in question is mpt fusion?

 Linus
 block/blk-mq.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b7b8933ec241..33c428530193 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1457,7 +1457,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
 
do {
page = alloc_pages_node(set->numa_node,
-   GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
+   GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
this_order);
if (page)
break;
@@ -1479,8 +1479,6 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
left -= to_do * rq_size;
for (j = 0; j < to_do; j++) {
tags->rqs[i] = p;
-   tags->rqs[i]->atomic_flags = 0;
-   tags->rqs[i]->cmd_flags = 0;
if (set->ops->init_request) {
if (set->ops->init_request(set->driver_data,
tags->rqs[i], hctx_idx, i,


Re: NULL deref around blkmq in v4.0-rc1–rc7

2015-04-09 Thread Jan Engelhardt

On Thursday 2015-04-09 19:38, Linus Torvalds wrote:
>>
>> I reran bisect just to be sure.
>> It now shows v4.0-rc1~9 is bad, v4.0-rc1~9^1 is ok, and v4.0-rc~9^2 is
>> ok too. So this means that the combination of the both ~9 childs work
>> badly together.
>
>Ok, that's just _odd_.
>[...]
>So I get the feeling that the oops you are seeing is likely not
>consistent, and may depend on allocation patterns or similar.

It's fairly consistent (reproducible?). Only 1 in 15 or so (have not kept track
really) attempts does it not die.

With frame pointers:
BUG: unable to handle kernel paging request at 1000
IP: [] scsi_init_cmd_errh+0x2a/0x62
PGD 0 
Oops: 0002 [#1] SMP 
Modules linked in: xfs crc32c_generic libcrc32c dm_crypt xts gf128mul 
algif_skcipher af_alg sd_mod mptsas scsi_transport_sas mptscsih mptbase dm_mod 
sg ipv6
CPU: 0 PID: 403 Comm: kworker/u2:1 Not tainted 4.0.0-rc7+ #55
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
task: 88007b686f60 ti: 88007bcb4000 task.ti: 88007bcb4000
RIP: 0010:[]  [] 
scsi_init_cmd_errh+0x2a/0x62
RSP: 0018:88007bcb77a8  EFLAGS: 00010246
RAX:  RBX: 88007bf8d800 RCX: 0018
RDX: 88007bf7ab70 RSI:  RDI: 1000
RBP: 88007bcb77a8 R08: 88007beb9c40 R09: 
R10:  R11: ea0001fe17c0 R12: 88007bf7ab70
R13:  R14: 88007bf8d800 R15: 88007bf7aa00
FS:  () GS:88007fc0() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 1000 CR3: 7cb0d000 CR4: 07f0
Stack:
 88007bcb7818 81286d59 88007b686f60 88007bc24000
 88007bf7ab78 88007bf8d968 88007be56c00 88007bc24000
 88007cbfb400 88007bcb7850 88007be56c08 88007bf7aa00
Call Trace:
 [] scsi_queue_rq+0x2e8/0x3d2
 [] __blk_mq_run_hw_queue+0x19b/0x2a2
 [] ? blk_mq_merge_queue_io+0x75/0x147
 [] ? __xfs_get_blocks+0x2f9/0x2f9 [xfs]
 [] blk_mq_run_hw_queue+0x4f/0x99
 [] blk_sq_make_request+0x163/0x170
 [] generic_make_request+0x97/0xd6
 [] submit_bio+0x10d/0x12c
 [] ? __lru_cache_add+0x1e/0x3f
 [] mpage_bio_submit+0x25/0x2c
 [] mpage_readpages+0xf8/0x10c
 [] ? __xfs_get_blocks+0x2f9/0x2f9 [xfs]
 [] xfs_vm_readpages+0x18/0x1a [xfs]
 [] __do_page_cache_readahead+0x137/0x1d3
 [] ondemand_readahead+0x20a/0x21b
 [] page_cache_sync_readahead+0x38/0x3a
 [] generic_file_read_iter+0x191/0x4fb
 [] ? xfs_ilock+0x32/0x5d [xfs]
 [] xfs_file_read_iter+0x1c2/0x213 [xfs]
 [] new_sync_read+0x74/0x98
 [] __vfs_read+0x14/0x3b
 [] vfs_read+0x74/0xc1
 [] kernel_read+0x3c/0x4a
 [] prepare_binprm+0x117/0x11f
 [] do_execveat_common.isra.31+0x3b2/0x5d8
 [] do_execve+0x27/0x29
 [] call_usermodehelper+0x10a/0x138
 [] ? call_usermodehelper+0x49/0x49
 [] ret_from_fork+0x58/0x90
 [] ? call_usermodehelper+0x49/0x49
Code: c3 55 48 89 fa 48 c7 87 b0 00 00 00 00 00 00 00 c7 87 f4 00 00 00 00 00 
00 00 48 8b bf 10 01 00 00 31 c0 b9 18 00 00 00 48 89 e5  ab 66 83 ba cc 00 
00 00 00 75 2a 48 8b 8a d8 00 00 00 8a 01 
RIP  [] scsi_init_cmd_errh+0x2a/0x62
 RSP 
CR2: 1000
---[ end trace fbec0fe487830b1d ]---



>and %rdi is 0x1000. It seems to be simply
>
> memset(cmd->sense_buffer, 0, SCSI_SENSE_BUFFERSIZE);
>
>where 'cmd->sense_buffer' has some insane value ("PAGE_SIZE" or just a
>flipped bit, or whatever)

Having been observed on two isolated different systems, I don't
think so much that it would be a broken HW-induced bitflip.

Oh yeah, if anybody likes, I can hand out the virtualbox image.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NULL deref around blkmq in v4.0-rc1–rc7

2015-04-09 Thread Jan Engelhardt

On Thursday 2015-04-09 19:38, Linus Torvalds wrote:

 I reran bisect just to be sure.
 It now shows v4.0-rc1~9 is bad, v4.0-rc1~9^1 is ok, and v4.0-rc~9^2 is
 ok too. So this means that the combination of the both ~9 childs work
 badly together.

Ok, that's just _odd_.
[...]
So I get the feeling that the oops you are seeing is likely not
consistent, and may depend on allocation patterns or similar.

It's fairly consistent (reproducible?). Only 1 in 15 or so (have not kept track
really) attempts does it not die.

With frame pointers:
BUG: unable to handle kernel paging request at 1000
IP: [812853c9] scsi_init_cmd_errh+0x2a/0x62
PGD 0 
Oops: 0002 [#1] SMP 
Modules linked in: xfs crc32c_generic libcrc32c dm_crypt xts gf128mul 
algif_skcipher af_alg sd_mod mptsas scsi_transport_sas mptscsih mptbase dm_mod 
sg ipv6
CPU: 0 PID: 403 Comm: kworker/u2:1 Not tainted 4.0.0-rc7+ #55
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
task: 88007b686f60 ti: 88007bcb4000 task.ti: 88007bcb4000
RIP: 0010:[812853c9]  [812853c9] 
scsi_init_cmd_errh+0x2a/0x62
RSP: 0018:88007bcb77a8  EFLAGS: 00010246
RAX:  RBX: 88007bf8d800 RCX: 0018
RDX: 88007bf7ab70 RSI:  RDI: 1000
RBP: 88007bcb77a8 R08: 88007beb9c40 R09: 
R10:  R11: ea0001fe17c0 R12: 88007bf7ab70
R13:  R14: 88007bf8d800 R15: 88007bf7aa00
FS:  () GS:88007fc0() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 1000 CR3: 7cb0d000 CR4: 07f0
Stack:
 88007bcb7818 81286d59 88007b686f60 88007bc24000
 88007bf7ab78 88007bf8d968 88007be56c00 88007bc24000
 88007cbfb400 88007bcb7850 88007be56c08 88007bf7aa00
Call Trace:
 [81286d59] scsi_queue_rq+0x2e8/0x3d2
 [8119e64d] __blk_mq_run_hw_queue+0x19b/0x2a2
 [8119e901] ? blk_mq_merge_queue_io+0x75/0x147
 [a00fa34a] ? __xfs_get_blocks+0x2f9/0x2f9 [xfs]
 [8119edeb] blk_mq_run_hw_queue+0x4f/0x99
 [8119fab9] blk_sq_make_request+0x163/0x170
 [81196a8b] generic_make_request+0x97/0xd6
 [81196bd7] submit_bio+0x10d/0x12c
 [810d5e15] ? __lru_cache_add+0x1e/0x3f
 [81142af5] mpage_bio_submit+0x25/0x2c
 [8114387d] mpage_readpages+0xf8/0x10c
 [a00fa34a] ? __xfs_get_blocks+0x2f9/0x2f9 [xfs]
 [a00f9c45] xfs_vm_readpages+0x18/0x1a [xfs]
 [810d4e5c] __do_page_cache_readahead+0x137/0x1d3
 [810d5102] ondemand_readahead+0x20a/0x21b
 [810d5262] page_cache_sync_readahead+0x38/0x3a
 [810cd1c5] generic_file_read_iter+0x191/0x4fb
 [a010b2cf] ? xfs_ilock+0x32/0x5d [xfs]
 [a01023c2] xfs_file_read_iter+0x1c2/0x213 [xfs]
 [81118e63] new_sync_read+0x74/0x98
 [81119aef] __vfs_read+0x14/0x3b
 [81119b8a] vfs_read+0x74/0xc1
 [8111d977] kernel_read+0x3c/0x4a
 [8111dbd2] prepare_binprm+0x117/0x11f
 [8111f10d] do_execveat_common.isra.31+0x3b2/0x5d8
 [8111f35a] do_execve+0x27/0x29
 [81050e07] call_usermodehelper+0x10a/0x138
 [81050cfd] ? call_usermodehelper+0x49/0x49
 [8133b3d8] ret_from_fork+0x58/0x90
 [81050cfd] ? call_usermodehelper+0x49/0x49
Code: c3 55 48 89 fa 48 c7 87 b0 00 00 00 00 00 00 00 c7 87 f4 00 00 00 00 00 
00 00 48 8b bf 10 01 00 00 31 c0 b9 18 00 00 00 48 89 e5 f3 ab 66 83 ba cc 00 
00 00 00 75 2a 48 8b 8a d8 00 00 00 8a 01 
RIP  [812853c9] scsi_init_cmd_errh+0x2a/0x62
 RSP 88007bcb77a8
CR2: 1000
---[ end trace fbec0fe487830b1d ]---



and %rdi is 0x1000. It seems to be simply

 memset(cmd-sense_buffer, 0, SCSI_SENSE_BUFFERSIZE);

where 'cmd-sense_buffer' has some insane value (PAGE_SIZE or just a
flipped bit, or whatever)

Having been observed on two isolated different systems, I don't
think so much that it would be a broken HW-induced bitflip.

Oh yeah, if anybody likes, I can hand out the virtualbox image.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NULL deref around blkmq in v4.0-rc1–rc7

2015-04-09 Thread Jens Axboe

On 04/09/2015 03:12 PM, Linus Torvalds wrote:

On Thu, Apr 9, 2015 at 11:24 AM, Jan Engelhardt jeng...@inai.de wrote:


It's fairly consistent (reproducible?). Only 1 in 15 or so (have not kept track
really) attempts does it not die.

With frame pointers:
  [81286d59] scsi_queue_rq+0x2e8/0x3d2
  [8119e64d] __blk_mq_run_hw_queue+0x19b/0x2a2
  [8119e901] ? blk_mq_merge_queue_io+0x75/0x147
  [a00fa34a] ? __xfs_get_blocks+0x2f9/0x2f9 [xfs]
  [8119edeb] blk_mq_run_hw_queue+0x4f/0x99
  [8119fab9] blk_sq_make_request+0x163/0x170


Ok, good.

So the cmd comes from

 struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);

which in turn is just

 return (void *) rq + sizeof(*rq);

which in turn is written by some crazy monkey on crack. That's some
shit code. Why the hell you'd write it that way, when the natural
thing to do would be just

 return rq + 1;

without the sizeof, and without the cast.

The particular crazy monkey on crack is Jens Axboe, in commit 320ae51feed5c.

Jens, really. This code is shit.


That's a bit rough on a single line of code like that, don't you think? 
But yes, rq + 1 is identical and cleaner...



That -sense_buffer thing is supposed to be initialized by the
blk_mq_ops.init_request() function, which is called - if it exists =
when the array of requests ('-rqs[]') is initialized.

And that code too looks like crap. It seems to be very clever, trying
to allocaet big contiguous chunks of RAM for the requests, but then
the initialization sequence is questionable as hell. It takes that
nonzeroed allocation, and zeroes a few fields randomly. The rest will
contain whatever garbage data they used to.

Does this entirely untested patch make any difference?

And Jens, this all really looks very fishy. When I look at these kinds
of core functions, and find just *stupid* code like this, it makes me
unhappy.


Not sure why it isn't all zeroed, definitely the saner thing to do at 
init time. So patch looks fine, should be applied regardless of whether 
or not it fixes this issue.


But it's _always_ been like that, so it's not a change in behavior. It 
is fragile, so perhaps some SCSI change modified alloc behavior and it's 
not causing and issue.


And if this is mpt, we recently ran into some list corruption issues due 
to a bug in the driver. It hit on reboot, but it was scan related, so 
could be a boot issue as well.


--
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NULL deref around blkmq in v4.0-rc1–rc7

2015-04-09 Thread Linus Torvalds
On Thu, Apr 9, 2015 at 11:24 AM, Jan Engelhardt jeng...@inai.de wrote:

 It's fairly consistent (reproducible?). Only 1 in 15 or so (have not kept 
 track
 really) attempts does it not die.

 With frame pointers:
  [81286d59] scsi_queue_rq+0x2e8/0x3d2
  [8119e64d] __blk_mq_run_hw_queue+0x19b/0x2a2
  [8119e901] ? blk_mq_merge_queue_io+0x75/0x147
  [a00fa34a] ? __xfs_get_blocks+0x2f9/0x2f9 [xfs]
  [8119edeb] blk_mq_run_hw_queue+0x4f/0x99
  [8119fab9] blk_sq_make_request+0x163/0x170

Ok, good.

So the cmd comes from

struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);

which in turn is just

return (void *) rq + sizeof(*rq);

which in turn is written by some crazy monkey on crack. That's some
shit code. Why the hell you'd write it that way, when the natural
thing to do would be just

return rq + 1;

without the sizeof, and without the cast.

The particular crazy monkey on crack is Jens Axboe, in commit 320ae51feed5c.

Jens, really. This code is shit.

That -sense_buffer thing is supposed to be initialized by the
blk_mq_ops.init_request() function, which is called - if it exists =
when the array of requests ('-rqs[]') is initialized.

And that code too looks like crap. It seems to be very clever, trying
to allocaet big contiguous chunks of RAM for the requests, but then
the initialization sequence is questionable as hell. It takes that
nonzeroed allocation, and zeroes a few fields randomly. The rest will
contain whatever garbage data they used to.

Does this entirely untested patch make any difference?

And Jens, this all really looks very fishy. When I look at these kinds
of core functions, and find just *stupid* code like this, it makes me
unhappy.

Anyway, I assume that the actual disk in question is mpt fusion?

 Linus
 block/blk-mq.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b7b8933ec241..33c428530193 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1457,7 +1457,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
 
do {
page = alloc_pages_node(set-numa_node,
-   GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
+   GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | 
__GFP_ZERO,
this_order);
if (page)
break;
@@ -1479,8 +1479,6 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct 
blk_mq_tag_set *set,
left -= to_do * rq_size;
for (j = 0; j  to_do; j++) {
tags-rqs[i] = p;
-   tags-rqs[i]-atomic_flags = 0;
-   tags-rqs[i]-cmd_flags = 0;
if (set-ops-init_request) {
if (set-ops-init_request(set-driver_data,
tags-rqs[i], hctx_idx, i,