Re: v4.16-rc1 + dm-mpath + BFQ

2018-05-10 Thread Paolo Valente


> Il giorno 10 mag 2018, alle ore 18:12, Bart Van Assche 
>  ha scritto:
> 
> On Fri, 2018-05-04 at 22:11 +0200, Paolo Valente wrote:
>>> Il giorno 30 mar 2018, alle ore 18:57, Bart Van Assche 
>>>  ha scritto:
>>> 
>>> On Fri, 2018-03-30 at 10:23 +0200, Paolo Valente wrote:
 Still 4.16-rc1, being that the version for which you reported this
 issue in the first place.
>>> 
>>> A vanilla v4.16-rc1 kernel is not sufficient to run the srp-test software
>>> since RDMA/CM support for the SRP target driver is missing from that kernel.
>>> That's why I asked you to use the for-next branch from my github repository
>>> in a previous e-mail. Anyway, since the necessary patches are now in
>>> linux-next, the srp-test software can also be run against linux-next. Here
>>> are the results that I obtained with label next-20180329 and the kernel
>>> config attached to your previous e-mail:
>>> 
>>> # while ./srp-test/run_tests -c -d -r 10 -e bfq; do :; done
>>> 
>>> BUG: unable to handle kernel NULL pointer dereference at 0200
>>> PGD 0 P4D 0 
>>> Oops: 0002 [#1] SMP PTI
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>> 1.0.0-prebuilt.qemu-project.org 04/01/2014
>>> RIP: 0010:rb_erase+0x284/0x380
>>> Call Trace:
>>> 
>>> elv_rb_del+0x24/0x30
>>> bfq_remove_request+0x9a/0x2e0 [bfq]
>>> ? rcu_read_lock_sched_held+0x64/0x70
>>> ? update_load_avg+0x72b/0x760
>>> bfq_finish_requeue_request+0x2e1/0x3b0 [bfq]
>>> ? __lock_is_held+0x5a/0xa0
>>> blk_mq_free_request+0x5f/0x1a0
>>> blk_put_request+0x23/0x60
>>> multipath_release_clone+0xe/0x10
>>> dm_softirq_done+0xe3/0x270
>>> __blk_mq_complete_request_remote+0x18/0x20
>>> flush_smp_call_function_queue+0xa1/0x150
>>> generic_smp_call_function_single_interrupt+0x13/0x30
>>> smp_call_function_single_interrupt+0x4d/0x220
>>> call_function_single_interrupt+0xf/0x20
>>> 
>> 
>> I suspect my recent fix [1] might fix your failure too.
>> 
>> [1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1682264.html
> 
> Hello Paolo,
> 
> With patch [1] applied I can't reproduce the aforementioned crash. I will add
> my Tested-by.
> 

Great, thanks!

Paolo

> Thanks,
> 
> Bart.
> 
> 



Re: v4.16-rc1 + dm-mpath + BFQ

2018-05-10 Thread Bart Van Assche
On Fri, 2018-05-04 at 22:11 +0200, Paolo Valente wrote:
> > Il giorno 30 mar 2018, alle ore 18:57, Bart Van Assche 
> >  ha scritto:
> > 
> > On Fri, 2018-03-30 at 10:23 +0200, Paolo Valente wrote:
> > > Still 4.16-rc1, being that the version for which you reported this
> > > issue in the first place.
> > 
> > A vanilla v4.16-rc1 kernel is not sufficient to run the srp-test software
> > since RDMA/CM support for the SRP target driver is missing from that kernel.
> > That's why I asked you to use the for-next branch from my github repository
> > in a previous e-mail. Anyway, since the necessary patches are now in
> > linux-next, the srp-test software can also be run against linux-next. Here
> > are the results that I obtained with label next-20180329 and the kernel
> > config attached to your previous e-mail:
> > 
> > # while ./srp-test/run_tests -c -d -r 10 -e bfq; do :; done
> > 
> > BUG: unable to handle kernel NULL pointer dereference at 0200
> > PGD 0 P4D 0 
> > Oops: 0002 [#1] SMP PTI
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > 1.0.0-prebuilt.qemu-project.org 04/01/2014
> > RIP: 0010:rb_erase+0x284/0x380
> > Call Trace:
> > 
> > elv_rb_del+0x24/0x30
> > bfq_remove_request+0x9a/0x2e0 [bfq]
> > ? rcu_read_lock_sched_held+0x64/0x70
> > ? update_load_avg+0x72b/0x760
> > bfq_finish_requeue_request+0x2e1/0x3b0 [bfq]
> > ? __lock_is_held+0x5a/0xa0
> > blk_mq_free_request+0x5f/0x1a0
> > blk_put_request+0x23/0x60
> > multipath_release_clone+0xe/0x10
> > dm_softirq_done+0xe3/0x270
> > __blk_mq_complete_request_remote+0x18/0x20
> > flush_smp_call_function_queue+0xa1/0x150
> > generic_smp_call_function_single_interrupt+0x13/0x30
> > smp_call_function_single_interrupt+0x4d/0x220
> > call_function_single_interrupt+0xf/0x20
> > 
> 
> I suspect my recent fix [1] might fix your failure too.
> 
> [1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1682264.html

Hello Paolo,

With patch [1] applied I can't reproduce the aforementioned crash. I will add
my Tested-by.

Thanks,

Bart.




Re: v4.16-rc1 + dm-mpath + BFQ

2018-05-10 Thread Laurence Oberman
On Thu, 2018-05-10 at 15:16 +, Bart Van Assche wrote:
> On Fri, 2018-05-04 at 16:42 -0400, Laurence Oberman wrote:
> > I was never able to reproduce Barts original issue using his tree
> > and
> > actual mlx5/cx4 hardware and ibsrp
> > I enabled BFQ with no other special tuning for the moath and
> > subpaths.
> > I was waiting for him to come back from vacation to check with him.
> 
> (back in the office)
> 
> Hello Laurence,
> 
> What I understood from off-list communication is that you tried to
> find
> a way to reproduce what I reported without using the srp-test
> software.
> My understanding is that both Paolo and I can reproduce the reported
> issue
> with the srp-test software.
> 
> Bart.
> 
> 
> 

Hello Bart

using your kernel
4.17.0-rc2.bart+

CONFIG_IOSCHED_BFQ=y
CONFIG_BFQ_GROUP_IOSCHED=y

These are all SRP LUNS

36001405b2b5c6c24c084b6fa4d55da2f dm-27 LIO-ORG ,block-10
size=3.9G features='2 queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 2:0:0:9  sdap 66:144 active ready running
  `- 1:0:0:9  sdaz 67:48  active ready running

36001405b26ebe76dcb94a489f6f245f8 dm-18 LIO-ORG ,block-21
size=3.9G features='2 queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 2:0:0:20 sdx  65:112 active ready running
  `- 1:0:0:20 sdaa 65:160 active ready running

[root@ibclient ~]# cd /sys/block
[root@ibclient block]# cat /sys/block/dm-18/queue/scheduler
mq-deadline kyber [bfq] none
[root@ibclient block]# cat /sys/block/sdaa/queue/scheduler
mq-deadline kyber [bfq] none
[root@ibclient block]# cat /sys/block/sdx/queue/scheduler
mq-deadline kyber [bfq] none

Not using the test software just exercising the LUNS via my own tests I
am unable to get the OOPS

I guess something in the srp-test software triggers it then.

Doing plenty of IO to 5 mpath devices (1.3Gbytes/sec)

#Time cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map
KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut 
12:08:320   0  1437   1107  88G   5M   1G 902M 300M
178M  1380K345  0  0  6 74  0   4 

Thanks
Laurence



Re: v4.16-rc1 + dm-mpath + BFQ

2018-05-10 Thread Paolo Valente


> Il giorno 10 mag 2018, alle ore 17:16, Bart Van Assche 
>  ha scritto:
> 
> On Fri, 2018-05-04 at 16:42 -0400, Laurence Oberman wrote:
>> I was never able to reproduce Barts original issue using his tree and
>> actual mlx5/cx4 hardware and ibsrp
>> I enabled BFQ with no other special tuning for the moath and subpaths.
>> I was waiting for him to come back from vacation to check with him.
> 
> (back in the office)
> 
> Hello Laurence,
> 
> What I understood from off-list communication is that you tried to find
> a way to reproduce what I reported without using the srp-test software.
> My understanding is that both Paolo and I can reproduce the reported issue
> with the srp-test software.
> 

Thanks for chiming in, Bart.

Above all, with my fix [1] it should be gone.

Looking forward to your feedback,
Paolo

[1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1682264.html
> Bart.
> 
> 
> 



Re: v4.16-rc1 + dm-mpath + BFQ

2018-05-10 Thread Bart Van Assche
On Fri, 2018-05-04 at 16:42 -0400, Laurence Oberman wrote:
> I was never able to reproduce Barts original issue using his tree and
> actual mlx5/cx4 hardware and ibsrp
> I enabled BFQ with no other special tuning for the moath and subpaths.
> I was waiting for him to come back from vacation to check with him.

(back in the office)

Hello Laurence,

What I understood from off-list communication is that you tried to find
a way to reproduce what I reported without using the srp-test software.
My understanding is that both Paolo and I can reproduce the reported issue
with the srp-test software.

Bart.





Re: v4.16-rc1 + dm-mpath + BFQ

2018-05-04 Thread Laurence Oberman
On Fri, 2018-05-04 at 22:11 +0200, Paolo Valente wrote:
> > Il giorno 30 mar 2018, alle ore 18:57, Bart Van Assche  > c...@wdc.com> ha scritto:
> > 
> > On Fri, 2018-03-30 at 10:23 +0200, Paolo Valente wrote:
> > > Still 4.16-rc1, being that the version for which you reported
> > > this
> > > issue in the first place.
> > 
> > A vanilla v4.16-rc1 kernel is not sufficient to run the srp-test
> > software
> > since RDMA/CM support for the SRP target driver is missing from
> > that kernel.
> > That's why I asked you to use the for-next branch from my github
> > repository
> > in a previous e-mail. Anyway, since the necessary patches are now
> > in
> > linux-next, the srp-test software can also be run against linux-
> > next. Here
> > are the results that I obtained with label next-20180329 and the
> > kernel
> > config attached to your previous e-mail:
> > 
> > # while ./srp-test/run_tests -c -d -r 10 -e bfq; do :; done
> > 
> > BUG: unable to handle kernel NULL pointer dereference at
> > 0200
> > PGD 0 P4D 0 
> > Oops: 0002 [#1] SMP PTI
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-
> > prebuilt.qemu-project.org 04/01/2014
> > RIP: 0010:rb_erase+0x284/0x380
> > Call Trace:
> > 
> > elv_rb_del+0x24/0x30
> > bfq_remove_request+0x9a/0x2e0 [bfq]
> > ? rcu_read_lock_sched_held+0x64/0x70
> > ? update_load_avg+0x72b/0x760
> > bfq_finish_requeue_request+0x2e1/0x3b0 [bfq]
> > ? __lock_is_held+0x5a/0xa0
> > blk_mq_free_request+0x5f/0x1a0
> > blk_put_request+0x23/0x60
> > multipath_release_clone+0xe/0x10
> > dm_softirq_done+0xe3/0x270
> > __blk_mq_complete_request_remote+0x18/0x20
> > flush_smp_call_function_queue+0xa1/0x150
> > generic_smp_call_function_single_interrupt+0x13/0x30
> > smp_call_function_single_interrupt+0x4d/0x220
> > call_function_single_interrupt+0xf/0x20
> > 
> > 
> 
> Hi Bart,
> I suspect my recent fix [1] might fix your failure too.
> 
> Thanks,
> Paolo
> 
> [1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1682
> 264.html
> 
> > Bart.
> > 
> > 
> > 
> 
> 
I was never able to reproduce Barts original issue using his tree and
actual mlx5/cx4 hardware and ibsrp
I enabled BFQ with no other special tuning for the moath and subpaths.
I was waiting for him to come back from vacation to check with him.

Thanks
Laurence


Re: v4.16-rc1 + dm-mpath + BFQ

2018-05-04 Thread Paolo Valente


> Il giorno 30 mar 2018, alle ore 18:57, Bart Van Assche 
>  ha scritto:
> 
> On Fri, 2018-03-30 at 10:23 +0200, Paolo Valente wrote:
>> Still 4.16-rc1, being that the version for which you reported this
>> issue in the first place.
> 
> A vanilla v4.16-rc1 kernel is not sufficient to run the srp-test software
> since RDMA/CM support for the SRP target driver is missing from that kernel.
> That's why I asked you to use the for-next branch from my github repository
> in a previous e-mail. Anyway, since the necessary patches are now in
> linux-next, the srp-test software can also be run against linux-next. Here
> are the results that I obtained with label next-20180329 and the kernel
> config attached to your previous e-mail:
> 
> # while ./srp-test/run_tests -c -d -r 10 -e bfq; do :; done
> 
> BUG: unable to handle kernel NULL pointer dereference at 0200
> PGD 0 P4D 0 
> Oops: 0002 [#1] SMP PTI
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.0.0-prebuilt.qemu-project.org 04/01/2014
> RIP: 0010:rb_erase+0x284/0x380
> Call Trace:
> 
> elv_rb_del+0x24/0x30
> bfq_remove_request+0x9a/0x2e0 [bfq]
> ? rcu_read_lock_sched_held+0x64/0x70
> ? update_load_avg+0x72b/0x760
> bfq_finish_requeue_request+0x2e1/0x3b0 [bfq]
> ? __lock_is_held+0x5a/0xa0
> blk_mq_free_request+0x5f/0x1a0
> blk_put_request+0x23/0x60
> multipath_release_clone+0xe/0x10
> dm_softirq_done+0xe3/0x270
> __blk_mq_complete_request_remote+0x18/0x20
> flush_smp_call_function_queue+0xa1/0x150
> generic_smp_call_function_single_interrupt+0x13/0x30
> smp_call_function_single_interrupt+0x4d/0x220
> call_function_single_interrupt+0xf/0x20
> 
> 

Hi Bart,
I suspect my recent fix [1] might fix your failure too.

Thanks,
Paolo

[1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1682264.html

> Bart.
> 
> 
> 



Re: v4.16-rc1 + dm-mpath + BFQ

2018-04-16 Thread Paolo Valente


> Il giorno 01 apr 2018, alle ore 10:56, Paolo Valente 
>  ha scritto:
> 
> 
> 
>> Il giorno 30 mar 2018, alle ore 18:57, Bart Van Assche 
>>  ha scritto:
>> 
>> On Fri, 2018-03-30 at 10:23 +0200, Paolo Valente wrote:
>>> Still 4.16-rc1, being that the version for which you reported this
>>> issue in the first place.
>> 
>> A vanilla v4.16-rc1 kernel is not sufficient to run the srp-test software
>> since RDMA/CM support for the SRP target driver is missing from that kernel.
>> That's why I asked you to use the for-next branch from my github repository
>> in a previous e-mail.
> 
> Yep, that's the branch/top commit I used (as you suggested):
> 190943ce1824 [bvanassche/for-next] scsi: mpt3sas: fix oops in error handlers 
> after shutdown/unload
> with
> bvanasschehttps://github.com/bvanassche/linux.git
> 
> The kernel in that branch presents itself as 4.16-rc1, but, as you
> point out, it should contain the needed support.
> 
>> Anyway, since the necessary patches are now in
>> linux-next, the srp-test software can also be run against linux-next. Here
>> are the results that I obtained with label next-20180329 and the kernel
>> config attached to your previous e-mail:
>> 
>> # while ./srp-test/run_tests -c -d -r 10 -e bfq; do :; done
>> 
>> BUG: unable to handle kernel NULL pointer dereference at 0200
>> PGD 0 P4D 0 
>> Oops: 0002 [#1] SMP PTI
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> 1.0.0-prebuilt.qemu-project.org 04/01/2014
>> RIP: 0010:rb_erase+0x284/0x380
>> Call Trace:
>> 
>> elv_rb_del+0x24/0x30
>> bfq_remove_request+0x9a/0x2e0 [bfq]
>> ? rcu_read_lock_sched_held+0x64/0x70
>> ? update_load_avg+0x72b/0x760
>> bfq_finish_requeue_request+0x2e1/0x3b0 [bfq]
>> ? __lock_is_held+0x5a/0xa0
>> blk_mq_free_request+0x5f/0x1a0
>> blk_put_request+0x23/0x60
>> multipath_release_clone+0xe/0x10
>> dm_softirq_done+0xe3/0x270
>> __blk_mq_complete_request_remote+0x18/0x20
>> flush_smp_call_function_queue+0xa1/0x150
>> generic_smp_call_function_single_interrupt+0x13/0x30
>> smp_call_function_single_interrupt+0x4d/0x220
>> call_function_single_interrupt+0xf/0x20
>> 
>> 
> 
> This new trace just confirms my suspects.  Looking forward to some
> feedback from Mike or Jens.  Otherwise I'll try to look into it
> myself, although I don't think I am the right person to suggest the
> best cure for this cloning issue.
> 

Hi Bart,
I tried to investigate this further, but the corruption of a cloned
request (or some other mishappening) that then causes this failure
occurs somewhere, earlier, in the cloning phase; and, as I feared, I
was not able to spot the mistake in that part of the code, especially
because I'm not able to reproduce the failure itself.

I might possibly have more luck after some hints from knowledgeable
people.

Otherwise, if, in your test, this failure occurs immediately after you
start the test, and if you are willing to repeat this test with my
development version of bfq, then we may have hope to get a detailed
trace of what happens under the hood.

Thanks,
Paolo



> Thanks,
> Paolo
> 
>> Bart.



Re: v4.16-rc1 + dm-mpath + BFQ

2018-04-01 Thread Paolo Valente


> Il giorno 30 mar 2018, alle ore 18:57, Bart Van Assche 
>  ha scritto:
> 
> On Fri, 2018-03-30 at 10:23 +0200, Paolo Valente wrote:
>> Still 4.16-rc1, being that the version for which you reported this
>> issue in the first place.
> 
> A vanilla v4.16-rc1 kernel is not sufficient to run the srp-test software
> since RDMA/CM support for the SRP target driver is missing from that kernel.
> That's why I asked you to use the for-next branch from my github repository
> in a previous e-mail.

Yep, that's the branch/top commit I used (as you suggested):
190943ce1824 [bvanassche/for-next] scsi: mpt3sas: fix oops in error handlers 
after shutdown/unload
with
bvanassche  https://github.com/bvanassche/linux.git

The kernel in that branch presents itself as 4.16-rc1, but, as you
point out, it should contain the needed support.

> Anyway, since the necessary patches are now in
> linux-next, the srp-test software can also be run against linux-next. Here
> are the results that I obtained with label next-20180329 and the kernel
> config attached to your previous e-mail:
> 
> # while ./srp-test/run_tests -c -d -r 10 -e bfq; do :; done
> 
> BUG: unable to handle kernel NULL pointer dereference at 0200
> PGD 0 P4D 0 
> Oops: 0002 [#1] SMP PTI
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.0.0-prebuilt.qemu-project.org 04/01/2014
> RIP: 0010:rb_erase+0x284/0x380
> Call Trace:
> 
> elv_rb_del+0x24/0x30
> bfq_remove_request+0x9a/0x2e0 [bfq]
> ? rcu_read_lock_sched_held+0x64/0x70
> ? update_load_avg+0x72b/0x760
> bfq_finish_requeue_request+0x2e1/0x3b0 [bfq]
> ? __lock_is_held+0x5a/0xa0
> blk_mq_free_request+0x5f/0x1a0
> blk_put_request+0x23/0x60
> multipath_release_clone+0xe/0x10
> dm_softirq_done+0xe3/0x270
> __blk_mq_complete_request_remote+0x18/0x20
> flush_smp_call_function_queue+0xa1/0x150
> generic_smp_call_function_single_interrupt+0x13/0x30
> smp_call_function_single_interrupt+0x4d/0x220
> call_function_single_interrupt+0xf/0x20
> 
> 

This new trace just confirms my suspects.  Looking forward to some
feedback from Mike or Jens.  Otherwise I'll try to look into it
myself, although I don't think I am the right person to suggest the
best cure for this cloning issue.

Thanks,
Paolo

> Bart.
> 
> 
> 



Re: v4.16-rc1 + dm-mpath + BFQ

2018-03-30 Thread Bart Van Assche
On Fri, 2018-03-30 at 10:23 +0200, Paolo Valente wrote:
> Still 4.16-rc1, being that the version for which you reported this
> issue in the first place.

A vanilla v4.16-rc1 kernel is not sufficient to run the srp-test software
since RDMA/CM support for the SRP target driver is missing from that kernel.
That's why I asked you to use the for-next branch from my github repository
in a previous e-mail. Anyway, since the necessary patches are now in
linux-next, the srp-test software can also be run against linux-next. Here
are the results that I obtained with label next-20180329 and the kernel
config attached to your previous e-mail:

# while ./srp-test/run_tests -c -d -r 10 -e bfq; do :; done

BUG: unable to handle kernel NULL pointer dereference at 0200
PGD 0 P4D 0 
Oops: 0002 [#1] SMP PTI
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.0.0-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:rb_erase+0x284/0x380
Call Trace:
 
 elv_rb_del+0x24/0x30
 bfq_remove_request+0x9a/0x2e0 [bfq]
 ? rcu_read_lock_sched_held+0x64/0x70
 ? update_load_avg+0x72b/0x760
 bfq_finish_requeue_request+0x2e1/0x3b0 [bfq]
 ? __lock_is_held+0x5a/0xa0
 blk_mq_free_request+0x5f/0x1a0
 blk_put_request+0x23/0x60
 multipath_release_clone+0xe/0x10
 dm_softirq_done+0xe3/0x270
 __blk_mq_complete_request_remote+0x18/0x20
 flush_smp_call_function_queue+0xa1/0x150
 generic_smp_call_function_single_interrupt+0x13/0x30
 smp_call_function_single_interrupt+0x4d/0x220
 call_function_single_interrupt+0xf/0x20
 

Bart.





Re: v4.16-rc1 + dm-mpath + BFQ

2018-03-30 Thread Paolo Valente
+Jens, Mike

> Il giorno 30 mar 2018, alle ore 01:16, Bart Van Assche 
>  ha scritto:
> 
> On Thu, 2018-03-29 at 11:02 +0200, Paolo Valente wrote:
>>> Il giorno 01 mar 2018, alle ore 02:35, Bart Van Assche 
>>>  ha scritto:
>>> Thank you for having shared your kernel config off-list. After having
>>> made the following changes to your kernel config I was able to run the
>>> srp-test software:
>>> * Enable CONFIG_DM_MULTIPATH_QL, CONFIG_DM_MULTIPATH_ST,
>>> CONFIG_SCSI_DH_RDAC, CONFIG_SCSI_DH_EMC and CONFIG_SCSI_DH_ALUA.
>>> * Disable CONFIG_KASAN. Apparently there is an incompatibility between the
>>> rdma_rxe driver and KASAN. I'm still analyzing this.
>>> 
>>> Please let me know whether these changes also allow you to run the srp-test
>>> software and whether you can reproduce what I reported at the start of this
>>> e-mail thread.
>>> 
>> 
>> Thanks for these new directives and sorry for my long delay.  I've
>> modified the config as per your suggestions (you can find my new
>> config attached), and retried.
>> 
>> Unfortunately, same failure:
>> $ sudo ./run_tests -c -d -r 10 -t 02-mq -e bfq
>> Unloaded the ib_srpt kernel module
>> Unloaded the rdma_rxe kernel module
>> SoftRoCE network interfaces: rxe0
>> Zero-initializing /dev/ram0 ... done
>> Zero-initializing /dev/ram1 ... done
>> mkdir: impossibile creare la directory "021c:42ff:fe4c:fac9": Invalid 
>> argument
>> Retrying with old port name format
>> mkdir: impossibile creare la directory "0xfe80021c42fffe4cfac9": 
>> Invalid argument
> 
> Hello Paolo,
> 

Hi

> With your kernel config and I/O scheduler "none" srp-test runs reliably
> on my test setup.

I tried with none too, but:
$ sudo ./run_tests -c -d -r 10 -t 02-mq -e none
[sudo] password di paolo: 
Unloaded the ib_srpt kernel module
Unloaded the rdma_rxe kernel module
SoftRoCE network interfaces: rxe0
insmod: ERROR: could not insert module 
/lib/modules/4.16.0-rc1+/kernel/drivers/infiniband/ulp/srpt/ib_srpt.ko: File 
exists

> The result for the BFQ scheduler is available below.


Thanks for pasting it.

According to the stack trace, the cause of the problem may still be
some missing initialization in request cloning, like the one I
reported [1], a thread that you initiated as a consequence of a
failure rather similar to the present one.

Mike and Jens took care of solving that issue (which had more general
implications than just driving BFQ crazy).  Unfortunately I can't
remember how that story ended, and I got somehow lost among threads
while trying to reconstruct it.

Mike, Jens, I guess you ended up making a fix; if so, do you have any
idea about how your fix relates to this new (?) issue.  This one
occurs after an end_clone_request, instead of a dm_mq_queue_rq, like
the previous one did.  Or, more in general, does this issue ring any
bell?

[1] https://www.spinics.net/lists/dm-devel/msg32088.html

> If
> the srp-test software did not start on your setup I assume that you are
> using another kernel version? Which kernel version did you use?
> 

Still 4.16-rc1, being that the version for which you reported this
issue in the first place.


Thanks,
Paolo

> Thanks,
> 
> Bart.
> 
> 
> 
> 
> BUG: unable to handle kernel NULL pointer dereference at 0200
> IP: rb_erase+0x284/0x380
> PGD 0 P4D 0 
> Oops: 0002 [#1] SMP PTI
> Modules linked in: ib_srp libcrc32c scsi_transport_srp ib_srpt 
> target_core_iblock target_core_mod rdma_cm iw_cm ib_cm scsi_debug brd 
> rdma_rxe ip6_udp_tunnel udp_tunnel ib_umad ib_uverbs ib_core
> kyber_iosched bfq crct10dif_pclmul crc32_pclmul ghash_clmulni_intel serio_raw 
> virtio_balloon virtio_console multipath virtio_net virtio_blk virtio_scsi 
> ata_generic crc32c_intel virtio_pci virtio_ring
> virtio pata_acpi [last unloaded: ip6_udp_tunnel]
> CPU: 3 PID: 28 Comm: ksoftirqd/3 Not tainted 4.16.0-rc7-dbg+ #2
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.0.0-prebuilt.qemu-project.org 04/01/2014
> RIP: 0010:rb_erase+0x284/0x380
> RSP: :a5ad0040f908 EFLAGS: 00010206
> RAX: de9f81e9b700 RBX: 9445775b1380 RCX: 
> RDX: de9f81e9b700 RSI: 9445652e1380 RDI: 9445775b13e0
> RBP: a5ad0040f908 R08: 0200 R09: 0002
> R10: 0001 R11: af25f020 R12: 9445775b13e0
> R13: 944564376800 R14: 944576328000 R15: 0001
> FS:  () GS:94457fd8() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0200 CR3: 6b210001 CR4: 003606e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
> elv_rb_del+0x24/0x30
> bfq_remove_request+0x9a/0x2e0 [bfq]
> bfq_finish_requeue_request+0x2e1/0x3b0 [bfq]
> blk_mq_free_request+0x5f/0x1a0
> blk_put_request+0x23/0x60
> multipath_release_clone+0xe/0x10
> dm_softirq_done+0xe3/0x270
> __blk_mq_complete_request+0xfd/0x190
> blk_

Re: v4.16-rc1 + dm-mpath + BFQ

2018-03-29 Thread Bart Van Assche
On Thu, 2018-03-29 at 11:02 +0200, Paolo Valente wrote:
> > Il giorno 01 mar 2018, alle ore 02:35, Bart Van Assche 
> >  ha scritto:
> > Thank you for having shared your kernel config off-list. After having
> > made the following changes to your kernel config I was able to run the
> > srp-test software:
> > * Enable CONFIG_DM_MULTIPATH_QL, CONFIG_DM_MULTIPATH_ST,
> >  CONFIG_SCSI_DH_RDAC, CONFIG_SCSI_DH_EMC and CONFIG_SCSI_DH_ALUA.
> > * Disable CONFIG_KASAN. Apparently there is an incompatibility between the
> >  rdma_rxe driver and KASAN. I'm still analyzing this.
> > 
> > Please let me know whether these changes also allow you to run the srp-test
> > software and whether you can reproduce what I reported at the start of this
> > e-mail thread.
> > 
> 
> Thanks for these new directives and sorry for my long delay.  I've
> modified the config as per your suggestions (you can find my new
> config attached), and retried.
> 
> Unfortunately, same failure:
> $ sudo ./run_tests -c -d -r 10 -t 02-mq -e bfq
> Unloaded the ib_srpt kernel module
> Unloaded the rdma_rxe kernel module
> SoftRoCE network interfaces: rxe0
> Zero-initializing /dev/ram0 ... done
> Zero-initializing /dev/ram1 ... done
> mkdir: impossibile creare la directory "021c:42ff:fe4c:fac9": Invalid argument
> Retrying with old port name format
> mkdir: impossibile creare la directory "0xfe80021c42fffe4cfac9": 
> Invalid argument

Hello Paolo,

With your kernel config and I/O scheduler "none" srp-test runs reliably
on my test setup. The result for the BFQ scheduler is available below. If
the srp-test software did not start on your setup I assume that you are
using another kernel version? Which kernel version did you use?

Thanks,

Bart.




BUG: unable to handle kernel NULL pointer dereference at 0200
IP: rb_erase+0x284/0x380
PGD 0 P4D 0 
Oops: 0002 [#1] SMP PTI
Modules linked in: ib_srp libcrc32c scsi_transport_srp ib_srpt 
target_core_iblock target_core_mod rdma_cm iw_cm ib_cm scsi_debug brd rdma_rxe 
ip6_udp_tunnel udp_tunnel ib_umad ib_uverbs ib_core
kyber_iosched bfq crct10dif_pclmul crc32_pclmul ghash_clmulni_intel serio_raw 
virtio_balloon virtio_console multipath virtio_net virtio_blk virtio_scsi 
ata_generic crc32c_intel virtio_pci virtio_ring
virtio pata_acpi [last unloaded: ip6_udp_tunnel]
CPU: 3 PID: 28 Comm: ksoftirqd/3 Not tainted 4.16.0-rc7-dbg+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.0.0-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:rb_erase+0x284/0x380
RSP: :a5ad0040f908 EFLAGS: 00010206
RAX: de9f81e9b700 RBX: 9445775b1380 RCX: 
RDX: de9f81e9b700 RSI: 9445652e1380 RDI: 9445775b13e0
RBP: a5ad0040f908 R08: 0200 R09: 0002
R10: 0001 R11: af25f020 R12: 9445775b13e0
R13: 944564376800 R14: 944576328000 R15: 0001
FS:  () GS:94457fd8() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0200 CR3: 6b210001 CR4: 003606e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 elv_rb_del+0x24/0x30
 bfq_remove_request+0x9a/0x2e0 [bfq]
 bfq_finish_requeue_request+0x2e1/0x3b0 [bfq]
 blk_mq_free_request+0x5f/0x1a0
 blk_put_request+0x23/0x60
 multipath_release_clone+0xe/0x10
 dm_softirq_done+0xe3/0x270
 __blk_mq_complete_request+0xfd/0x190
 blk_mq_complete_request+0x69/0xa0
 dm_complete_request+0x22/0x30
 end_clone_request+0x1d/0x20
 __blk_mq_end_request+0x5b/0x70
 scsi_end_request+0xba/0x220
 scsi_io_completion+0x4f1/0x700
 ? scsi_dec_host_busy+0xa6/0x130
 scsi_finish_command+0xef/0x140
 scsi_softirq_done+0x11f/0x170
 __blk_mq_complete_request+0xfd/0x190
 blk_mq_complete_request+0x69/0xa0
 scsi_mq_done+0x34/0x100
 srp_recv_done+0x2f6/0xa40 [ib_srp]
 ? rxe_poll_cq+0x13a/0x150 [rdma_rxe]
 __ib_process_cq+0x83/0xc0 [ib_core]
 ib_poll_handler+0x2b/0x80 [ib_core]
 irq_poll_softirq+0x90/0x140
 __do_softirq+0xcf/0x4b1
 run_ksoftirqd+0x33/0x50
 smpboot_thread_fn+0xfc/0x170
 kthread+0x121/0x140
 ? sort_range+0x30/0x30
 ? kthread_create_worker_on_cpu+0x70/0x70
 ret_from_fork+0x3a/0x50
Code: 83 e2 01 0f 85 45 fe ff ff 5d c3 4c 89 0e 4d 85 d2 0f 84 28 fe ff ff 48 
83 c8 01 48 89 0a 49 89 02 5d c3 4d 85 c0 4c 89 06 74 9c <49> 89 10 5d c3 48 89 
0e 5d c3 4d 89 48 10 eb d3 4d 8b 50 08
4c 
RIP: rb_erase+0x284/0x380 RSP: a5ad0040f908
CR2: 0200
---[ end trace 29e2f703ddaa3232 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x2d00 from 0x8100 (relocation range: 
0x8000-0xbfff)
---[ end Kernel panic - not syncing: Fatal exception in interrupt


































Re: v4.16-rc1 + dm-mpath + BFQ

2018-02-28 Thread Bart Van Assche
On Fri, 2018-02-16 at 08:39 +0100, Paolo Valente wrote:
> after enabling the listing options in your list, and a few other
> related options, such iblock support, I get this:
> 
> $ sudo ./run_tests -c -d -r 10 -t 02-mq -e bfq
> Unloaded the ib_srpt kernel module
> Unloaded the rdma_rxe kernel module
> SoftRoCE network interfaces: rxe0
> Zero-initializing /dev/ram0 ... done
> Zero-initializing /dev/ram1 ... done
> mkdir: impossibile creare la directory "021c:42ff:fe4c:fac9": Invalid argument
> Retrying with old port name format
> mkdir: impossibile creare la directory "0xfe80021c42fffe4cfac9": 
> Invalid argument

Hello Paolo,

Thank you for having shared your kernel config off-list. After having
made the following changes to your kernel config I was able to run the
srp-test software:
* Enable CONFIG_DM_MULTIPATH_QL, CONFIG_DM_MULTIPATH_ST,
  CONFIG_SCSI_DH_RDAC, CONFIG_SCSI_DH_EMC and CONFIG_SCSI_DH_ALUA.
* Disable CONFIG_KASAN. Apparently there is an incompatibility between the
  rdma_rxe driver and KASAN. I'm still analyzing this.

Please let me know whether these changes also allow you to run the srp-test
software and whether you can reproduce what I reported at the start of this
e-mail thread.

Thanks,

Bart.




Re: v4.16-rc1 + dm-mpath + BFQ

2018-02-21 Thread Bart Van Assche
On Fri, 2018-02-16 at 08:39 +0100, Paolo Valente wrote:
> after enabling the listing options in your list, and a few other
> related options, such iblock support, I get this:
> 
> $ sudo ./run_tests -c -d -r 10 -t 02-mq -e bfq
> Unloaded the ib_srpt kernel module
> Unloaded the rdma_rxe kernel module
> SoftRoCE network interfaces: rxe0
> Zero-initializing /dev/ram0 ... done
> Zero-initializing /dev/ram1 ... done
> mkdir: impossibile creare la directory "021c:42ff:fe4c:fac9": Invalid argument
> Retrying with old port name format
> mkdir: impossibile creare la directory "0xfe80021c42fffe4cfac9": 
> Invalid argument

Hello Paolo,

That probably means that there is still something missing from the kernel
config that you are using. Please send that kernel-config to me (off-list)
such that I can have a look at it.

Thanks,

Bart.





Re: v4.16-rc1 + dm-mpath + BFQ

2018-02-15 Thread Paolo Valente


> Il giorno 14 feb 2018, alle ore 19:11, Bart Van Assche 
>  ha scritto:
> 
> On 02/14/18 09:55, Paolo Valente wrote:
>> After following all of them (and taking some other step needed), I
>> invoked:
>> sudo ./run_tests -c -d -r 10 -t 02-mq -e bfq
>> But I got the following:
>> ./lib/functions: riga 34: /sys/class/block/ram0/size: No such file or 
>> directory
>> ./lib/functions: riga 34: * 512: errore di sintassi: atteso un operando (il 
>> token dell'errore è "* 512")
>> Unloaded the ib_srpt kernel module
>> Unloaded the rdma_rxe kernel module
>> modprobe: FATAL: Module ib_uverbs not found in directory 
>> /lib/modules/4.16.0-rc1+
>> modprobe: FATAL: Module ib_umad not found in directory 
>> /lib/modules/4.16.0-rc1+
>> SoftRoCE network interfaces: rxe0
>> modprobe: FATAL: Module target_core_iblock not found in directory 
>> /lib/modules/4.16.0-rc1+
>> So I think I need a little more help, to have this working in a
>> reasonable amount of time.  In particular, could you tell me all what
>> is missing?
> 
> Hello Paolo,
> 
> Can you check whether CONFIG_BLK_DEV_RAM, CONFIG_INFINIBAND, 
> CONFIG_INFINIBAND_USER_MAD, CONFIG_INFINIBAND_USER_ACCESS, 
> CONFIG_INFINIBAND_USER_MEM, CONFIG_INFINIBAND_IPOIB, CONFIG_INFINIBAND_SRP, 
> CONFIG_INFINIBAND_SRPT and CONFIG_RDMA_RXE were enabled in your kernel config?
> 

(+Linus, Ulf)

Hi Bart,
after enabling the listing options in your list, and a few other
related options, such iblock support, I get this:

$ sudo ./run_tests -c -d -r 10 -t 02-mq -e bfq
Unloaded the ib_srpt kernel module
Unloaded the rdma_rxe kernel module
SoftRoCE network interfaces: rxe0
Zero-initializing /dev/ram0 ... done
Zero-initializing /dev/ram1 ... done
mkdir: impossibile creare la directory "021c:42ff:fe4c:fac9": Invalid argument
Retrying with old port name format
mkdir: impossibile creare la directory "0xfe80021c42fffe4cfac9": 
Invalid argument

Thanks for your patience and collaboration,
Paolo

> Thanks,
> 
> Bart.



Re: v4.16-rc1 + dm-mpath + BFQ

2018-02-14 Thread Bart Van Assche

On 02/14/18 09:55, Paolo Valente wrote:

After following all of them (and taking some other step needed), I
invoked:
sudo ./run_tests -c -d -r 10 -t 02-mq -e bfq

But I got the following:
./lib/functions: riga 34: /sys/class/block/ram0/size: No such file or directory
./lib/functions: riga 34: * 512: errore di sintassi: atteso un operando (il token 
dell'errore è "* 512")
Unloaded the ib_srpt kernel module
Unloaded the rdma_rxe kernel module
modprobe: FATAL: Module ib_uverbs not found in directory 
/lib/modules/4.16.0-rc1+
modprobe: FATAL: Module ib_umad not found in directory /lib/modules/4.16.0-rc1+
SoftRoCE network interfaces: rxe0
modprobe: FATAL: Module target_core_iblock not found in directory 
/lib/modules/4.16.0-rc1+

So I think I need a little more help, to have this working in a
reasonable amount of time.  In particular, could you tell me all what
is missing?


Hello Paolo,

Can you check whether CONFIG_BLK_DEV_RAM, CONFIG_INFINIBAND, 
CONFIG_INFINIBAND_USER_MAD, CONFIG_INFINIBAND_USER_ACCESS, 
CONFIG_INFINIBAND_USER_MEM, CONFIG_INFINIBAND_IPOIB, 
CONFIG_INFINIBAND_SRP, CONFIG_INFINIBAND_SRPT and CONFIG_RDMA_RXE were 
enabled in your kernel config?


Thanks,

Bart.


Re: v4.16-rc1 + dm-mpath + BFQ

2018-02-14 Thread Paolo Valente


> Il giorno 13 feb 2018, alle ore 19:47, Bart Van Assche 
>  ha scritto:
> 
> On Tue, 2018-02-13 at 19:38 +0100, Paolo Valente wrote:
>> as a first attempt, I've followed your steps, but got:
>> Error: could not find sg_reset
> 
> Please install the sg3_utils package. Every Linux distro I know of supports 
> that
> package.

I happened to do this test on a Fedora.

> And in case you would like to install it from source, the source code of
> that package is available from http://sg.danny.cz/sg/sg3_utils.html.
> 
>> For ib_srp-backport, I get a lot of warnings like the following one,
>> at "make install" (preceded by corresponding warnings at the end of
>> the compilation):
>> depmod: WARNING: /lib/modules/4.16.0-rc1+/extra/ib_srp.ko needs unknown 
>> symbol rdma_resolve_addr
>> 
>> Unfortunately, it gets worse while executing "make scst srpt":
> 
> Please neither install the ib_srp-backport driver nor SCST. These drivers have
> not yet been tested against kernel v4.16-rc1. I provided you a kernel tree in
> which both the SRP initiator and target drivers support RoCE such that you 
> don't
> need to install these out-of-tree drivers. I think all that you need from the
> srp-test README document are the instructions to configure /etc/multipath.conf
> and the instructions for installing the required packages. From that README
> document:
> 
> Install the following software packages if these have not yet been installed:
> fio, gcc-c++, make, multipath-tools or device-mapper-multipath, sg3_utils,
> srptools, e2fsprogs and xfsprogs.
> 

Thank you very much for these instructions Bart.

After following all of them (and taking some other step needed), I
invoked:
sudo ./run_tests -c -d -r 10 -t 02-mq -e bfq

But I got the following:
./lib/functions: riga 34: /sys/class/block/ram0/size: No such file or directory
./lib/functions: riga 34: * 512: errore di sintassi: atteso un operando (il 
token dell'errore è "* 512")
Unloaded the ib_srpt kernel module
Unloaded the rdma_rxe kernel module
modprobe: FATAL: Module ib_uverbs not found in directory 
/lib/modules/4.16.0-rc1+
modprobe: FATAL: Module ib_umad not found in directory /lib/modules/4.16.0-rc1+
SoftRoCE network interfaces: rxe0
modprobe: FATAL: Module target_core_iblock not found in directory 
/lib/modules/4.16.0-rc1+

So I think I need a little more help, to have this working in a
reasonable amount of time.  In particular, could you tell me all what
is missing?

Thanks,
Paolo

> Thanks,
> 
> Bart.
> 
> 



Re: v4.16-rc1 + dm-mpath + BFQ

2018-02-13 Thread Bart Van Assche
On Tue, 2018-02-13 at 19:38 +0100, Paolo Valente wrote:
> as a first attempt, I've followed your steps, but got:
> Error: could not find sg_reset

Please install the sg3_utils package. Every Linux distro I know of supports that
package. And in case you would like to install it from source, the source code 
of
that package is available from http://sg.danny.cz/sg/sg3_utils.html.

> For ib_srp-backport, I get a lot of warnings like the following one,
> at "make install" (preceded by corresponding warnings at the end of
> the compilation):
> depmod: WARNING: /lib/modules/4.16.0-rc1+/extra/ib_srp.ko needs unknown 
> symbol rdma_resolve_addr
> 
> Unfortunately, it gets worse while executing "make scst srpt":

Please neither install the ib_srp-backport driver nor SCST. These drivers have
not yet been tested against kernel v4.16-rc1. I provided you a kernel tree in
which both the SRP initiator and target drivers support RoCE such that you don't
need to install these out-of-tree drivers. I think all that you need from the
srp-test README document are the instructions to configure /etc/multipath.conf
and the instructions for installing the required packages. From that README
document:

Install the following software packages if these have not yet been installed:
fio, gcc-c++, make, multipath-tools or device-mapper-multipath, sg3_utils,
srptools, e2fsprogs and xfsprogs.

Thanks,

Bart.




Re: v4.16-rc1 + dm-mpath + BFQ

2018-02-13 Thread Paolo Valente


> Il giorno 12 feb 2018, alle ore 17:31, Bart Van Assche 
>  ha scritto:
> 
> On 02/11/18 23:35, Paolo Valente wrote:
>> Also this smells a little bit like some spurious elevator call.
>> Unfortunately I have no clue on the cause.  To go on, I need at least
>> to reproduce it.  In this respect: Bart, could you please tell me how
>> to setup the offending configuration, and to cause the failure?
>> Possibly with just one, or at most two PCs.  I don't have fancier hw
>> at the moment.
> 
> Hello Paolo,
> 
> Although I expect that it is possible to reproduce this with an unmodified 
> v4.16-rc1 kernel, this is how I ran into this issue:
> * Clone the for-next branch of https://github.com/bvanassche/linux.
> * Build and install that kernel in a virtual machine.
> * Clone https://github.com/bvanassche/srp-test.
> * Run the following command:
>  srp-test/run_tests -c -d -r 10 -t 02-mq -e bfq
> 

Hi Bart,
as a first attempt, I've followed your steps, but got:
Error: could not find sg_reset
expectedly because of dependencies that you are implying in your steps.

So, I have followed the instructions in the srp-test README for the
case "Running the Tests on an Ethernet Setup", directly on a 4.16-rc1.

For ib_srp-backport, I get a lot of warnings like the following one,
at "make install" (preceded by corresponding warnings at the end of
the compilation):
depmod: WARNING: /lib/modules/4.16.0-rc1+/extra/ib_srp.ko needs unknown symbol 
rdma_resolve_addr

Unfortunately, it gets worse while executing "make scst srpt":

  CC [M]  /home/paolo/scst/srpt/src/ib_srpt.o
In file included from /home/paolo/scst/srpt/src/ib_srpt.c:62:0:
/home/paolo/scst/srpt/src/ib_srpt.h:481:8: error: redefinition of ‘struct 
srp_login_req_rdma’
 struct srp_login_req_rdma {
^~
In file included from /home/paolo/scst/srpt/src/ib_srpt.h:44:0,
 from /home/paolo/scst/srpt/src/ib_srpt.c:62:
/mnt/linux-dev/linux/include/scsi/srp.h:139:8: note: originally defined here
 struct srp_login_req_rdma {
^~

Could you please give me some help, so as to not get lost among these issues?

Thanks,
Paolo

> Thanks,
> 
> Bart.



Re: v4.16-rc1 + dm-mpath + BFQ

2018-02-12 Thread Bart Van Assche

On 02/11/18 23:35, Paolo Valente wrote:

Also this smells a little bit like some spurious elevator call.
Unfortunately I have no clue on the cause.  To go on, I need at least
to reproduce it.  In this respect: Bart, could you please tell me how
to setup the offending configuration, and to cause the failure?
Possibly with just one, or at most two PCs.  I don't have fancier hw
at the moment.


Hello Paolo,

Although I expect that it is possible to reproduce this with an 
unmodified v4.16-rc1 kernel, this is how I ran into this issue:

* Clone the for-next branch of https://github.com/bvanassche/linux.
* Build and install that kernel in a virtual machine.
* Clone https://github.com/bvanassche/srp-test.
* Run the following command:
  srp-test/run_tests -c -d -r 10 -t 02-mq -e bfq

Thanks,

Bart.


Re: v4.16-rc1 + dm-mpath + BFQ

2018-02-11 Thread Paolo Valente


> Il giorno 09 feb 2018, alle ore 20:18, Jens Axboe  ha 
> scritto:
> 
> On 2/9/18 12:14 PM, Bart Van Assche wrote:
>> On 02/09/18 10:58, Jens Axboe wrote:
>>> On 2/9/18 11:54 AM, Bart Van Assche wrote:
 Hello Paolo,
 
 If I enable the BFQ scheduler for a dm-mpath device then a kernel oops
 appears (see also below). This happens systematically with Linus' tree from
 this morning (commit 54ce685cae30) merged with Jens' for-linus branch 
 (commit
 a78773906147 ("block, bfq: add requeue-request hook")) and for-next branch
 (commit 88455ad7f928). Is this a known issue?
>>> 
>>> Does it happen on Linus -git as well, or just with my for-linus merged in?
>>> What I'm getting at is if a78773906147 caused this or not.
>> 
>> Hello Jens,
>> 
>> Thanks for chiming in. After having reverted commit a78773906147, after 
>> having rebuilt the BFQ scheduler, after having rebooted and after having 
>> repeated the test I see the same kernel oops being reported. I think 
>> that means that this regression is not caused by commit a78773906147. In 
>> case it would be useful, here is how gdb translates the crash address:
>> 
>> $ gdb block/bfq*ko
>> (gdb) list *(bfq_remove_request+0x8d)
>> 0x280d is in bfq_remove_request (block/bfq-iosched.c:1760).
>> 1755list_del_init(&rq->queuelist);
>> 1756bfqq->queued[sync]--;
>> 1757bfqd->queued--;
>> 1758elv_rb_del(&bfqq->sort_list, rq);
>> 1759
>> 1760elv_rqhash_del(q, rq);
>> 1761if (q->last_merge == rq)
>> 1762q->last_merge = NULL;
>> 1763
>> 1764if (RB_EMPTY_ROOT(&bfqq->sort_list)) {
> 
> Looks very odd. So clearly RQF_HASHED is set, but we're blowing up on
> the hash list pointers. I'll let Paolo take a look at this one. Thanks
> for testing without that commit, I want to push out my pending fixes
> today and this would have thrown a wrench in the works.
> 

Also this smells a little bit like some spurious elevator call.
Unfortunately I have no clue on the cause.  To go on, I need at least
to reproduce it.  In this respect: Bart, could you please tell me how
to setup the offending configuration, and to cause the failure?
Possibly with just one, or at most two PCs.  I don't have fancier hw
at the moment.

Thanks,
Paolo

> -- 
> Jens Axboe



Re: v4.16-rc1 + dm-mpath + BFQ

2018-02-09 Thread Jens Axboe
On 2/9/18 12:14 PM, Bart Van Assche wrote:
> On 02/09/18 10:58, Jens Axboe wrote:
>> On 2/9/18 11:54 AM, Bart Van Assche wrote:
>>> Hello Paolo,
>>>
>>> If I enable the BFQ scheduler for a dm-mpath device then a kernel oops
>>> appears (see also below). This happens systematically with Linus' tree from
>>> this morning (commit 54ce685cae30) merged with Jens' for-linus branch 
>>> (commit
>>> a78773906147 ("block, bfq: add requeue-request hook")) and for-next branch
>>> (commit 88455ad7f928). Is this a known issue?
>>
>> Does it happen on Linus -git as well, or just with my for-linus merged in?
>> What I'm getting at is if a78773906147 caused this or not.
> 
> Hello Jens,
> 
> Thanks for chiming in. After having reverted commit a78773906147, after 
> having rebuilt the BFQ scheduler, after having rebooted and after having 
> repeated the test I see the same kernel oops being reported. I think 
> that means that this regression is not caused by commit a78773906147. In 
> case it would be useful, here is how gdb translates the crash address:
> 
> $ gdb block/bfq*ko
> (gdb) list *(bfq_remove_request+0x8d)
> 0x280d is in bfq_remove_request (block/bfq-iosched.c:1760).
> 1755list_del_init(&rq->queuelist);
> 1756bfqq->queued[sync]--;
> 1757bfqd->queued--;
> 1758elv_rb_del(&bfqq->sort_list, rq);
> 1759
> 1760elv_rqhash_del(q, rq);
> 1761if (q->last_merge == rq)
> 1762q->last_merge = NULL;
> 1763
> 1764if (RB_EMPTY_ROOT(&bfqq->sort_list)) {

Looks very odd. So clearly RQF_HASHED is set, but we're blowing up on
the hash list pointers. I'll let Paolo take a look at this one. Thanks
for testing without that commit, I want to push out my pending fixes
today and this would have thrown a wrench in the works.

-- 
Jens Axboe



Re: v4.16-rc1 + dm-mpath + BFQ

2018-02-09 Thread Bart Van Assche

On 02/09/18 10:58, Jens Axboe wrote:

On 2/9/18 11:54 AM, Bart Van Assche wrote:

Hello Paolo,

If I enable the BFQ scheduler for a dm-mpath device then a kernel oops
appears (see also below). This happens systematically with Linus' tree from
this morning (commit 54ce685cae30) merged with Jens' for-linus branch (commit
a78773906147 ("block, bfq: add requeue-request hook")) and for-next branch
(commit 88455ad7f928). Is this a known issue?


Does it happen on Linus -git as well, or just with my for-linus merged in?
What I'm getting at is if a78773906147 caused this or not.


Hello Jens,

Thanks for chiming in. After having reverted commit a78773906147, after 
having rebuilt the BFQ scheduler, after having rebooted and after having 
repeated the test I see the same kernel oops being reported. I think 
that means that this regression is not caused by commit a78773906147. In 
case it would be useful, here is how gdb translates the crash address:


$ gdb block/bfq*ko
(gdb) list *(bfq_remove_request+0x8d)
0x280d is in bfq_remove_request (block/bfq-iosched.c:1760).
1755list_del_init(&rq->queuelist);
1756bfqq->queued[sync]--;
1757bfqd->queued--;
1758elv_rb_del(&bfqq->sort_list, rq);
1759
1760elv_rqhash_del(q, rq);
1761if (q->last_merge == rq)
1762q->last_merge = NULL;
1763
1764if (RB_EMPTY_ROOT(&bfqq->sort_list)) {

Bart.


Re: v4.16-rc1 + dm-mpath + BFQ

2018-02-09 Thread Jens Axboe
On 2/9/18 11:54 AM, Bart Van Assche wrote:
> Hello Paolo,
> 
> If I enable the BFQ scheduler for a dm-mpath device then a kernel oops
> appears (see also below). This happens systematically with Linus' tree from
> this morning (commit 54ce685cae30) merged with Jens' for-linus branch (commit
> a78773906147 ("block, bfq: add requeue-request hook")) and for-next branch
> (commit 88455ad7f928). Is this a known issue?

Does it happen on Linus -git as well, or just with my for-linus merged in?
What I'm getting at is if a78773906147 caused this or not.



-- 
Jens Axboe