Re: regression 4.15-rc: kernel oops in dm-multipath

2017-12-22 Thread Christian Borntraeger
I will give that commit a try. Thanks.

On 12/22/2017 05:00 PM, Bart Van Assche wrote:
> On Fri, 2017-12-22 at 10:53 +0100, Christian Borntraeger wrote:
>> any quick ideas?
> 
> Hello Christian,
> 
> Was commit afc567a4977b ("dm table: fix regression from improper 
> dm_dev_internal.count
> refcount_t conversion") included in your test kernel?
> 
> Bart.
> 



Re: [for-4.16 PATCH 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler

2017-12-22 Thread Mike Snitzer
On Tue, Dec 19 2017 at  4:05pm -0500,
Mike Snitzer  wrote:

> These patches enable DM multipath to work well on NVMe over Fabrics
> devices.  Currently that implies CONFIG_NVME_MULTIPATH is _not_ set.
> 
> But follow-on work will be to make it so that native NVMe multipath
> and DM multipath can be made to co-exist (e.g. blacklisting certain
> NVMe devices from being consumed by native NVMe multipath?)
> 
> Patch 1 updates block core to formalize a recent construct that
> Christoph embeedded into NVMe core (and native NVMe multipath):
> callback into a bio-based driver from the blk-mq driver's .complete
> hook to blk_steal_bios() a request's bios.
> 
> Patch 2 switches NVMe over to using the block infrastructure
> established by Patch 1.
> 
> Patch 3 moves the nvme_req_needs_failover() from NVMe multipath to
> core.  Which allow sstacked devices (like DM multipath) to make use of
> NVMe's enhanced error handling.
> 
> Patch 4 updates DM multipath to also make use of the block
> infrastructure established by Patch 1.
> 
> Patch 5 can be largely ignored.. but it illustrates that Patch 1 - 4
> enable DM multipath to avoid extra DM endio callbacks.
> 
> These patches have been developed ontop of numerous DM changes I've
> staged for 4.16, see:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16
> (which happens to include these 5 patches at the end, purely for
> interim linux-next coverage purposes as these changes land in the
> appropriate maintainer tree).
> 
> I've updated the "mptest" DM multipath testsuite to provide NVMe test
> coverage (using NVMe fcloop), see: https://github.com/snitm/mptest
> 
> The tree I've been testing includes all of 'dm-4.16' and all but one
> of the commits from 'nvme-4.16', see:
> https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=dm-4.16_nvme-4.16
> (I've let James Smart know that commit a0b69cc8 causes "nvme connect"
> to not work on my fcloop testbed).
> 
> Jens, provided review is favorable, I'd very much appreciate it you'd
> pick up patches 1 - 3 for 4.16.

BTW, Christoph if you're open to picking up patches 1 - 3 into
'nvme-4.16' that works too.  I just figured since there is a block core
dependency Jens would want to take them direct.

Thanks,
Mike


Re: regression 4.15-rc: kernel oops in dm-multipath

2017-12-22 Thread Mike Snitzer
On Fri, Dec 22 2017 at 11:00am -0500,
Bart Van Assche  wrote:

> On Fri, 2017-12-22 at 10:53 +0100, Christian Borntraeger wrote:
> > any quick ideas?
> 
> Hello Christian,
> 
> Was commit afc567a4977b ("dm table: fix regression from improper 
> dm_dev_internal.count
> refcount_t conversion") included in your test kernel?

Good call.  Unlikely since since he is running 4.15-rc3 but afc567a4977b
wasn't merged until after rc3; for inclusion in 4.15-rc4.


Re: regression 4.15-rc: kernel oops in dm-multipath

2017-12-22 Thread Bart Van Assche
On Fri, 2017-12-22 at 10:53 +0100, Christian Borntraeger wrote:
> any quick ideas?

Hello Christian,

Was commit afc567a4977b ("dm table: fix regression from improper 
dm_dev_internal.count
refcount_t conversion") included in your test kernel?

Bart.


Re: [PATCH 3/3] block: Polling completion performance optimization

2017-12-22 Thread Jens Axboe
On 12/21/17 4:10 PM, Keith Busch wrote:
> On Thu, Dec 21, 2017 at 03:17:41PM -0700, Jens Axboe wrote:
>> On 12/21/17 2:34 PM, Keith Busch wrote:
>>> It would be nice, but the driver doesn't know a request's completion
>>> is going to be a polled. 
>>
>> That's trivially solvable though, since the information is available
>> at submission time.
>>
>>> Even if it did, we don't have a spec defined
>>> way to tell the controller not to send an interrupt with this command's
>>> compeletion, which would be negated anyway if any interrupt driven IO
>>> is mixed in the same queue. We could possibly create a special queue
>>> with interrupts disabled for this purpose if we can pass the HIPRI hint
>>> through the request.
>>
>> There's on way to do it per IO, right. But you can create a sq/cq pair
>> without interrupts enabled. This would also allow you to scale better
>> with multiple users of polling, a case where we currently don't
>> perform as well spdk, for instance.
> 
> Would you be open to have blk-mq provide special hi-pri hardware contexts
> for all these requests to come through? Maybe one per NUMA node? If not,
> I don't think have enough unused bits in the NVMe command id to stash
> the hctx id to extract the original request.

Yeah, in fact I think we HAVE to do it this way. I've been thinking about
this for a while, and ideally I'd really like blk-mq to support multiple
queue "pools". It's basically just a mapping thing. Right now you hand
blk-mq all your queues, and the mappings are defined for one set of
queues. It'd be nifty to support multiple sets, so we could do things
like "reserve X for polling", for example, and just have the mappings
magically work. blk_mq_map_queue() then just needs to take the bio
or request (or just cmd flags) to be able to decide what set the
request belongs to, making the mapping a function of {cpu,type}.

I originally played with this in the context of isolating writes on
a single queue, to reduce the amount of resources they can grab. And
it'd work nicely on this as well.

Completions could be configurable to where the submitter would do it
(like now, best for sync single thread), or to where you have a/more
kernel threads doing it (spdk'ish, best for high qd / thread count).

-- 
Jens Axboe



Re: Regression with a0747a859ef6 ("bdi: add error handle for bdi_debug_register")

2017-12-22 Thread Bruno Wolff III

On Fri, Dec 22, 2017 at 21:20:10 +0800,
 weiping zhang  wrote:

2017-12-22 12:53 GMT+08:00 Bruno Wolff III :

On Thu, Dec 21, 2017 at 17:16:03 -0600,
 Bruno Wolff III  wrote:



Enforcing mode alone isn't enough as I tested that one one machine at home
and it didn't trigger the problem. I'll try another machine late tonight.



I got the problem to occur on my i686 machine when booting in enforcing
mode. This machine uses raid 1 vua mdraid which may or may not be a factor
in this problem. The boot log has a trace at the end and might be helpful,
so I'm attaching it here.

Hi Bruno,
I can reproduce this issue in my QEMU test VM easily, just add an soft
RAID1, always trigger
that warning, I'll debug it later.


Great. When you have a fix, I can test it.


Re: Regression with a0747a859ef6 ("bdi: add error handle for bdi_debug_register")

2017-12-22 Thread weiping zhang
2017-12-22 12:53 GMT+08:00 Bruno Wolff III :
> On Thu, Dec 21, 2017 at 17:16:03 -0600,
>  Bruno Wolff III  wrote:
>>
>>
>> Enforcing mode alone isn't enough as I tested that one one machine at home
>> and it didn't trigger the problem. I'll try another machine late tonight.
>
>
> I got the problem to occur on my i686 machine when booting in enforcing
> mode. This machine uses raid 1 vua mdraid which may or may not be a factor
> in this problem. The boot log has a trace at the end and might be helpful,
> so I'm attaching it here.
Hi Bruno,
I can reproduce this issue in my QEMU test VM easily, just add an soft
RAID1, always trigger
that warning, I'll debug it later.


regression 4.15-rc: kernel oops in dm-multipath

2017-12-22 Thread Christian Borntraeger
Since 4.15-rc1 I get the following during boot relatively often (but not 100% 
reproducable)


Seems to be 2 oopses...


"[5.851954] device-mapper: multipath service-time: version 0.3.0 loaded
"[5.902244] Unable to handle kernel pointer dereference in virtual kernel 
address space
"[5.902272] Failing address: 03ff82196000 TEID: 03ff82196803
"[5.902275] Fault in home space mode while using kernel ASCE.
"[5.902283] AS:0135c007 R3:0002105e0007 S:0020 
"[5.902390] Oops: 0010 ilc:3 [#1] SMP 
"[5.902437] Modules linked in: dm_service_time mlx4_ib mlx4_en ptp ib_core 
pp
"s_core ghash_s390 prng aes_s390 des_s390 des_generic sha512_s390 sha256_s390 
sha
"1_s390 sha_common mlx4_core eadm_sch dm_multipath dm_mod zcrypt_cex4 zcrypt 
rng_
"core
"[5.902818] Unable to handle kernel pointer dereference in virtual kernel 
address space
"[5.902829] Failing address: 03ff8218e000 TEID: 03ff8218e803
"[5.902840] Fault in home space mode while using 
"[5.902867]  vhost_net sch_fq_codel tun
"[5.902899] kernel 
"[5.902917]  vhost tap ip_tables
"[5.902940] ASCE.
"[5.902955] AS:0135c007 R3:0002105e0007 
"[5.902972]  x_tables autofs4
"[5.902987] S:0020 
"[5.903012] CPU: 0 PID: 742 Comm: systemd-udevd Not tainted 4.15.0-rc3+ #11
"[5.903024] Hardware name: IBM 2964 NC9 704 (LPAR)
"[5.903035] Krnl PSW : 47407382 702c2011 
(multipath_busy+0x9a
"/0x128 [dm_multipath])
"[5.903085]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 
RI: 0 EA:3
"[5.903112] Krnl GPRS: 0001 03ff82195a72  

"[5.903133]03ff800cff9c  0800 
0001fa508730
"[5.903154]0001f1f48000 03e0 0001f808c030 
0001e76afb00
"[5.903173]0001f1f48000 0001f89efc58 0001f89efa08 
0001f89ef9c8
"[5.903191] Krnl Code: 03ff800f4e30: e310b024lg  
%r1,32(%r11)
"[5.903191]03ff800f4e36: e3101004lg  
%r1,0(%r1)
"[5.903191]   #03ff800f4e3c: e3101114lg  
%r1,272(%r1)
"[5.903191]   >03ff800f4e42: e32016980004lg  
%r2,1688(%r1)
"[5.903191]03ff800f4e48: c0e5f972brasl   
%r14,3ff800f412c
"[5.903191]03ff800f4e4e: ec28000d007ecij 
%r2,0,8,3ff800f4e68
"[5.903191]03ff800f4e54: a7180001lhi %r1,1
"[5.903191]03ff800f4e58: e3b0b004lg  
%r11,0(%r11)
"[5.903308] Call Trace:
"[5.903319] ([<0001f89ef9c0>] 0x1f89ef9c0)
"[5.903342]  [<03ff800cff3e>] dm_old_request_fn+0x56/0x1d0 [dm_mod] 
"[5.903367]  [<00734f66>] __blk_run_queue+0x86/0x108 
"[5.903385]  [<00736132>] queue_unplugged+0x8a/0x200 
"[5.903404]  [<0073ca0c>] blk_flush_plug_list+0x284/0x2f0 
"[5.903417]  [<0073d234>] blk_finish_plug+0x3c/0x60 
"[5.903426]  [<00313dd8>] __do_page_cache_readahead+0x2e8/0x3d0 
"[5.903441]  [<00314512>] force_page_cache_readahead+0xb2/0x150 
"[5.903454]  [<002ff1f0>] generic_file_read_iter+0x6b0/0xa28 
"[5.903477]  [<003b7e98>] __vfs_read+0x100/0x178 
"[5.903490]  [<003b7f9a>] vfs_read+0x8a/0x148 
"[5.903506]  [<003b864e>] SyS_read+0x66/0xd8 
"[5.903520]  [<00ae9144>] system_call+0x290/0x2b0 
"[5.903523] INFO: lockdep is turned off.
"[5.903527] Last Breaking-Event-Address:
"[5.903541]  [<03ff800f4e18>] multipath_busy+0x70/0x128 [dm_multipath]
"[5.903552]  
"[5.903562] Oops: 0010 ilc:3 [#2] 
"[5.903566] Kernel panic - not syncing: Fatal exception: panic_on_oops



The faulting code seems to be

list_for_each_entry(pgpath, >pgpaths, list) {
 854:   e3 b0 b0 00 00 04   lg  %r11,0(%r11)
 85a:   ec ba 00 21 80 64   cgrje   %r11,%r10,89c 

if (pgpath->is_active) {
 860:   91 80 b0 f8 tm  248(%r11),128
 864:   a7 84 ff f8 je  854 
struct request_queue *q = bdev_get_queue(pgpath->path.dev->bdev);
 868:   e3 10 b0 20 00 04   lg  %r1,32(%r11)

bool blk_poll(struct request_queue *q, blk_qc_t cookie);

static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
{
return bdev->bd_disk->queue;/* this is never NULL */
 86e:   e3 10 10 00 00 04   lg  %r1,0(%r1)
 874:   e3 10 11 10 00 04   lg  %r1,272(%r1) 
return blk_lld_busy(q);
 87a:   e3 20 16 98 00 04   lg  %r2,1688(%r1)
 880:   c0 e5 00 00 00 00   brasl   %r14,880 




any quick ideas?