Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu

2017-06-30 Thread Jens Axboe
On 06/30/2017 10:17 PM, Jens Axboe wrote:
> On 06/30/2017 08:08 AM, Jens Axboe wrote:
>> On 06/30/2017 07:05 AM, Brian King wrote:
>>> On 06/29/2017 09:17 PM, Jens Axboe wrote:
 On 06/29/2017 07:20 PM, Ming Lei wrote:
> On Fri, Jun 30, 2017 at 2:42 AM, Jens Axboe  wrote:
>> On 06/29/2017 10:00 AM, Jens Axboe wrote:
>>> On 06/29/2017 09:58 AM, Jens Axboe wrote:
 On 06/29/2017 02:40 AM, Ming Lei wrote:
> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe  wrote:
>> On 06/28/2017 03:12 PM, Brian King wrote:
>>> This patch converts the in_flight counter in struct hd_struct from a
>>> pair of atomics to a pair of percpu counters. This eliminates a 
>>> couple
>>> of atomics from the hot path. When running this on a Power system, 
>>> to
>>> a single null_blk device with 80 submission queues, irq mode 0, with
>>> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s.
>>
>> This has been done before, but I've never really liked it. The 
>> reason is
>> that it means that reading the part stat inflight count now has to
>> iterate over every possible CPU. Did you use partitions in your 
>> testing?
>> How many CPUs were configured? When I last tested this a few years 
>> ago
>> on even a quad core nehalem (which is notoriously shitty for 
>> cross-node
>> latencies), it was a net loss.
>
> One year ago, I saw null_blk's IOPS can be decreased to 10%
> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has
> 96 cores, and dual numa nodes) too, the performance can be
> recovered basically if per numa-node counter is introduced and
> used in this case, but the patch was never posted out.
> If anyone is interested in that, I can rebase the patch on current
> block tree and post out. I guess the performance issue might be
> related with system cache coherency implementation more or less.
> This issue on ARM64 can be observed with the following userspace
> atomic counting test too:
>
>http://kernel.ubuntu.com/~ming/test/cache/

 How well did the per-node thing work? Doesn't seem to me like it would
 go far enough. And per CPU is too much. One potential improvement would
 be to change the part_stat_read() to just loop online CPUs, instead of
 all possible CPUs. When CPUs go on/offline, use that as the slow path 
 to
 ensure the stats are sane. Often there's a huge difference between
 NR_CPUS configured and what the system has. As Brian states, RH ships
 with 2048, while I doubt a lot of customers actually run that...

 Outside of coming up with a more clever data structure that is fully
 CPU topology aware, one thing that could work is just having X cache
 line separated read/write inflight counters per node, where X is some
 suitable value (like 4). That prevents us from having cross node
 traffic, and it also keeps the cross cpu traffic fairly low. That 
 should
 provide a nice balance between cost of incrementing the inflight
 counting, and the cost of looping for reading it.

 And that brings me to the next part...

>> I do agree that we should do something about it, and it's one of 
>> those
>> items I've highlighted in talks about blk-mq on pending issues to fix
>> up. It's just not great as it currently stands, but I don't think per
>> CPU counters is the right way to fix it, at least not for the 
>> inflight
>> counter.
>
> Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe
> we can use some blk-mq knowledge(tagset?) to figure out the
> 'in_flight' counter. I thought about it before, but never got a
> perfect solution, and looks it is a bit hard, :-)

 The tags are already a bit spread out, so it's worth a shot. That would
 remove the need to do anything in the inc/dec path, as the tags already
 do that. The inlight count could be easily retrieved with
 sbitmap_weight(). The only issue here is that we need separate read and
 write counters, and the weight would obviously only get us the total
 count. But we can have a slower path for that, just iterate the tags 
 and
 count them. The fast path only cares about total count.

 Let me try that out real quick.
>>>
>>> Well, that only works for whole disk stats, of course... There's no way
>>> around iterating the tags and checking for this to truly work.
>>
>> Totally untested proof of concept for using the tags for this. I based
>> this on top of Brian's patch, 

Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu

2017-06-30 Thread Jens Axboe
On 06/30/2017 08:08 AM, Jens Axboe wrote:
> On 06/30/2017 07:05 AM, Brian King wrote:
>> On 06/29/2017 09:17 PM, Jens Axboe wrote:
>>> On 06/29/2017 07:20 PM, Ming Lei wrote:
 On Fri, Jun 30, 2017 at 2:42 AM, Jens Axboe  wrote:
> On 06/29/2017 10:00 AM, Jens Axboe wrote:
>> On 06/29/2017 09:58 AM, Jens Axboe wrote:
>>> On 06/29/2017 02:40 AM, Ming Lei wrote:
 On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe  wrote:
> On 06/28/2017 03:12 PM, Brian King wrote:
>> This patch converts the in_flight counter in struct hd_struct from a
>> pair of atomics to a pair of percpu counters. This eliminates a 
>> couple
>> of atomics from the hot path. When running this on a Power system, to
>> a single null_blk device with 80 submission queues, irq mode 0, with
>> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s.
>
> This has been done before, but I've never really liked it. The reason 
> is
> that it means that reading the part stat inflight count now has to
> iterate over every possible CPU. Did you use partitions in your 
> testing?
> How many CPUs were configured? When I last tested this a few years ago
> on even a quad core nehalem (which is notoriously shitty for 
> cross-node
> latencies), it was a net loss.

 One year ago, I saw null_blk's IOPS can be decreased to 10%
 of non-RQF_IO_STAT on a dual socket ARM64(each CPU has
 96 cores, and dual numa nodes) too, the performance can be
 recovered basically if per numa-node counter is introduced and
 used in this case, but the patch was never posted out.
 If anyone is interested in that, I can rebase the patch on current
 block tree and post out. I guess the performance issue might be
 related with system cache coherency implementation more or less.
 This issue on ARM64 can be observed with the following userspace
 atomic counting test too:

http://kernel.ubuntu.com/~ming/test/cache/
>>>
>>> How well did the per-node thing work? Doesn't seem to me like it would
>>> go far enough. And per CPU is too much. One potential improvement would
>>> be to change the part_stat_read() to just loop online CPUs, instead of
>>> all possible CPUs. When CPUs go on/offline, use that as the slow path to
>>> ensure the stats are sane. Often there's a huge difference between
>>> NR_CPUS configured and what the system has. As Brian states, RH ships
>>> with 2048, while I doubt a lot of customers actually run that...
>>>
>>> Outside of coming up with a more clever data structure that is fully
>>> CPU topology aware, one thing that could work is just having X cache
>>> line separated read/write inflight counters per node, where X is some
>>> suitable value (like 4). That prevents us from having cross node
>>> traffic, and it also keeps the cross cpu traffic fairly low. That should
>>> provide a nice balance between cost of incrementing the inflight
>>> counting, and the cost of looping for reading it.
>>>
>>> And that brings me to the next part...
>>>
> I do agree that we should do something about it, and it's one of those
> items I've highlighted in talks about blk-mq on pending issues to fix
> up. It's just not great as it currently stands, but I don't think per
> CPU counters is the right way to fix it, at least not for the inflight
> counter.

 Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe
 we can use some blk-mq knowledge(tagset?) to figure out the
 'in_flight' counter. I thought about it before, but never got a
 perfect solution, and looks it is a bit hard, :-)
>>>
>>> The tags are already a bit spread out, so it's worth a shot. That would
>>> remove the need to do anything in the inc/dec path, as the tags already
>>> do that. The inlight count could be easily retrieved with
>>> sbitmap_weight(). The only issue here is that we need separate read and
>>> write counters, and the weight would obviously only get us the total
>>> count. But we can have a slower path for that, just iterate the tags and
>>> count them. The fast path only cares about total count.
>>>
>>> Let me try that out real quick.
>>
>> Well, that only works for whole disk stats, of course... There's no way
>> around iterating the tags and checking for this to truly work.
>
> Totally untested proof of concept for using the tags for this. I based
> this on top of Brian's patch, so it includes his patch plus the
> _double() stuff I did which is no longer really needed.
>
>
> diff --git a/block/bio.c b/block/bio.c
> index 9cf98b29588a..ec99d9ba0f33 100644
> 

Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu

2017-06-30 Thread Brian King
On 06/30/2017 06:26 PM, Jens Axboe wrote:
> On 06/30/2017 05:23 PM, Ming Lei wrote:
>> Hi Bian,
>>
>> On Sat, Jul 1, 2017 at 2:33 AM, Brian King  wrote:
>>> On 06/30/2017 09:08 AM, Jens Axboe wrote:
>>> Compared with the totally percpu approach, this way might help 1:M or
>>> N:M mapping, but won't help 1:1 map(NVMe), when hctx is mapped to
>>> each CPU(especially there are huge hw queues on a big system), :-(
>>
>> Not disagreeing with that, without having some mechanism to only
>> loop queues that have pending requests. That would be similar to the
>> ctx_map for sw to hw queues. But I don't think that would be worthwhile
>> doing, I like your pnode approach better. However, I'm still not fully
>> convinced that one per node is enough to get the scalability we need.
>>
>> Would be great if Brian could re-test with your updated patch, so we
>> know how it works for him at least.
>
> I'll try running with both approaches today and see how they compare.

 Focus on Ming's, a variant of that is the most likely path forward,
 imho. It'd be great to do a quick run on mine as well, just to establish
 how it compares to mainline, though.
>>>
>>> On my initial runs, the one from you Jens, appears to perform a bit better, 
>>> although
>>> both are a huge improvement from what I was seeing before.
>>>
>>> I ran 4k random reads using fio to nullblk in two configurations on my 20 
>>> core
>>> system with 4 NUMA nodes and 4-way SMT, so 80 logical CPUs. I ran both 80 
>>> threads
>>> to a single null_blk as well as 80 threads to 80 null_block devices, so one 
>>> thread
>>
>> Could you share what the '80 null_block devices' is?  It means you
>> create 80 null_blk
>> devices? Or you create one null_blk and make its hw queue number as 80
>> via module
>> parameter of ''submit_queues"?
> 
> That's a valid question, was going to ask that too. But I assumed that Brian
> used submit_queues to set as many queues as he has logical CPUs in the system.
>>
>> I guess we should focus on multi-queue case since it is the normal way of 
>> NVMe.
>>
>>> per null_blk. This is what I saw on this machine:
>>>
>>> Using the Per node atomic change from Ming Lei
>>> 1 null_blk, 80 threads
>>> iops=9376.5K
>>>
>>> 80 null_blk, 1 thread
>>> iops=9523.5K
>>>
>>>
>>> Using the alternate patch from Jens using the tags
>>> 1 null_blk, 80 threads
>>> iops=9725.8K
>>>
>>> 80 null_blk, 1 thread
>>> iops=9569.4K
>>
>> If 1 thread means single fio job, looks the number is too too high, that 
>> means
>> one random IO can complete in about 0.1us(100ns) on single CPU, not sure if 
>> it
>> is possible, :-)
> 
> It means either 1 null_blk device, 80 threads running IO to it. Or 80 null_blk
> devices, each with a thread running IO to it. See above, he details that it's
> 80 threads on 80 devices for that case.

Right. So the two modes I'm running in are:

1. 80 null_blk devices, each with one submit_queue, with one fio job per 
null_blk device,
   so 80 threads total. 80 logical CPUs
2. 1 null_blk device, with 80 submit_queues, 80 fio jobs, 80 logical CPUs.

In theory, the two should result in similar numbers. 

Here are the commands and fio configurations:

Scenario #1
modprobe null_blk submit_queues=80 nr_devices=1 irqmode=0

FIO config:
[global]
buffered=0
invalidate=1
bs=4k
iodepth=64
numjobs=80
group_reporting=1
rw=randrw
rwmixread=100
rwmixwrite=0
ioengine=libaio
runtime=60
time_based

[job1]
filename=/dev/nullb0


Scenario #2
modprobe null_blk submit_queues=1 nr_devices=80 irqmode=0

FIO config
[global]
buffered=0
invalidate=1
bs=4k
iodepth=64
numjobs=1
group_reporting=1
rw=randrw
rwmixread=100
rwmixwrite=0
ioengine=libaio
runtime=60
time_based

[job1]
filename=/dev/nullb0
[job2]
filename=/dev/nullb1
[job3]
filename=/dev/nullb2
[job4]
filename=/dev/nullb3
[job5]
filename=/dev/nullb4
[job6]
filename=/dev/nullb5
[job7]
filename=/dev/nullb6
[job8]
filename=/dev/nullb7
[job9]
filename=/dev/nullb8
[job10]
filename=/dev/nullb9
[job11]
filename=/dev/nullb10
[job12]
filename=/dev/nullb11
[job13]
filename=/dev/nullb12
[job14]
filename=/dev/nullb13
[job15]
filename=/dev/nullb14
[job16]
filename=/dev/nullb15
[job17]
filename=/dev/nullb16
[job18]
filename=/dev/nullb17
[job19]
filename=/dev/nullb18
[job20]
filename=/dev/nullb19
[job21]
filename=/dev/nullb20
[job22]
filename=/dev/nullb21
[job23]
filename=/dev/nullb22
[job24]
filename=/dev/nullb23
[job25]
filename=/dev/nullb24
[job26]
filename=/dev/nullb25
[job27]
filename=/dev/nullb26
[job28]
filename=/dev/nullb27
[job29]
filename=/dev/nullb28
[job30]
filename=/dev/nullb29
[job31]
filename=/dev/nullb30
[job32]
filename=/dev/nullb31
[job33]
filename=/dev/nullb32
[job34]
filename=/dev/nullb33
[job35]
filename=/dev/nullb34
[job36]
filename=/dev/nullb35
[job37]
filename=/dev/nullb36
[job38]
filename=/dev/nullb37
[job39]
filename=/dev/nullb38
[job40]
filename=/dev/nullb39
[job41]
filename=/dev/nullb40
[job42]

Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu

2017-06-30 Thread Jens Axboe
On 06/30/2017 05:23 PM, Ming Lei wrote:
> Hi Bian,
> 
> On Sat, Jul 1, 2017 at 2:33 AM, Brian King  wrote:
>> On 06/30/2017 09:08 AM, Jens Axboe wrote:
>> Compared with the totally percpu approach, this way might help 1:M or
>> N:M mapping, but won't help 1:1 map(NVMe), when hctx is mapped to
>> each CPU(especially there are huge hw queues on a big system), :-(
>
> Not disagreeing with that, without having some mechanism to only
> loop queues that have pending requests. That would be similar to the
> ctx_map for sw to hw queues. But I don't think that would be worthwhile
> doing, I like your pnode approach better. However, I'm still not fully
> convinced that one per node is enough to get the scalability we need.
>
> Would be great if Brian could re-test with your updated patch, so we
> know how it works for him at least.

 I'll try running with both approaches today and see how they compare.
>>>
>>> Focus on Ming's, a variant of that is the most likely path forward,
>>> imho. It'd be great to do a quick run on mine as well, just to establish
>>> how it compares to mainline, though.
>>
>> On my initial runs, the one from you Jens, appears to perform a bit better, 
>> although
>> both are a huge improvement from what I was seeing before.
>>
>> I ran 4k random reads using fio to nullblk in two configurations on my 20 
>> core
>> system with 4 NUMA nodes and 4-way SMT, so 80 logical CPUs. I ran both 80 
>> threads
>> to a single null_blk as well as 80 threads to 80 null_block devices, so one 
>> thread
> 
> Could you share what the '80 null_block devices' is?  It means you
> create 80 null_blk
> devices? Or you create one null_blk and make its hw queue number as 80
> via module
> parameter of ''submit_queues"?

That's a valid question, was going to ask that too. But I assumed that Brian
used submit_queues to set as many queues as he has logical CPUs in the system.
> 
> I guess we should focus on multi-queue case since it is the normal way of 
> NVMe.
> 
>> per null_blk. This is what I saw on this machine:
>>
>> Using the Per node atomic change from Ming Lei
>> 1 null_blk, 80 threads
>> iops=9376.5K
>>
>> 80 null_blk, 1 thread
>> iops=9523.5K
>>
>>
>> Using the alternate patch from Jens using the tags
>> 1 null_blk, 80 threads
>> iops=9725.8K
>>
>> 80 null_blk, 1 thread
>> iops=9569.4K
> 
> If 1 thread means single fio job, looks the number is too too high, that means
> one random IO can complete in about 0.1us(100ns) on single CPU, not sure if it
> is possible, :-)

It means either 1 null_blk device, 80 threads running IO to it. Or 80 null_blk
devices, each with a thread running IO to it. See above, he details that it's
80 threads on 80 devices for that case.

-- 
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu

2017-06-30 Thread Brian King
On 06/30/2017 09:08 AM, Jens Axboe wrote:
 Compared with the totally percpu approach, this way might help 1:M or
 N:M mapping, but won't help 1:1 map(NVMe), when hctx is mapped to
 each CPU(especially there are huge hw queues on a big system), :-(
>>>
>>> Not disagreeing with that, without having some mechanism to only
>>> loop queues that have pending requests. That would be similar to the
>>> ctx_map for sw to hw queues. But I don't think that would be worthwhile
>>> doing, I like your pnode approach better. However, I'm still not fully
>>> convinced that one per node is enough to get the scalability we need.
>>>
>>> Would be great if Brian could re-test with your updated patch, so we
>>> know how it works for him at least.
>>
>> I'll try running with both approaches today and see how they compare.
> 
> Focus on Ming's, a variant of that is the most likely path forward,
> imho. It'd be great to do a quick run on mine as well, just to establish
> how it compares to mainline, though.

On my initial runs, the one from you Jens, appears to perform a bit better, 
although
both are a huge improvement from what I was seeing before.

I ran 4k random reads using fio to nullblk in two configurations on my 20 core
system with 4 NUMA nodes and 4-way SMT, so 80 logical CPUs. I ran both 80 
threads
to a single null_blk as well as 80 threads to 80 null_block devices, so one 
thread
per null_blk. This is what I saw on this machine:

Using the Per node atomic change from Ming Lei
1 null_blk, 80 threads
iops=9376.5K

80 null_blk, 1 thread
iops=9523.5K


Using the alternate patch from Jens using the tags
1 null_blk, 80 threads
iops=9725.8K

80 null_blk, 1 thread
iops=9569.4K

Its interesting that with this change the single device, 80 threads scenario
actually got better than the 80 null_blk scenario. I'll try on a larger machine
as well. I've got a 32 core machine I can try this on too. Next week I can
work with our performance team on running this on a system with a bunch of nvme
devices so we can then test the disk partition case as well and see if there is
any noticeable overhead.

Thanks,

Brian


-- 
Brian King
Power Linux I/O
IBM Linux Technology Center

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 2/2] multipath: attempt at common multipath.rules

2017-06-30 Thread Benjamin Marzinski
On Tue, Jun 27, 2017 at 11:41:32PM +0200, Martin Wilck wrote:
> Hello Ben,
> 
> Thanks again for your effort on this issue.
> I finally found time to look into this proposal more deeply, sorry that
> it took so long.
> 
> Please find my comments below.
> 
> Best regards,
> Martin
> 
> On Wed, 2017-04-12 at 17:15 -0500, Benjamin Marzinski wrote:
> > This is a proposal to try and bring the Redhat and SuSE
> > multipath.rules
> > closer. There are a couple of changes that I'd like some input on.
> > 
> > The big change is moving the kpartx call into the multipath
> > rules.  Half
> > of the current kpartx.rules file is about creating symlinks for
> > multiple
> > types of dm devices. The other half auto-creates kpartx devices on
> > top
> > of multipath devices. Since it is only creating kpartx devices on top
> > of
> > multipath devices, I've moved the these rules into multipath.rules,
> > or
> > rather, I've replaced them with the redhat rules in multipath.rules. 
> 
> I don't like the move of the rules too much, for two reasons:
> 
> 1) Even if multipath is the only use case today, I see no deeper reason
> why kpartx would be used only for multipath devices. It could be used
> to create partitions on top of other dm targets as well.
> 
> 2) multipath.rules now consists of two completely unrelated parts, one
> for detecting paths that are part of multipath maps, and one for
> running kpartx on the maps themselves. That's confusing and sort of
> against the "spirit" of udev rules, as far as I understand it.
> Logically, if you really want to merge this into another .rules file,
> the kpartx part would rather belong into 11-dm-mpath.rules (which deals
> with multipath maps) than in 56-multipath.rules (which deals with
> paths).

That's a fair complaint.  I'm not sure if the kpartx rules belong in
11-dm-mpath.rules. That works a lot line 11-dm-lvm.rules. Both are just
setting up flags for the other udev rules to use.  Perhaps the best
answer is to move what is the kpartx part of my multipath.rules to
kpartx.rules, and move the generic dm device labelling stuff from
kpartx.rules to some other rules file.
 
> > The
> > biggest difference is the kpartx isn't run on every reload.  It works
> > with the 11-dm-mpath.rules code to not run kpartx on multipathd
> > generated reloads or when there aren't any working paths. It does
> > remember if it didn't get to run kpartx when it was supposed to
> > (because
> > there were no valid paths or the device was suspended) and will make
> > sure to run it on the next possible uevent.
> 
> I like this part, but i have some suggestions, see below.
> 
> > The other change is the redhat multipath rules remove the partition
> > device nodes for devices claimed by multipath. The udev rule will
> > only
> > do this one time (both to keep from running partx on every event, and
> > so
> > that if users manually reread the partition table, we don't keep
> > removing them when clearly they are wanted). Redhat does this because
> > we
> > had multiple customer issues where they were using the scsi
> > partitions
> > instead of the kpartx devices. Obviously, with setting the partition
> > devices to not ready and clearing their fs_type, this isn't
> > essential,
> > but it has helped make customers do the right thing.
> 
> Isn't this racy? You call "partx -d" on the parent device (e.g. sda).
> At this point in time, the kernel will already have detected the
> partitions and "add" uevents for them are likely to follow up quickly.
> You generate "remove" events that may race with the "add" - or am I
> overlooking something?

Yes this is racey. However, when the partition devices do come up, the
are marked with ENV{SYSTEMD_READY}="0" and ENV{ID_FS_TYPE}="none". This
means that systemd isn't going to do anything with them. So the devices
may appear for an instant, and then get pulled, but nothing should
auto-anything on top of them.

 
> I'd feel better if we'd call "partx -d" while processing the "add"
> uevent for the _partitions_, ideally late in that process. That would
> make (almost) sure that "add" was finished when "remove" was generated.
> See below.

My goal was to pull them before they ever appeared at all.  I should
point out that although my way is racey insofar as the partition device
may appear for a brief moment or not. The partition device will always
get removed, regardless of whether the "partx -d" call comes before or
after it appears.

-Ben

> > 
> > Cc: Martin Wilck 
> > Cc: Hannes Reinecke 
> > Signed-off-by: Benjamin Marzinski 
> > ---
> >  kpartx/kpartx.rules   |  8 
> >  multipath/multipath.rules | 27 ---
> >  2 files changed, 24 insertions(+), 11 deletions(-)
> > 
> > diff --git a/kpartx/kpartx.rules b/kpartx/kpartx.rules
> > index 48a4d6c..906e320 100644
> > --- a/kpartx/kpartx.rules
> > +++ b/kpartx/kpartx.rules
> > @@ -34,12 +34,4 @@ ENV{ID_FS_LABEL_ENC}=="?*",
> > 

Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu

2017-06-30 Thread Jens Axboe
On 06/30/2017 07:05 AM, Brian King wrote:
> On 06/29/2017 09:17 PM, Jens Axboe wrote:
>> On 06/29/2017 07:20 PM, Ming Lei wrote:
>>> On Fri, Jun 30, 2017 at 2:42 AM, Jens Axboe  wrote:
 On 06/29/2017 10:00 AM, Jens Axboe wrote:
> On 06/29/2017 09:58 AM, Jens Axboe wrote:
>> On 06/29/2017 02:40 AM, Ming Lei wrote:
>>> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe  wrote:
 On 06/28/2017 03:12 PM, Brian King wrote:
> This patch converts the in_flight counter in struct hd_struct from a
> pair of atomics to a pair of percpu counters. This eliminates a couple
> of atomics from the hot path. When running this on a Power system, to
> a single null_blk device with 80 submission queues, irq mode 0, with
> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s.

 This has been done before, but I've never really liked it. The reason 
 is
 that it means that reading the part stat inflight count now has to
 iterate over every possible CPU. Did you use partitions in your 
 testing?
 How many CPUs were configured? When I last tested this a few years ago
 on even a quad core nehalem (which is notoriously shitty for cross-node
 latencies), it was a net loss.
>>>
>>> One year ago, I saw null_blk's IOPS can be decreased to 10%
>>> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has
>>> 96 cores, and dual numa nodes) too, the performance can be
>>> recovered basically if per numa-node counter is introduced and
>>> used in this case, but the patch was never posted out.
>>> If anyone is interested in that, I can rebase the patch on current
>>> block tree and post out. I guess the performance issue might be
>>> related with system cache coherency implementation more or less.
>>> This issue on ARM64 can be observed with the following userspace
>>> atomic counting test too:
>>>
>>>http://kernel.ubuntu.com/~ming/test/cache/
>>
>> How well did the per-node thing work? Doesn't seem to me like it would
>> go far enough. And per CPU is too much. One potential improvement would
>> be to change the part_stat_read() to just loop online CPUs, instead of
>> all possible CPUs. When CPUs go on/offline, use that as the slow path to
>> ensure the stats are sane. Often there's a huge difference between
>> NR_CPUS configured and what the system has. As Brian states, RH ships
>> with 2048, while I doubt a lot of customers actually run that...
>>
>> Outside of coming up with a more clever data structure that is fully
>> CPU topology aware, one thing that could work is just having X cache
>> line separated read/write inflight counters per node, where X is some
>> suitable value (like 4). That prevents us from having cross node
>> traffic, and it also keeps the cross cpu traffic fairly low. That should
>> provide a nice balance between cost of incrementing the inflight
>> counting, and the cost of looping for reading it.
>>
>> And that brings me to the next part...
>>
 I do agree that we should do something about it, and it's one of those
 items I've highlighted in talks about blk-mq on pending issues to fix
 up. It's just not great as it currently stands, but I don't think per
 CPU counters is the right way to fix it, at least not for the inflight
 counter.
>>>
>>> Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe
>>> we can use some blk-mq knowledge(tagset?) to figure out the
>>> 'in_flight' counter. I thought about it before, but never got a
>>> perfect solution, and looks it is a bit hard, :-)
>>
>> The tags are already a bit spread out, so it's worth a shot. That would
>> remove the need to do anything in the inc/dec path, as the tags already
>> do that. The inlight count could be easily retrieved with
>> sbitmap_weight(). The only issue here is that we need separate read and
>> write counters, and the weight would obviously only get us the total
>> count. But we can have a slower path for that, just iterate the tags and
>> count them. The fast path only cares about total count.
>>
>> Let me try that out real quick.
>
> Well, that only works for whole disk stats, of course... There's no way
> around iterating the tags and checking for this to truly work.

 Totally untested proof of concept for using the tags for this. I based
 this on top of Brian's patch, so it includes his patch plus the
 _double() stuff I did which is no longer really needed.


 diff --git a/block/bio.c b/block/bio.c
 index 9cf98b29588a..ec99d9ba0f33 100644
 --- a/block/bio.c
 +++ b/block/bio.c
 @@ -1737,7 +1737,7 @@ void generic_start_io_acct(int rw, unsigned long 
 sectors,
 

[dm-devel] [PATCH v2] dm raid: avoid BUG() in __rdev_sectors()

2017-06-30 Thread heinzm
From: Heinz Mauelshagen 

Avoid the BUG() and catch invalid rdev size in the constructor.

Signed-off-by: Heinz Mauelshagen 
---
 drivers/md/dm-raid.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index b4b75da..2e10c2f 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -1571,7 +1571,7 @@ static sector_t __rdev_sectors(struct raid_set *rs)
return rdev->sectors;
}
 
-   BUG(); /* Constructor ensures we got some. */
+   return 0;
 }
 
 /* Calculate the sectors per device and per array used for @rs */
@@ -2941,7 +2941,7 @@ static int raid_ctr(struct dm_target *ti, unsigned int 
argc, char **argv)
bool resize;
struct raid_type *rt;
unsigned int num_raid_params, num_raid_devs;
-   sector_t calculated_dev_sectors;
+   sector_t calculated_dev_sectors, rdev_sectors;
struct raid_set *rs = NULL;
const char *arg;
struct rs_layout rs_layout;
@@ -3017,7 +3017,14 @@ static int raid_ctr(struct dm_target *ti, unsigned int 
argc, char **argv)
if (r)
goto bad;
 
-   resize = calculated_dev_sectors != __rdev_sectors(rs);
+   rdev_sectors = __rdev_sectors(rs);
+   if (!rdev_sectors) {
+   ti->error = "Invalid rdev size";
+   r = -EINVAL;
+   goto bad;
+   }
+
+   resize = calculated_dev_sectors != rdev_sectors;
 
INIT_WORK(>md.event_work, do_table_event);
ti->private = rs;
-- 
2.9.4

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu

2017-06-30 Thread Brian King
On 06/29/2017 09:17 PM, Jens Axboe wrote:
> On 06/29/2017 07:20 PM, Ming Lei wrote:
>> On Fri, Jun 30, 2017 at 2:42 AM, Jens Axboe  wrote:
>>> On 06/29/2017 10:00 AM, Jens Axboe wrote:
 On 06/29/2017 09:58 AM, Jens Axboe wrote:
> On 06/29/2017 02:40 AM, Ming Lei wrote:
>> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe  wrote:
>>> On 06/28/2017 03:12 PM, Brian King wrote:
 This patch converts the in_flight counter in struct hd_struct from a
 pair of atomics to a pair of percpu counters. This eliminates a couple
 of atomics from the hot path. When running this on a Power system, to
 a single null_blk device with 80 submission queues, irq mode 0, with
 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s.
>>>
>>> This has been done before, but I've never really liked it. The reason is
>>> that it means that reading the part stat inflight count now has to
>>> iterate over every possible CPU. Did you use partitions in your testing?
>>> How many CPUs were configured? When I last tested this a few years ago
>>> on even a quad core nehalem (which is notoriously shitty for cross-node
>>> latencies), it was a net loss.
>>
>> One year ago, I saw null_blk's IOPS can be decreased to 10%
>> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has
>> 96 cores, and dual numa nodes) too, the performance can be
>> recovered basically if per numa-node counter is introduced and
>> used in this case, but the patch was never posted out.
>> If anyone is interested in that, I can rebase the patch on current
>> block tree and post out. I guess the performance issue might be
>> related with system cache coherency implementation more or less.
>> This issue on ARM64 can be observed with the following userspace
>> atomic counting test too:
>>
>>http://kernel.ubuntu.com/~ming/test/cache/
>
> How well did the per-node thing work? Doesn't seem to me like it would
> go far enough. And per CPU is too much. One potential improvement would
> be to change the part_stat_read() to just loop online CPUs, instead of
> all possible CPUs. When CPUs go on/offline, use that as the slow path to
> ensure the stats are sane. Often there's a huge difference between
> NR_CPUS configured and what the system has. As Brian states, RH ships
> with 2048, while I doubt a lot of customers actually run that...
>
> Outside of coming up with a more clever data structure that is fully
> CPU topology aware, one thing that could work is just having X cache
> line separated read/write inflight counters per node, where X is some
> suitable value (like 4). That prevents us from having cross node
> traffic, and it also keeps the cross cpu traffic fairly low. That should
> provide a nice balance between cost of incrementing the inflight
> counting, and the cost of looping for reading it.
>
> And that brings me to the next part...
>
>>> I do agree that we should do something about it, and it's one of those
>>> items I've highlighted in talks about blk-mq on pending issues to fix
>>> up. It's just not great as it currently stands, but I don't think per
>>> CPU counters is the right way to fix it, at least not for the inflight
>>> counter.
>>
>> Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe
>> we can use some blk-mq knowledge(tagset?) to figure out the
>> 'in_flight' counter. I thought about it before, but never got a
>> perfect solution, and looks it is a bit hard, :-)
>
> The tags are already a bit spread out, so it's worth a shot. That would
> remove the need to do anything in the inc/dec path, as the tags already
> do that. The inlight count could be easily retrieved with
> sbitmap_weight(). The only issue here is that we need separate read and
> write counters, and the weight would obviously only get us the total
> count. But we can have a slower path for that, just iterate the tags and
> count them. The fast path only cares about total count.
>
> Let me try that out real quick.

 Well, that only works for whole disk stats, of course... There's no way
 around iterating the tags and checking for this to truly work.
>>>
>>> Totally untested proof of concept for using the tags for this. I based
>>> this on top of Brian's patch, so it includes his patch plus the
>>> _double() stuff I did which is no longer really needed.
>>>
>>>
>>> diff --git a/block/bio.c b/block/bio.c
>>> index 9cf98b29588a..ec99d9ba0f33 100644
>>> --- a/block/bio.c
>>> +++ b/block/bio.c
>>> @@ -1737,7 +1737,7 @@ void generic_start_io_acct(int rw, unsigned long 
>>> sectors,
>>> part_round_stats(cpu, part);
>>> part_stat_inc(cpu, part, ios[rw]);
>>> part_stat_add(cpu, part, sectors[rw], sectors);
>>> -   

Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu

2017-06-30 Thread Ming Lei
On Fri, Jun 30, 2017 at 2:42 AM, Jens Axboe  wrote:
> On 06/29/2017 10:00 AM, Jens Axboe wrote:
>> On 06/29/2017 09:58 AM, Jens Axboe wrote:
>>> On 06/29/2017 02:40 AM, Ming Lei wrote:
 On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe  wrote:
> On 06/28/2017 03:12 PM, Brian King wrote:
>> This patch converts the in_flight counter in struct hd_struct from a
>> pair of atomics to a pair of percpu counters. This eliminates a couple
>> of atomics from the hot path. When running this on a Power system, to
>> a single null_blk device with 80 submission queues, irq mode 0, with
>> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s.
>
> This has been done before, but I've never really liked it. The reason is
> that it means that reading the part stat inflight count now has to
> iterate over every possible CPU. Did you use partitions in your testing?
> How many CPUs were configured? When I last tested this a few years ago
> on even a quad core nehalem (which is notoriously shitty for cross-node
> latencies), it was a net loss.

 One year ago, I saw null_blk's IOPS can be decreased to 10%
 of non-RQF_IO_STAT on a dual socket ARM64(each CPU has
 96 cores, and dual numa nodes) too, the performance can be
 recovered basically if per numa-node counter is introduced and
 used in this case, but the patch was never posted out.
 If anyone is interested in that, I can rebase the patch on current
 block tree and post out. I guess the performance issue might be
 related with system cache coherency implementation more or less.
 This issue on ARM64 can be observed with the following userspace
 atomic counting test too:

http://kernel.ubuntu.com/~ming/test/cache/
>>>
>>> How well did the per-node thing work? Doesn't seem to me like it would
>>> go far enough. And per CPU is too much. One potential improvement would
>>> be to change the part_stat_read() to just loop online CPUs, instead of
>>> all possible CPUs. When CPUs go on/offline, use that as the slow path to
>>> ensure the stats are sane. Often there's a huge difference between
>>> NR_CPUS configured and what the system has. As Brian states, RH ships
>>> with 2048, while I doubt a lot of customers actually run that...
>>>
>>> Outside of coming up with a more clever data structure that is fully
>>> CPU topology aware, one thing that could work is just having X cache
>>> line separated read/write inflight counters per node, where X is some
>>> suitable value (like 4). That prevents us from having cross node
>>> traffic, and it also keeps the cross cpu traffic fairly low. That should
>>> provide a nice balance between cost of incrementing the inflight
>>> counting, and the cost of looping for reading it.
>>>
>>> And that brings me to the next part...
>>>
> I do agree that we should do something about it, and it's one of those
> items I've highlighted in talks about blk-mq on pending issues to fix
> up. It's just not great as it currently stands, but I don't think per
> CPU counters is the right way to fix it, at least not for the inflight
> counter.

 Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe
 we can use some blk-mq knowledge(tagset?) to figure out the
 'in_flight' counter. I thought about it before, but never got a
 perfect solution, and looks it is a bit hard, :-)
>>>
>>> The tags are already a bit spread out, so it's worth a shot. That would
>>> remove the need to do anything in the inc/dec path, as the tags already
>>> do that. The inlight count could be easily retrieved with
>>> sbitmap_weight(). The only issue here is that we need separate read and
>>> write counters, and the weight would obviously only get us the total
>>> count. But we can have a slower path for that, just iterate the tags and
>>> count them. The fast path only cares about total count.
>>>
>>> Let me try that out real quick.
>>
>> Well, that only works for whole disk stats, of course... There's no way
>> around iterating the tags and checking for this to truly work.
>
> Totally untested proof of concept for using the tags for this. I based
> this on top of Brian's patch, so it includes his patch plus the
> _double() stuff I did which is no longer really needed.
>
>
> diff --git a/block/bio.c b/block/bio.c
> index 9cf98b29588a..ec99d9ba0f33 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1737,7 +1737,7 @@ void generic_start_io_acct(int rw, unsigned long 
> sectors,
> part_round_stats(cpu, part);
> part_stat_inc(cpu, part, ios[rw]);
> part_stat_add(cpu, part, sectors[rw], sectors);
> -   part_inc_in_flight(part, rw);
> +   part_inc_in_flight(cpu, part, rw);
>
> part_stat_unlock();
>  }
> @@ -1751,7 +1751,7 @@ void generic_end_io_acct(int rw, struct hd_struct *part,
>
> part_stat_add(cpu, part, ticks[rw], duration);
> 

Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu

2017-06-30 Thread Ming Lei
On Thu, Jun 29, 2017 at 12:31:05PM -0500, Brian King wrote:
> On 06/29/2017 11:25 AM, Ming Lei wrote:
> > On Thu, Jun 29, 2017 at 11:58 PM, Jens Axboe  wrote:
> >> On 06/29/2017 02:40 AM, Ming Lei wrote:
> >>> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe  wrote:
>  On 06/28/2017 03:12 PM, Brian King wrote:
> > This patch converts the in_flight counter in struct hd_struct from a
> > pair of atomics to a pair of percpu counters. This eliminates a couple
> > of atomics from the hot path. When running this on a Power system, to
> > a single null_blk device with 80 submission queues, irq mode 0, with
> > 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s.
> 
>  This has been done before, but I've never really liked it. The reason is
>  that it means that reading the part stat inflight count now has to
>  iterate over every possible CPU. Did you use partitions in your testing?
>  How many CPUs were configured? When I last tested this a few years ago
>  on even a quad core nehalem (which is notoriously shitty for cross-node
>  latencies), it was a net loss.
> >>>
> >>> One year ago, I saw null_blk's IOPS can be decreased to 10%
> >>> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has
> >>> 96 cores, and dual numa nodes) too, the performance can be
> >>> recovered basically if per numa-node counter is introduced and
> >>> used in this case, but the patch was never posted out.
> >>> If anyone is interested in that, I can rebase the patch on current
> >>> block tree and post out. I guess the performance issue might be
> >>> related with system cache coherency implementation more or less.
> >>> This issue on ARM64 can be observed with the following userspace
> >>> atomic counting test too:
> >>>
> >>>http://kernel.ubuntu.com/~ming/test/cache/
> >>
> >> How well did the per-node thing work? Doesn't seem to me like it would
> > 
> > Last time, on ARM64, I remembered that the IOPS was basically recovered,
> > but now I don't have a such machine to test. Could Brian test the attached 
> > patch
> > to see if it works on big Power machine?
> > 
> > And the idea is simple, just make the atomic counter per-node.
> 
> I tried loading the patch and get an oops right away on boot. Haven't been 
> able to 
> debug anything yet. This is on top of an older kernel, so not sure if that is
> the issue or not. I can try upstream and see if i have different results...
> 
> [-1;-1fUbuntu 16.04[-1;-1f.  .  .  .Unable to handle kernel paging request 
> for data at address 0xc00031313a333532
> Faulting instruction address: 0xc02552c4
> Oops: Kernel access of bad area, sig: 11 [#1]
> SMP NR_CPUS=1024 NUMA 
> PowerNV
> Modules linked in: dm_round_robin vmx_crypto powernv_rng leds_powernv 
> powernv_op_panel led_class rng_core dm_multipath autofs4 raid10 raid456 
> async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
> multipath dm_mirror dm_region_hash dm_log cxlflash bnx2x mdio libcrc32c nvme 
> nvme_core lpfc cxl crc_t10dif crct10dif_generic crct10dif_common
> CPU: 9 PID: 3485 Comm: multipathd Not tainted 4.9.10-dirty #7
> task: c00fba0d task.stack: c00fb8a1c000
> NIP: c02552c4 LR: c0255274 CTR: 
> REGS: c00fb8a1f350 TRAP: 0300   Not tainted  (4.9.10-dirty)
> MSR: 9280b033   CR: 24028848  
> XER: 
> CFAR: c0008a60 DAR: c00031313a333532 DSISR: 4000 SOFTE: 1 
> GPR00: c01f8980 c00fb8a1f5d0 c0f24300 c00fc4007c00 
> GPR04: 024000c0 c02fc0ac 025d c00fc5d346e0 
> GPR08: 000fc50d6000  c00031313a333532 c00fbc836240 
> GPR12: 2200 cff02400 3fff79cfebf0  
> GPR16: c138fd03 01003a12bf70 3fff79ae75e0 3fff79d269e8 
> GPR20: 0001 3fff79cfebf0 01003a393aa0 3fff79d067b8 
> GPR24:  c0b04948 a1ff c02fc0ac 
> GPR28: 024000c0 0007 c02fc0ac c00fc4007c00 
> NIP [c02552c4] __kmalloc_track_caller+0x94/0x2f0
> LR [c0255274] __kmalloc_track_caller+0x44/0x2f0
> Call Trace:
> [c00fb8a1f5d0] [c00fb8a1f610] 0xc00fb8a1f610 (unreliable)
> [c00fb8a1f620] [c01f8980] kstrdup+0x50/0xb0
> [c00fb8a1f660] [c02fc0ac] __kernfs_new_node+0x4c/0x140
> [c00fb8a1f6b0] [c02fd9f4] kernfs_new_node+0x34/0x80
> [c00fb8a1f6e0] [c0300708] kernfs_create_link+0x38/0xf0
> [c00fb8a1f720] [c0301cb8] 
> sysfs_do_create_link_sd.isra.0+0xa8/0x160
> [c00fb8a1f770] [c054a658] device_add+0x2b8/0x740
> [c00fb8a1f830] [c054ae58] device_create_groups_vargs+0x178/0x190
> [c00fb8a1f890] [c01fcd70] bdi_register+0x80/0x1d0
> [c00fb8a1f8c0] [c01fd244] bdi_register_owner+0x44/0x80
> [c00fb8a1f940] 

Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu

2017-06-30 Thread Ming Lei
On Thu, Jun 29, 2017 at 11:58 PM, Jens Axboe  wrote:
> On 06/29/2017 02:40 AM, Ming Lei wrote:
>> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe  wrote:
>>> On 06/28/2017 03:12 PM, Brian King wrote:
 This patch converts the in_flight counter in struct hd_struct from a
 pair of atomics to a pair of percpu counters. This eliminates a couple
 of atomics from the hot path. When running this on a Power system, to
 a single null_blk device with 80 submission queues, irq mode 0, with
 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s.
>>>
>>> This has been done before, but I've never really liked it. The reason is
>>> that it means that reading the part stat inflight count now has to
>>> iterate over every possible CPU. Did you use partitions in your testing?
>>> How many CPUs were configured? When I last tested this a few years ago
>>> on even a quad core nehalem (which is notoriously shitty for cross-node
>>> latencies), it was a net loss.
>>
>> One year ago, I saw null_blk's IOPS can be decreased to 10%
>> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has
>> 96 cores, and dual numa nodes) too, the performance can be
>> recovered basically if per numa-node counter is introduced and
>> used in this case, but the patch was never posted out.
>> If anyone is interested in that, I can rebase the patch on current
>> block tree and post out. I guess the performance issue might be
>> related with system cache coherency implementation more or less.
>> This issue on ARM64 can be observed with the following userspace
>> atomic counting test too:
>>
>>http://kernel.ubuntu.com/~ming/test/cache/
>
> How well did the per-node thing work? Doesn't seem to me like it would

Last time, on ARM64, I remembered that the IOPS was basically recovered,
but now I don't have a such machine to test. Could Brian test the attached patch
to see if it works on big Power machine?

And the idea is simple, just make the atomic counter per-node.

> go far enough. And per CPU is too much. One potential improvement would
> be to change the part_stat_read() to just loop online CPUs, instead of
> all possible CPUs. When CPUs go on/offline, use that as the slow path to
> ensure the stats are sane. Often there's a huge difference between
> NR_CPUS configured and what the system has. As Brian states, RH ships
> with 2048, while I doubt a lot of customers actually run that...

One observation I saw on arm64 dual socket before is that atomic inc/dec on
counter stored in local numa node is much cheaper than cross-node, that is
why I tried the per-node counter. And wrt. in-flight atomic counter, both inc
and dec should happen on CPUs belonging to same numa node in case of
blk-mq.

>
> Outside of coming up with a more clever data structure that is fully
> CPU topology aware, one thing that could work is just having X cache
> line separated read/write inflight counters per node, where X is some
> suitable value (like 4). That prevents us from having cross node
> traffic, and it also keeps the cross cpu traffic fairly low. That should
> provide a nice balance between cost of incrementing the inflight
> counting, and the cost of looping for reading it.
>
> And that brings me to the next part...
>
>>> I do agree that we should do something about it, and it's one of those
>>> items I've highlighted in talks about blk-mq on pending issues to fix
>>> up. It's just not great as it currently stands, but I don't think per
>>> CPU counters is the right way to fix it, at least not for the inflight
>>> counter.
>>
>> Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe
>> we can use some blk-mq knowledge(tagset?) to figure out the
>> 'in_flight' counter. I thought about it before, but never got a
>> perfect solution, and looks it is a bit hard, :-)
>
> The tags are already a bit spread out, so it's worth a shot. That would
> remove the need to do anything in the inc/dec path, as the tags already
> do that. The inlight count could be easily retrieved with
> sbitmap_weight(). The only issue here is that we need separate read and
> write counters, and the weight would obviously only get us the total
> count. But we can have a slower path for that, just iterate the tags and
> count them. The fast path only cares about total count.
>
> Let me try that out real quick.
>
> --
> Jens Axboe
>



Thanks,
Ming Lei
diff --git a/block/partition-generic.c b/block/partition-generic.c
index c5ec8246e25e..bd6644bf9643 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -140,8 +140,9 @@ ssize_t part_inflight_show(struct device *dev,
 {
 	struct hd_struct *p = dev_to_part(dev);
 
-	return sprintf(buf, "%8u %8u\n", atomic_read(>in_flight[0]),
-		atomic_read(>in_flight[1]));
+	return sprintf(buf, "%8u %8u\n",
+		pnode_counter_read_all(>in_flight[0]),
+		pnode_counter_read_all(>in_flight[1]));
 }
 
 #ifdef CONFIG_FAIL_MAKE_REQUEST
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 

Re: [dm-devel] [PATCH RESEND] dm-raid: Do not call BUG() in __rdev_sectors()

2017-06-30 Thread Heinz Mauelshagen



On 06/30/2017 09:33 AM, Johannes Thumshirn wrote:

On Thu, Jun 29, 2017 at 04:09:38PM +0200, Heinz Mauelshagen wrote:

Hi Hannes,

thanks for your patch.

BUG() is intentional there to catch what I've already fixed with patch

"[dm-devel] [PATCH] dm raid: fix oops on upgrading to extended superblock
format"

on 06/23/2017.

IOW: returning 0 is never right and would hide bugs like that one fixed.

OTOH panicing systems to catch errors which could've been mitigted is
never right as well. Not a single time. Never. BUG() and BUG_ON() are not
for error handling.


Point taken. We'll change that.





--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH RESEND] dm-raid: Do not call BUG() in __rdev_sectors()

2017-06-30 Thread Johannes Thumshirn
On Thu, Jun 29, 2017 at 04:09:38PM +0200, Heinz Mauelshagen wrote:
> 
> Hi Hannes,
> 
> thanks for your patch.
> 
> BUG() is intentional there to catch what I've already fixed with patch
> 
> "[dm-devel] [PATCH] dm raid: fix oops on upgrading to extended superblock
> format"
> 
> on 06/23/2017.
> 
> IOW: returning 0 is never right and would hide bugs like that one fixed.

OTOH panicing systems to catch errors which could've been mitigted is
never right as well. Not a single time. Never. BUG() and BUG_ON() are not
for error handling.

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel