Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu
On 06/29/2017 07:20 PM, Ming Lei wrote: > On Fri, Jun 30, 2017 at 2:42 AM, Jens Axboewrote: >> On 06/29/2017 10:00 AM, Jens Axboe wrote: >>> On 06/29/2017 09:58 AM, Jens Axboe wrote: On 06/29/2017 02:40 AM, Ming Lei wrote: > On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe wrote: >> On 06/28/2017 03:12 PM, Brian King wrote: >>> This patch converts the in_flight counter in struct hd_struct from a >>> pair of atomics to a pair of percpu counters. This eliminates a couple >>> of atomics from the hot path. When running this on a Power system, to >>> a single null_blk device with 80 submission queues, irq mode 0, with >>> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s. >> >> This has been done before, but I've never really liked it. The reason is >> that it means that reading the part stat inflight count now has to >> iterate over every possible CPU. Did you use partitions in your testing? >> How many CPUs were configured? When I last tested this a few years ago >> on even a quad core nehalem (which is notoriously shitty for cross-node >> latencies), it was a net loss. > > One year ago, I saw null_blk's IOPS can be decreased to 10% > of non-RQF_IO_STAT on a dual socket ARM64(each CPU has > 96 cores, and dual numa nodes) too, the performance can be > recovered basically if per numa-node counter is introduced and > used in this case, but the patch was never posted out. > If anyone is interested in that, I can rebase the patch on current > block tree and post out. I guess the performance issue might be > related with system cache coherency implementation more or less. > This issue on ARM64 can be observed with the following userspace > atomic counting test too: > >http://kernel.ubuntu.com/~ming/test/cache/ How well did the per-node thing work? Doesn't seem to me like it would go far enough. And per CPU is too much. One potential improvement would be to change the part_stat_read() to just loop online CPUs, instead of all possible CPUs. When CPUs go on/offline, use that as the slow path to ensure the stats are sane. Often there's a huge difference between NR_CPUS configured and what the system has. As Brian states, RH ships with 2048, while I doubt a lot of customers actually run that... Outside of coming up with a more clever data structure that is fully CPU topology aware, one thing that could work is just having X cache line separated read/write inflight counters per node, where X is some suitable value (like 4). That prevents us from having cross node traffic, and it also keeps the cross cpu traffic fairly low. That should provide a nice balance between cost of incrementing the inflight counting, and the cost of looping for reading it. And that brings me to the next part... >> I do agree that we should do something about it, and it's one of those >> items I've highlighted in talks about blk-mq on pending issues to fix >> up. It's just not great as it currently stands, but I don't think per >> CPU counters is the right way to fix it, at least not for the inflight >> counter. > > Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe > we can use some blk-mq knowledge(tagset?) to figure out the > 'in_flight' counter. I thought about it before, but never got a > perfect solution, and looks it is a bit hard, :-) The tags are already a bit spread out, so it's worth a shot. That would remove the need to do anything in the inc/dec path, as the tags already do that. The inlight count could be easily retrieved with sbitmap_weight(). The only issue here is that we need separate read and write counters, and the weight would obviously only get us the total count. But we can have a slower path for that, just iterate the tags and count them. The fast path only cares about total count. Let me try that out real quick. >>> >>> Well, that only works for whole disk stats, of course... There's no way >>> around iterating the tags and checking for this to truly work. >> >> Totally untested proof of concept for using the tags for this. I based >> this on top of Brian's patch, so it includes his patch plus the >> _double() stuff I did which is no longer really needed. >> >> >> diff --git a/block/bio.c b/block/bio.c >> index 9cf98b29588a..ec99d9ba0f33 100644 >> --- a/block/bio.c >> +++ b/block/bio.c >> @@ -1737,7 +1737,7 @@ void generic_start_io_acct(int rw, unsigned long >> sectors, >> part_round_stats(cpu, part); >> part_stat_inc(cpu, part, ios[rw]); >> part_stat_add(cpu, part, sectors[rw], sectors); >> - part_inc_in_flight(part, rw); >> + part_inc_in_flight(cpu, part, rw); >> >> part_stat_unlock(); >> } >> @@ -1751,7 +1751,7 @@ void
[dm-devel] [git pull] device mapper fixes for 4.12 final
Hi Linus, The following changes since commit feb7695fe9fb83084aa29de0094774f4c9d4c9fc: dm io: fix duplicate bio completion due to missing ref count (2017-06-21 12:04:50 -0400) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git tags/for-4.12/dm-fixes-5 for you to fetch changes up to 00a0ea33b495ee6149bf5a77ac5807ce87323abb: dm thin: do not queue freed thin mapping for next stage processing (2017-06-27 15:14:34 -0400) Please pull, thanks. Mike - A dm thinp fix for crash that will occur when metadata device failure races with discard passdown to the underlying data device. - A dm raid fix to not access the superblock's >= 1.9.0 'sectors' member unconditionally. Heinz Mauelshagen (1): dm raid: fix oops on upgrading to extended superblock format Vallish Vaidyeshwara (1): dm thin: do not queue freed thin mapping for next stage processing drivers/md/dm-raid.c | 17 ++--- drivers/md/dm-thin.c | 26 +- 2 files changed, 27 insertions(+), 16 deletions(-) -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu
On 06/29/2017 10:00 AM, Jens Axboe wrote: > On 06/29/2017 09:58 AM, Jens Axboe wrote: >> On 06/29/2017 02:40 AM, Ming Lei wrote: >>> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboewrote: On 06/28/2017 03:12 PM, Brian King wrote: > This patch converts the in_flight counter in struct hd_struct from a > pair of atomics to a pair of percpu counters. This eliminates a couple > of atomics from the hot path. When running this on a Power system, to > a single null_blk device with 80 submission queues, irq mode 0, with > 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s. This has been done before, but I've never really liked it. The reason is that it means that reading the part stat inflight count now has to iterate over every possible CPU. Did you use partitions in your testing? How many CPUs were configured? When I last tested this a few years ago on even a quad core nehalem (which is notoriously shitty for cross-node latencies), it was a net loss. >>> >>> One year ago, I saw null_blk's IOPS can be decreased to 10% >>> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has >>> 96 cores, and dual numa nodes) too, the performance can be >>> recovered basically if per numa-node counter is introduced and >>> used in this case, but the patch was never posted out. >>> If anyone is interested in that, I can rebase the patch on current >>> block tree and post out. I guess the performance issue might be >>> related with system cache coherency implementation more or less. >>> This issue on ARM64 can be observed with the following userspace >>> atomic counting test too: >>> >>>http://kernel.ubuntu.com/~ming/test/cache/ >> >> How well did the per-node thing work? Doesn't seem to me like it would >> go far enough. And per CPU is too much. One potential improvement would >> be to change the part_stat_read() to just loop online CPUs, instead of >> all possible CPUs. When CPUs go on/offline, use that as the slow path to >> ensure the stats are sane. Often there's a huge difference between >> NR_CPUS configured and what the system has. As Brian states, RH ships >> with 2048, while I doubt a lot of customers actually run that... >> >> Outside of coming up with a more clever data structure that is fully >> CPU topology aware, one thing that could work is just having X cache >> line separated read/write inflight counters per node, where X is some >> suitable value (like 4). That prevents us from having cross node >> traffic, and it also keeps the cross cpu traffic fairly low. That should >> provide a nice balance between cost of incrementing the inflight >> counting, and the cost of looping for reading it. >> >> And that brings me to the next part... >> I do agree that we should do something about it, and it's one of those items I've highlighted in talks about blk-mq on pending issues to fix up. It's just not great as it currently stands, but I don't think per CPU counters is the right way to fix it, at least not for the inflight counter. >>> >>> Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe >>> we can use some blk-mq knowledge(tagset?) to figure out the >>> 'in_flight' counter. I thought about it before, but never got a >>> perfect solution, and looks it is a bit hard, :-) >> >> The tags are already a bit spread out, so it's worth a shot. That would >> remove the need to do anything in the inc/dec path, as the tags already >> do that. The inlight count could be easily retrieved with >> sbitmap_weight(). The only issue here is that we need separate read and >> write counters, and the weight would obviously only get us the total >> count. But we can have a slower path for that, just iterate the tags and >> count them. The fast path only cares about total count. >> >> Let me try that out real quick. > > Well, that only works for whole disk stats, of course... There's no way > around iterating the tags and checking for this to truly work. Totally untested proof of concept for using the tags for this. I based this on top of Brian's patch, so it includes his patch plus the _double() stuff I did which is no longer really needed. diff --git a/block/bio.c b/block/bio.c index 9cf98b29588a..ec99d9ba0f33 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1737,7 +1737,7 @@ void generic_start_io_acct(int rw, unsigned long sectors, part_round_stats(cpu, part); part_stat_inc(cpu, part, ios[rw]); part_stat_add(cpu, part, sectors[rw], sectors); - part_inc_in_flight(part, rw); + part_inc_in_flight(cpu, part, rw); part_stat_unlock(); } @@ -1751,7 +1751,7 @@ void generic_end_io_acct(int rw, struct hd_struct *part, part_stat_add(cpu, part, ticks[rw], duration); part_round_stats(cpu, part); - part_dec_in_flight(part, rw); + part_dec_in_flight(cpu, part, rw); part_stat_unlock(); } diff --git a/block/blk-core.c b/block/blk-core.c
Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu
On 06/29/2017 11:25 AM, Ming Lei wrote: > On Thu, Jun 29, 2017 at 11:58 PM, Jens Axboewrote: >> On 06/29/2017 02:40 AM, Ming Lei wrote: >>> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe wrote: On 06/28/2017 03:12 PM, Brian King wrote: > This patch converts the in_flight counter in struct hd_struct from a > pair of atomics to a pair of percpu counters. This eliminates a couple > of atomics from the hot path. When running this on a Power system, to > a single null_blk device with 80 submission queues, irq mode 0, with > 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s. This has been done before, but I've never really liked it. The reason is that it means that reading the part stat inflight count now has to iterate over every possible CPU. Did you use partitions in your testing? How many CPUs were configured? When I last tested this a few years ago on even a quad core nehalem (which is notoriously shitty for cross-node latencies), it was a net loss. >>> >>> One year ago, I saw null_blk's IOPS can be decreased to 10% >>> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has >>> 96 cores, and dual numa nodes) too, the performance can be >>> recovered basically if per numa-node counter is introduced and >>> used in this case, but the patch was never posted out. >>> If anyone is interested in that, I can rebase the patch on current >>> block tree and post out. I guess the performance issue might be >>> related with system cache coherency implementation more or less. >>> This issue on ARM64 can be observed with the following userspace >>> atomic counting test too: >>> >>>http://kernel.ubuntu.com/~ming/test/cache/ >> >> How well did the per-node thing work? Doesn't seem to me like it would > > Last time, on ARM64, I remembered that the IOPS was basically recovered, > but now I don't have a such machine to test. Could Brian test the attached > patch > to see if it works on big Power machine? > > And the idea is simple, just make the atomic counter per-node. I tried loading the patch and get an oops right away on boot. Haven't been able to debug anything yet. This is on top of an older kernel, so not sure if that is the issue or not. I can try upstream and see if i have different results... [-1;-1fUbuntu 16.04[-1;-1f. . . .Unable to handle kernel paging request for data at address 0xc00031313a333532 Faulting instruction address: 0xc02552c4 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=1024 NUMA PowerNV Modules linked in: dm_round_robin vmx_crypto powernv_rng leds_powernv powernv_op_panel led_class rng_core dm_multipath autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq multipath dm_mirror dm_region_hash dm_log cxlflash bnx2x mdio libcrc32c nvme nvme_core lpfc cxl crc_t10dif crct10dif_generic crct10dif_common CPU: 9 PID: 3485 Comm: multipathd Not tainted 4.9.10-dirty #7 task: c00fba0d task.stack: c00fb8a1c000 NIP: c02552c4 LR: c0255274 CTR: REGS: c00fb8a1f350 TRAP: 0300 Not tainted (4.9.10-dirty) MSR: 9280b033 CR: 24028848 XER: CFAR: c0008a60 DAR: c00031313a333532 DSISR: 4000 SOFTE: 1 GPR00: c01f8980 c00fb8a1f5d0 c0f24300 c00fc4007c00 GPR04: 024000c0 c02fc0ac 025d c00fc5d346e0 GPR08: 000fc50d6000 c00031313a333532 c00fbc836240 GPR12: 2200 cff02400 3fff79cfebf0 GPR16: c138fd03 01003a12bf70 3fff79ae75e0 3fff79d269e8 GPR20: 0001 3fff79cfebf0 01003a393aa0 3fff79d067b8 GPR24: c0b04948 a1ff c02fc0ac GPR28: 024000c0 0007 c02fc0ac c00fc4007c00 NIP [c02552c4] __kmalloc_track_caller+0x94/0x2f0 LR [c0255274] __kmalloc_track_caller+0x44/0x2f0 Call Trace: [c00fb8a1f5d0] [c00fb8a1f610] 0xc00fb8a1f610 (unreliable) [c00fb8a1f620] [c01f8980] kstrdup+0x50/0xb0 [c00fb8a1f660] [c02fc0ac] __kernfs_new_node+0x4c/0x140 [c00fb8a1f6b0] [c02fd9f4] kernfs_new_node+0x34/0x80 [c00fb8a1f6e0] [c0300708] kernfs_create_link+0x38/0xf0 [c00fb8a1f720] [c0301cb8] sysfs_do_create_link_sd.isra.0+0xa8/0x160 [c00fb8a1f770] [c054a658] device_add+0x2b8/0x740 [c00fb8a1f830] [c054ae58] device_create_groups_vargs+0x178/0x190 [c00fb8a1f890] [c01fcd70] bdi_register+0x80/0x1d0 [c00fb8a1f8c0] [c01fd244] bdi_register_owner+0x44/0x80 [c00fb8a1f940] [c0453bbc] device_add_disk+0x1cc/0x500 [c00fb8a1f9f0] [c0764dec] dm_create+0x33c/0x5f0 [c00fb8a1fa90] [c076d9ac] dev_create+0x8c/0x3e0 [c00fb8a1fb40] [c076d1fc] ctl_ioctl+0x38c/0x580
Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu
On 06/29/2017 09:58 AM, Jens Axboe wrote: > On 06/29/2017 02:40 AM, Ming Lei wrote: >> On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboewrote: >>> On 06/28/2017 03:12 PM, Brian King wrote: This patch converts the in_flight counter in struct hd_struct from a pair of atomics to a pair of percpu counters. This eliminates a couple of atomics from the hot path. When running this on a Power system, to a single null_blk device with 80 submission queues, irq mode 0, with 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s. >>> >>> This has been done before, but I've never really liked it. The reason is >>> that it means that reading the part stat inflight count now has to >>> iterate over every possible CPU. Did you use partitions in your testing? >>> How many CPUs were configured? When I last tested this a few years ago >>> on even a quad core nehalem (which is notoriously shitty for cross-node >>> latencies), it was a net loss. >> >> One year ago, I saw null_blk's IOPS can be decreased to 10% >> of non-RQF_IO_STAT on a dual socket ARM64(each CPU has >> 96 cores, and dual numa nodes) too, the performance can be >> recovered basically if per numa-node counter is introduced and >> used in this case, but the patch was never posted out. >> If anyone is interested in that, I can rebase the patch on current >> block tree and post out. I guess the performance issue might be >> related with system cache coherency implementation more or less. >> This issue on ARM64 can be observed with the following userspace >> atomic counting test too: >> >>http://kernel.ubuntu.com/~ming/test/cache/ > > How well did the per-node thing work? Doesn't seem to me like it would > go far enough. And per CPU is too much. One potential improvement would > be to change the part_stat_read() to just loop online CPUs, instead of > all possible CPUs. When CPUs go on/offline, use that as the slow path to > ensure the stats are sane. Often there's a huge difference between > NR_CPUS configured and what the system has. As Brian states, RH ships > with 2048, while I doubt a lot of customers actually run that... > > Outside of coming up with a more clever data structure that is fully > CPU topology aware, one thing that could work is just having X cache > line separated read/write inflight counters per node, where X is some > suitable value (like 4). That prevents us from having cross node > traffic, and it also keeps the cross cpu traffic fairly low. That should > provide a nice balance between cost of incrementing the inflight > counting, and the cost of looping for reading it. > > And that brings me to the next part... > >>> I do agree that we should do something about it, and it's one of those >>> items I've highlighted in talks about blk-mq on pending issues to fix >>> up. It's just not great as it currently stands, but I don't think per >>> CPU counters is the right way to fix it, at least not for the inflight >>> counter. >> >> Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe >> we can use some blk-mq knowledge(tagset?) to figure out the >> 'in_flight' counter. I thought about it before, but never got a >> perfect solution, and looks it is a bit hard, :-) > > The tags are already a bit spread out, so it's worth a shot. That would > remove the need to do anything in the inc/dec path, as the tags already > do that. The inlight count could be easily retrieved with > sbitmap_weight(). The only issue here is that we need separate read and > write counters, and the weight would obviously only get us the total > count. But we can have a slower path for that, just iterate the tags and > count them. The fast path only cares about total count. > > Let me try that out real quick. Well, that only works for whole disk stats, of course... There's no way around iterating the tags and checking for this to truly work. -- Jens Axboe -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu
On 06/29/2017 02:40 AM, Ming Lei wrote: > On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboewrote: >> On 06/28/2017 03:12 PM, Brian King wrote: >>> This patch converts the in_flight counter in struct hd_struct from a >>> pair of atomics to a pair of percpu counters. This eliminates a couple >>> of atomics from the hot path. When running this on a Power system, to >>> a single null_blk device with 80 submission queues, irq mode 0, with >>> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s. >> >> This has been done before, but I've never really liked it. The reason is >> that it means that reading the part stat inflight count now has to >> iterate over every possible CPU. Did you use partitions in your testing? >> How many CPUs were configured? When I last tested this a few years ago >> on even a quad core nehalem (which is notoriously shitty for cross-node >> latencies), it was a net loss. > > One year ago, I saw null_blk's IOPS can be decreased to 10% > of non-RQF_IO_STAT on a dual socket ARM64(each CPU has > 96 cores, and dual numa nodes) too, the performance can be > recovered basically if per numa-node counter is introduced and > used in this case, but the patch was never posted out. > If anyone is interested in that, I can rebase the patch on current > block tree and post out. I guess the performance issue might be > related with system cache coherency implementation more or less. > This issue on ARM64 can be observed with the following userspace > atomic counting test too: > >http://kernel.ubuntu.com/~ming/test/cache/ How well did the per-node thing work? Doesn't seem to me like it would go far enough. And per CPU is too much. One potential improvement would be to change the part_stat_read() to just loop online CPUs, instead of all possible CPUs. When CPUs go on/offline, use that as the slow path to ensure the stats are sane. Often there's a huge difference between NR_CPUS configured and what the system has. As Brian states, RH ships with 2048, while I doubt a lot of customers actually run that... Outside of coming up with a more clever data structure that is fully CPU topology aware, one thing that could work is just having X cache line separated read/write inflight counters per node, where X is some suitable value (like 4). That prevents us from having cross node traffic, and it also keeps the cross cpu traffic fairly low. That should provide a nice balance between cost of incrementing the inflight counting, and the cost of looping for reading it. And that brings me to the next part... >> I do agree that we should do something about it, and it's one of those >> items I've highlighted in talks about blk-mq on pending issues to fix >> up. It's just not great as it currently stands, but I don't think per >> CPU counters is the right way to fix it, at least not for the inflight >> counter. > > Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe > we can use some blk-mq knowledge(tagset?) to figure out the > 'in_flight' counter. I thought about it before, but never got a > perfect solution, and looks it is a bit hard, :-) The tags are already a bit spread out, so it's worth a shot. That would remove the need to do anything in the inc/dec path, as the tags already do that. The inlight count could be easily retrieved with sbitmap_weight(). The only issue here is that we need separate read and write counters, and the weight would obviously only get us the total count. But we can have a slower path for that, just iterate the tags and count them. The fast path only cares about total count. Let me try that out real quick. -- Jens Axboe -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH] libmultipath: update 3PARdata builtin config
On Wed, Jun 28, 2017 at 07:48:38PM +0200, Xose Vazquez Perez wrote: > On 06/26/2017 09:03 PM, Benjamin Marzinski wrote: > > > This updated config comes from hp. > > It would be nice to have more information. > Why and when is this needed? I assume the change to dev_loss_tmo is simply a preference issue. Like Netapp, they don't want their devices to get auto-removed when they go down. I also assume that in their internal testing, they hit cases where 5 seconds wasn't enough time to wait for some transient issue with the array to resolve. At any rate, I'm simply passing along their request, which seems like a perfectly reasonable one to me. -Ben > > > BTW: HPE 'MSA 205x' and 'StoreVirtual 3200'(LeftHand) are missing. > > > Signed-off-by: Benjamin Marzinski> > --- > > libmultipath/hwtable.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/libmultipath/hwtable.c b/libmultipath/hwtable.c > > index 390d143..54bdcfc 100644 > > --- a/libmultipath/hwtable.c > > +++ b/libmultipath/hwtable.c > > @@ -49,6 +49,8 @@ static struct hwentry default_hw[] = { > > .hwhandler = "1 alua", > > .prio_name = PRIO_ALUA, > > .no_path_retry = 18, > > + .fast_io_fail = 10, > > + .dev_loss = MAX_DEV_LOSS_TMO, > > }, > > { > > /* RA8000 / ESA12000 */ > > -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH RESEND] dm-raid: Do not call BUG() in __rdev_sectors()
Hi Hannes, thanks for your patch. BUG() is intentional there to catch what I've already fixed with patch "[dm-devel] [PATCH] dm raid: fix oops on upgrading to extended superblock format" on 06/23/2017. IOW: returning 0 is never right and would hide bugs like that one fixed. Heinz On 06/28/2017 10:04 AM, Hannes Reinecke wrote: __rdev_sectors() might be called for an invalid/non-existing RAID set during raid_ctr(), which is perfectly fine and no reason to panic. Signed-off-by: Hannes Reinecke--- drivers/md/dm-raid.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index 7d89322..9bbb596 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c @@ -1571,7 +1571,8 @@ static sector_t __rdev_sectors(struct raid_set *rs) return rdev->sectors; } - BUG(); /* Constructor ensures we got some. */ + /* No valid raid devices */ + return 0; } /* Calculate the sectors per device and per array used for @rs */ -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu
On 06/28/2017 05:19 PM, Jens Axboe wrote: > On 06/28/2017 04:07 PM, Brian King wrote: >> On 06/28/2017 04:59 PM, Jens Axboe wrote: >>> On 06/28/2017 03:54 PM, Jens Axboe wrote: On 06/28/2017 03:12 PM, Brian King wrote: > -static inline int part_in_flight(struct hd_struct *part) > +static inline unsigned long part_in_flight(struct hd_struct *part) > { > - return atomic_read(>in_flight[0]) + > atomic_read(>in_flight[1]); > + return part_stat_read(part, in_flight[0]) + part_stat_read(part, > in_flight[1]); One obvious improvement would be to not do this twice, but only have to loop once. Instead of making this an array, make it a structure with a read and write count. It still doesn't really fix the issue of someone running on a kernel with a ton of possible CPUs configured. But it does reduce the overhead by 50%. >>> >>> Or something as simple as this: >>> >>> #define part_stat_read_double(part, field1, field2) \ >>> ({ \ >>> typeof((part)->dkstats->field1) res = 0;\ >>> unsigned int _cpu; \ >>> for_each_possible_cpu(_cpu) { \ >>> res += per_cpu_ptr((part)->dkstats, _cpu)->field1; \ >>> res += per_cpu_ptr((part)->dkstats, _cpu)->field2; \ >>> } \ >>> res;\ >>> }) >>> >>> static inline unsigned long part_in_flight(struct hd_struct *part) >>> { >>> return part_stat_read_double(part, in_flight[0], in_flight[1]); >>> } >>> >> >> I'll give this a try and also see about running some more exhaustive >> runs to see if there are any cases where we go backwards in performance. >> >> I'll also run with partitions and see how that impacts this. > > And do something nuts, like setting NR_CPUS to 512 or whatever. What do > distros ship with? Both RHEL and SLES set NR_CPUS=2048 for the Power architecture. I can easily switch the SMT mode of the machine I used for this from 4 to 8 to have a total of 160 online logical CPUs and see how that affects the performance. I'll see if I can find a larger machine as well. Thanks, Brian -- Brian King Power Linux I/O IBM Linux Technology Center -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu
On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboewrote: > On 06/28/2017 03:12 PM, Brian King wrote: >> This patch converts the in_flight counter in struct hd_struct from a >> pair of atomics to a pair of percpu counters. This eliminates a couple >> of atomics from the hot path. When running this on a Power system, to >> a single null_blk device with 80 submission queues, irq mode 0, with >> 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s. > > This has been done before, but I've never really liked it. The reason is > that it means that reading the part stat inflight count now has to > iterate over every possible CPU. Did you use partitions in your testing? > How many CPUs were configured? When I last tested this a few years ago > on even a quad core nehalem (which is notoriously shitty for cross-node > latencies), it was a net loss. One year ago, I saw null_blk's IOPS can be decreased to 10% of non-RQF_IO_STAT on a dual socket ARM64(each CPU has 96 cores, and dual numa nodes) too, the performance can be recovered basically if per numa-node counter is introduced and used in this case, but the patch was never posted out. If anyone is interested in that, I can rebase the patch on current block tree and post out. I guess the performance issue might be related with system cache coherency implementation more or less. This issue on ARM64 can be observed with the following userspace atomic counting test too: http://kernel.ubuntu.com/~ming/test/cache/ > > I do agree that we should do something about it, and it's one of those > items I've highlighted in talks about blk-mq on pending issues to fix > up. It's just not great as it currently stands, but I don't think per > CPU counters is the right way to fix it, at least not for the inflight > counter. Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe we can use some blk-mq knowledge(tagset?) to figure out the 'in_flight' counter. I thought about it before, but never got a perfect solution, and looks it is a bit hard, :-) Thanks, Ming Lei -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel