Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu

2017-07-04 Thread Jens Axboe
On 07/04/2017 02:55 PM, Brian King wrote:
> On 07/01/2017 11:43 AM, Jens Axboe wrote:
>> Now:
>>
>> http://git.kernel.dk/cgit/linux-block/commit/?h=mq-inflight=87f73ef2b9edb6834001df8f7cb48c7a116e8cd3
>>
>>> And updated the branch here:
>>>
>>> http://git.kernel.dk/cgit/linux-block/log/?h=mq-inflight
>>>
>>> to include that, and be based on top of for-4.13/block. If you prefer just
>>> pulling a branch, pull:
>>>
>>> git://git.kernel.dk/linux-block mq-inflight
>>
>> I've tested it, and made a few adjustments and fixes. The branches are
>> the same, just rebased. Works fine for me, even with partitions now.
>>
>> Let me know how it works for you.
> 
> This is looking very nice. This is what I'm seeing now:
> 
> 1 null_blk device / 80 submit_queues / fio numjobs=80
> 9.4M IOPs
> 
> 80 null_blk devices / 1 submit_queue per null_blk / fio numjobs=1 (1 per 
> null_blk)
> 9.3M IOPs
> 
> 1 null_blk device / 160 submit_queues / fio numjobs=160
> 9.8M IOPs
> 
> 160 null_blk devices / 1 submit_queue per null_blk / fio numjobs=1 (1 per 
> null_blk)
> 8.9M IOPs
> 
> What is queued in mq-inflight looks to resolve the issue I've been seeing.

Good to hear! I'll test some corner cases and post a "real" version
for review. Thanks for all your testing.

-- 
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu

2017-07-04 Thread Brian King
On 07/03/2017 08:20 PM, Ming Lei wrote:
>> Right. So the two modes I'm running in are:
>>
>> 1. 80 null_blk devices, each with one submit_queue, with one fio job per 
>> null_blk device,
>>so 80 threads total. 80 logical CPUs
>> 2. 1 null_blk device, with 80 submit_queues, 80 fio jobs, 80 logical CPUs.
>>
>> In theory, the two should result in similar numbers.
>>
>> Here are the commands and fio configurations:
>>
>> Scenario #1
>> modprobe null_blk submit_queues=80 nr_devices=1 irqmode=0
>>
>> FIO config:
>> [global]
>> buffered=0
>> invalidate=1
>> bs=4k
>> iodepth=64
>> numjobs=80
>> group_reporting=1
>> rw=randrw
>> rwmixread=100
>> rwmixwrite=0
>> ioengine=libaio
>> runtime=60
>> time_based
>>
>> [job1]
>> filename=/dev/nullb0

> 
> IMO it should be more reasonable to use single null_blk with 80 queues
> via setting submit_queues as 80 than simply 80 null_blks.  So suggest to 
> switch
> to test 80 queues in your future test.

That should be what my Scenario #1 covers. 

Thanks,

Brian

-- 
Brian King
Power Linux I/O
IBM Linux Technology Center

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu

2017-07-04 Thread Brian King
On 07/01/2017 11:43 AM, Jens Axboe wrote:
> Now:
> 
> http://git.kernel.dk/cgit/linux-block/commit/?h=mq-inflight=87f73ef2b9edb6834001df8f7cb48c7a116e8cd3
> 
>> And updated the branch here:
>>
>> http://git.kernel.dk/cgit/linux-block/log/?h=mq-inflight
>>
>> to include that, and be based on top of for-4.13/block. If you prefer just
>> pulling a branch, pull:
>>
>> git://git.kernel.dk/linux-block mq-inflight
> 
> I've tested it, and made a few adjustments and fixes. The branches are
> the same, just rebased. Works fine for me, even with partitions now.
> 
> Let me know how it works for you.

This is looking very nice. This is what I'm seeing now:

1 null_blk device / 80 submit_queues / fio numjobs=80
9.4M IOPs

80 null_blk devices / 1 submit_queue per null_blk / fio numjobs=1 (1 per 
null_blk)
9.3M IOPs

1 null_blk device / 160 submit_queues / fio numjobs=160
9.8M IOPs

160 null_blk devices / 1 submit_queue per null_blk / fio numjobs=1 (1 per 
null_blk)
8.9M IOPs

What is queued in mq-inflight looks to resolve the issue I've been seeing.

Thanks!!!

Brian

-- 
Brian King
Power Linux I/O
IBM Linux Technology Center

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu

2017-07-04 Thread Ming Lei
On Sat, Jul 1, 2017 at 10:18 AM, Brian King  wrote:
> On 06/30/2017 06:26 PM, Jens Axboe wrote:
>> On 06/30/2017 05:23 PM, Ming Lei wrote:
>>> Hi Bian,
>>>
>>> On Sat, Jul 1, 2017 at 2:33 AM, Brian King  
>>> wrote:
 On 06/30/2017 09:08 AM, Jens Axboe wrote:
 Compared with the totally percpu approach, this way might help 1:M or
 N:M mapping, but won't help 1:1 map(NVMe), when hctx is mapped to
 each CPU(especially there are huge hw queues on a big system), :-(
>>>
>>> Not disagreeing with that, without having some mechanism to only
>>> loop queues that have pending requests. That would be similar to the
>>> ctx_map for sw to hw queues. But I don't think that would be worthwhile
>>> doing, I like your pnode approach better. However, I'm still not fully
>>> convinced that one per node is enough to get the scalability we need.
>>>
>>> Would be great if Brian could re-test with your updated patch, so we
>>> know how it works for him at least.
>>
>> I'll try running with both approaches today and see how they compare.
>
> Focus on Ming's, a variant of that is the most likely path forward,
> imho. It'd be great to do a quick run on mine as well, just to establish
> how it compares to mainline, though.

 On my initial runs, the one from you Jens, appears to perform a bit 
 better, although
 both are a huge improvement from what I was seeing before.

 I ran 4k random reads using fio to nullblk in two configurations on my 20 
 core
 system with 4 NUMA nodes and 4-way SMT, so 80 logical CPUs. I ran both 80 
 threads
 to a single null_blk as well as 80 threads to 80 null_block devices, so 
 one thread
>>>
>>> Could you share what the '80 null_block devices' is?  It means you
>>> create 80 null_blk
>>> devices? Or you create one null_blk and make its hw queue number as 80
>>> via module
>>> parameter of ''submit_queues"?
>>
>> That's a valid question, was going to ask that too. But I assumed that Brian
>> used submit_queues to set as many queues as he has logical CPUs in the 
>> system.
>>>
>>> I guess we should focus on multi-queue case since it is the normal way of 
>>> NVMe.
>>>
 per null_blk. This is what I saw on this machine:

 Using the Per node atomic change from Ming Lei
 1 null_blk, 80 threads
 iops=9376.5K

 80 null_blk, 1 thread
 iops=9523.5K


 Using the alternate patch from Jens using the tags
 1 null_blk, 80 threads
 iops=9725.8K

 80 null_blk, 1 thread
 iops=9569.4K
>>>
>>> If 1 thread means single fio job, looks the number is too too high, that 
>>> means
>>> one random IO can complete in about 0.1us(100ns) on single CPU, not sure if 
>>> it
>>> is possible, :-)
>>
>> It means either 1 null_blk device, 80 threads running IO to it. Or 80 
>> null_blk
>> devices, each with a thread running IO to it. See above, he details that it's
>> 80 threads on 80 devices for that case.
>
> Right. So the two modes I'm running in are:
>
> 1. 80 null_blk devices, each with one submit_queue, with one fio job per 
> null_blk device,
>so 80 threads total. 80 logical CPUs
> 2. 1 null_blk device, with 80 submit_queues, 80 fio jobs, 80 logical CPUs.
>
> In theory, the two should result in similar numbers.
>
> Here are the commands and fio configurations:
>
> Scenario #1
> modprobe null_blk submit_queues=80 nr_devices=1 irqmode=0
>
> FIO config:
> [global]
> buffered=0
> invalidate=1
> bs=4k
> iodepth=64
> numjobs=80
> group_reporting=1
> rw=randrw
> rwmixread=100
> rwmixwrite=0
> ioengine=libaio
> runtime=60
> time_based
>
> [job1]
> filename=/dev/nullb0
>
>
> Scenario #2
> modprobe null_blk submit_queues=1 nr_devices=80 irqmode=0
>
> FIO config
> [global]
> buffered=0
> invalidate=1
> bs=4k
> iodepth=64
> numjobs=1
> group_reporting=1
> rw=randrw
> rwmixread=100
> rwmixwrite=0
> ioengine=libaio
> runtime=60
> time_based
>
> [job1]
> filename=/dev/nullb0
> [job2]
> filename=/dev/nullb1
> [job3]
> filename=/dev/nullb2
> [job4]
> filename=/dev/nullb3
> [job5]
> filename=/dev/nullb4
> [job6]
> filename=/dev/nullb5
> [job7]
> filename=/dev/nullb6
> [job8]
> filename=/dev/nullb7
> [job9]
> filename=/dev/nullb8
> [job10]
> filename=/dev/nullb9
> [job11]
> filename=/dev/nullb10
> [job12]
> filename=/dev/nullb11
> [job13]
> filename=/dev/nullb12
> [job14]
> filename=/dev/nullb13
> [job15]
> filename=/dev/nullb14
> [job16]
> filename=/dev/nullb15
> [job17]
> filename=/dev/nullb16
> [job18]
> filename=/dev/nullb17
> [job19]
> filename=/dev/nullb18
> [job20]
> filename=/dev/nullb19
> [job21]
> filename=/dev/nullb20
> [job22]
> filename=/dev/nullb21
> [job23]
> filename=/dev/nullb22
> [job24]
> filename=/dev/nullb23
> [job25]
> filename=/dev/nullb24
> [job26]
> filename=/dev/nullb25
> [job27]
> filename=/dev/nullb26
> [job28]
> filename=/dev/nullb27
> [job29]
> filename=/dev/nullb28
>