On 2020/12/15 2:58, Eric Farman wrote:
>
>
> On 12/10/20 8:56 AM, xuxiaoyang (C) wrote:
>>
>>
>> On 2020/12/9 22:42, Eric Farman wrote:
>>>
>>>
>>> On 12/9/20 6:54 AM, Cornelia Huck wrote:
>>>> On Tue, 8 Dec 2020 21:55:53 +0800
>>>> "xuxiaoyang (C)" <xuxiaoya...@huawei.com> wrote:
>>>>
>>>>> On 2020/11/21 15:58, xuxiaoyang (C) wrote:
>>>>>> vfio_pin_pages() accepts an array of unrelated iova pfns and processes
>>>>>> each to return the physical pfn. When dealing with large arrays of
>>>>>> contiguous iovas, vfio_iommu_type1_pin_pages is very inefficient because
>>>>>> it is processed page by page.In this case, we can divide the iova pfn
>>>>>> array into multiple continuous ranges and optimize them. For example,
>>>>>> when the iova pfn array is {1,5,6,7,9}, it will be divided into three
>>>>>> groups {1}, {5,6,7}, {9} for processing. When processing {5,6,7}, the
>>>>>> number of calls to pin_user_pages_remote is reduced from 3 times to once.
>>>>>> For single page or large array of discontinuous iovas, we still use
>>>>>> vfio_pin_page_external to deal with it to reduce the performance loss
>>>>>> caused by refactoring.
>>>>>>
>>>>>> Signed-off-by: Xiaoyang Xu <xuxiaoya...@huawei.com>
>>>>
>>>> (...)
>>>>
>>>>>
>>>>> hi Cornelia Huck, Eric Farman, Zhenyu Wang, Zhi Wang
>>>>>
>>>>> vfio_pin_pages() accepts an array of unrelated iova pfns and processes
>>>>> each to return the physical pfn. When dealing with large arrays of
>>>>> contiguous iovas, vfio_iommu_type1_pin_pages is very inefficient because
>>>>> it is processed page by page. In this case, we can divide the iova pfn
>>>>> array into multiple continuous ranges and optimize them. I have a set
>>>>> of performance test data for reference.
>>>>>
>>>>> The patch was not applied
>>>>> 1 page 512 pages
>>>>> no huge pages: 1638ns 223651ns
>>>>> THP: 1668ns 222330ns
>>>>> HugeTLB: 1526ns 208151ns
>>>>>
>>>>> The patch was applied
>>>>> 1 page 512 pages
>>>>> no huge pages 1735ns 167286ns
>>>>> THP: 1934ns 126900ns
>>>>> HugeTLB: 1713ns 102188ns
>>>>>
>>>>> As Alex Williamson said, this patch lacks proof that it works in the
>>>>> real world. I think you will have some valuable opinions.
>>>>
>>>> Looking at this from the vfio-ccw angle, I'm not sure how much this
>>>> would buy us, as we deal with IDAWs, which are designed so that they
>>>> can be non-contiguous. I guess this depends a lot on what the guest
>>>> does.
>>>
>>> This would be my concern too, but I don't have data off the top of my head
>>> to say one way or another...
>>>
>>>>
>>>> Eric, any opinion? Do you maybe also happen to have a test setup that
>>>> mimics workloads actually seen in the real world?
>>>>
>>>
>>> ...I do have some test setups, which I will try to get some data from in a
>>> couple days. At the moment I've broken most of those setups trying to
>>> implement some other stuff, and can't revert back at the moment. Will get
>>> back to this.
>>>
>>> Eric
>>> .
>>
>> Thank you for your reply. Looking forward to your test data.
>
> Xu,
>
> The scenario I ran was a host kernel 5.10.0-rc7 with qemu 5.2.0, with a
> Fedora 32 guest with 4 VCPU and 4GB memory. I tried this a handful of times
> across a couple different hosts, so the likelihood that these numbers are
> outliers are pretty low. The histograms below come from a simple bpftrace,
> recording the number of pages asked to be pinned, and the length of time (in
> nanoseconds) it took to pin all those pages. I separated out the length of
> time for a request of one page versus a request of multiple pages, because as
> you will see the former far outnumbers the latter.
>
> The first thing I tried was simply to boot the guest via vfio-ccw, to see how
> the patch itself behaved:
>
> @1_page_ns BASE +PATCH
> 256, 512 12531 42.50% 12744 42.26%
> 512, 1K 5660 19.20% 5611 18.61%
> 1K, 2K 8416 28.54% 8947 29.67%
> 2K, 4K 2694 9.14% 2669 8.85%
> 4K, 8K 164 0.56% 169 0.56%
> 8K, 16K 14 0.05% 14 0.05%
> 16K, 32K 2 0.01% 3 0.01%
> 32K, 64K 0 0.00% 0 0.00%
> 64K, 128K 0 0.00% 0 0.00%
>
> @n_pages_ns BASE +PATCH
> 256, 512 0 0.00% 0 0.00%
> 512, 1K 67 0.97% 48 0.68%
> 1K, 2K 1598 23.13% 1036 14.71%
> 2K, 4K 2784 40.30% 3112 44.17%
> 4K, 8K 1288 18.64% 1579 22.41%
> 8K, 16K 1011 14.63% 1032 14.65%
> 16K, 32K 159 2.30% 234 3.32%
> 32K, 64K 1 0.01% 2 0.03%
> 64K, 128K 0 0.00% 2 0.03%
>
> @npage BASE +PATCH
> 1 29484 81.02% 30157 81.06%
> 2, 4 3298 9.06% 3385 9.10%
> 4, 8 1011 2.78% 1029 2.77%
> 8, 16 2600 7.14% 2631 7.07%
>
>
> The second thing I tried was simply fio, running it for about 10 minutes with
> a few minutes each for sequential read, sequential write, random read, and
> random write. (I tried this with both the guest booted off vfio-ccw and
> virtio-blk, but the difference was negligible.) The results in this space are
> similar as well:
>
> @1_page_ns BASE +PATCH
> 256, 512 5648104 66.79% 6615878 66.75%
> 512, 1K 1784047 21.10% 2082852 21.01%
> 1K, 2K 648877 7.67% 771964 7.79%
> 2K, 4K 339551 4.01% 396381 4.00%
> 4K, 8K 32513 0.38% 40359 0.41%
> 8K, 16K 2602 0.03% 2884 0.03%
> 16K, 32K 758 0.01% 762 0.01%
> 32K, 64K 434 0.01% 352 0.00%
>
> @n_pages_ns BASE +PATCH
> 256, 512 0 0.00% 0 0.00%
> 512, 1K 470803 12.18% 360524 7.95%
> 1K, 2K 1305166 33.75% 1739183 38.37%
> 2K, 4K 1338277 34.61% 1471161 32.46%
> 4K, 8K 733480 18.97% 937341 20.68%
> 8K, 16K 16954 0.44% 20708 0.46%
> 16K, 32K 1278 0.03% 2197 0.05%
> 32K, 64K 707 0.02% 703 0.02%
>
> @npage BASE +PATCH
> 1 8457107 68.62% 9911624 68.62%
> 2, 4 2066957 16.77% 2446462 16.94%
> 4, 8 359989 2.92% 417188 2.89%
> 8, 16 1440006 11.68% 1668482 11.55%
>
>
> I tried a smattering of other tests that might be more realistic, but the
> results were all pretty similar so there's no point in appending them here.
> Across the board, the amount of time spent on a multi-page request grows with
> the supplied patch. It doesn't get me very excited.
>
> If you are wondering why this might be, Conny's initial take about IDAWs
> being non-contiguous by design is spot on. Let's observe the page counts
> given to vfio_iommu_type1_pin_contiguous_pages() in addition to the counts in
> vfio_iommu_type1_pin_pages(). The following is an example of one guest boot
> PLUS an fio run:
>
> vfio_iommu_type1_pin_pages npage:
> 1 9890332 68.64%
> 2, 4 2438213 16.92%
> 4, 8 416278 2.89%
> 8, 16 1663201 11.54%
> Total 14408024
>
> vfio_iommu_type1_pin_contiguous_pages npage:
> 1 16384925 86.89%
> 2, 4 1327548 7.04%
> 4, 8 727564 3.86%
> 8, 16 417182 2.21%
> Total 18857219
>
> Yup... 87% of the calls to vfio_iommu_type1_pin_contiguous_pages() do so with
> a length of just a single page.
>
> Happy to provide more data if desired, but it doesn't look like a benefit to
> vfio-ccw's use.
>
> Thanks,
> Eric
>
>
Eric, vfio-ccw pin single page accounted for 87%,
and the length of continuous pages is very short.
In my test data, the continuous page length is 512,
which is a huge difference. It is easy to understand
that this patch does not benefit vfio-ccw.
Finally, thank you very much for your test data.
Regards,
Xu
>>
>> Regards,
>> Xu
>>
> .