Re: virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-13 Thread Dongli Zhang



On 3/13/19 5:39 PM, Cornelia Huck wrote:
> On Wed, 13 Mar 2019 11:26:04 +0800
> Dongli Zhang  wrote:
> 
>> On 3/13/19 1:33 AM, Cornelia Huck wrote:
>>> On Tue, 12 Mar 2019 10:22:46 -0700 (PDT)
>>> Dongli Zhang  wrote:
>>>   
 I observed that there is one msix vector for config and one shared vector
 for all queues in below qemu cmdline, when the num-queues for virtio-blk
 is more than the number of possible cpus:

 qemu: "-smp 4" while "-device 
 virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=6"

 # cat /proc/interrupts 
CPU0   CPU1   CPU2   CPU3
 ... ...
  24:  0  0  0  0   PCI-MSI 65536-edge  
 virtio0-config
  25:  0  0  0 59   PCI-MSI 65537-edge  
 virtio0-virtqueues
 ... ...


 However, when num-queues is the same as number of possible cpus:

 qemu: "-smp 4" while "-device 
 virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=4"

 # cat /proc/interrupts 
CPU0   CPU1   CPU2   CPU3
 ... ... 
  24:  0  0  0  0   PCI-MSI 65536-edge  
 virtio0-config
  25:  2  0  0  0   PCI-MSI 65537-edge  
 virtio0-req.0
  26:  0 35  0  0   PCI-MSI 65538-edge  
 virtio0-req.1
  27:  0  0 32  0   PCI-MSI 65539-edge  
 virtio0-req.2
  28:  0  0  0  0   PCI-MSI 65540-edge  
 virtio0-req.3
 ... ...

 In above case, there is one msix vector per queue.  
>>>
>>> Please note that this is pci-specific...
>>>   


 This is because the max number of queues is not limited by the number of
 possible cpus.

 By default, nvme (regardless about write_queues and poll_queues) and
 xen-blkfront limit the number of queues with num_possible_cpus().  
>>>
>>> ...and these are probably pci-specific as well.  
>>
>> Not pci-specific, but per-cpu as well.
> 
> Ah, I meant that those are pci devices.
> 
>>
>>>   


 Is this by design on purpose, or can we fix with below?


 diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
 index 4bc083b..df95ce3 100644
 --- a/drivers/block/virtio_blk.c
 +++ b/drivers/block/virtio_blk.c
 @@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
if (err)
num_vqs = 1;
  
 +  num_vqs = min(num_possible_cpus(), num_vqs);
 +
vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), GFP_KERNEL);
if (!vblk->vqs)
return -ENOMEM;  
>>>
>>> virtio-blk, however, is not pci-specific.
>>>
>>> If we are using the ccw transport on s390, a completely different
>>> interrupt mechanism is in use ('floating' interrupts, which are not
>>> per-cpu). A check like that should therefore not go into the generic
>>> driver.
>>>   
>>
>> So far there seems two options.
>>
>> The 1st option is to ask the qemu user to always specify "-num-queues" with 
>> the
>> same number of vcpus when running x86 guest with pci for virtio-blk or
>> virtio-scsi, in order to assign a vector for each queue.
> 
> That does seem like an extra burden for the user: IIUC, things work
> even if you have too many queues, it's just not optimal. It sounds like
> something that can be done by a management layer (e.g. libvirt), though.
> 
>> Or, is it fine for virtio folks to add a new hook to 'struct 
>> virtio_config_ops'
>> so that different platforms (e.g., pci or ccw) would use different ways to 
>> limit
>> the max number of queues in guest, with something like below?
> 
> That sounds better, as both transports and drivers can opt-in here.
> 
> However, maybe it would be even better to try to come up with a better
> strategy of allocating msix vectors in virtio-pci. More vectors in the
> num_queues > num_cpus case, even if they still need to be shared?
> Individual vectors for n-1 cpus and then a shared one for the remaining
> queues?
> 
> It might even be device-specific: Have some low-traffic status queues
> share a vector, and provide an individual vector for high-traffic
> queues. Would need some device<->transport interface, obviously.
> 

This sounds a little bit similar to multiple hctx maps?

So far, as virtio-blk only supports set->nr_maps = 1, no matter how many hw
queues are assigned for virtio-blk, blk_mq_alloc_tag_set() would use at most
nr_cpu_ids hw queues.

2981 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
... ...
3021 /*
3022  * There is no use for more h/w queues than cpus if we just have
3023  * a single map
3024  */
3025 if (set->nr_maps == 1 && set->nr_hw_queues > nr_cpu_ids)
3026 set->nr_hw_queues = nr_cpu_ids;

Even the block layer would limit the number of hw queues by n

[RFC] vhost: select TAP if VHOST is configured

2019-03-13 Thread Stephen Hemminger
If VHOST_NET is configured but TUN and TAP are not, then the
kernel will build but vhost will not work correctly since it can't
setup the necessary tap device.

A solution is to select it.

Fixes: 9a393b5d5988 ("tap: tap as an independent module")
Signed-off-by: Stephen Hemminger 
---
 drivers/vhost/Kconfig | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index b580885243f7..a24c69598241 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -1,7 +1,8 @@
 config VHOST_NET
tristate "Host kernel accelerator for virtio net"
-   depends on NET && EVENTFD && (TUN || !TUN) && (TAP || !TAP)
+   depends on NET && EVENTFD
select VHOST
+   select TAP
---help---
  This kernel module can be loaded in host kernel to accelerate
  guest networking with virtio_net. Not to be confused with virtio_net
-- 
2.17.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-13 Thread Christoph Hellwig
On Tue, Mar 12, 2019 at 01:53:37PM -0700, James Bottomley wrote:
> I've got to say: optimize what?  What code do we ever have in the
> kernel that kmap's a page and then doesn't do anything with it? You can
> guarantee that on kunmap the page is either referenced (needs
> invalidating) or updated (needs flushing). The in-kernel use of kmap is
> always
> 
> kmap
> do something with the mapped page
> kunmap
> 
> In a very short interval.  It seems just a simplification to make
> kunmap do the flush if needed rather than try to have the users
> remember.  The thing which makes this really simple is that on most
> architectures flush and invalidate is the same operation.  If you
> really want to optimize you can use the referenced and dirty bits on
> the kmapped pte to tell you what operation to do, but if your flush is
> your invalidate, you simply assume the data needs flushing on kunmap
> without checking anything.

I agree that this would be a good way to simplify the API.   Now
we'd just need volunteers to implement this for all architectures
that need cache flushing and then remove the explicit flushing in
the callers..

> > Which means after we fix vhost to add the flush_dcache_page after
> > kunmap, Parisc will get a double hit (but it also means Parisc was
> > the only one of those archs needed explicit cache flushes, where
> > vhost worked correctly so far.. so it kinds of proofs your point of
> > giving up being the safe choice).
> 
> What double hit?  If there's no cache to flush then cache flush is a
> no-op.  It's also a highly piplineable no-op because the CPU has the L1
> cache within easy reach.  The only event when flush takes a large
> amount time is if we actually have dirty data to write back to main
> memory.

I've heard people complaining that on some microarchitectures even
no-op cache flushes are relatively expensive.  Don't ask me why,
but if we can easily avoid double flushes we should do that.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-13 Thread Cornelia Huck
On Wed, 13 Mar 2019 11:26:04 +0800
Dongli Zhang  wrote:

> On 3/13/19 1:33 AM, Cornelia Huck wrote:
> > On Tue, 12 Mar 2019 10:22:46 -0700 (PDT)
> > Dongli Zhang  wrote:
> >   
> >> I observed that there is one msix vector for config and one shared vector
> >> for all queues in below qemu cmdline, when the num-queues for virtio-blk
> >> is more than the number of possible cpus:
> >>
> >> qemu: "-smp 4" while "-device 
> >> virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=6"
> >>
> >> # cat /proc/interrupts 
> >>CPU0   CPU1   CPU2   CPU3
> >> ... ...
> >>  24:  0  0  0  0   PCI-MSI 65536-edge  
> >> virtio0-config
> >>  25:  0  0  0 59   PCI-MSI 65537-edge  
> >> virtio0-virtqueues
> >> ... ...
> >>
> >>
> >> However, when num-queues is the same as number of possible cpus:
> >>
> >> qemu: "-smp 4" while "-device 
> >> virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=4"
> >>
> >> # cat /proc/interrupts 
> >>CPU0   CPU1   CPU2   CPU3
> >> ... ... 
> >>  24:  0  0  0  0   PCI-MSI 65536-edge  
> >> virtio0-config
> >>  25:  2  0  0  0   PCI-MSI 65537-edge  
> >> virtio0-req.0
> >>  26:  0 35  0  0   PCI-MSI 65538-edge  
> >> virtio0-req.1
> >>  27:  0  0 32  0   PCI-MSI 65539-edge  
> >> virtio0-req.2
> >>  28:  0  0  0  0   PCI-MSI 65540-edge  
> >> virtio0-req.3
> >> ... ...
> >>
> >> In above case, there is one msix vector per queue.  
> > 
> > Please note that this is pci-specific...
> >   
> >>
> >>
> >> This is because the max number of queues is not limited by the number of
> >> possible cpus.
> >>
> >> By default, nvme (regardless about write_queues and poll_queues) and
> >> xen-blkfront limit the number of queues with num_possible_cpus().  
> > 
> > ...and these are probably pci-specific as well.  
> 
> Not pci-specific, but per-cpu as well.

Ah, I meant that those are pci devices.

> 
> >   
> >>
> >>
> >> Is this by design on purpose, or can we fix with below?
> >>
> >>
> >> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> >> index 4bc083b..df95ce3 100644
> >> --- a/drivers/block/virtio_blk.c
> >> +++ b/drivers/block/virtio_blk.c
> >> @@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
> >>if (err)
> >>num_vqs = 1;
> >>  
> >> +  num_vqs = min(num_possible_cpus(), num_vqs);
> >> +
> >>vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), GFP_KERNEL);
> >>if (!vblk->vqs)
> >>return -ENOMEM;  
> > 
> > virtio-blk, however, is not pci-specific.
> > 
> > If we are using the ccw transport on s390, a completely different
> > interrupt mechanism is in use ('floating' interrupts, which are not
> > per-cpu). A check like that should therefore not go into the generic
> > driver.
> >   
> 
> So far there seems two options.
> 
> The 1st option is to ask the qemu user to always specify "-num-queues" with 
> the
> same number of vcpus when running x86 guest with pci for virtio-blk or
> virtio-scsi, in order to assign a vector for each queue.

That does seem like an extra burden for the user: IIUC, things work
even if you have too many queues, it's just not optimal. It sounds like
something that can be done by a management layer (e.g. libvirt), though.

> Or, is it fine for virtio folks to add a new hook to 'struct 
> virtio_config_ops'
> so that different platforms (e.g., pci or ccw) would use different ways to 
> limit
> the max number of queues in guest, with something like below?

That sounds better, as both transports and drivers can opt-in here.

However, maybe it would be even better to try to come up with a better
strategy of allocating msix vectors in virtio-pci. More vectors in the
num_queues > num_cpus case, even if they still need to be shared?
Individual vectors for n-1 cpus and then a shared one for the remaining
queues?

It might even be device-specific: Have some low-traffic status queues
share a vector, and provide an individual vector for high-traffic
queues. Would need some device<->transport interface, obviously.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization