Re: virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-14 Thread Jason Wang


On 2019/3/14 下午2:12, Dongli Zhang wrote:


On 3/13/19 5:39 PM, Cornelia Huck wrote:

On Wed, 13 Mar 2019 11:26:04 +0800
Dongli Zhang  wrote:


On 3/13/19 1:33 AM, Cornelia Huck wrote:

On Tue, 12 Mar 2019 10:22:46 -0700 (PDT)
Dongli Zhang  wrote:
   

I observed that there is one msix vector for config and one shared vector
for all queues in below qemu cmdline, when the num-queues for virtio-blk
is more than the number of possible cpus:

qemu: "-smp 4" while "-device 
virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=6"

# cat /proc/interrupts
CPU0   CPU1   CPU2   CPU3
... ...
  24:  0  0  0  0   PCI-MSI 65536-edge  
virtio0-config
  25:  0  0  0 59   PCI-MSI 65537-edge  
virtio0-virtqueues
... ...


However, when num-queues is the same as number of possible cpus:

qemu: "-smp 4" while "-device 
virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=4"

# cat /proc/interrupts
CPU0   CPU1   CPU2   CPU3
... ...
  24:  0  0  0  0   PCI-MSI 65536-edge  
virtio0-config
  25:  2  0  0  0   PCI-MSI 65537-edge  
virtio0-req.0
  26:  0 35  0  0   PCI-MSI 65538-edge  
virtio0-req.1
  27:  0  0 32  0   PCI-MSI 65539-edge  
virtio0-req.2
  28:  0  0  0  0   PCI-MSI 65540-edge  
virtio0-req.3
... ...

In above case, there is one msix vector per queue.

Please note that this is pci-specific...
   


This is because the max number of queues is not limited by the number of
possible cpus.

By default, nvme (regardless about write_queues and poll_queues) and
xen-blkfront limit the number of queues with num_possible_cpus().

...and these are probably pci-specific as well.

Not pci-specific, but per-cpu as well.

Ah, I meant that those are pci devices.

   


Is this by design on purpose, or can we fix with below?


diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 4bc083b..df95ce3 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
if (err)
num_vqs = 1;
  
+	num_vqs = min(num_possible_cpus(), num_vqs);

+
vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), GFP_KERNEL);
if (!vblk->vqs)
return -ENOMEM;

virtio-blk, however, is not pci-specific.

If we are using the ccw transport on s390, a completely different
interrupt mechanism is in use ('floating' interrupts, which are not
per-cpu). A check like that should therefore not go into the generic
driver.
   

So far there seems two options.

The 1st option is to ask the qemu user to always specify "-num-queues" with the
same number of vcpus when running x86 guest with pci for virtio-blk or
virtio-scsi, in order to assign a vector for each queue.

That does seem like an extra burden for the user: IIUC, things work
even if you have too many queues, it's just not optimal. It sounds like
something that can be done by a management layer (e.g. libvirt), though.


Or, is it fine for virtio folks to add a new hook to 'struct virtio_config_ops'
so that different platforms (e.g., pci or ccw) would use different ways to limit
the max number of queues in guest, with something like below?

That sounds better, as both transports and drivers can opt-in here.

However, maybe it would be even better to try to come up with a better
strategy of allocating msix vectors in virtio-pci. More vectors in the
num_queues > num_cpus case, even if they still need to be shared?
Individual vectors for n-1 cpus and then a shared one for the remaining
queues?

It might even be device-specific: Have some low-traffic status queues
share a vector, and provide an individual vector for high-traffic
queues. Would need some device<->transport interface, obviously.


This sounds a little bit similar to multiple hctx maps?

So far, as virtio-blk only supports set->nr_maps = 1, no matter how many hw
queues are assigned for virtio-blk, blk_mq_alloc_tag_set() would use at most
nr_cpu_ids hw queues.

2981 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
... ...
3021 /*
3022  * There is no use for more h/w queues than cpus if we just have
3023  * a single map
3024  */
3025 if (set->nr_maps == 1 && set->nr_hw_queues > nr_cpu_ids)
3026 set->nr_hw_queues = nr_cpu_ids;

Even the block layer would limit the number of hw queues by nr_cpu_ids when
(set->nr_maps == 1).

That's why I think virtio-blk should use the similar solution as nvme
(regardless about write_queues and poll_queues) and xen-blkfront.

Added Jason again. I do not know why the mailing list of
virtualization@lists.linux-foundation.org always filters out Jason's email...


Dongli Zhang



Or something like I proposed several years ago? 
https://do

Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-14 Thread Jason Wang


On 2019/3/15 上午3:33, Andrea Arcangeli wrote:

Hello Jason,

On Thu, Mar 14, 2019 at 09:49:03PM +0800, Jason Wang wrote:

Yes since we don't want to slow down 32bit.

If you've a lot of ram there's no justification to stick to a 32bit
kernel, so I don't think there's need to maintain a separate model
just for 32bit. I really wouldn't care about the performance of 32bit
with >700MB of RAM if that would cause any maintenance burden. Let's
focus on the best 64bit implementation that will work equally
optimally on 32bit with <= 700M of RAM.



Yes, but probably there are still some reasons of keeping copy_user() 
friends as a fallback. When we have a large virtqueue, the ring may 
occupy more than one page. This means the VA might not be contiguous 
when using kmap(). Instead of doing tricks in the accessories, maybe 
it's or simpler better just fall back to copy_user() in this case. And 
we meet the similar issue when software device IOTLB is used for vhost. 
And in the following example for gup, we can simply do a fallback when 
we race with the invalidation.


Michael also tends to keep the copy_user(), he suggested to use 
copy_user() for VIVT archs then there's no need for a explicit 
flush_dcache_page(). And he also want a module parameter for falling 
back to copy_user() for e.g debugging purpose.





Talking to Jerome about the set_page_dirty issue, he raised the point
of what happens if two thread calls a mmu notifier invalidate
simultaneously. The first mmu notifier could call set_page_dirty and
then proceed in try_to_free_buffers or page_mkclean and then the
concurrent mmu notifier that arrives second, then must not call
set_page_dirty a second time.

With KVM sptes mappings and vhost mappings you would call
set_page_dirty (if you invoked gup with FOLL_WRITE) only when
effectively tearing down any secondary mapping (you've got pointers in
both cases for the mapping). So there's no way to risk a double
set_page_dirty from concurrent mmu notifier invalidate because the
invalidate takes a lock when it has to teardown the mapping and so
set_page_dirty is only run in the first invalidate method and not in
the second. In the spte case even better, as you wouldn't need to call
it even at teardown time unless the spte is dirty (if shadow mask
allows dirty sptes with EPT2 or NPT or shadow MMU).



I see, the sounds indeed better.




If you instead had to invalidate a secondary MMU mapping that isn't
tracked by the driver (again: not vhost nor KVM case), you could have
used the dirty bit of the kernel pagetable to call set_page_dirty and
disambiguate but that's really messy, and it would prevent the use of
gigapages in the direct mapping too and it'd require vmap for 4k
tracking.

To make sure set_page_dirty is run a single time no matter if the
invalidate known when a mapping is tear down, I suggested the below
model:

   access = FOLL_WRITE

repeat:
   page = gup_fast(access)
   put_page(page) /* need a way to drop FOLL_GET from gup_fast instead! */

   spin_lock(mmu_notifier_lock);
   if (race with invalidate) {
 spin_unlock..
 goto repeat;
   }
   if (access == FOLL_WRITE)
 set_page_dirty(page)
   establish writable mapping in secondary MMU on page
   spin_unlock

(replace spin_lock with mutex_lock for vhost of course if you stick to
a mutex and _start/_end instead of non-sleepable ->invalidate_range)



Yes, I probably stick to the vq mutex since the invalidation needs to be 
synchronized with the whole packet processing routine.




"race with invalidate" is the usual "mmu_notifier_retry" in kvm_host.h
to be implemented for vhost.

We could add a FOLL_DIRTY flag to add to FOLL_TOUCH to move the
set_page_dirty inside GUP forced (currently it's not forced if the
linux pte is already dirty). And we could remove FOLL_GET.

Then if you have the ability to disambiguate which is the first
invalidate that tears down a mapping to any given page (vhost can do
that trivially, it's just a pointer to a page struct to kmap), in the
mmu notifier invalidate just before dropping the spinlock you would
do this check:

def vhost_invalidate_range_start():
[..]
spin_lock(mmu_notifier_lock);
[..]
if (vhost->page_pointer) {
   if (access == FOLL_WRITE)
VM_WARN_ON(!PageDirty(vhost->page_pointer));
   vhost->page_pointer = NULL;
   /* no put_page, already done at gup time */
}
spin_unlock(..

Generally speaking set_page_dirty is safer done after the last
modification to the data of the page. However the way stable page
works, as long as the mmu notifier invalidate didn't run, the PG_dirty
cannot go away.



Ok.




So this model solves the issue with guaranteeing a single
set_page_dirty is run before page_mkclean or try_to_free_buffers can
run, even for drivers that implement the invalidate as a generic "TLB
flush" across the whole secondary MMU and that cannot disambiguate the
first invalidate from a second invalidate if they're issued
concurrently on the same address 

Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-14 Thread Andrea Arcangeli
Hello Jason,

On Thu, Mar 14, 2019 at 09:49:03PM +0800, Jason Wang wrote:
> Yes since we don't want to slow down 32bit.

If you've a lot of ram there's no justification to stick to a 32bit
kernel, so I don't think there's need to maintain a separate model
just for 32bit. I really wouldn't care about the performance of 32bit
with >700MB of RAM if that would cause any maintenance burden. Let's
focus on the best 64bit implementation that will work equally
optimally on 32bit with <= 700M of RAM.

Talking to Jerome about the set_page_dirty issue, he raised the point
of what happens if two thread calls a mmu notifier invalidate
simultaneously. The first mmu notifier could call set_page_dirty and
then proceed in try_to_free_buffers or page_mkclean and then the
concurrent mmu notifier that arrives second, then must not call
set_page_dirty a second time.

With KVM sptes mappings and vhost mappings you would call
set_page_dirty (if you invoked gup with FOLL_WRITE) only when
effectively tearing down any secondary mapping (you've got pointers in
both cases for the mapping). So there's no way to risk a double
set_page_dirty from concurrent mmu notifier invalidate because the
invalidate takes a lock when it has to teardown the mapping and so
set_page_dirty is only run in the first invalidate method and not in
the second. In the spte case even better, as you wouldn't need to call
it even at teardown time unless the spte is dirty (if shadow mask
allows dirty sptes with EPT2 or NPT or shadow MMU).

If you instead had to invalidate a secondary MMU mapping that isn't
tracked by the driver (again: not vhost nor KVM case), you could have
used the dirty bit of the kernel pagetable to call set_page_dirty and
disambiguate but that's really messy, and it would prevent the use of
gigapages in the direct mapping too and it'd require vmap for 4k
tracking.

To make sure set_page_dirty is run a single time no matter if the
invalidate known when a mapping is tear down, I suggested the below
model:

  access = FOLL_WRITE

repeat:
  page = gup_fast(access)
  put_page(page) /* need a way to drop FOLL_GET from gup_fast instead! */

  spin_lock(mmu_notifier_lock);
  if (race with invalidate) {
spin_unlock..
goto repeat;
  }
  if (access == FOLL_WRITE)
set_page_dirty(page)
  establish writable mapping in secondary MMU on page
  spin_unlock

(replace spin_lock with mutex_lock for vhost of course if you stick to
a mutex and _start/_end instead of non-sleepable ->invalidate_range)

"race with invalidate" is the usual "mmu_notifier_retry" in kvm_host.h
to be implemented for vhost.

We could add a FOLL_DIRTY flag to add to FOLL_TOUCH to move the
set_page_dirty inside GUP forced (currently it's not forced if the
linux pte is already dirty). And we could remove FOLL_GET.

Then if you have the ability to disambiguate which is the first
invalidate that tears down a mapping to any given page (vhost can do
that trivially, it's just a pointer to a page struct to kmap), in the
mmu notifier invalidate just before dropping the spinlock you would
do this check:

def vhost_invalidate_range_start():
   [..]
   spin_lock(mmu_notifier_lock);
   [..]
   if (vhost->page_pointer) {
  if (access == FOLL_WRITE)
VM_WARN_ON(!PageDirty(vhost->page_pointer));
  vhost->page_pointer = NULL;
  /* no put_page, already done at gup time */
   }
   spin_unlock(..

Generally speaking set_page_dirty is safer done after the last
modification to the data of the page. However the way stable page
works, as long as the mmu notifier invalidate didn't run, the PG_dirty
cannot go away.

So this model solves the issue with guaranteeing a single
set_page_dirty is run before page_mkclean or try_to_free_buffers can
run, even for drivers that implement the invalidate as a generic "TLB
flush" across the whole secondary MMU and that cannot disambiguate the
first invalidate from a second invalidate if they're issued
concurrently on the same address by two different CPUs.

So for those drivers that can disambiguate trivially (like KVM and
vhost) we'll just add a debug check in the invalidate to validate the
common code for all mmu notifier users.

This is the solution for RDMA hardware and everything that can support
mmu notifiers too and they can take infinitely long secondary MMU
mappins without interfering with stable pages at all (i.e. long term
pins but without pinning) perfectly safe and transparent to the whole
stable page code.

I think O_DIRECT for stable pages shall be solved taking the page lock
or writeback lock or a new rwsem in the inode that is taken for
writing by page_mkclean and try_to_free_buffers and for reading by
outstanding O_DIRECT in flight I/O, like I suggested probably ages ago
but then we only made GUP take the page pin, which is fine for mmu
notifier actually (except those didn't exist back then). To solve
O_DIRECT we can leverage the 100% guarantee that the pin will be
dropped ASAP and stop page_mkclean and stop or trylock in
try_to_fr

[PATCH 15/38] vfs: Convert virtio_balloon to fs_context

2019-03-14 Thread David Howells
Signed-off-by: David Howells 
cc: "Michael S. Tsirkin" 
cc: Jason Wang 
cc: virtualization@lists.linux-foundation.org
---

 drivers/virtio/virtio_balloon.c |   19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index f19061b585a4..89d67c8aa719 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -757,21 +757,22 @@ static int virtballoon_migratepage(struct 
balloon_dev_info *vb_dev_info,
 
return MIGRATEPAGE_SUCCESS;
 }
+#include 
 
-static struct dentry *balloon_mount(struct file_system_type *fs_type,
-   int flags, const char *dev_name, void *data)
-{
-   static const struct dentry_operations ops = {
-   .d_dname = simple_dname,
-   };
+static const struct dentry_operations balloon_dops = {
+   .d_dname = simple_dname,
+};
 
-   return mount_pseudo(fs_type, "balloon-kvm:", NULL, &ops,
-   BALLOON_KVM_MAGIC);
+static int balloon_init_fs_context(struct fs_context *fc)
+{
+   return vfs_init_pseudo_fs_context(fc, "balloon-kvm:",
+ NULL, NULL,
+ &balloon_dops, BALLOON_KVM_MAGIC);
 }
 
 static struct file_system_type balloon_fs = {
.name   = "balloon-kvm",
-   .mount  = balloon_mount,
+   .init_fs_context = balloon_init_fs_context,
.kill_sb= kill_anon_super,
 };
 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 00/38] VFS: Convert trivial filesystems and more

2019-03-14 Thread David Howells


Hi Al,

Here's a set of patches that:

 (1) Provides a convenience member in struct fs_context that is OR'd into
 sb->s_iflags by sget_fc().

 (2) Provides a convenience vfs_init_pseudo_fs_context() helper function
 for doing most of the work in mounting a pseudo filesystem.

 (3) Converts all the trivial filesystems that have no arguments to
 fs_context.

 (4) Converts binderfs (which was trivial before January).

 (5) Converts ramfs, tmpfs, rootfs and devtmpfs.

 (6) Kills off mount_pseudo(), mount_pseudo_xattr(), mount_ns(),
 sget_userns().

The patches can be found here also:

https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git

on branch:

mount-api-viro

David
---
David Howells (38):
  vfs: Provide sb->s_iflags settings in fs_context struct
  vfs: Provide a mount_pseudo-replacement for fs_context
  vfs: Convert aio to fs_context
  vfs: Convert anon_inodes to fs_context
  vfs: Convert bdev to fs_context
  vfs: Convert nsfs to fs_context
  vfs: Convert pipe to fs_context
  vfs: Convert zsmalloc to fs_context
  vfs: Convert sockfs to fs_context
  vfs: Convert dax to fs_context
  vfs: Convert drm to fs_context
  vfs: Convert ia64 perfmon to fs_context
  vfs: Convert cxl to fs_context
  vfs: Convert ocxlflash to fs_context
  vfs: Convert virtio_balloon to fs_context
  vfs: Convert btrfs_test to fs_context
  vfs: Kill off mount_pseudo() and mount_pseudo_xattr()
  vfs: Use sget_fc() for pseudo-filesystems
  vfs: Convert binderfs to fs_context
  vfs: Convert nfsctl to fs_context
  vfs: Convert rpc_pipefs to fs_context
  vfs: Kill off mount_ns()
  vfs: Kill sget_userns()
  vfs: Convert binfmt_misc to fs_context
  vfs: Convert configfs to fs_context
  vfs: Convert efivarfs to fs_context
  vfs: Convert fusectl to fs_context
  vfs: Convert qib_fs/ipathfs to fs_context
  vfs: Convert ibmasmfs to fs_context
  vfs: Convert oprofilefs to fs_context
  vfs: Convert gadgetfs to fs_context
  vfs: Convert xenfs to fs_context
  vfs: Convert openpromfs to fs_context
  vfs: Convert apparmorfs to fs_context
  vfs: Convert securityfs to fs_context
  vfs: Convert selinuxfs to fs_context
  vfs: Convert smackfs to fs_context
  tmpfs, devtmpfs, ramfs, rootfs: Convert to fs_context


 arch/ia64/kernel/perfmon.c |   14 +
 drivers/android/binderfs.c |  173 +---
 drivers/base/devtmpfs.c|   16 +
 drivers/dax/super.c|   13 +
 drivers/gpu/drm/drm_drv.c  |   14 +
 drivers/infiniband/hw/qib/qib_fs.c |   26 ++
 drivers/misc/cxl/api.c |   10 -
 drivers/misc/ibmasm/ibmasmfs.c |   21 +-
 drivers/oprofile/oprofilefs.c  |   20 +-
 drivers/scsi/cxlflash/ocxl_hw.c|   21 +-
 drivers/usb/gadget/legacy/inode.c  |   21 +-
 drivers/virtio/virtio_balloon.c|   19 +-
 drivers/xen/xenfs/super.c  |   21 +-
 fs/aio.c   |   15 +
 fs/anon_inodes.c   |   12 +
 fs/binfmt_misc.c   |   20 +-
 fs/block_dev.c |   14 +
 fs/btrfs/tests/btrfs-tests.c   |   13 +
 fs/configfs/mount.c|   20 +-
 fs/efivarfs/super.c|   20 +-
 fs/fuse/control.c  |   20 +-
 fs/libfs.c |   91 ++--
 fs/nfsd/nfsctl.c   |   33 ++-
 fs/nsfs.c  |   13 +
 fs/openpromfs/inode.c  |   20 +-
 fs/pipe.c  |   12 +
 fs/ramfs/inode.c   |  104 ++---
 fs/super.c |  106 ++
 include/linux/fs.h |   21 --
 include/linux/fs_context.h |8 +
 include/linux/ramfs.h  |6 -
 include/linux/shmem_fs.h   |4 
 init/do_mounts.c   |   12 -
 mm/shmem.c |  396 
 mm/zsmalloc.c  |   19 +-
 net/socket.c   |   14 +
 net/sunrpc/rpc_pipe.c  |   34 ++-
 security/apparmor/apparmorfs.c |   20 +-
 security/inode.c   |   21 +-
 security/selinux/selinuxfs.c   |   20 +-
 security/smack/smackfs.c   |   34 ++-
 41 files changed, 902 insertions(+), 609 deletions(-)

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-14 Thread Dongli Zhang



On 03/14/2019 08:13 PM, Cornelia Huck wrote:
> On Thu, 14 Mar 2019 14:12:32 +0800
> Dongli Zhang  wrote:
> 
>> On 3/13/19 5:39 PM, Cornelia Huck wrote:
>>> On Wed, 13 Mar 2019 11:26:04 +0800
>>> Dongli Zhang  wrote:
>>>   
 On 3/13/19 1:33 AM, Cornelia Huck wrote:  
> On Tue, 12 Mar 2019 10:22:46 -0700 (PDT)
> Dongli Zhang  wrote:
> 
>> Is this by design on purpose, or can we fix with below?
>>
>>
>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>> index 4bc083b..df95ce3 100644
>> --- a/drivers/block/virtio_blk.c
>> +++ b/drivers/block/virtio_blk.c
>> @@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
>>  if (err)
>>  num_vqs = 1;
>>  
>> +num_vqs = min(num_possible_cpus(), num_vqs);
>> +
>>  vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), 
>> GFP_KERNEL);
>>  if (!vblk->vqs)
>>  return -ENOMEM;
>
> virtio-blk, however, is not pci-specific.
>
> If we are using the ccw transport on s390, a completely different
> interrupt mechanism is in use ('floating' interrupts, which are not
> per-cpu). A check like that should therefore not go into the generic
> driver.
> 

 So far there seems two options.

 The 1st option is to ask the qemu user to always specify "-num-queues" 
 with the
 same number of vcpus when running x86 guest with pci for virtio-blk or
 virtio-scsi, in order to assign a vector for each queue.  
>>>
>>> That does seem like an extra burden for the user: IIUC, things work
>>> even if you have too many queues, it's just not optimal. It sounds like
>>> something that can be done by a management layer (e.g. libvirt), though.
>>>   
 Or, is it fine for virtio folks to add a new hook to 'struct 
 virtio_config_ops'
 so that different platforms (e.g., pci or ccw) would use different ways to 
 limit
 the max number of queues in guest, with something like below?  
>>>
>>> That sounds better, as both transports and drivers can opt-in here.
>>>
>>> However, maybe it would be even better to try to come up with a better
>>> strategy of allocating msix vectors in virtio-pci. More vectors in the
>>> num_queues > num_cpus case, even if they still need to be shared?
>>> Individual vectors for n-1 cpus and then a shared one for the remaining
>>> queues?
>>>
>>> It might even be device-specific: Have some low-traffic status queues
>>> share a vector, and provide an individual vector for high-traffic
>>> queues. Would need some device<->transport interface, obviously.
>>>   
>>
>> This sounds a little bit similar to multiple hctx maps?
>>
>> So far, as virtio-blk only supports set->nr_maps = 1, no matter how many hw
>> queues are assigned for virtio-blk, blk_mq_alloc_tag_set() would use at most
>> nr_cpu_ids hw queues.
>>
>> 2981 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
>> ... ...
>> 3021 /*
>> 3022  * There is no use for more h/w queues than cpus if we just have
>> 3023  * a single map
>> 3024  */
>> 3025 if (set->nr_maps == 1 && set->nr_hw_queues > nr_cpu_ids)
>> 3026 set->nr_hw_queues = nr_cpu_ids;
>>
>> Even the block layer would limit the number of hw queues by nr_cpu_ids when
>> (set->nr_maps == 1).
> 
> Correct me if I'm wrong, but there seem to be two kinds of limitations
> involved here:
> - Allocation of msix vectors by the virtio-pci transport. We end up
>   with shared vectors if we have more virtqueues than vcpus. Other
>   transports may or may not have similar issues, but essentially, this
>   is something that applies to all kind of virtio devices attached via
>   the virtio-pci transport.

It depends.

For virtio-net, we need to specify the number of available vectors on qemu side,
e.g.,:

-device virtio-net-pci,netdev=tapnet,mq=true,vectors=16

This parameter is specific for virtio-net.

Suppose if 'queues=8' while 'vectors=16', as 2*8+1 > 16, there will be lack of
vectors and guest would not be able to assign one vector for each queue.

I was tortured by this long time ago and it seems qemu would minimize the memory
allocation and the default 'vectors' is 3.

BTW, why cannot we have a more consistent configuration for most qemu devices,
e.g., so far:

virtio-blk use 'num-queues'
nvme use 'num_queues'
virtio-net use 'queues' for tap :)

> - The block layer limits the number of hw queues to the number of
>   vcpus. This applies only to virtio devices that interact with the
>   block layer, but regardless of the virtio transport.

Yes: virtio-blk and virtio-scsi.

> 
>> That's why I think virtio-blk should use the similar solution as nvme
>> (regardless about write_queues and poll_queues) and xen-blkfront.
> 
> Ok, the hw queues limit from above would be an argument to limit to
> #vcpus in the virtio-blk driver, regardless of the transport used. (No
> idea if th

Re: virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-14 Thread Dongli Zhang



On 03/14/2019 08:32 PM, Michael S. Tsirkin wrote:
> On Tue, Mar 12, 2019 at 10:22:46AM -0700, Dongli Zhang wrote:
>> I observed that there is one msix vector for config and one shared vector
>> for all queues in below qemu cmdline, when the num-queues for virtio-blk
>> is more than the number of possible cpus:
>>
>> qemu: "-smp 4" while "-device 
>> virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=6"
> 
> So why do this?

I observed this when I was testing virtio-blk and block layer.

I just assign a very large 'num-queues' to virtio-blk and keep changing the
number of vcpu in order to study blk-mq.

The num-queues for nvme (qemu) is by default 64, while it is 1 for virtio-blk.

> 
>> # cat /proc/interrupts 
>>CPU0   CPU1   CPU2   CPU3
>> ... ...
>>  24:  0  0  0  0   PCI-MSI 65536-edge  
>> virtio0-config
>>  25:  0  0  0 59   PCI-MSI 65537-edge  
>> virtio0-virtqueues
>> ... ...
>>
>>
>> However, when num-queues is the same as number of possible cpus:
>>
>> qemu: "-smp 4" while "-device 
>> virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=4"
>>
>> # cat /proc/interrupts 
>>CPU0   CPU1   CPU2   CPU3
>> ... ... 
>>  24:  0  0  0  0   PCI-MSI 65536-edge  
>> virtio0-config
>>  25:  2  0  0  0   PCI-MSI 65537-edge  
>> virtio0-req.0
>>  26:  0 35  0  0   PCI-MSI 65538-edge  
>> virtio0-req.1
>>  27:  0  0 32  0   PCI-MSI 65539-edge  
>> virtio0-req.2
>>  28:  0  0  0  0   PCI-MSI 65540-edge  
>> virtio0-req.3
>> ... ...
>>
>> In above case, there is one msix vector per queue.
>>
>>
>> This is because the max number of queues is not limited by the number of
>> possible cpus.
>>
>> By default, nvme (regardless about write_queues and poll_queues) and
>> xen-blkfront limit the number of queues with num_possible_cpus().
>>
>>
>> Is this by design on purpose, or can we fix with below?
>>
>>
>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>> index 4bc083b..df95ce3 100644
>> --- a/drivers/block/virtio_blk.c
>> +++ b/drivers/block/virtio_blk.c
>> @@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
>>  if (err)
>>  num_vqs = 1;
>>  
>> +num_vqs = min(num_possible_cpus(), num_vqs);
>> +
>>  vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), GFP_KERNEL);
>>  if (!vblk->vqs)
>>  return -ENOMEM;
>> --
>>
>>
>> PS: The same issue is applicable to virtio-scsi as well.
>>
>> Thank you very much!
>>
>> Dongli Zhang
> 
> I don't think this will address the issue if there's vcpu hotplug though.
> Because it's not about num_possible_cpus it's about the # of active VCPUs,
> right? Does block hangle CPU hotplug generally?
> We could maybe address that by switching vq to msi vector mapping in
> a cpu hotplug notifier...
> 

It looks it is about num_possible_cpus/nr_cpu_ids for cpu hotplug.


For instance, below is when only 2 out of 6 cpus are initialized while
virtio-blk has 6 queues.

"-smp 2,maxcpus=6" and "-device
virtio-blk-pci,drive=drive0,id=disk0,num-queues=6,iothread=io1"

# cat /sys/devices/system/cpu/present
0-1
# cat /sys/devices/system/cpu/possible
0-5
# cat /proc/interrupts  | grep virtio
 24:  0  0   PCI-MSI 65536-edge  virtio0-config
 25:   1864  0   PCI-MSI 65537-edge  virtio0-req.0
 26:  0   1069   PCI-MSI 65538-edge  virtio0-req.1
 27:  0  0   PCI-MSI 65539-edge  virtio0-req.2
 28:  0  0   PCI-MSI 65540-edge  virtio0-req.3
 29:  0  0   PCI-MSI 65541-edge  virtio0-req.4
 30:  0  0   PCI-MSI 65542-edge  virtio0-req.5

6 + 1 irqs are assigned even 4 out of 6 cpus are still offline.


Below is about the nvme emulated by qemu. While 2 out of 6 cpus are initial
assigned, nvme has 64 queues by default.

"-smp 2,maxcpus=6" and "-device nvme,drive=drive1,serial=deadbeaf1"

# cat /sys/devices/system/cpu/present
0-1
# cat /sys/devices/system/cpu/possible
0-5
# cat /proc/interrupts | grep nvme
 31:  0 16   PCI-MSI 81920-edge  nvme0q0
 32: 35  0   PCI-MSI 81921-edge  nvme0q1
 33:  0 42   PCI-MSI 81922-edge  nvme0q2
 34:  0  0   PCI-MSI 81923-edge  nvme0q3
 35:  0  0   PCI-MSI 81924-edge  nvme0q4
 36:  0  0   PCI-MSI 81925-edge  nvme0q5
 37:  0  0   PCI-MSI 81926-edge  nvme0q6

6 io queues are assigned with irq, although only 2 cpus are online.


When only 2 out of 48 cpus are online, there are 48 hctx created by block layer.

"-smp 2,maxcpus=48" and "-device
virtio-blk-pci,drive=drive0,id=disk0,num-queues=48,iothread=io1"

# ls /sys/kernel/debug/block/vda/ | grep hctx | wc -l
4

Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-14 Thread Jason Wang


On 2019/3/14 下午6:42, Michael S. Tsirkin wrote:

Which means after we fix vhost to add the flush_dcache_page after
kunmap, Parisc will get a double hit (but it also means Parisc
was the only one of those archs needed explicit cache flushes,
where vhost worked correctly so far.. so it kinds of proofs your
point of giving up being the safe choice).

What double hit?  If there's no cache to flush then cache flush is
a no-op.  It's also a highly piplineable no-op because the CPU has
the L1 cache within easy reach.  The only event when flush takes a
large amount time is if we actually have dirty data to write back
to main memory.

I've heard people complaining that on some microarchitectures even
no-op cache flushes are relatively expensive.  Don't ask me why,
but if we can easily avoid double flushes we should do that.

It's still not entirely free for us.  Our internal cache line is around
32 bytes (some have 16 and some have 64) but that means we need 128
flushes for a page ... we definitely can't pipeline them all.  So I
agree duplicate flush elimination would be a small improvement.

James

I suspect we'll keep the copyXuser path around for 32 bit anyway -
right Jason?



Yes since we don't want to slow down 32bit.

Thanks



So we can also keep using that on parisc...

--

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-14 Thread Michael S. Tsirkin
On Tue, Mar 12, 2019 at 10:22:46AM -0700, Dongli Zhang wrote:
> I observed that there is one msix vector for config and one shared vector
> for all queues in below qemu cmdline, when the num-queues for virtio-blk
> is more than the number of possible cpus:
> 
> qemu: "-smp 4" while "-device 
> virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=6"

So why do this?

> # cat /proc/interrupts 
>CPU0   CPU1   CPU2   CPU3
> ... ...
>  24:  0  0  0  0   PCI-MSI 65536-edge  
> virtio0-config
>  25:  0  0  0 59   PCI-MSI 65537-edge  
> virtio0-virtqueues
> ... ...
> 
> 
> However, when num-queues is the same as number of possible cpus:
> 
> qemu: "-smp 4" while "-device 
> virtio-blk-pci,drive=drive-0,id=virtblk0,num-queues=4"
> 
> # cat /proc/interrupts 
>CPU0   CPU1   CPU2   CPU3
> ... ... 
>  24:  0  0  0  0   PCI-MSI 65536-edge  
> virtio0-config
>  25:  2  0  0  0   PCI-MSI 65537-edge  
> virtio0-req.0
>  26:  0 35  0  0   PCI-MSI 65538-edge  
> virtio0-req.1
>  27:  0  0 32  0   PCI-MSI 65539-edge  
> virtio0-req.2
>  28:  0  0  0  0   PCI-MSI 65540-edge  
> virtio0-req.3
> ... ...
> 
> In above case, there is one msix vector per queue.
> 
> 
> This is because the max number of queues is not limited by the number of
> possible cpus.
> 
> By default, nvme (regardless about write_queues and poll_queues) and
> xen-blkfront limit the number of queues with num_possible_cpus().
> 
> 
> Is this by design on purpose, or can we fix with below?
> 
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 4bc083b..df95ce3 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
>   if (err)
>   num_vqs = 1;
>  
> + num_vqs = min(num_possible_cpus(), num_vqs);
> +
>   vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), GFP_KERNEL);
>   if (!vblk->vqs)
>   return -ENOMEM;
> --
> 
> 
> PS: The same issue is applicable to virtio-scsi as well.
> 
> Thank you very much!
> 
> Dongli Zhang

I don't think this will address the issue if there's vcpu hotplug though.
Because it's not about num_possible_cpus it's about the # of active VCPUs,
right? Does block hangle CPU hotplug generally?
We could maybe address that by switching vq to msi vector mapping in
a cpu hotplug notifier...

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: virtio-blk: should num_vqs be limited by num_possible_cpus()?

2019-03-14 Thread Cornelia Huck
On Thu, 14 Mar 2019 14:12:32 +0800
Dongli Zhang  wrote:

> On 3/13/19 5:39 PM, Cornelia Huck wrote:
> > On Wed, 13 Mar 2019 11:26:04 +0800
> > Dongli Zhang  wrote:
> >   
> >> On 3/13/19 1:33 AM, Cornelia Huck wrote:  
> >>> On Tue, 12 Mar 2019 10:22:46 -0700 (PDT)
> >>> Dongli Zhang  wrote:

>  Is this by design on purpose, or can we fix with below?
> 
> 
>  diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>  index 4bc083b..df95ce3 100644
>  --- a/drivers/block/virtio_blk.c
>  +++ b/drivers/block/virtio_blk.c
>  @@ -513,6 +513,8 @@ static int init_vq(struct virtio_blk *vblk)
>   if (err)
>   num_vqs = 1;
>   
>  +num_vqs = min(num_possible_cpus(), num_vqs);
>  +
>   vblk->vqs = kmalloc_array(num_vqs, sizeof(*vblk->vqs), 
>  GFP_KERNEL);
>   if (!vblk->vqs)
>   return -ENOMEM;
> >>>
> >>> virtio-blk, however, is not pci-specific.
> >>>
> >>> If we are using the ccw transport on s390, a completely different
> >>> interrupt mechanism is in use ('floating' interrupts, which are not
> >>> per-cpu). A check like that should therefore not go into the generic
> >>> driver.
> >>> 
> >>
> >> So far there seems two options.
> >>
> >> The 1st option is to ask the qemu user to always specify "-num-queues" 
> >> with the
> >> same number of vcpus when running x86 guest with pci for virtio-blk or
> >> virtio-scsi, in order to assign a vector for each queue.  
> > 
> > That does seem like an extra burden for the user: IIUC, things work
> > even if you have too many queues, it's just not optimal. It sounds like
> > something that can be done by a management layer (e.g. libvirt), though.
> >   
> >> Or, is it fine for virtio folks to add a new hook to 'struct 
> >> virtio_config_ops'
> >> so that different platforms (e.g., pci or ccw) would use different ways to 
> >> limit
> >> the max number of queues in guest, with something like below?  
> > 
> > That sounds better, as both transports and drivers can opt-in here.
> > 
> > However, maybe it would be even better to try to come up with a better
> > strategy of allocating msix vectors in virtio-pci. More vectors in the
> > num_queues > num_cpus case, even if they still need to be shared?
> > Individual vectors for n-1 cpus and then a shared one for the remaining
> > queues?
> > 
> > It might even be device-specific: Have some low-traffic status queues
> > share a vector, and provide an individual vector for high-traffic
> > queues. Would need some device<->transport interface, obviously.
> >   
> 
> This sounds a little bit similar to multiple hctx maps?
> 
> So far, as virtio-blk only supports set->nr_maps = 1, no matter how many hw
> queues are assigned for virtio-blk, blk_mq_alloc_tag_set() would use at most
> nr_cpu_ids hw queues.
> 
> 2981 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
> ... ...
> 3021 /*
> 3022  * There is no use for more h/w queues than cpus if we just have
> 3023  * a single map
> 3024  */
> 3025 if (set->nr_maps == 1 && set->nr_hw_queues > nr_cpu_ids)
> 3026 set->nr_hw_queues = nr_cpu_ids;
> 
> Even the block layer would limit the number of hw queues by nr_cpu_ids when
> (set->nr_maps == 1).

Correct me if I'm wrong, but there seem to be two kinds of limitations
involved here:
- Allocation of msix vectors by the virtio-pci transport. We end up
  with shared vectors if we have more virtqueues than vcpus. Other
  transports may or may not have similar issues, but essentially, this
  is something that applies to all kind of virtio devices attached via
  the virtio-pci transport.
- The block layer limits the number of hw queues to the number of
  vcpus. This applies only to virtio devices that interact with the
  block layer, but regardless of the virtio transport.

> That's why I think virtio-blk should use the similar solution as nvme
> (regardless about write_queues and poll_queues) and xen-blkfront.

Ok, the hw queues limit from above would be an argument to limit to
#vcpus in the virtio-blk driver, regardless of the transport used. (No
idea if there are better ways to deal with this, I'm not familiar with
the interface.)

For virtio devices that don't interact with the block layer and are
attached via the virtio-pci transport, it might still make sense to
revisit vector allocation.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()

2019-03-14 Thread Michael S. Tsirkin
On Wed, Mar 13, 2019 at 09:37:08AM -0700, James Bottomley wrote:
> On Wed, 2019-03-13 at 09:05 -0700, Christoph Hellwig wrote:
> > On Tue, Mar 12, 2019 at 01:53:37PM -0700, James Bottomley wrote:
> > > I've got to say: optimize what?  What code do we ever have in the
> > > kernel that kmap's a page and then doesn't do anything with it? You
> > > can
> > > guarantee that on kunmap the page is either referenced (needs
> > > invalidating) or updated (needs flushing). The in-kernel use of
> > > kmap is
> > > always
> > > 
> > > kmap
> > > do something with the mapped page
> > > kunmap
> > > 
> > > In a very short interval.  It seems just a simplification to make
> > > kunmap do the flush if needed rather than try to have the users
> > > remember.  The thing which makes this really simple is that on most
> > > architectures flush and invalidate is the same operation.  If you
> > > really want to optimize you can use the referenced and dirty bits
> > > on the kmapped pte to tell you what operation to do, but if your
> > > flush is your invalidate, you simply assume the data needs flushing
> > > on kunmap without checking anything.
> > 
> > I agree that this would be a good way to simplify the API.   Now
> > we'd just need volunteers to implement this for all architectures
> > that need cache flushing and then remove the explicit flushing in
> > the callers..
> 
> Well, it's already done on parisc ...  I can help with this if we agree
> it's the best way forward.  It's really only architectures that
> implement flush_dcache_page that would need modifying.
> 
> It may also improve performance because some kmap/use/flush/kunmap
> sequences have flush_dcache_page() instead of
> flush_kernel_dcache_page() and the former is hugely expensive and
> usually unnecessary because GUP already flushed all the user aliases.
> 
> In the interests of full disclosure the reason we do it for parisc is
> because our later machines have problems even with clean aliases.  So
> on most VIPT systems, doing kmap/read/kunmap creates a fairly harmless
> clean alias.  Technically it should be invalidated, because if you
> remap the same page to the same colour you get cached stale data, but
> in practice the data is expired from the cache long before that
> happens, so the problem is almost never seen if the flush is forgotten.
>  Our problem is on the P9xxx processor: they have a L1/L2 VIPT L3 PIPT
> cache.  As the L1/L2 caches expire clean data, they place the expiring
> contents into L3, but because L3 is PIPT, the stale alias suddenly
> becomes the default for any read of they physical page because any
> update which dirtied the cache line often gets written to main memory
> and placed into the L3 as clean *before* the clean alias in L1/L2 gets
> expired, so the older clean alias replaces it.
> 
> Our only recourse is to kill all aliases with prejudice before the
> kernel loses ownership.
> 
> > > > Which means after we fix vhost to add the flush_dcache_page after
> > > > kunmap, Parisc will get a double hit (but it also means Parisc
> > > > was the only one of those archs needed explicit cache flushes,
> > > > where vhost worked correctly so far.. so it kinds of proofs your
> > > > point of giving up being the safe choice).
> > > 
> > > What double hit?  If there's no cache to flush then cache flush is
> > > a no-op.  It's also a highly piplineable no-op because the CPU has
> > > the L1 cache within easy reach.  The only event when flush takes a
> > > large amount time is if we actually have dirty data to write back
> > > to main memory.
> > 
> > I've heard people complaining that on some microarchitectures even
> > no-op cache flushes are relatively expensive.  Don't ask me why,
> > but if we can easily avoid double flushes we should do that.
> 
> It's still not entirely free for us.  Our internal cache line is around
> 32 bytes (some have 16 and some have 64) but that means we need 128
> flushes for a page ... we definitely can't pipeline them all.  So I
> agree duplicate flush elimination would be a small improvement.
> 
> James

I suspect we'll keep the copyXuser path around for 32 bit anyway -
right Jason?
So we can also keep using that on parisc...

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] vhost: select TAP if VHOST is configured

2019-03-14 Thread Michael S. Tsirkin
On Wed, Mar 13, 2019 at 10:49:32AM -0700, Stephen Hemminger wrote:
> If VHOST_NET is configured but TUN and TAP are not, then the
> kernel will build but vhost will not work correctly since it can't
> setup the necessary tap device.
> 
> A solution is to select it.
> 
> Fixes: 9a393b5d5988 ("tap: tap as an independent module")
> Signed-off-by: Stephen Hemminger 

Well vhost can also work with a SOCK_RAW socket.
QEMU userspace does not seem to use that interface
pobably because we never got around to adding
offload support, but it's there and we don't know
that no one else does.


> ---
>  drivers/vhost/Kconfig | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
> index b580885243f7..a24c69598241 100644
> --- a/drivers/vhost/Kconfig
> +++ b/drivers/vhost/Kconfig
> @@ -1,7 +1,8 @@
>  config VHOST_NET
>   tristate "Host kernel accelerator for virtio net"
> - depends on NET && EVENTFD && (TUN || !TUN) && (TAP || !TAP)
> + depends on NET && EVENTFD
>   select VHOST
> + select TAP
>   ---help---
> This kernel module can be loaded in host kernel to accelerate
> guest networking with virtio_net. Not to be confused with virtio_net
> -- 
> 2.17.1
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization