from:"Yan Zhao"

Re: [PATCH v3 kvm/queue 14/16] KVM: Handle page fault for private memory

2022-01-13 Thread Yan Zhao

hi Sean,
Sorry for the late reply. I just saw this mail in my mailbox.

On Wed, Jan 05, 2022 at 08:52:39PM +, Sean Christopherson wrote:
> On Wed, Jan 05, 2022, Yan Zhao wrote:
> > Sorry, maybe I didn't express it clearly.
> > 
> > As in the kvm_faultin_pfn_private(), 
> > static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > struct kvm_page_fault *fault,
> > bool *is_private_pfn, int *r)
> > {
> > int order;
> > int mem_convert_type;
> > struct kvm_memory_slot *slot = fault->slot;
> > long pfn = kvm_memfd_get_pfn(slot, fault->gfn, );
> > ...
> > }
> > Currently, kvm_memfd_get_pfn() is called unconditionally.
> > However, if the backend of a private memslot is not memfd, and is device
> > fd for example, a different xxx_get_pfn() is required here.
> 
> Ya, I've complained about this in a different thread[*].  This should really 
> be
> something like kvm_private_fd_get_pfn(), where the underlying ops struct can 
> point
> at any compatible backing store.
> 
> https://lore.kernel.org/all/ycumuemybxfyy...@google.com/
>
ok. 

> > Further, though mapped to a private gfn, it might be ok for QEMU to
> > access the device fd in hva-based way (or call it MMU access way, e.g.
> > read/write/mmap), it's desired that it could use the traditional to get
> > pfn without convert the range to a shared one.
> 
> No, this is expressly forbidden.  The backing store for a private gfn must not
> be accessible by userspace.  It's possible a backing store could support 
> both, but
> not concurrently, and any conversion must be done without KVM being involved.
> In other words, resolving a private gfn must either succeed or fail (exit to
> userspace), KVM cannot initiate any conversions.
>
When it comes to a device passthrough via VFIO, there might be more work
related to the device fd as a backend.

First, unlike memfd which can allocate one private fd for a set of PFNs,
and one shared fd for another set of PFNs, for device fd, it needs to open
the same physical device twice, one for shared fd, and one for private fd.

Then, for private device fd, now its ramblock has to use 
qemu_ram_alloc_from_fd()
instead of current qemu_ram_alloc_from_ptr().
And as in VFIO, this private fd is shared by several ramblocks (each locating 
from
a different base offset), the base offsets also need to be kept somewhere 
in order to call get_pfn successfully. (this info is kept in
vma through mmap() previously, so without mmap(), a new interface might
be required). 

Also, for shared device fd,  mmap() is required in order to allocate the
ramblock with qemu_ram_alloc_from_ptr(), and more importantly to make
the future gfn_to_hva, and hva_to_pfn possible.
But as the shared and private fds are based on the same physical device,
the vfio driver needs to record which vma ranges are allowed for the actual
mmap_fault, which vma area are not.

With the above changes, it only prevents the host user space from accessing
the device mapped to private GFNs.
For memory backends, host kernel space accessing is prevented via MKTME.
And for device, the device needs to the work to disallow host kernel
space access.
However, unlike memory side, the device side would not cause any MCE. 
Thereby, host user space access to the device also would not cause MCEs, 
either. 

So, I'm not sure if the above work is worthwhile to the device fd.

> > pfn = __gfn_to_pfn_memslot(slot, fault->gfn, ...)
> > |->addr = __gfn_to_hva_many (slot, gfn,...)
> > |  pfn = hva_to_pfn (addr,...)
> > 
> > 
> > So, is it possible to recognize such kind of backends in KVM, and to get
> > the pfn in traditional way without converting them to shared?
> > e.g.
> > - specify KVM_MEM_PRIVATE_NONPROTECT to memory regions with such kind
> > of backends, or
> > - detect the fd type and check if get_pfn is provided. if no, go the
> >   traditional way.
> 
> No, because the whole point of this is to make guest private memory 
> inaccessible
> to host userspace.  Or did I misinterpret your questions?
I think the host unmap series is based on the assumption that host user
space access to the memory based to private guest GFNs would cause fatal
MCEs.
So, I hope for backends who will not bring this fatal error can keep
using traditional way to get pfn and be mapped to private GFNs at the
same time.

Thanks
Yan

Re: [PATCH v3 kvm/queue 14/16] KVM: Handle page fault for private memory

2022-01-05 Thread Yan Zhao

On Wed, Jan 05, 2022 at 02:28:10PM +0800, Chao Peng wrote:
> On Tue, Jan 04, 2022 at 06:06:12PM +0800, Yan Zhao wrote:
> > On Tue, Jan 04, 2022 at 05:10:08PM +0800, Chao Peng wrote:
<...> 
> > Thanks. So QEMU will re-generate memslots and set KVM_MEM_PRIVATE
> > accordingly? Will it involve slot deletion and create?
> 
> KVM will not re-generate memslots when do the conversion, instead, it
> does unmap/map a range on the same memslot. For memslot with tag
> KVM_MEM_PRIVATE, it always have two mappings (private/shared) but at a
> time only one is effective. What conversion does is to turn off the
> existing mapping and turn on the other mapping for specified range in
> that slot.
>
got it. thanks!

<...>
> > > > > +static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> > > > > + struct kvm_page_fault *fault,
> > > > > + bool *is_private_pfn, int *r)
> > > > > +{
> > > > > + int order;
> > > > > + int mem_convert_type;
> > > > > + struct kvm_memory_slot *slot = fault->slot;
> > > > > + long pfn = kvm_memfd_get_pfn(slot, fault->gfn, );
> > > > For private memory slots, it's possible to have pfns backed by
> > > > backends other than memfd, e.g. devicefd.
> > > 
> > > Surely yes, although this patch only supports memfd, but it's designed
> > > to be extensible to support other memory backing stores than memfd. There
> > > is one assumption in this design however: one private memslot can be
> > > backed by only one type of such memory backing store, e.g. if the
> > > devicefd you mentioned can independently provide memory for a memslot
> > > then that's no issue.
> > > 
> > > >So is it possible to let those
> > > > private memslots keep private and use traditional hva-based way?
> > > 
> > > Typically this fd-based private memory uses the 'offset' as the
> > > userspace address to get a pfn from the backing store fd. But I believe
> > > the current code does not prevent you from using the hva as the
> > By hva-based way, I mean mmap is required for this fd.
> > 
> > > userspace address, as long as your memory backing store understand that
> > > address and can provide the pfn basing on it. But since you already have
> > > the hva, you probably already mmap-ed the fd to userspace, that seems
> > > not this private memory patch can protect you. Probably I didn't quite
> > Yes, for this fd, though mapped in private memslot, there's no need to
> > prevent QEMU/host from accessing it as it will not cause the severe machine
> > check.
> > 
> > > understand 'keep private' you mentioned here.
> > 'keep private' means allow this kind of private memslot which does not
> > require protection from this private memory patch :)
> 
> Then I think such memory can be the shared part of memory of the
> KVM_MEM_PRIVATE memslot. As said above, this is initially supported :)
>
Sorry, maybe I didn't express it clearly.

As in the kvm_faultin_pfn_private(), 
static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
struct kvm_page_fault *fault,
bool *is_private_pfn, int *r)
{
int order;
int mem_convert_type;
struct kvm_memory_slot *slot = fault->slot;
long pfn = kvm_memfd_get_pfn(slot, fault->gfn, );
...
}
Currently, kvm_memfd_get_pfn() is called unconditionally.
However, if the backend of a private memslot is not memfd, and is device
fd for example, a different xxx_get_pfn() is required here.

Further, though mapped to a private gfn, it might be ok for QEMU to
access the device fd in hva-based way (or call it MMU access way, e.g.
read/write/mmap), it's desired that it could use the traditional to get
pfn without convert the range to a shared one.
pfn = __gfn_to_pfn_memslot(slot, fault->gfn, ...)
|->addr = __gfn_to_hva_many (slot, gfn,...)
|  pfn = hva_to_pfn (addr,...)


So, is it possible to recognize such kind of backends in KVM, and to get
the pfn in traditional way without converting them to shared?
e.g.
- specify KVM_MEM_PRIVATE_NONPROTECT to memory regions with such kind
of backends, or
- detect the fd type and check if get_pfn is provided. if no, go the
  traditional way.

Thanks
Yan

> > > > Reasons below:
> > > > 1. only memfd is supported in this patch set.
> > > > 2. qemu/host read/write to those private memslots backing up by 
> > > > devicefd may
> > > > not cause machine check.

Re: [PATCH v3 kvm/queue 14/16] KVM: Handle page fault for private memory

2022-01-04 Thread Yan Zhao

On Tue, Jan 04, 2022 at 05:10:08PM +0800, Chao Peng wrote:
> On Tue, Jan 04, 2022 at 09:46:35AM +0800, Yan Zhao wrote:
> > On Thu, Dec 23, 2021 at 08:30:09PM +0800, Chao Peng wrote:
> > > When a page fault from the secondary page table while the guest is
> > > running happens in a memslot with KVM_MEM_PRIVATE, we need go
> > > different paths for private access and shared access.
> > > 
> > >   - For private access, KVM checks if the page is already allocated in
> > > the memory backend, if yes KVM establishes the mapping, otherwise
> > > exits to userspace to convert a shared page to private one.
> > >
> > will this conversion be atomical or not?
> > For example, after punching a hole in a private memory slot, will KVM
> > see two notifications: one for invalidation of the whole private memory
> > slot, and one for fallocate of the rest ranges besides the hole?
> > Or, KVM only sees one invalidation notification for the hole?
> 
> Punching hole doesn't need to invalidate the whole memory slot. It only
> send one invalidation notification to KVM for the 'hole' part.
good :)

> 
> Taking shared-to-private conversion as example it only invalidates the
> 'hole' part (that usually only the portion of the whole memory) on the
> shared fd,, and then fallocate the private memory in the private fd at
> the 'hole'. The KVM invalidation notification happens when the shared
> hole gets invalidated. The establishment of the private mapping happens
> at subsequent KVM page fault handlers.
> 
> > Could you please show QEMU code about this conversion?
> 
> See below for the QEMU side conversion code. The above described
> invalidation and fallocation will be two steps in this conversion. If
> error happens in the middle then this error will be propagated to
> kvm_run to do the proper action (e.g. may kill the guest?).
> 
> int ram_block_convert_range(RAMBlock *rb, uint64_t start, size_t length,
> bool shared_to_private)
> {
> int ret; 
> int fd_from, fd_to;
> 
> if (!rb || rb->private_fd <= 0) { 
> return -1;
> }
> 
> if (!QEMU_PTR_IS_ALIGNED(start, rb->page_size) ||
> !QEMU_PTR_IS_ALIGNED(length, rb->page_size)) {
> return -1;
> }
> 
> if (length > rb->max_length) {
> return -1;
> }
> 
> if (shared_to_private) {
> fd_from = rb->fd;
> fd_to = rb->private_fd;
> } else {
> fd_from = rb->private_fd;
> fd_to = rb->fd;
> }
> 
> ret = ram_block_discard_range_fd(rb, start, length, fd_from);
> if (ret) {
> return ret; 
> }
> 
> if (fd_to > 0) { 
> return fallocate(fd_to, 0, start, length);
> }
> 
> return 0;
> }
> 
Thanks. So QEMU will re-generate memslots and set KVM_MEM_PRIVATE
accordingly? Will it involve slot deletion and create?

> > 
> > 
> > >   - For shared access, KVM also checks if the page is already allocated
> > > in the memory backend, if yes then exit to userspace to convert a
> > > private page to shared one, otherwise it's treated as a traditional
> > > hva-based shared memory, KVM lets existing code to obtain a pfn with
> > > get_user_pages() and establish the mapping.
> > > 
> > > The above code assume private memory is persistent and pre-allocated in
> > > the memory backend so KVM can use this information as an indicator for
> > > a page is private or shared. The above check is then performed by
> > > calling kvm_memfd_get_pfn() which currently is implemented as a
> > > pagecache search but in theory that can be implemented differently
> > > (i.e. when the page is even not mapped into host pagecache there should
> > > be some different implementation).
> > > 
> > > Signed-off-by: Yu Zhang 
> > > Signed-off-by: Chao Peng 
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c | 73 --
> > >  arch/x86/kvm/mmu/paging_tmpl.h | 11 +++--
> > >  2 files changed, 77 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 2856eb662a21..fbcdf62f8281 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -2920,6 +2920,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> > >   if (max_level == PG_LEVEL_4K)
> > >   return PG_LEVEL_4K;
> > >  
> > > + if (kvm_slot_is_private(slot))
> &g

Re: [PATCH v3 kvm/queue 14/16] KVM: Handle page fault for private memory

2022-01-03 Thread Yan Zhao

On Thu, Dec 23, 2021 at 08:30:09PM +0800, Chao Peng wrote:
> When a page fault from the secondary page table while the guest is
> running happens in a memslot with KVM_MEM_PRIVATE, we need go
> different paths for private access and shared access.
> 
>   - For private access, KVM checks if the page is already allocated in
> the memory backend, if yes KVM establishes the mapping, otherwise
> exits to userspace to convert a shared page to private one.
>
will this conversion be atomical or not?
For example, after punching a hole in a private memory slot, will KVM
see two notifications: one for invalidation of the whole private memory
slot, and one for fallocate of the rest ranges besides the hole?
Or, KVM only sees one invalidation notification for the hole?
Could you please show QEMU code about this conversion?


>   - For shared access, KVM also checks if the page is already allocated
> in the memory backend, if yes then exit to userspace to convert a
> private page to shared one, otherwise it's treated as a traditional
> hva-based shared memory, KVM lets existing code to obtain a pfn with
> get_user_pages() and establish the mapping.
> 
> The above code assume private memory is persistent and pre-allocated in
> the memory backend so KVM can use this information as an indicator for
> a page is private or shared. The above check is then performed by
> calling kvm_memfd_get_pfn() which currently is implemented as a
> pagecache search but in theory that can be implemented differently
> (i.e. when the page is even not mapped into host pagecache there should
> be some different implementation).
> 
> Signed-off-by: Yu Zhang 
> Signed-off-by: Chao Peng 
> ---
>  arch/x86/kvm/mmu/mmu.c | 73 --
>  arch/x86/kvm/mmu/paging_tmpl.h | 11 +++--
>  2 files changed, 77 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 2856eb662a21..fbcdf62f8281 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2920,6 +2920,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
>   if (max_level == PG_LEVEL_4K)
>   return PG_LEVEL_4K;
>  
> + if (kvm_slot_is_private(slot))
> + return max_level;
> +
>   host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot);
>   return min(host_level, max_level);
>  }
> @@ -3950,7 +3953,59 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu 
> *vcpu, gpa_t cr2_or_gpa,
> kvm_vcpu_gfn_to_hva(vcpu, gfn), );
>  }
>  
> -static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault 
> *fault, int *r)
> +static bool kvm_vcpu_is_private_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
> +{
> + /*
> +  * At this time private gfn has not been supported yet. Other patch
> +  * that enables it should change this.
> +  */
> + return false;
> +}
> +
> +static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
> + struct kvm_page_fault *fault,
> + bool *is_private_pfn, int *r)
> +{
> + int order;
> + int mem_convert_type;
> + struct kvm_memory_slot *slot = fault->slot;
> + long pfn = kvm_memfd_get_pfn(slot, fault->gfn, );
For private memory slots, it's possible to have pfns backed by
backends other than memfd, e.g. devicefd. So is it possible to let those
private memslots keep private and use traditional hva-based way?
Reasons below:
1. only memfd is supported in this patch set.
2. qemu/host read/write to those private memslots backing up by devicefd may
not cause machine check.

Thanks
Yan


> +
> + if (kvm_vcpu_is_private_gfn(vcpu, fault->addr >> PAGE_SHIFT)) {
> + if (pfn < 0)
> + mem_convert_type = KVM_EXIT_MEM_MAP_PRIVATE;
> + else {
> + fault->pfn = pfn;
> + if (slot->flags & KVM_MEM_READONLY)
> + fault->map_writable = false;
> + else
> + fault->map_writable = true;
> +
> + if (order == 0)
> + fault->max_level = PG_LEVEL_4K;
> + *is_private_pfn = true;
> + *r = RET_PF_FIXED;
> + return true;
> + }
> + } else {
> + if (pfn < 0)
> + return false;
> +
> + kvm_memfd_put_pfn(pfn);
> + mem_convert_type = KVM_EXIT_MEM_MAP_SHARED;
> + }
> +
> + vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR;
> + vcpu->run->mem.type = mem_convert_type;
> + vcpu->run->mem.u.map.gpa = fault->gfn << PAGE_SHIFT;
> + vcpu->run->mem.u.map.size = PAGE_SIZE;
> + fault->pfn = -1;
> + *r = -1;
> + return true;
> +}
> +
> +static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault 
> *fault,
> + bool *is_private_pfn, int *r)
>  {
>

Re: VFIO Migration

2020-11-05 Thread Yan Zhao

On Tue, Nov 03, 2020 at 10:13:05AM -0700, Alex Williamson wrote:
> On Tue, 3 Nov 2020 11:03:24 +
> Stefan Hajnoczi  wrote:

<...>
>  
> > Management tools need to match the device model/configuration from the
> > source device against the destination device. If the destination is
> > capable of supporting the source's device model/configuration then
> > migration can proceed safely.
> > 
> > Let's look at the case where we are migration from an older version of a
> > device to a newer version. On the source we have:
> > 
> >   model = https://vendor-a.com/my-nic
> > 
> > On the destination we have:
> > 
> >   model = https://vendor-a.com/my-nic
> >   rss = on
> > 
> > The two devices are incompatible because the destination exposes the RSS
> > feature that is not present on the source. The RSS feature involves
> > guest-visible hardware interface changes and a change to the device
> > state representation. It is not safe to migrate!
> > 
> > In this case an extra configuration step is necessary so that the
> > destination device can accept the device state from the source. The
> > management tool invokes a vendor-specific tool to put the device into
> > the right configuration:
> > 
> >   # vendor-tool set-migration-config --device :00:04.0 \
> >  --model https://vendor-a.com/my-nic
> > 
> > (This tool only succeeds when the device is bound to VFIO but not yet
> > opened.)
> > 
> > The tool invokes ioctls on the vendor-specific VFIO driver that does two
> > things:
> > 1. Tells the device to present the old hardware interface without RSS
> > 2. Uses the old device state representation without RSS support
> > 
> > Does this approach fit?
> 
> 
> Should we not require that any sort of configuration like this occurs
> through sysfs?  We must be able to create an instance with a specific
> configuration without using vendor specific tools, therefore in the
> worse case we should be able to remove and recreate an instance as we
> desire without invoking vendor specific tools.  Thanks,
> 
hi Alex,
could mdevctl serve as a general configuration tool to create/destroy/config
mdev devices?

I think previously the main debate is on what is an easy way for management
tool to find and create a compatible target mdev device according to sysfs
info of source mdev device, is it?
as in [1], we have simplified the method to 1:1 matching of mdev_type
in src and target. and we can further force it to be 1:1 matching of
vendor_specific attributes (e.g. pci id) and dynamic resources
(e.g. aggregator, fps,...) and have mdevctl to create a compatible target
for management tools.

Given management tools like openstack are still in their preliminary
stage of supporting mdev devices, could we first settle down the
compatibility sysfs protocol and treat mdevctl as userspace tool
currently?

[1]: https://lists.gnu.org/archive/html/qemu-devel/2020-09/msg03273.html

Thanks
Yan

Re: [PATCH v28 03/17] vfio: Add save and load functions for VFIO PCI devices

2020-10-24 Thread Yan Zhao

On Sat, Oct 24, 2020 at 08:16:30AM -0600, Alex Williamson wrote:
> On Sat, 24 Oct 2020 19:53:39 +0800
> Yan Zhao  wrote:
> 
> > hi
> > when I migrating VFs, the PCI_COMMAND is not properly saved. and the
> > target side would meet below bug
> > root@tester:~# [  189.360671] ++>> reset starts here: 
> > iavf_reset_task !!!
> > [  199.360798] iavf :00:04.0: Reset never finished (0)
> > [  199.380504] kernel BUG at drivers/pci/msi.c:352!
> > [  199.382957] invalid opcode:  [#1] SMP PTI
> > [  199.384855] CPU: 1 PID: 419 Comm: kworker/1:2 Tainted: G   OE
> >  5.0.0-13-generic #14-Ubuntu
> > [  199.388204] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> > [  199.392401] Workqueue: events iavf_reset_task [iavf]
> > [  199.393586] RIP: 0010:free_msi_irqs+0x17b/0x1b0
> > [  199.394659] Code: 84 e1 fe ff ff 45 31 f6 eb 11 41 83 c6 01 44 39 73 14 
> > 0f 86 ce fe ff ff 8b 7b 10 44 01 f7 e8 3c 7a ba ff 48 83 78 70 00 74 e0 
> > <0f> 0b 49 8d b5 b0 00 00 00 e8 07 27 bb ff e9 cf fe ff ff 48 8b 78
> > [  199.399056] RSP: 0018:abd1006cfdb8 EFLAGS: 00010282
> > [  199.400302] RAX: 9e336d8a2800 RBX: 9eb006c0 RCX: 
> > 
> > [  199.402000] RDX:  RSI: 0019 RDI: 
> > baa68100
> > [  199.403168] RBP: abd1006cfde8 R08: 9e3375000248 R09: 
> > 9e3375000338
> > [  199.404343] R10:  R11: baa68108 R12: 
> > 9e3374ef12c0
> > [  199.405526] R13: 9e3374ef1000 R14:  R15: 
> > 9e3371f2d018
> > [  199.406702] FS:  () GS:9e3375b0() 
> > knlGS:
> > [  199.408027] CS:  0010 DS:  ES:  CR0: 80050033
> > [  199.408987] CR2:  CR3: 33266000 CR4: 
> > 06e0
> > [  199.410155] DR0:  DR1:  DR2: 
> > 
> > [  199.411321] DR3:  DR6: fffe0ff0 DR7: 
> > 0400
> > [  199.412437] Call Trace:
> > [  199.412750]  pci_disable_msix+0xf3/0x120
> > [  199.413227]  iavf_reset_interrupt_capability.part.40+0x19/0x40 [iavf]
> > [  199.413998]  iavf_reset_task+0x4b3/0x9d0 [iavf]
> > [  199.414544]  process_one_work+0x20f/0x410
> > [  199.415026]  worker_thread+0x34/0x400
> > [  199.415486]  kthread+0x120/0x140
> > [  199.415876]  ? process_one_work+0x410/0x410
> > [  199.416380]  ? __kthread_parkme+0x70/0x70
> > [  199.416864]  ret_from_fork+0x35/0x40
> > 
> > I fixed it with below patch.
> > 
> > 
> > commit ad3efa0eeea7edb352294bfce35b904b8d3c759c
> > Author: Yan Zhao 
> > Date:   Sat Oct 24 19:45:01 2020 +0800
> > 
> > msix fix.
> > 
> > Signed-off-by: Yan Zhao 
> > 
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index f63f15b553..92f71bf933 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -2423,8 +2423,14 @@ const VMStateDescription vmstate_vfio_pci_config = {
> >  static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> >  {
> >  VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> > +PCIDevice *pdev = >pdev;
> > +uint16_t pci_cmd;
> > +
> > +pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
> > +qemu_put_be16(f, pci_cmd);
> >  
> >  vmstate_save_state(f, _vfio_pci_config, vdev, NULL);
> > +
> >  }
> >  
> >  static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
> > @@ -2432,6 +2438,10 @@ static int vfio_pci_load_config(VFIODevice 
> > *vbasedev, QEMUFile *f)
> >  VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> >  PCIDevice *pdev = >pdev;
> >  int ret;
> > +uint16_t pci_cmd;
> > +
> > +pci_cmd = qemu_get_be16(f);
> > +vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
> >  
> >  ret = vmstate_load_state(f, _vfio_pci_config, vdev, 1);
> >  if (ret) {
> > 
> 
> 
> We need to avoid this sort of ad-hoc stuffing random fields into the
> config stream.  The command register is already migrated in vconfig, it
> only needs to be written through vfio:
> 
> vfio_pci_write_config(pdev, PCI_COMMAND,
> pci_get_word(pdev->config, PCI_COMMAND), 2);
> 
yes, it should work. previously we just rely on qemu to save and load
the common fields.

Thanks
Yan

> 
> 
> > On Fri, Oct 23, 20

Re: [PATCH v28 10/17] memory: Set DIRTY_MEMORY_MIGRATION when IOMMU is enabled

2020-10-24 Thread Yan Zhao

Reviewed-by: Yan Zhao 
On Fri, Oct 23, 2020 at 04:10:36PM +0530, Kirti Wankhede wrote:
> mr->ram_block is NULL when mr->is_iommu is true, then fr.dirty_log_mask
> wasn't set correctly due to which memory listener's log_sync doesn't
> get called.
> This patch returns log_mask with DIRTY_MEMORY_MIGRATION set when
> IOMMU is enabled.
> 
> Signed-off-by: Kirti Wankhede 
> ---
>  softmmu/memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/softmmu/memory.c b/softmmu/memory.c
> index 403ff3abc99b..94f606e9d9d9 100644
> --- a/softmmu/memory.c
> +++ b/softmmu/memory.c
> @@ -1792,7 +1792,7 @@ bool memory_region_is_ram_device(MemoryRegion *mr)
>  uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
>  {
>  uint8_t mask = mr->dirty_log_mask;
> -if (global_dirty_log && mr->ram_block) {
> +if (global_dirty_log && (mr->ram_block || memory_region_is_iommu(mr))) {
>  mask |= (1 << DIRTY_MEMORY_MIGRATION);
>  }
>  return mask;
> -- 
> 2.7.0
>

Re: [PATCH v28 09/17] vfio: Add load state functions to SaveVMHandlers

2020-10-24 Thread Yan Zhao

Reviewed-by: Yan Zhao 

On Fri, Oct 23, 2020 at 04:10:35PM +0530, Kirti Wankhede wrote:
> Sequence  during _RESUMING device state:
> While data for this device is available, repeat below steps:
> a. read data_offset from where user application should write data.
> b. write data of data_size to migration region from data_offset.
> c. write data_size which indicates vendor driver that data is written in
>staging buffer.
> 
> For user, data is opaque. User should write data in the same order as
> received.
> 
> Signed-off-by: Kirti Wankhede 
> Reviewed-by: Neo Jia 
> Reviewed-by: Dr. David Alan Gilbert 
> ---
>  hw/vfio/migration.c  | 195 
> +++
>  hw/vfio/trace-events |   4 ++
>  2 files changed, 199 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index be9e4aba541d..240646592b39 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -257,6 +257,77 @@ static int vfio_save_buffer(QEMUFile *f, VFIODevice 
> *vbasedev, uint64_t *size)
>  return ret;
>  }
>  
> +static int vfio_load_buffer(QEMUFile *f, VFIODevice *vbasedev,
> +uint64_t data_size)
> +{
> +VFIORegion *region = >migration->region;
> +uint64_t data_offset = 0, size, report_size;
> +int ret;
> +
> +do {
> +ret = vfio_mig_read(vbasedev, _offset, sizeof(data_offset),
> +  region->fd_offset + 
> VFIO_MIG_STRUCT_OFFSET(data_offset));
> +if (ret < 0) {
> +return ret;
> +}
> +
> +if (data_offset + data_size > region->size) {
> +/*
> + * If data_size is greater than the data section of migration 
> region
> + * then iterate the write buffer operation. This case can occur 
> if
> + * size of migration region at destination is smaller than size 
> of
> + * migration region at source.
> + */
> +report_size = size = region->size - data_offset;
> +data_size -= size;
> +} else {
> +report_size = size = data_size;
> +data_size = 0;
> +}
> +
> +trace_vfio_load_state_device_data(vbasedev->name, data_offset, size);
> +
> +while (size) {
> +void *buf;
> +uint64_t sec_size;
> +bool buf_alloc = false;
> +
> +buf = get_data_section_size(region, data_offset, size, 
> _size);
> +
> +if (!buf) {
> +buf = g_try_malloc(sec_size);
> +if (!buf) {
> +error_report("%s: Error allocating buffer ", __func__);
> +return -ENOMEM;
> +}
> +buf_alloc = true;
> +}
> +
> +qemu_get_buffer(f, buf, sec_size);
> +
> +if (buf_alloc) {
> +ret = vfio_mig_write(vbasedev, buf, sec_size,
> +region->fd_offset + data_offset);
> +g_free(buf);
> +
> +if (ret < 0) {
> +return ret;
> +}
> +}
> +size -= sec_size;
> +data_offset += sec_size;
> +}
> +
> +ret = vfio_mig_write(vbasedev, _size, sizeof(report_size),
> +region->fd_offset + 
> VFIO_MIG_STRUCT_OFFSET(data_size));
> +if (ret < 0) {
> +return ret;
> +}
> +} while (data_size);
> +
> +return 0;
> +}
> +
>  static int vfio_update_pending(VFIODevice *vbasedev)
>  {
>  VFIOMigration *migration = vbasedev->migration;
> @@ -293,6 +364,33 @@ static int vfio_save_device_config_state(QEMUFile *f, 
> void *opaque)
>  return qemu_file_get_error(f);
>  }
>  
> +static int vfio_load_device_config_state(QEMUFile *f, void *opaque)
> +{
> +VFIODevice *vbasedev = opaque;
> +uint64_t data;
> +
> +if (vbasedev->ops && vbasedev->ops->vfio_load_config) {
> +int ret;
> +
> +ret = vbasedev->ops->vfio_load_config(vbasedev, f);
> +if (ret) {
> +error_report("%s: Failed to load device config space",
> + vbasedev->name);
> +return ret;
> +}
> +}
> +
> +data = qemu_get_be64(f);
> +if (data != VFIO_MIG_FLAG_END_OF_STATE) {
> +error_report("%s: Failed loading device config space, "
> + "end flag incorrect 0x%"PRIx64, vbasedev->name, data);
> +return -EINVAL;
>

Re: [PATCH v28 08/17] vfio: Add save state functions to SaveVMHandlers

2020-10-24 Thread Yan Zhao

Reviewed-by: Yan Zhao 

On Fri, Oct 23, 2020 at 04:10:34PM +0530, Kirti Wankhede wrote:
> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> functions. These functions handles pre-copy and stop-and-copy phase.
> 
> In _SAVING|_RUNNING device state or pre-copy phase:
> - read pending_bytes. If pending_bytes > 0, go through below steps.
> - read data_offset - indicates kernel driver to write data to staging
>   buffer.
> - read data_size - amount of data in bytes written by vendor driver in
>   migration region.
> - read data_size bytes of data from data_offset in the migration region.
> - Write data packet to file stream as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> VFIO_MIG_FLAG_END_OF_STATE }
> 
> In _SAVING device state or stop-and-copy phase
> a. read config space of device and save to migration file stream. This
>doesn't need to be from vendor driver. Any other special config state
>from driver can be saved as data in following iteration.
> b. read pending_bytes. If pending_bytes > 0, go through below steps.
> c. read data_offset - indicates kernel driver to write data to staging
>buffer.
> d. read data_size - amount of data in bytes written by vendor driver in
>migration region.
> e. read data_size bytes of data from data_offset in the migration region.
> f. Write data packet as below:
>{VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> g. iterate through steps b to f while (pending_bytes > 0)
> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> 
> When data region is mapped, its user's responsibility to read data from
> data_offset of data_size before moving to next steps.
> 
> Added fix suggested by Artem Polyakov to reset pending_bytes in
> vfio_save_iterate().
> Added fix suggested by Zhi Wang to add 0 as data size in migration stream and
> add END_OF_STATE delimiter to indicate phase complete.
> 
> Suggested-by: Artem Polyakov 
> Suggested-by: Zhi Wang 
> Signed-off-by: Kirti Wankhede 
> Reviewed-by: Neo Jia 
> ---
>  hw/vfio/migration.c   | 276 
> ++
>  hw/vfio/trace-events  |   6 +
>  include/hw/vfio/vfio-common.h |   1 +
>  3 files changed, 283 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 94d2bdae5c54..be9e4aba541d 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -148,6 +148,151 @@ static int vfio_migration_set_state(VFIODevice 
> *vbasedev, uint32_t mask,
>  return 0;
>  }
>  
> +static void *get_data_section_size(VFIORegion *region, uint64_t data_offset,
> +   uint64_t data_size, uint64_t *size)
> +{
> +void *ptr = NULL;
> +uint64_t limit = 0;
> +int i;
> +
> +if (!region->mmaps) {
> +if (size) {
> +*size = MIN(data_size, region->size - data_offset);
> +}
> +return ptr;
> +}
> +
> +for (i = 0; i < region->nr_mmaps; i++) {
> +VFIOMmap *map = region->mmaps + i;
> +
> +if ((data_offset >= map->offset) &&
> +(data_offset < map->offset + map->size)) {
> +
> +/* check if data_offset is within sparse mmap areas */
> +ptr = map->mmap + data_offset - map->offset;
> +if (size) {
> +*size = MIN(data_size, map->offset + map->size - 
> data_offset);
> +}
> +break;
> +} else if ((data_offset < map->offset) &&
> +   (!limit || limit > map->offset)) {
> +/*
> + * data_offset is not within sparse mmap areas, find size of
> + * non-mapped area. Check through all list since region->mmaps 
> list
> + * is not sorted.
> + */
> +limit = map->offset;
> +}
> +}
> +
> +if (!ptr && size) {
> +*size = limit ? MIN(data_size, limit - data_offset) : data_size;
> +}
> +return ptr;
> +}
> +
> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev, uint64_t 
> *size)
> +{
> +VFIOMigration *migration = vbasedev->migration;
> +VFIORegion *region = >region;
> +uint64_t data_offset = 0, data_size = 0, sz;
> +int ret;
> +
> +ret = vfio_mig_read(vbasedev, _offset, sizeof(data_offset),
> +  region->fd_offset + 
> VFIO_MIG_STRUCT_OFFSET(data_offset));
> +if (ret < 0) {
> +return ret;
> +}
> +
> +ret = vfio_mig_read(vbasedev, _size, sizeof(data_size),
> +region->fd_offset + 
> VFIO

Re: [PATCH v28 14/17] vfio: Dirty page tracking when vIOMMU is enabled

2020-10-24 Thread Yan Zhao

Reviewed-by: Yan Zhao 

On Fri, Oct 23, 2020 at 04:10:40PM +0530, Kirti Wankhede wrote:
> When vIOMMU is enabled, add MAP notifier from log_sync when all
> devices in container are in stop and copy phase of migration. Call replay
> and then from notifier callback, get dirty pages.
> 
> Suggested-by: Alex Williamson 
> Signed-off-by: Kirti Wankhede 
> ---
>  hw/vfio/common.c | 88 
> 
>  hw/vfio/trace-events |  1 +
>  2 files changed, 83 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 2634387df948..c0b5b6245a47 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -442,8 +442,8 @@ static bool 
> vfio_listener_skipped_section(MemoryRegionSection *section)
>  }
>  
>  /* Called with rcu_read_lock held.  */
> -static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
> -   bool *read_only)
> +static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
> +   ram_addr_t *ram_addr, bool *read_only)
>  {
>  MemoryRegion *mr;
>  hwaddr xlat;
> @@ -474,8 +474,17 @@ static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void 
> **vaddr,
>  return false;
>  }
>  
> -*vaddr = memory_region_get_ram_ptr(mr) + xlat;
> -*read_only = !writable || mr->readonly;
> +if (vaddr) {
> +*vaddr = memory_region_get_ram_ptr(mr) + xlat;
> +}
> +
> +if (ram_addr) {
> +*ram_addr = memory_region_get_ram_addr(mr) + xlat;
> +}
> +
> +if (read_only) {
> +*read_only = !writable || mr->readonly;
> +}
>  
>  return true;
>  }
> @@ -485,7 +494,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, 
> IOMMUTLBEntry *iotlb)
>  VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>  VFIOContainer *container = giommu->container;
>  hwaddr iova = iotlb->iova + giommu->iommu_offset;
> -bool read_only;
>  void *vaddr;
>  int ret;
>  
> @@ -501,7 +509,9 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, 
> IOMMUTLBEntry *iotlb)
>  rcu_read_lock();
>  
>  if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
> -if (!vfio_get_vaddr(iotlb, , _only)) {
> +bool read_only;
> +
> +if (!vfio_get_xlat_addr(iotlb, , NULL, _only)) {
>  goto out;
>  }
>  /*
> @@ -899,11 +909,77 @@ err_out:
>  return ret;
>  }
>  
> +typedef struct {
> +IOMMUNotifier n;
> +VFIOGuestIOMMU *giommu;
> +} vfio_giommu_dirty_notifier;
> +
> +static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry 
> *iotlb)
> +{
> +vfio_giommu_dirty_notifier *gdn = container_of(n,
> +vfio_giommu_dirty_notifier, 
> n);
> +VFIOGuestIOMMU *giommu = gdn->giommu;
> +VFIOContainer *container = giommu->container;
> +hwaddr iova = iotlb->iova + giommu->iommu_offset;
> +ram_addr_t translated_addr;
> +
> +trace_vfio_iommu_map_dirty_notify(iova, iova + iotlb->addr_mask);
> +
> +if (iotlb->target_as != _space_memory) {
> +error_report("Wrong target AS \"%s\", only system memory is allowed",
> + iotlb->target_as->name ? iotlb->target_as->name : 
> "none");
> +return;
> +}
> +
> +rcu_read_lock();
> +if (vfio_get_xlat_addr(iotlb, NULL, _addr, NULL)) {
> +int ret;
> +
> +ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
> +translated_addr);
> +if (ret) {
> +error_report("vfio_iommu_map_dirty_notify(%p, 0x%"HWADDR_PRIx", "
> + "0x%"HWADDR_PRIx") = %d (%m)",
> + container, iova,
> + iotlb->addr_mask + 1, ret);
> +}
> +}
> +rcu_read_unlock();
> +}
> +
>  static int vfio_sync_dirty_bitmap(VFIOContainer *container,
>MemoryRegionSection *section)
>  {
>  ram_addr_t ram_addr;
>  
> +if (memory_region_is_iommu(section->mr)) {
> +VFIOGuestIOMMU *giommu;
> +
> +QLIST_FOREACH(giommu, >giommu_list, giommu_next) {
> +if (MEMORY_REGION(giommu->iommu) == section->mr &&
> +giommu->n.start == section->offset_within_region) {
> +Int128 llend;
> +vfio_giommu_dirty_notifier gdn = { .giommu = giommu };
&g

Re: [PATCH v28 03/17] vfio: Add save and load functions for VFIO PCI devices

2020-10-24 Thread Yan Zhao

hi
when I migrating VFs, the PCI_COMMAND is not properly saved. and the
target side would meet below bug
root@tester:~# [  189.360671] ++>> reset starts here: iavf_reset_task 
!!!
[  199.360798] iavf :00:04.0: Reset never finished (0)
[  199.380504] kernel BUG at drivers/pci/msi.c:352!
[  199.382957] invalid opcode:  [#1] SMP PTI
[  199.384855] CPU: 1 PID: 419 Comm: kworker/1:2 Tainted: G   OE 
5.0.0-13-generic #14-Ubuntu
[  199.388204] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[  199.392401] Workqueue: events iavf_reset_task [iavf]
[  199.393586] RIP: 0010:free_msi_irqs+0x17b/0x1b0
[  199.394659] Code: 84 e1 fe ff ff 45 31 f6 eb 11 41 83 c6 01 44 39 73 14 0f 
86 ce fe ff ff 8b 7b 10 44 01 f7 e8 3c 7a ba ff 48 83 78 70 00 74 e0 <0f> 0b 49 
8d b5 b0 00 00 00 e8 07 27 bb ff e9 cf fe ff ff 48 8b 78
[  199.399056] RSP: 0018:abd1006cfdb8 EFLAGS: 00010282
[  199.400302] RAX: 9e336d8a2800 RBX: 9eb006c0 RCX: 
[  199.402000] RDX:  RSI: 0019 RDI: baa68100
[  199.403168] RBP: abd1006cfde8 R08: 9e3375000248 R09: 9e3375000338
[  199.404343] R10:  R11: baa68108 R12: 9e3374ef12c0
[  199.405526] R13: 9e3374ef1000 R14:  R15: 9e3371f2d018
[  199.406702] FS:  () GS:9e3375b0() 
knlGS:
[  199.408027] CS:  0010 DS:  ES:  CR0: 80050033
[  199.408987] CR2:  CR3: 33266000 CR4: 06e0
[  199.410155] DR0:  DR1:  DR2: 
[  199.411321] DR3:  DR6: fffe0ff0 DR7: 0400
[  199.412437] Call Trace:
[  199.412750]  pci_disable_msix+0xf3/0x120
[  199.413227]  iavf_reset_interrupt_capability.part.40+0x19/0x40 [iavf]
[  199.413998]  iavf_reset_task+0x4b3/0x9d0 [iavf]
[  199.414544]  process_one_work+0x20f/0x410
[  199.415026]  worker_thread+0x34/0x400
[  199.415486]  kthread+0x120/0x140
[  199.415876]  ? process_one_work+0x410/0x410
[  199.416380]  ? __kthread_parkme+0x70/0x70
[  199.416864]  ret_from_fork+0x35/0x40

I fixed it with below patch.


commit ad3efa0eeea7edb352294bfce35b904b8d3c759c
Author: Yan Zhao 
Date:   Sat Oct 24 19:45:01 2020 +0800

msix fix.

Signed-off-by: Yan Zhao 

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index f63f15b553..92f71bf933 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2423,8 +2423,14 @@ const VMStateDescription vmstate_vfio_pci_config = {
 static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
 {
 VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+PCIDevice *pdev = >pdev;
+uint16_t pci_cmd;
+
+pci_cmd = pci_default_read_config(pdev, PCI_COMMAND, 2);
+qemu_put_be16(f, pci_cmd);
 
 vmstate_save_state(f, _vfio_pci_config, vdev, NULL);
+
 }
 
 static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
@@ -2432,6 +2438,10 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, 
QEMUFile *f)
 VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
 PCIDevice *pdev = >pdev;
 int ret;
+uint16_t pci_cmd;
+
+pci_cmd = qemu_get_be16(f);
+vfio_pci_write_config(pdev, PCI_COMMAND, pci_cmd, 2);
 
 ret = vmstate_load_state(f, _vfio_pci_config, vdev, 1);
 if (ret) {


On Fri, Oct 23, 2020 at 04:10:29PM +0530, Kirti Wankhede wrote:
> Added functions to save and restore PCI device specific data,
> specifically config space of PCI device.
> 
> Signed-off-by: Kirti Wankhede 
> Reviewed-by: Neo Jia 
> ---
>  hw/vfio/pci.c | 48 
> +++
>  include/hw/vfio/vfio-common.h |  2 ++
>  2 files changed, 50 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index bffd5bfe3b78..92cc25a5489f 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -41,6 +41,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "migration/blocker.h"
> +#include "migration/qemu-file.h"
>  
>  #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
>  
> @@ -2401,11 +2402,58 @@ static Object *vfio_pci_get_object(VFIODevice 
> *vbasedev)
>  return OBJECT(vdev);
>  }
>  
> +static bool vfio_msix_present(void *opaque, int version_id)
> +{
> +PCIDevice *pdev = opaque;
> +
> +return msix_present(pdev);
> +}
> +
> +const VMStateDescription vmstate_vfio_pci_config = {
> +.name = "VFIOPCIDevice",
> +.version_id = 1,
> +.minimum_version_id = 1,
> +.fields = (VMStateField[]) {
> +VMSTATE_PCI_DEVICE(pdev, VFIOPCIDevice),
> +VMSTATE_MSIX_TEST(pdev, VFIOPCIDevice, vfio_msix_present),
> +VMSTATE_END_OF_LIST

Re: [PATCH v28 07/17] vfio: Register SaveVMHandlers for VFIO device

2020-10-24 Thread Yan Zhao

On Fri, Oct 23, 2020 at 04:10:33PM +0530, Kirti Wankhede wrote:
> Define flags to be used as delimiter in migration stream for VFIO devices.
> Added .save_setup and .save_cleanup functions. Map & unmap migration
> region from these functions at source during saving or pre-copy phase.
> 
> Set VFIO device state depending on VM's state. During live migration, VM is
> running when .save_setup is called, _SAVING | _RUNNING state is set for VFIO
> device. During save-restore, VM is paused, _SAVING state is set for VFIO 
> device.
> 
> Signed-off-by: Kirti Wankhede 
> Reviewed-by: Neo Jia 
> ---
>  hw/vfio/migration.c  | 102 
> +++
>  hw/vfio/trace-events |   2 +
>  2 files changed, 104 insertions(+)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index a0f0e79b9b73..94d2bdae5c54 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -8,12 +8,15 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qemu/main-loop.h"
> +#include "qemu/cutils.h"
>  #include 
>  
>  #include "sysemu/runstate.h"
>  #include "hw/vfio/vfio-common.h"
>  #include "cpu.h"
>  #include "migration/migration.h"
> +#include "migration/vmstate.h"
>  #include "migration/qemu-file.h"
>  #include "migration/register.h"
>  #include "migration/blocker.h"
> @@ -25,6 +28,22 @@
>  #include "trace.h"
>  #include "hw/hw.h"
>  
> +/*
> + * Flags to be used as unique delimiters for VFIO devices in the migration
> + * stream. These flags are composed as:
> + * 0x => MSB 32-bit all 1s
> + * 0xef10 => Magic ID, represents emulated (virtual) function IO
> + * 0x => 16-bits reserved for flags
> + *
> + * The beginning of state information is marked by _DEV_CONFIG_STATE,
> + * _DEV_SETUP_STATE, or _DEV_DATA_STATE, respectively. The end of a
> + * certain state information is marked by _END_OF_STATE.
> + */
> +#define VFIO_MIG_FLAG_END_OF_STATE  (0xef11ULL)
> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xef12ULL)
> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
> +#define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
> +
>  static inline int vfio_mig_access(VFIODevice *vbasedev, void *val, int count,
>off_t off, bool iswrite)
>  {
> @@ -129,6 +148,75 @@ static int vfio_migration_set_state(VFIODevice 
> *vbasedev, uint32_t mask,
>  return 0;
>  }
>  
> +static void vfio_migration_cleanup(VFIODevice *vbasedev)
> +{
> +VFIOMigration *migration = vbasedev->migration;
> +
> +if (migration->region.mmaps) {
> +vfio_region_unmap(>region);
> +}
> +}
> +
> +/* -- */
> +
> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> +{
> +VFIODevice *vbasedev = opaque;
> +VFIOMigration *migration = vbasedev->migration;
> +int ret;
> +
> +trace_vfio_save_setup(vbasedev->name);
> +
> +qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> +
> +if (migration->region.mmaps) {
> +/*
> + * Calling vfio_region_mmap() from migration thread. Memory API 
> called
> + * from this function require locking the iothread when called from
> + * outside the main loop thread.
> + */
> +qemu_mutex_lock_iothread();
> +ret = vfio_region_mmap(>region);
> +qemu_mutex_unlock_iothread();
> +if (ret) {
> +error_report("%s: Failed to mmap VFIO migration region: %s",
> + vbasedev->name, strerror(-ret));
> +error_report("%s: Falling back to slow path", vbasedev->name);
> +}
> +}
> +
> +ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_MASK,
> +       VFIO_DEVICE_STATE_SAVING);
> +if (ret) {
> +error_report("%s: Failed to set state SAVING", vbasedev->name);
> +return ret;
> +}
> +

is it possible to call vfio_update_pending() and vfio_save_buffer() here?
so that vendor driver has a chance to hook compatibility checking string
early in save_setup stage and can avoid to hook the string in both
precopy iteration stage and stop and copy stage.

But I think it's ok if we agree to add this later.

Besides that,
Reviewed-by: Yan Zhao 

> +qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +ret = qemu_file_get_error(f);
> +if (ret) {
> +return ret;
> +}
>

Re: device compatibility interface for live migration with assigned devices

2020-09-10 Thread Yan Zhao

On Thu, Sep 10, 2020 at 12:02:44PM -0600, Alex Williamson wrote:
> On Thu, 10 Sep 2020 13:50:11 +0100
> Sean Mooney  wrote:
> 
> > On Thu, 2020-09-10 at 14:38 +0200, Cornelia Huck wrote:
> > > On Wed, 9 Sep 2020 10:13:09 +0800
> > > Yan Zhao  wrote:
> > >   
> > > > > > still, I'd like to put it more explicitly to make ensure it's not 
> > > > > > missed:
> > > > > > the reason we want to specify compatible_type as a trait and check
> > > > > > whether target compatible_type is the superset of source
> > > > > > compatible_type is for the consideration of backward compatibility.
> > > > > > e.g.
> > > > > > an old generation device may have a mdev type xxx-v4-yyy, while a 
> > > > > > newer
> > > > > > generation  device may be of mdev type xxx-v5-yyy.
> > > > > > with the compatible_type traits, the old generation device is still
> > > > > > able to be regarded as compatible to newer generation device even 
> > > > > > their
> > > > > > mdev types are not equal.
> > > > > 
> > > > > If you want to support migration from v4 to v5, can't the (presumably
> > > > > newer) driver that supports v5 simply register the v4 type as well, so
> > > > > that the mdev can be created as v4? (Just like QEMU versioned machine
> > > > > types work.)
> > > > 
> > > > yes, it should work in some conditions.
> > > > but it may not be that good in some cases when v5 and v4 in the name 
> > > > string
> > > > of mdev type identify hardware generation (e.g. v4 for gen8, and v5 for
> > > > gen9)
> > > > 
> > > > e.g.
> > > > (1). when src mdev type is v4 and target mdev type is v5 as
> > > > software does not support it initially, and v4 and v5 identify hardware
> > > > differences.  
> > > 
> > > My first hunch here is: Don't introduce types that may be compatible
> > > later. Either make them compatible, or make them distinct by design,
> > > and possibly add a different, compatible type later.
> > >   
> > > > then after software upgrade, v5 is now compatible to v4, should the
> > > > software now downgrade mdev type from v5 to v4?
> > > > not sure if moving hardware generation info into a separate attribute
> > > > from mdev type name is better. e.g. remove v4, v5 in mdev type, while 
> > > > use
> > > > compatible_pci_ids to identify compatibility.  
> > > 
> > > If the generations are compatible, don't mention it in the mdev type.
> > > If they aren't, use distinct types, so that management software doesn't
> > > have to guess. At least that would be my naive approach here.  
> > yep that is what i would prefer to see too.
> > >   
> > > > 
> > > > (2) name string of mdev type is composed by "driver_name + type_name".
> > > > in some devices, e.g. qat, different generations of devices are binding 
> > > > to
> > > > drivers of different names, e.g. "qat-v4", "qat-v5".
> > > > then though type_name is equal, mdev type is not equal. e.g.
> > > > "qat-v4-type1", "qat-v5-type1".  
> > > 
> > > I guess that shows a shortcoming of that "driver_name + type_name"
> > > approach? Or maybe I'm just confused.  
> > yes i really dont like haveing the version in the mdev-type name 
> > i would stongly perfger just qat-type-1 wehere qat is just there as a way 
> > of namespacing.
> > although symmetric-cryto, asymmetric-cryto and compression woudl be a 
> > better name then type-1, type-2, type-3 if
> > that is what they would end up mapping too. e.g. qat-compression or qat-aes 
> > is a much better name then type-1
> > higher layers of software are unlikely to parse the mdev names but as a 
> > human looking at them its much eaiser to
> > understand if the names are meaningful. the qat prefix i think is important 
> > however to make sure that your mdev-types
> > dont colide with other vendeors mdev types. so i woudl encurage all vendors 
> > to prefix there mdev types with etiher the
> > device name or the vendor.
> 
> +1 to all this, the mdev type is meant to indicate a software
> compatible interface, if different hardware versions can be software
> compatible, then don't make the job of finding a compatible device
> harder.  The full type is a combination of the vendor driver name plus
> the vendor provided type name specifically in order to provide a type
> namespace per vendor driver.  That's done at the mdev core level.
> Thanks,

hi Alex,
got it. so do you suggest that vendors use consistent driver name over
generations of devices?
for qat, they create different modules for each generation. This
practice is not good if they want to support migration between devices
of different generations, right?

and can I understand that we don't want support of migration between
different mdev types even in future ?

Thanks
Yan

Re: device compatibility interface for live migration with assigned devices

2020-09-08 Thread Yan Zhao

hi All,
Per our previous discussion, there are two main concerns to the previous
proposal:
(1) it's currently hard for openstack to match mdev types.
(2) complicated.

so, we further propose below changes:
(1) requiring two compatible mdevs to have the same mdev type for now.
(though kernel still exposes compatible_type attributes for future use)  
(2) requiring 1:1 match for other attributes under sysfs type node for now
(those attributes are specified via compatible_ but
with only 1 value in it.)
(3) do not match attributes under device instance node.
rather, they are regarded as part of resource claiming process.
so src and dest values are ensured to be 1:1.
A dynamic_resources attribute under sysfs  node is added to
list the attributes under device instance that mgt tools need to
ensure 1:1 from src and dest.
the "aggregator" attribute under device instance node is such one that
needs to be listed.
Those listed attributes can actually be treated as device state set by
vendor driver during live migration. but we still want to ask for them to
be set by mgt tools before live migration starts, in oder to reduce the
chance of live migration failure.

do you like those changes?

after the changes, the sysfs interface would look like blow:

  |- [parent physical device]
  |--- Vendor-specific-attributes [optional]
  |--- [mdev_supported_types]
  | |--- []
  | |   |--- create
  | |   |--- name
  | |   |--- available_instances
  | |   |--- device_api
  | |   |--- software_version
  | |   |--- compatible_type
  | |   |--- compatible_
  | |   |--- compatible_
  | |   |--- dynamic_resources
  | |   |--- description
  | |   |--- [devices]

- device_api : exact match between src and dest is required.
   its value can be one of 
   "vfio-pci", "vfio-platform", "vfio-amba", "vfio-ccw", "vfio-ap"
- software_version: version of vendor driver.
in major.minor.bugfix scheme. 
dest major should be equal to src major,
dest minor should be no less than src minor.
once migration stream related code changed, vendor
drivers need to bump the version.
- compatible_type: not used by mgt tools currently.
   vendor drivers can provide this attribute, but need to
   know that mgt apps would ignore it.
   when in future mgt tools support this attribute, it
   would allow migration across different mdev types,
   so that devices of older generation may be able to
   migrate to newer generations.

- compatible_: for device api specific attributes,
  e.g. compatible_subchannel_type,
  dest values should be superset of arc values.
  vendor drivers can specify only one value in this attribute,
  in order to do exact match between src and dest.
  It's ok for mgt tools to only read one value in the
  attribute so that src:dest values are 1:1.

- compatible_: for mdev type specific attributes,
  e.g. compatible_pci_ids, compatible_chpid_type
  dest values should be superset of arc values.
  vendor drivers can specify only one value in the attribute
  in order to do exact match between src and dest.
  It's ok for mgt tools to only read one value in the
  attribute so that src:dest values are 1:1.

- dynamic_resources: though defined statically under ,
  this attribute lists attributes under device instance that
  need to be set as part of claiming dest resources.
  e.g. $cat dynamic_resources: aggregator, fps,...
  then after dest device is created, values of its device
  attributes need to be set to that of src device attributes.
  Failure in syncing src device values to dest device
  values is treated the same as failing to claiming
  dest resources.
  attributes under device instance that are not listed
  in this attribute would not be part of resource checking in
  mgt tools.



Thanks
Yan

Re: device compatibility interface for live migration with assigned devices

2020-09-08 Thread Yan Zhao

> > still, I'd like to put it more explicitly to make ensure it's not missed:
> > the reason we want to specify compatible_type as a trait and check
> > whether target compatible_type is the superset of source
> > compatible_type is for the consideration of backward compatibility.
> > e.g.
> > an old generation device may have a mdev type xxx-v4-yyy, while a newer
> > generation  device may be of mdev type xxx-v5-yyy.
> > with the compatible_type traits, the old generation device is still
> > able to be regarded as compatible to newer generation device even their
> > mdev types are not equal.
> 
> If you want to support migration from v4 to v5, can't the (presumably
> newer) driver that supports v5 simply register the v4 type as well, so
> that the mdev can be created as v4? (Just like QEMU versioned machine
> types work.)
yes, it should work in some conditions.
but it may not be that good in some cases when v5 and v4 in the name string
of mdev type identify hardware generation (e.g. v4 for gen8, and v5 for
gen9)

e.g.
(1). when src mdev type is v4 and target mdev type is v5 as
software does not support it initially, and v4 and v5 identify hardware
differences.
then after software upgrade, v5 is now compatible to v4, should the
software now downgrade mdev type from v5 to v4?
not sure if moving hardware generation info into a separate attribute
from mdev type name is better. e.g. remove v4, v5 in mdev type, while use
compatible_pci_ids to identify compatibility.

(2) name string of mdev type is composed by "driver_name + type_name".
in some devices, e.g. qat, different generations of devices are binding to
drivers of different names, e.g. "qat-v4", "qat-v5".
then though type_name is equal, mdev type is not equal. e.g.
"qat-v4-type1", "qat-v5-type1".

Thanks
Yan

Re: device compatibility interface for live migration with assigned devices

2020-08-30 Thread Yan Zhao

On Fri, Aug 28, 2020 at 03:04:12PM +0100, Sean Mooney wrote:
> On Fri, 2020-08-28 at 15:47 +0200, Cornelia Huck wrote:
> > On Wed, 26 Aug 2020 14:41:17 +0800
> > Yan Zhao  wrote:
> > 
> > > previously, we want to regard the two mdevs created with dsa-1dwq x 30 and
> > > dsa-2dwq x 15 as compatible, because the two mdevs consist equal 
> > > resources.
> > > 
> > > But, as it's a burden to upper layer, we agree that if this condition
> > > happens, we still treat the two as incompatible.
> > > 
> > > To fix it, either the driver should expose dsa-1dwq only, or the target
> > > dsa-2dwq needs to be destroyed and reallocated via dsa-1dwq x 30.
> > 
> > AFAIU, these are mdev types, aren't they? So, basically, any management
> > software needs to take care to use the matching mdev type on the target
> > system for device creation?
> 
> or just do the simple thing of use the same mdev type on the source and dest.
> matching mdevtypes is not nessiarly trivial. we could do that but we woudl 
> have
> to do that in python rather then sql so it would be slower to do at least 
> today.
> 
> we dont currently have the ablity to say the resouce provider must have 1 of 
> these
> set of traits. just that we must have a specific trait. this is a feature we 
> have
> disucssed a couple of times and delayed untill we really really need it but 
> its not out
> of the question that we could add it for this usecase. i suspect however we 
> would do exact
> match first and explore this later after the inital mdev migration works.

Yes, I think it's good.

still, I'd like to put it more explicitly to make ensure it's not missed:
the reason we want to specify compatible_type as a trait and check
whether target compatible_type is the superset of source
compatible_type is for the consideration of backward compatibility.
e.g.
an old generation device may have a mdev type xxx-v4-yyy, while a newer
generation  device may be of mdev type xxx-v5-yyy.
with the compatible_type traits, the old generation device is still
able to be regarded as compatible to newer generation device even their
mdev types are not equal.

Thanks
Yan
> by the way i was looking at some vdpa reslated matiail today and noticed vdpa 
> devices are nolonger
> usign mdevs and and now use a vhost chardev so i guess we will need a 
> completely seperate mechanioum
> for vdpa vs mdev migration as a result. that is rather unfortunet but i guess 
> that is life.
> > 
>

Re: device compatibility interface for live migration with assigned devices

2020-08-30 Thread Yan Zhao

On Fri, Aug 28, 2020 at 03:47:41PM +0200, Cornelia Huck wrote:
> On Wed, 26 Aug 2020 14:41:17 +0800
> Yan Zhao  wrote:
> 
> > previously, we want to regard the two mdevs created with dsa-1dwq x 30 and
> > dsa-2dwq x 15 as compatible, because the two mdevs consist equal resources.
> > 
> > But, as it's a burden to upper layer, we agree that if this condition
> > happens, we still treat the two as incompatible.
> > 
> > To fix it, either the driver should expose dsa-1dwq only, or the target
> > dsa-2dwq needs to be destroyed and reallocated via dsa-1dwq x 30.
> 
> AFAIU, these are mdev types, aren't they? So, basically, any management
> software needs to take care to use the matching mdev type on the target
> system for device creation?
dsa-1dwq is the mdev type.
there's no dsa-2dwq yet. and I think no dsa-2dwq should be provided in
future according to our discussion.

GVT currently does not support aggregator also.
how to add the the aggregator attribute is currently uder discussion,
and up to now it is recommended to be a vendor specific attributes.

https://lists.freedesktop.org/archives/intel-gvt-dev/2020-July/006854.html.

Thanks
Yan

Re: device compatibility interface for live migration with assigned devices

2020-08-26 Thread Yan Zhao

On Thu, Aug 20, 2020 at 02:24:26PM +0100, Sean Mooney wrote:
> On Thu, 2020-08-20 at 14:27 +0800, Yan Zhao wrote:
> > On Thu, Aug 20, 2020 at 06:16:28AM +0100, Sean Mooney wrote:
> > > On Thu, 2020-08-20 at 12:01 +0800, Yan Zhao wrote:
> > > > On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote:
> > > > > On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote:
> > > > > > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
> > > > > > > On Tue, 18 Aug 2020 10:16:28 +0100
> > > > > > > Daniel P. Berrangé  wrote:
> > > > > > > 
> > > > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
> > > > > > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
> > > > > > > > > 
> > > > > > > > >  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
> > > > > > > > > 
> > > > > > > > >  On 2020/8/14 下午1:16, Yan Zhao wrote:
> > > > > > > > > 
> > > > > > > > >  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > > > > > > > > 
> > > > > > > > >  On 2020/8/10 下午3:46, Yan Zhao wrote:  
> > > > > > > > >  we actually can also retrieve the same information through 
> > > > > > > > > sysfs, .e.g
> > > > > > > > > 
> > > > > > > > >  |- [path to device]
> > > > > > > > > |--- migration
> > > > > > > > > | |--- self
> > > > > > > > > | |   |---device_api
> > > > > > > > > ||   |---mdev_type
> > > > > > > > > ||   |---software_version
> > > > > > > > > ||   |---device_id
> > > > > > > > > ||   |---aggregator
> > > > > > > > > | |--- compatible
> > > > > > > > > | |   |---device_api
> > > > > > > > > ||   |---mdev_type
> > > > > > > > > ||   |---software_version
> > > > > > > > > ||   |---device_id
> > > > > > > > > ||   |---aggregator
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > >  Yes but:
> > > > > > > > > 
> > > > > > > > >  - You need one file per attribute (one syscall for one 
> > > > > > > > > attribute)
> > > > > > > > >  - Attribute is coupled with kobject
> > > > > > > 
> > > > > > > Is that really that bad? You have the device with an embedded 
> > > > > > > kobject
> > > > > > > anyway, and you can just put things into an attribute group?
> > > > > > > 
> > > > > > > [Also, I think that self/compatible split in the example makes 
> > > > > > > things
> > > > > > > needlessly complex. Shouldn't semantic versioning and matching 
> > > > > > > already
> > > > > > > cover nearly everything? I would expect very few cases that are 
> > > > > > > more
> > > > > > > complex than that. Maybe the aggregation stuff, but I don't think 
> > > > > > > we
> > > > > > > need that self/compatible split for that, either.]
> > > > > > 
> > > > > > Hi Cornelia,
> > > > > > 
> > > > > > The reason I want to declare compatible list of attributes is that
> > > > > > sometimes it's not a simple 1:1 matching of source attributes and 
> > > > > > target attributes
> > > > > > as I demonstrated below,
> > > > > > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is 
> > > > > > compatible to
> > > > > > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2),
> > > > > >(mdev_type i915-GVTg_V5_8 + aggregator 4)
> > > > > 
> > > > > the way you are doing the nameing is till really confusing by the way
> > > > > if this has not already been merged in the kernel can you chagne the 
> > > > > mdev
&g

Re: device compatibility interface for live migration with assigned devices

2020-08-26 Thread Yan Zhao

On Tue, Aug 25, 2020 at 04:39:25PM +0200, Cornelia Huck wrote:
<...>
> > do you think the bin_attribute I proposed yesterday good?
> > Then we can have a single compatible with a variable in the mdev_type and
> > aggregator.
> > 
> >mdev_type=i915-GVTg_V5_{val1:int:2,4,8}
> >aggregator={val1}/2
> 
> I'm not really a fan of binary attributes other than in cases where we
> have some kind of binary format to begin with.
> 
> IIUC, we basically have:
> - different partitioning (expressed in the mdev_type)
> - different number of partitions (expressed via the aggregator)
> - devices being compatible if the partitioning:aggregator ratio is the
>   same
> 
> (The multiple mdev_type variants seem to come from avoiding extra
> creation parameters, IIRC?)
> 
> Would it be enough to export
> base_type=i915-GVTg_V5
> aggregation_ratio=
> 
> to express the various combinations that are compatible without the
> need for multiple sets of attributes?

yes. I agree we need to decouple the mdev type name and aggregator for
compatibility detection purpose.

please allow me to put some words to describe the history and
motivation of introducing aggregator.

initially, we have fixed mdev_type
i915-GVTg_V5_1,
i915-GVTg_V5_2,
i915-GVTg_V5_4,
i915-GVTg_V5_8,
the digital after i915-GVTg_V5 representing the max number of instances
allowed to be created for this type. They also identify how many
resources are to be allocated for each type.

They are so far so good for current intel vgpus, i.e., cutting the
physical GPU into several virtual pieces and sharing them among several
VMs in pure mediation way.
fixed types are provided in advance as we thought it can meet needs from
most users and users can know the hardware capability they acquired
from the type name. the bigger in number, the smaller piece of physical
hardware.

Then, when it comes to scalable IOV in near future, one physical hardware
is able to be cut into a large number of units in hardware layer
The single unit to be assigned into guest can be very small while one to
several units are grouped into an mdev.

The fixed type scheme is then cumbersome. 
Therefore, a new attribute aggregator is introduced to specify the number
of resources to be assigned based on the base resource specified in type
name. e.g.
if type name is dsa-1dwq, and aggregator is 30, then the assignable
resources to guest is 30 wqs in a single created mdev.
if type name is dsa-2dwq, and aggregator is 15, then the assignable
resources to guest is also 30wqs in a single created mdev.
(in this example, the rule to define type name is different to the case
in GVT. here 1 wq means wq number is 1. yes, they are current reality.
:) )

previously, we want to regard the two mdevs created with dsa-1dwq x 30 and
dsa-2dwq x 15 as compatible, because the two mdevs consist equal resources.

But, as it's a burden to upper layer, we agree that if this condition
happens, we still treat the two as incompatible.

To fix it, either the driver should expose dsa-1dwq only, or the target
dsa-2dwq needs to be destroyed and reallocated via dsa-1dwq x 30.

Does it make sense?

Thanks
Yan

Re: device compatibility interface for live migration with assigned devices

2020-08-20 Thread Yan Zhao

On Thu, Aug 20, 2020 at 06:16:28AM +0100, Sean Mooney wrote:
> On Thu, 2020-08-20 at 12:01 +0800, Yan Zhao wrote:
> > On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote:
> > > On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote:
> > > > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
> > > > > On Tue, 18 Aug 2020 10:16:28 +0100
> > > > > Daniel P. Berrangé  wrote:
> > > > > 
> > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
> > > > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
> > > > > > > 
> > > > > > >  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
> > > > > > > 
> > > > > > >  On 2020/8/14 下午1:16, Yan Zhao wrote:
> > > > > > > 
> > > > > > >  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > > > > > > 
> > > > > > >  On 2020/8/10 下午3:46, Yan Zhao wrote:  
> > > > > > >  we actually can also retrieve the same information through 
> > > > > > > sysfs, .e.g
> > > > > > > 
> > > > > > >  |- [path to device]
> > > > > > > |--- migration
> > > > > > > | |--- self
> > > > > > > | |   |---device_api
> > > > > > > ||   |---mdev_type
> > > > > > > ||   |---software_version
> > > > > > > ||   |---device_id
> > > > > > > ||   |---aggregator
> > > > > > > | |--- compatible
> > > > > > > | |   |---device_api
> > > > > > > ||   |---mdev_type
> > > > > > > ||   |---software_version
> > > > > > > ||   |---device_id
> > > > > > > ||   |---aggregator
> > > > > > > 
> > > > > > > 
> > > > > > >  Yes but:
> > > > > > > 
> > > > > > >  - You need one file per attribute (one syscall for one attribute)
> > > > > > >  - Attribute is coupled with kobject
> > > > > 
> > > > > Is that really that bad? You have the device with an embedded kobject
> > > > > anyway, and you can just put things into an attribute group?
> > > > > 
> > > > > [Also, I think that self/compatible split in the example makes things
> > > > > needlessly complex. Shouldn't semantic versioning and matching already
> > > > > cover nearly everything? I would expect very few cases that are more
> > > > > complex than that. Maybe the aggregation stuff, but I don't think we
> > > > > need that self/compatible split for that, either.]
> > > > 
> > > > Hi Cornelia,
> > > > 
> > > > The reason I want to declare compatible list of attributes is that
> > > > sometimes it's not a simple 1:1 matching of source attributes and 
> > > > target attributes
> > > > as I demonstrated below,
> > > > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible 
> > > > to
> > > > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2),
> > > >(mdev_type i915-GVTg_V5_8 + aggregator 4)
> > > 
> > > the way you are doing the nameing is till really confusing by the way
> > > if this has not already been merged in the kernel can you chagne the mdev
> > > so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 instead 
> > > of half the device
> > > 
> > > currently you need to deived the aggratod by the number at the end of the 
> > > mdev type to figure out
> > > how much of the phsicial device is being used with is a very unfridly api 
> > > convention
> > > 
> > > the way aggrator are being proposed in general is not really someting i 
> > > like but i thin this at least
> > > is something that should be able to correct.
> > > 
> > > with the complexity in the mdev type name + aggrator i suspect that this 
> > > will never be support
> > > in openstack nova directly requireing integration via cyborg unless we 
> > > can pre partion the
> > > device in to mdevs staicaly and just ignore this.
> > > 
> > > this is way to vendor sepecif to integrate in

Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Yan Zhao

On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote:
> On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote:
> > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
> > > On Tue, 18 Aug 2020 10:16:28 +0100
> > > Daniel P. Berrangé  wrote:
> > > 
> > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
> > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
> > > > > 
> > > > >  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
> > > > > 
> > > > >  On 2020/8/14 下午1:16, Yan Zhao wrote:
> > > > > 
> > > > >  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > > > > 
> > > > >  On 2020/8/10 下午3:46, Yan Zhao wrote:  
> > > > >  we actually can also retrieve the same information through sysfs, 
> > > > > .e.g
> > > > > 
> > > > >  |- [path to device]
> > > > > |--- migration
> > > > > | |--- self
> > > > > | |   |---device_api
> > > > > ||   |---mdev_type
> > > > > ||   |---software_version
> > > > > ||   |---device_id
> > > > > ||   |---aggregator
> > > > > | |--- compatible
> > > > > | |   |---device_api
> > > > > ||   |---mdev_type
> > > > > ||   |---software_version
> > > > > ||   |---device_id
> > > > > ||   |---aggregator
> > > > > 
> > > > > 
> > > > >  Yes but:
> > > > > 
> > > > >  - You need one file per attribute (one syscall for one attribute)
> > > > >  - Attribute is coupled with kobject
> > > 
> > > Is that really that bad? You have the device with an embedded kobject
> > > anyway, and you can just put things into an attribute group?
> > > 
> > > [Also, I think that self/compatible split in the example makes things
> > > needlessly complex. Shouldn't semantic versioning and matching already
> > > cover nearly everything? I would expect very few cases that are more
> > > complex than that. Maybe the aggregation stuff, but I don't think we
> > > need that self/compatible split for that, either.]
> > 
> > Hi Cornelia,
> > 
> > The reason I want to declare compatible list of attributes is that
> > sometimes it's not a simple 1:1 matching of source attributes and target 
> > attributes
> > as I demonstrated below,
> > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to
> > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2),
> >(mdev_type i915-GVTg_V5_8 + aggregator 4)
> the way you are doing the nameing is till really confusing by the way
> if this has not already been merged in the kernel can you chagne the mdev
> so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 instead of 
> half the device
> 
> currently you need to deived the aggratod by the number at the end of the 
> mdev type to figure out
> how much of the phsicial device is being used with is a very unfridly api 
> convention
> 
> the way aggrator are being proposed in general is not really someting i like 
> but i thin this at least
> is something that should be able to correct.
> 
> with the complexity in the mdev type name + aggrator i suspect that this will 
> never be support
> in openstack nova directly requireing integration via cyborg unless we can 
> pre partion the
> device in to mdevs staicaly and just ignore this.
> 
> this is way to vendor sepecif to integrate into something like openstack in 
> nova unless we can guarentee
> taht how aggreator work will be portable across vendors genericly.
> 
> > 
> > and aggragator may be just one of such examples that 1:1 matching does not
> > fit.
> for openstack nova i dont see us support anything beyond the 1:1 case where 
> the mdev type does not change.
>
hi Sean,
I understand it's hard for openstack. but 1:N is always meaningful.
e.g.
if source device 1 has cap A, it is compatible to
device 2: cap A,
device 3: cap A+B,
device 4: cap A+B+C

to allow openstack to detect it correctly, in compatible list of
device 2, we would say compatible cap is A;
device 3, compatible cap is A or A+B;
device 4, compatible cap is A or A+B, or A+B+C;

then if openstack finds device A's self cap A is contained in compatible
cap of device 2/3/4, it can migrate device 1 to device 2,3,4.

conversely,  device 1's compatible cap is only

Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Yan Zhao

On Wed, Aug 19, 2020 at 09:22:34PM -0600, Alex Williamson wrote:
> On Thu, 20 Aug 2020 08:39:22 +0800
> Yan Zhao  wrote:
> 
> > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
> > > On Tue, 18 Aug 2020 10:16:28 +0100
> > > Daniel P. Berrangé  wrote:
> > >   
> > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:  
> > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
> > > > > 
> > > > >  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
> > > > > 
> > > > >  On 2020/8/14 下午1:16, Yan Zhao wrote:
> > > > > 
> > > > >  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > > > > 
> > > > >  On 2020/8/10 下午3:46, Yan Zhao wrote:
> > > >   
> > > > >  we actually can also retrieve the same information through sysfs, 
> > > > > .e.g
> > > > > 
> > > > >  |- [path to device]
> > > > > |--- migration
> > > > > | |--- self
> > > > > | |   |---device_api
> > > > > ||   |---mdev_type
> > > > > ||   |---software_version
> > > > > ||   |---device_id
> > > > > ||   |---aggregator
> > > > > | |--- compatible
> > > > > | |   |---device_api
> > > > > ||   |---mdev_type
> > > > > ||   |---software_version
> > > > > ||   |---device_id
> > > > > ||   |---aggregator
> > > > > 
> > > > > 
> > > > >  Yes but:
> > > > > 
> > > > >  - You need one file per attribute (one syscall for one attribute)
> > > > >  - Attribute is coupled with kobject  
> > > 
> > > Is that really that bad? You have the device with an embedded kobject
> > > anyway, and you can just put things into an attribute group?
> > > 
> > > [Also, I think that self/compatible split in the example makes things
> > > needlessly complex. Shouldn't semantic versioning and matching already
> > > cover nearly everything? I would expect very few cases that are more
> > > complex than that. Maybe the aggregation stuff, but I don't think we
> > > need that self/compatible split for that, either.]  
> > Hi Cornelia,
> > 
> > The reason I want to declare compatible list of attributes is that
> > sometimes it's not a simple 1:1 matching of source attributes and target 
> > attributes
> > as I demonstrated below,
> > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to
> > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2),
> >(mdev_type i915-GVTg_V5_8 + aggregator 4)
> > 
> > and aggragator may be just one of such examples that 1:1 matching does not
> > fit.
> 
> If you're suggesting that we need a new 'compatible' set for every
> aggregation, haven't we lost the purpose of aggregation?  For example,
> rather than having N mdev types to represent all the possible
> aggregation values, we have a single mdev type with N compatible
> migration entries, one for each possible aggregation value.  BTW, how do
> we have multiple compatible directories?  compatible0001,
> compatible0002? Thanks,
> 
do you think the bin_attribute I proposed yesterday good?
Then we can have a single compatible with a variable in the mdev_type and
aggregator.

   mdev_type=i915-GVTg_V5_{val1:int:2,4,8}
   aggregator={val1}/2

Thanks
Yan

Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Yan Zhao

On Wed, Aug 19, 2020 at 09:13:45PM -0600, Alex Williamson wrote:
> On Thu, 20 Aug 2020 08:18:10 +0800
> Yan Zhao  wrote:
> 
> > On Wed, Aug 19, 2020 at 11:50:21AM -0600, Alex Williamson wrote:
> > <...>
> > > > > > > What I care about is that we have a *standard* userspace API for
> > > > > > > performing device compatibility checking / state migration, for 
> > > > > > > use by
> > > > > > > QEMU/libvirt/ OpenStack, such that we can write code without 
> > > > > > > countless
> > > > > > > vendor specific code paths.
> > > > > > >
> > > > > > > If there is vendor specific stuff on the side, that's fine as we 
> > > > > > > can
> > > > > > > ignore that, but the core functionality for device compat / 
> > > > > > > migration
> > > > > > > needs to be standardized.
> > > > > > 
> > > > > > To summarize:
> > > > > > - choose one of sysfs or devlink
> > > > > > - have a common interface, with a standardized way to add
> > > > > >   vendor-specific attributes
> > > > > > ?
> > > > > 
> > > > > Please refer to my previous email which has more example and details. 
> > > > >
> > > > hi Parav,
> > > > the example is based on a new vdpa tool running over netlink, not based
> > > > on devlink, right?
> > > > For vfio migration compatibility, we have to deal with both mdev and 
> > > > physical
> > > > pci devices, I don't think it's a good idea to write a new tool for it, 
> > > > given
> > > > we are able to retrieve the same info from sysfs and there's already an
> > > > mdevctl from Alex (https://github.com/mdevctl/mdevctl).
> > > > 
> > > > hi All,
> > > > could we decide that sysfs is the interface that every VFIO vendor 
> > > > driver
> > > > needs to provide in order to support vfio live migration, otherwise the
> > > > userspace management tool would not list the device into the compatible
> > > > list?
> > > > 
> > > > if that's true, let's move to the standardizing of the sysfs interface.
> > > > (1) content
> > > > common part: (must)
> > > >- software_version: (in major.minor.bugfix scheme)
> > > >- device_api: vfio-pci or vfio-ccw ...
> > > >- type: mdev type for mdev device or
> > > >a signature for physical device which is a counterpart for
> > > >mdev type.
> > > > 
> > > > device api specific part: (must)
> > > >   - pci id: pci id of mdev parent device or pci id of physical pci
> > > > device (device_api is vfio-pci)  
> > > 
> > > As noted previously, the parent PCI ID should not matter for an mdev
> > > device, if a vendor has a dependency on matching the parent device PCI
> > > ID, that's a vendor specific restriction.  An mdev device can also
> > > expose a vfio-pci device API without the parent device being PCI.  For
> > > a physical PCI device, shouldn't the PCI ID be encompassed in the
> > > signature?  Thanks,
> > >   
> > you are right. I need to put the PCI ID as a vendor specific field.
> > I didn't do that because I wanted all fields in vendor specific to be
> > configurable by management tools, so they can configure the target device
> > according to the value of a vendor specific field even they don't know
> > the meaning of the field.
> > But maybe they can just ignore the field when they can't find a matching
> > writable field to configure the target.
> 
> 
> If fields can be ignored, what's the point of reporting them?  Seems
> it's no longer a requirement.  Thanks,
> 
sorry about the confusion. I mean this condition:
about to migrate, openstack searches if there are existing matching
MDEVs,
if yes, i.e. all common/vendor specific fields match, then just create
a VM with the matching target MDEV. (in this condition, the PCI ID field
is not ignored);
if not, openstack tries to create one MDEV according to mdev_type, and
configures MDEV according to the vendor specific attributes.
as PCI ID is not a configurable field, it just ignore the field.

Thanks
Yan

Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Yan Zhao

On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
> On Tue, 18 Aug 2020 10:16:28 +0100
> Daniel P. Berrangé  wrote:
> 
> > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
> > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
> > > 
> > >  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
> > > 
> > >  On 2020/8/14 下午1:16, Yan Zhao wrote:
> > > 
> > >  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > > 
> > >  On 2020/8/10 下午3:46, Yan Zhao wrote:  
> > 
> > >  we actually can also retrieve the same information through sysfs, .e.g
> > > 
> > >  |- [path to device]
> > > |--- migration
> > > | |--- self
> > > | |   |---device_api
> > > ||   |---mdev_type
> > > ||   |---software_version
> > > ||   |---device_id
> > > ||   |---aggregator
> > > | |--- compatible
> > > | |   |---device_api
> > > ||   |---mdev_type
> > > ||   |---software_version
> > > ||   |---device_id
> > > ||   |---aggregator
> > > 
> > > 
> > >  Yes but:
> > > 
> > >  - You need one file per attribute (one syscall for one attribute)
> > >  - Attribute is coupled with kobject
> 
> Is that really that bad? You have the device with an embedded kobject
> anyway, and you can just put things into an attribute group?
> 
> [Also, I think that self/compatible split in the example makes things
> needlessly complex. Shouldn't semantic versioning and matching already
> cover nearly everything? I would expect very few cases that are more
> complex than that. Maybe the aggregation stuff, but I don't think we
> need that self/compatible split for that, either.]
Hi Cornelia,

The reason I want to declare compatible list of attributes is that
sometimes it's not a simple 1:1 matching of source attributes and target 
attributes
as I demonstrated below,
source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to
target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2),
   (mdev_type i915-GVTg_V5_8 + aggregator 4)

and aggragator may be just one of such examples that 1:1 matching does not
fit.

So, we explicitly list out self/compatible attributes, and management
tools only need to check if self attributes is contained compatible
attributes.

or do you mean only compatible list is enough, and the management tools
need to find out self list by themselves?
But I think provide a self list is easier for management tools.

Thanks
Yan

Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Yan Zhao

On Wed, Aug 19, 2020 at 11:50:21AM -0600, Alex Williamson wrote:
<...>
> > > > > What I care about is that we have a *standard* userspace API for
> > > > > performing device compatibility checking / state migration, for use by
> > > > > QEMU/libvirt/ OpenStack, such that we can write code without countless
> > > > > vendor specific code paths.
> > > > >
> > > > > If there is vendor specific stuff on the side, that's fine as we can
> > > > > ignore that, but the core functionality for device compat / migration
> > > > > needs to be standardized.  
> > > > 
> > > > To summarize:
> > > > - choose one of sysfs or devlink
> > > > - have a common interface, with a standardized way to add
> > > >   vendor-specific attributes
> > > > ?  
> > > 
> > > Please refer to my previous email which has more example and details.  
> > hi Parav,
> > the example is based on a new vdpa tool running over netlink, not based
> > on devlink, right?
> > For vfio migration compatibility, we have to deal with both mdev and 
> > physical
> > pci devices, I don't think it's a good idea to write a new tool for it, 
> > given
> > we are able to retrieve the same info from sysfs and there's already an
> > mdevctl from Alex (https://github.com/mdevctl/mdevctl).
> > 
> > hi All,
> > could we decide that sysfs is the interface that every VFIO vendor driver
> > needs to provide in order to support vfio live migration, otherwise the
> > userspace management tool would not list the device into the compatible
> > list?
> > 
> > if that's true, let's move to the standardizing of the sysfs interface.
> > (1) content
> > common part: (must)
> >- software_version: (in major.minor.bugfix scheme)
> >- device_api: vfio-pci or vfio-ccw ...
> >- type: mdev type for mdev device or
> >a signature for physical device which is a counterpart for
> >mdev type.
> > 
> > device api specific part: (must)
> >   - pci id: pci id of mdev parent device or pci id of physical pci
> > device (device_api is vfio-pci)
> 
> As noted previously, the parent PCI ID should not matter for an mdev
> device, if a vendor has a dependency on matching the parent device PCI
> ID, that's a vendor specific restriction.  An mdev device can also
> expose a vfio-pci device API without the parent device being PCI.  For
> a physical PCI device, shouldn't the PCI ID be encompassed in the
> signature?  Thanks,
> 
you are right. I need to put the PCI ID as a vendor specific field.
I didn't do that because I wanted all fields in vendor specific to be
configurable by management tools, so they can configure the target device
according to the value of a vendor specific field even they don't know
the meaning of the field.
But maybe they can just ignore the field when they can't find a matching
writable field to configure the target.

Thanks
Yan


> >   - subchannel_type (device_api is vfio-ccw) 
> >  
> > vendor driver specific part: (optional)
> >   - aggregator
> >   - chpid_type
> >   - remote_url
> > 
> > NOTE: vendors are free to add attributes in this part with a
> > restriction that this attribute is able to be configured with the same
> > name in sysfs too. e.g.
> > for aggregator, there must be a sysfs attribute in device node
> > /sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator,
> > so that the userspace tool is able to configure the target device
> > according to source device's aggregator attribute.
> > 
> > 
> > (2) where and structure
> > proposal 1:
> > |- [path to device]
> >   |--- migration
> >   | |--- self
> >   | ||-software_version
> >   | ||-device_api
> >   | ||-type
> >   | ||-[pci_id or subchannel_type]
> >   | ||-
> >   | |--- compatible
> >   | ||-software_version
> >   | ||-device_api
> >   | ||-type
> >   | ||-[pci_id or subchannel_type]
> >   | ||-
> > multiple compatible is allowed.
> > attributes should be ASCII text files, preferably with only one value
> > per file.
> > 
> > 
> > proposal 2: use bin_attribute.
> > |- [path to device]
> >   |--- migration
> >   | |--- self
> >   | |--- compatible
> > 
> > so we can continue use multiline format. e.g.
> > cat compatible
> >   software_version=0.1.0
> >   device_api=vfio_pci
> >   type=i915-GVTg_V5_{val1:int:1,2,4,8}
> >   pci_id=80865963
> >   aggregator={val1}/2
> > 
> > Thanks
> > Yan
> > 
>

Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Yan Zhao

On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote:
> 
> On 2020/8/19 下午2:59, Yan Zhao wrote:
> > On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote:
> > > On 2020/8/19 上午11:30, Yan Zhao wrote:
> > > > hi All,
> > > > could we decide that sysfs is the interface that every VFIO vendor 
> > > > driver
> > > > needs to provide in order to support vfio live migration, otherwise the
> > > > userspace management tool would not list the device into the compatible
> > > > list?
> > > > 
> > > > if that's true, let's move to the standardizing of the sysfs interface.
> > > > (1) content
> > > > common part: (must)
> > > >  - software_version: (in major.minor.bugfix scheme)
> > > 
> > > This can not work for devices whose features can be negotiated/advertised
> > > independently. (E.g virtio devices)
> > > 
> > sorry, I don't understand here, why virtio devices need to use vfio 
> > interface?
> 
> 
> I don't see any reason that virtio devices can't be used by VFIO. Do you?
> 
> Actually, virtio devices have been used by VFIO for many years:
> 
> - passthrough a hardware virtio devices to userspace(VM) drivers
> - using virtio PMD inside guest
>
So, what's different for it vs passing through a physical hardware via VFIO?
even though the features are negotiated dynamically, could you explain
why it would cause software_version not work?


> 
> > I think this thread is discussing about vfio related devices.
> > 
> > > >  - device_api: vfio-pci or vfio-ccw ...
> > > >  - type: mdev type for mdev device or
> > > >  a signature for physical device which is a counterpart for
> > > >mdev type.
> > > > 
> > > > device api specific part: (must)
> > > > - pci id: pci id of mdev parent device or pci id of physical pci
> > > >   device (device_api is vfio-pci)API here.
> > > 
> > > So this assumes a PCI device which is probably not true.
> > > 
> > for device_api of vfio-pci, why it's not true?
> > 
> > for vfio-ccw, it's subchannel_type.
> 
> 
> Ok but having two different attributes for the same file is not good idea.
> How mgmt know there will be a 3rd type?
that's why some attributes need to be common. e.g.
device_api: it's common because mgmt need to know it's a pci device or a
ccw device. and the api type is already defined vfio.h.
(The field is agreed by and actually suggested by Alex in previous 
mail)
type: mdev_type for mdev. if mgmt does not understand it, it would not
  be able to create one compatible mdev device.
software_version: mgmt can compare the major and minor if it understands
  this fields.
> 
> 
> > 
> > > > - subchannel_type (device_api is vfio-ccw)
> > > > vendor driver specific part: (optional)
> > > > - aggregator
> > > > - chpid_type
> > > > - remote_url
> > > 
> > > For "remote_url", just wonder if it's better to integrate or reuse the
> > > existing NVME management interface instead of duplicating it here. 
> > > Otherwise
> > > it could be a burden for mgmt to learn. E.g vendor A may use "remote_url"
> > > but vendor B may use a different attribute.
> > > 
> > it's vendor driver specific.
> > vendor specific attributes are inevitable, and that's why we are
> > discussing here of a way to standardizing of it.
> 
> 
> Well, then you will end up with a very long list to discuss. E.g for
> networking devices, you will have "mac", "v(x)lan" and a lot of other.
> 
> Note that "remote_url" is not vendor specific but NVME (class/subsystem)
> specific.
> 
yes, it's just NVMe specific. I added it as an example to show what is
vendor specific.
if one attribute is vendor specific across all vendors, then it's not vendor 
specific,
it's already common attribute, right?

> The point is that if vendor/class specific part is unavoidable, why not
> making all of the attributes vendor specific?
>
some parts need to be common, as I listed above.

> 
> > our goal is that mgmt can use it without understanding the meaning of vendor
> > specific attributes.
> 
> 
> I'm not sure this is the correct design of uAPI. Is there something similar
> in the existing uAPIs?
> 
> And it might be hard to work for virtio devices.
> 
> 
> > 
> > > > NOTE: vendors are free to add attributes in this part with a
> > > > restriction that th

Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Yan Zhao

On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote:
> 
> On 2020/8/19 上午11:30, Yan Zhao wrote:
> > hi All,
> > could we decide that sysfs is the interface that every VFIO vendor driver
> > needs to provide in order to support vfio live migration, otherwise the
> > userspace management tool would not list the device into the compatible
> > list?
> > 
> > if that's true, let's move to the standardizing of the sysfs interface.
> > (1) content
> > common part: (must)
> > - software_version: (in major.minor.bugfix scheme)
> 
> 
> This can not work for devices whose features can be negotiated/advertised
> independently. (E.g virtio devices)
>
sorry, I don't understand here, why virtio devices need to use vfio interface?
I think this thread is discussing about vfio related devices.

> 
> > - device_api: vfio-pci or vfio-ccw ...
> > - type: mdev type for mdev device or
> > a signature for physical device which is a counterpart for
> >mdev type.
> > 
> > device api specific part: (must)
> >- pci id: pci id of mdev parent device or pci id of physical pci
> >  device (device_api is vfio-pci)API here.
> 
> 
> So this assumes a PCI device which is probably not true.
> 
for device_api of vfio-pci, why it's not true?

for vfio-ccw, it's subchannel_type.

> 
> >- subchannel_type (device_api is vfio-ccw)
> > vendor driver specific part: (optional)
> >- aggregator
> >- chpid_type
> >- remote_url
> 
> 
> For "remote_url", just wonder if it's better to integrate or reuse the
> existing NVME management interface instead of duplicating it here. Otherwise
> it could be a burden for mgmt to learn. E.g vendor A may use "remote_url"
> but vendor B may use a different attribute.
> 
it's vendor driver specific.
vendor specific attributes are inevitable, and that's why we are
discussing here of a way to standardizing of it.
our goal is that mgmt can use it without understanding the meaning of vendor
specific attributes.

> 
> > 
> > NOTE: vendors are free to add attributes in this part with a
> > restriction that this attribute is able to be configured with the same
> > name in sysfs too. e.g.
> 
> 
> Sysfs works well for common attributes belongs to a class, but I'm not sure
> it can work well for device/vendor specific attributes. Does this mean mgmt
> need to iterate all the attributes in both src and dst?
>
no. just attributes under migration directory.

> 
> > for aggregator, there must be a sysfs attribute in device node
> > /sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator,
> > so that the userspace tool is able to configure the target device
> > according to source device's aggregator attribute.
> > 
> > 
> > (2) where and structure
> > proposal 1:
> > |- [path to device]
> >|--- migration
> >| |--- self
> >| ||-software_version
> >| ||-device_api
> >| ||-type
> >| ||-[pci_id or subchannel_type]
> >| ||-
> >| |--- compatible
> >| ||-software_version
> >| ||-device_api
> >| ||-type
> >| ||-[pci_id or subchannel_type]
> >| ||-
> > multiple compatible is allowed.
> > attributes should be ASCII text files, preferably with only one value
> > per file.
> > 
> > 
> > proposal 2: use bin_attribute.
> > |- [path to device]
> >|--- migration
> >| |--- self
> >| |--- compatible
> > 
> > so we can continue use multiline format. e.g.
> > cat compatible
> >software_version=0.1.0
> >device_api=vfio_pci
> >type=i915-GVTg_V5_{val1:int:1,2,4,8}
> >pci_id=80865963
> >aggregator={val1}/2
> 
> 
> So basically two questions:
> 
> - how hard to standardize sysfs API for dealing with compatibility check (to
> make it work for most types of devices)
sorry, I just know we are in the process of standardizing of it :)

> - how hard for the mgmt to learn with a vendor specific attributes (vs
> existing management API)
what is existing management API?

Thanks

Re: device compatibility interface for live migration with assigned devices

2020-08-18 Thread Yan Zhao

On Tue, Aug 18, 2020 at 09:39:24AM +, Parav Pandit wrote:
> Hi Cornelia,
> 
> > From: Cornelia Huck 
> > Sent: Tuesday, August 18, 2020 3:07 PM
> > To: Daniel P. Berrangé 
> > Cc: Jason Wang ; Yan Zhao
> > ; k...@vger.kernel.org; libvir-l...@redhat.com;
> > qemu-devel@nongnu.org; Kirti Wankhede ;
> > eau...@redhat.com; xin-ran.w...@intel.com; cor...@lwn.net; openstack-
> > disc...@lists.openstack.org; shaohe.f...@intel.com; kevin.t...@intel.com;
> > Parav Pandit ; jian-feng.d...@intel.com;
> > dgilb...@redhat.com; zhen...@linux.intel.com; hejie...@intel.com;
> > bao.yum...@zte.com.cn; Alex Williamson ;
> > eskul...@redhat.com; smoo...@redhat.com; intel-gvt-
> > d...@lists.freedesktop.org; Jiri Pirko ;
> > dinec...@redhat.com; de...@ovirt.org
> > Subject: Re: device compatibility interface for live migration with assigned
> > devices
> > 
> > On Tue, 18 Aug 2020 10:16:28 +0100
> > Daniel P. Berrangé  wrote:
> > 
> > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
> > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
> > > >
> > > >  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
> > > >
> > > >  On 2020/8/14 下午1:16, Yan Zhao wrote:
> > > >
> > > >  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > > >
> > > >  On 2020/8/10 下午3:46, Yan Zhao wrote:
> > >
> > > >  we actually can also retrieve the same information through sysfs,
> > > > .e.g
> > > >
> > > >  |- [path to device]
> > > > |--- migration
> > > > | |--- self
> > > > | |   |---device_api
> > > > ||   |---mdev_type
> > > > ||   |---software_version
> > > > ||   |---device_id
> > > > ||   |---aggregator
> > > > | |--- compatible
> > > > | |   |---device_api
> > > > ||   |---mdev_type
> > > > ||   |---software_version
> > > > ||   |---device_id
> > > > ||   |---aggregator
> > > >
> > > >
> > > >  Yes but:
> > > >
> > > >  - You need one file per attribute (one syscall for one attribute)
> > > >  - Attribute is coupled with kobject
> > 
> > Is that really that bad? You have the device with an embedded kobject
> > anyway, and you can just put things into an attribute group?
> > 
> > [Also, I think that self/compatible split in the example makes things
> > needlessly complex. Shouldn't semantic versioning and matching already
> > cover nearly everything? I would expect very few cases that are more
> > complex than that. Maybe the aggregation stuff, but I don't think we need
> > that self/compatible split for that, either.]
> > 
> > > >
> > > >  All of above seems unnecessary.
> > > >
> > > >  Another point, as we discussed in another thread, it's really hard
> > > > to make  sure the above API work for all types of devices and
> > > > frameworks. So having a  vendor specific API looks much better.
> > > >
> > > >  From the POV of userspace mgmt apps doing device compat checking /
> > > > migration,  we certainly do NOT want to use different vendor
> > > > specific APIs. We want to  have an API that can be used / controlled in 
> > > > a
> > standard manner across vendors.
> > > >
> > > >Yes, but it could be hard. E.g vDPA will chose to use devlink 
> > > > (there's a
> > > >long debate on sysfs vs devlink). So if we go with sysfs, at least 
> > > > two
> > > >APIs needs to be supported ...
> > >
> > > NB, I was not questioning devlink vs sysfs directly. If devlink is
> > > related to netlink, I can't say I'm enthusiastic as IMKE sysfs is
> > > easier to deal with. I don't know enough about devlink to have much of an
> > opinion though.
> > > The key point was that I don't want the userspace APIs we need to deal
> > > with to be vendor specific.
> > 
> > From what I've seen of devlink, it seems quite nice; but I understand why
> > sysfs might be easier to deal with (especially as there's likely already a 
> > lot of
> > code using it.)
> > 
> > I understand that some users would like devlink because it is already widely
> > used for network drivers (and some others), but I don't think the majority 
> >

Re: device compatibility interface for live migration with assigned devices

2020-08-16 Thread Yan Zhao

On Fri, Aug 14, 2020 at 01:30:00PM +0100, Sean Mooney wrote:
> On Fri, 2020-08-14 at 13:16 +0800, Yan Zhao wrote:
> > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > > 
> > > On 2020/8/10 下午3:46, Yan Zhao wrote:
> > > > > driver is it handled by?
> > > > 
> > > > It looks that the devlink is for network device specific, and in
> > > > devlink.h, it says
> > > > include/uapi/linux/devlink.h - Network physical device Netlink
> > > > interface,
> > > 
> > > 
> > > Actually not, I think there used to have some discussion last year and the
> > > conclusion is to remove this comment.
> > > 
> > > It supports IB and probably vDPA in the future.
> > > 
> > 
> > hmm... sorry, I didn't find the referred discussion. only below discussion
> > regarding to why to add devlink.
> > 
> > https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html
> > >This doesn't seem to be too much related to networking? Why can't 
> > something
> > >like this be in sysfs?
> > 
> > It is related to networking quite bit. There has been couple of
> > iteration of this, including sysfs and configfs implementations. There
> > has been a consensus reached that this should be done by netlink. I
> > believe netlink is really the best for this purpose. Sysfs is not a good
> > idea
> > 
> > https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html
> > >there is already a way to change eth/ib via
> > >echo 'eth' > /sys/bus/pci/drivers/mlx4_core/:02:00.0/mlx4_port1
> > >
> > >sounds like this is another way to achieve the same?
> > 
> > It is. However the current way is driver-specific, not correct.
> > For mlx5, we need the same, it cannot be done in this way. Do devlink is
> > the correct way to go.
> im not sure i agree with that.
> standardising a filesystem based api that is used across all vendors is also 
> a valid
> option.  that said if devlink is the right choice form a kerenl perspective 
> by all
> means use it but i have not heard a convincing argument for why it actually 
> better.
> with tthat said we have been uing tools like ethtool to manage aspect of nics 
> for decades
> so its not that strange an idea to use a tool and binary protocoal rather 
> then a text
> based interface for this but there are advantages to both approches.
> >
Yes, I agree with you.

> > https://lwn.net/Articles/674867/
> > There a is need for some userspace API that would allow to expose things
> > that are not directly related to any device class like net_device of
> > ib_device, but rather chip-wide/switch-ASIC-wide stuff.
> > 
> > Use cases:
> > 1) get/set of port type (Ethernet/InfiniBand)
> > 2) monitoring of hardware messages to and from chip
> > 3) setting up port splitters - split port into multiple ones and squash 
> > again,
> >enables usage of splitter cable
> > 4) setting up shared buffers - shared among multiple ports within one 
> > chip
> > 
> > 
> > 
> > we actually can also retrieve the same information through sysfs, .e.g
> > 
> > > - [path to device]
> > 
> >   |--- migration
> >   | |--- self
> >   | |   |---device_api
> >   | |   |---mdev_type
> >   | |   |---software_version
> >   | |   |---device_id
> >   | |   |---aggregator
> >   | |--- compatible
> >   | |   |---device_api
> >   | |   |---mdev_type
> >   | |   |---software_version
> >   | |   |---device_id
> >   | |   |---aggregator
> > 
> > 
> > 
> > > 
> > > >   I feel like it's not very appropriate for a GPU driver to use
> > > > this interface. Is that right?
> > > 
> > > 
> > > I think not though most of the users are switch or ethernet devices. It
> > > doesn't prevent you from inventing new abstractions.
> > 
> > so need to patch devlink core and the userspace devlink tool?
> > e.g. devlink migration
> and devlink python libs if openstack was to use it directly.
> we do have caes where we just frok a process and execaute a comannd in a shell
> with or without elevated privladge but we really dont like doing that due to 
> the performacne impacat and security implciations so where we can use python 
> bindign
> over c apis we do. pyroute2 is the only python lib i know off of the top of 
> my head
> that support devlink so we would

Re: device compatibility interface for live migration with assigned devices

2020-08-13 Thread Yan Zhao

On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> 
> On 2020/8/10 下午3:46, Yan Zhao wrote:
> > > driver is it handled by?
> > It looks that the devlink is for network device specific, and in
> > devlink.h, it says
> > include/uapi/linux/devlink.h - Network physical device Netlink
> > interface,
> 
> 
> Actually not, I think there used to have some discussion last year and the
> conclusion is to remove this comment.
> 
> It supports IB and probably vDPA in the future.
>
hmm... sorry, I didn't find the referred discussion. only below discussion
regarding to why to add devlink.

https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html
>This doesn't seem to be too much related to networking? Why can't 
something
>like this be in sysfs?

It is related to networking quite bit. There has been couple of
iteration of this, including sysfs and configfs implementations. There
has been a consensus reached that this should be done by netlink. I
believe netlink is really the best for this purpose. Sysfs is not a good
idea

https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html
>there is already a way to change eth/ib via
>echo 'eth' > /sys/bus/pci/drivers/mlx4_core/:02:00.0/mlx4_port1
>
>sounds like this is another way to achieve the same?

It is. However the current way is driver-specific, not correct.
For mlx5, we need the same, it cannot be done in this way. Do devlink is
the correct way to go.

https://lwn.net/Articles/674867/
There a is need for some userspace API that would allow to expose things
that are not directly related to any device class like net_device of
ib_device, but rather chip-wide/switch-ASIC-wide stuff.

Use cases:
1) get/set of port type (Ethernet/InfiniBand)
2) monitoring of hardware messages to and from chip
3) setting up port splitters - split port into multiple ones and squash 
again,
   enables usage of splitter cable
4) setting up shared buffers - shared among multiple ports within one 
chip

we actually can also retrieve the same information through sysfs, .e.g

|- [path to device]
  |--- migration
  | |--- self
  | |   |---device_api
  | |   |---mdev_type
  | |   |---software_version
  | |   |---device_id
  | |   |---aggregator
  | |--- compatible
  | |   |---device_api
  | |   |---mdev_type
  | |   |---software_version
  | |   |---device_id
  | |   |---aggregator

> 
> >   I feel like it's not very appropriate for a GPU driver to use
> > this interface. Is that right?
> 
> 
> I think not though most of the users are switch or ethernet devices. It
> doesn't prevent you from inventing new abstractions.
so need to patch devlink core and the userspace devlink tool?
e.g. devlink migration

> Note that devlink is based on netlink, netlink has been widely used by
> various subsystems other than networking.

the advantage of netlink I see is that it can monitor device status and
notify upper layer that migration database needs to get updated.
But not sure whether openstack would like to use this capability.
As Sean said, it's heavy for openstack. it's heavy for vendor driver
as well :)

And devlink monitor now listens the notification and dumps the state
changes. If we want to use it, need to let it forward the notification
and dumped info to openstack, right?

Thanks
Yan

Re: device compatibility interface for live migration with assigned devices

2020-08-10 Thread Yan Zhao

On Wed, Aug 05, 2020 at 12:53:19PM +0200, Jiri Pirko wrote:
> Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.z...@intel.com wrote:
> >On Wed, Aug 05, 2020 at 04:02:48PM +0800, Jason Wang wrote:
> >> 
> >> On 2020/8/5 下午3:56, Jiri Pirko wrote:
> >> > Wed, Aug 05, 2020 at 04:41:54AM CEST, jasow...@redhat.com wrote:
> >> > > On 2020/8/5 上午10:16, Yan Zhao wrote:
> >> > > > On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
> >> > > > > On 2020/8/5 上午12:35, Cornelia Huck wrote:
> >> > > > > > [sorry about not chiming in earlier]
> >> > > > > > 
> >> > > > > > On Wed, 29 Jul 2020 16:05:03 +0800
> >> > > > > > Yan Zhao  wrote:
> >> > > > > > 
> >> > > > > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson 
> >> > > > > > > wrote:
> >> > > > > > (...)
> >> > > > > > 
> >> > > > > > > > Based on the feedback we've received, the previously 
> >> > > > > > > > proposed interface
> >> > > > > > > > is not viable.  I think there's agreement that the user 
> >> > > > > > > > needs to be
> >> > > > > > > > able to parse and interpret the version information.  Using 
> >> > > > > > > > json seems
> >> > > > > > > > viable, but I don't know if it's the best option.  Is there 
> >> > > > > > > > any
> >> > > > > > > > precedent of markup strings returned via sysfs we could 
> >> > > > > > > > follow?
> >> > > > > > I don't think encoding complex information in a sysfs file is a 
> >> > > > > > viable
> >> > > > > > approach. Quoting Documentation/filesystems/sysfs.rst:
> >> > > > > > 
> >> > > > > > "Attributes should be ASCII text files, preferably with only one 
> >> > > > > > value
> >> > > > > > per file. It is noted that it may not be efficient to contain 
> >> > > > > > only one
> >> > > > > > value per file, so it is socially acceptable to express an array 
> >> > > > > > of
> >> > > > > > values of the same type.
> >> > > > > > Mixing types, expressing multiple lines of data, and doing fancy
> >> > > > > > formatting of data is heavily frowned upon."
> >> > > > > > 
> >> > > > > > Even though this is an older file, I think these restrictions 
> >> > > > > > still
> >> > > > > > apply.
> >> > > > > +1, that's another reason why devlink(netlink) is better.
> >> > > > > 
> >> > > > hi Jason,
> >> > > > do you have any materials or sample code about devlink, so we can 
> >> > > > have a good
> >> > > > study of it?
> >> > > > I found some kernel docs about it but my preliminary study didn't 
> >> > > > show me the
> >> > > > advantage of devlink.
> >> > > 
> >> > > CC Jiri and Parav for a better answer for this.
> >> > > 
> >> > > My understanding is that the following advantages are obvious (as I 
> >> > > replied
> >> > > in another thread):
> >> > > 
> >> > > - existing users (NIC, crypto, SCSI, ib), mature and stable
> >> > > - much better error reporting (ext_ack other than string or errno)
> >> > > - namespace aware
> >> > > - do not couple with kobject
> >> > Jason, what is your use case?
> >> 
> >> 
> >> I think the use case is to report device compatibility for live migration.
> >> Yan proposed a simple sysfs based migration version first, but it looks not
> >> sufficient and something based on JSON is discussed.
> >> 
> >> Yan, can you help to summarize the discussion so far for Jiri as a
> >> reference?
> >> 
> >yes.
> >we are currently defining an device live migration compatibility
> >interface in order to let user space like openstack and libvirt knows
> >which two devices are live migration compatible.
> >currently the

Re: device compatibility interface for live migration with assigned devices

2020-08-05 Thread Yan Zhao

On Wed, Aug 05, 2020 at 04:02:48PM +0800, Jason Wang wrote:
> 
> On 2020/8/5 下午3:56, Jiri Pirko wrote:
> > Wed, Aug 05, 2020 at 04:41:54AM CEST, jasow...@redhat.com wrote:
> > > On 2020/8/5 上午10:16, Yan Zhao wrote:
> > > > On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
> > > > > On 2020/8/5 上午12:35, Cornelia Huck wrote:
> > > > > > [sorry about not chiming in earlier]
> > > > > > 
> > > > > > On Wed, 29 Jul 2020 16:05:03 +0800
> > > > > > Yan Zhao  wrote:
> > > > > > 
> > > > > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
> > > > > > (...)
> > > > > > 
> > > > > > > > Based on the feedback we've received, the previously proposed 
> > > > > > > > interface
> > > > > > > > is not viable.  I think there's agreement that the user needs 
> > > > > > > > to be
> > > > > > > > able to parse and interpret the version information.  Using 
> > > > > > > > json seems
> > > > > > > > viable, but I don't know if it's the best option.  Is there any
> > > > > > > > precedent of markup strings returned via sysfs we could follow?
> > > > > > I don't think encoding complex information in a sysfs file is a 
> > > > > > viable
> > > > > > approach. Quoting Documentation/filesystems/sysfs.rst:
> > > > > > 
> > > > > > "Attributes should be ASCII text files, preferably with only one 
> > > > > > value
> > > > > > per file. It is noted that it may not be efficient to contain only 
> > > > > > one
> > > > > > value per file, so it is socially acceptable to express an array of
> > > > > > values of the same type.
> > > > > > Mixing types, expressing multiple lines of data, and doing fancy
> > > > > > formatting of data is heavily frowned upon."
> > > > > > 
> > > > > > Even though this is an older file, I think these restrictions still
> > > > > > apply.
> > > > > +1, that's another reason why devlink(netlink) is better.
> > > > > 
> > > > hi Jason,
> > > > do you have any materials or sample code about devlink, so we can have 
> > > > a good
> > > > study of it?
> > > > I found some kernel docs about it but my preliminary study didn't show 
> > > > me the
> > > > advantage of devlink.
> > > 
> > > CC Jiri and Parav for a better answer for this.
> > > 
> > > My understanding is that the following advantages are obvious (as I 
> > > replied
> > > in another thread):
> > > 
> > > - existing users (NIC, crypto, SCSI, ib), mature and stable
> > > - much better error reporting (ext_ack other than string or errno)
> > > - namespace aware
> > > - do not couple with kobject
> > Jason, what is your use case?
> 
> 
> I think the use case is to report device compatibility for live migration.
> Yan proposed a simple sysfs based migration version first, but it looks not
> sufficient and something based on JSON is discussed.
> 
> Yan, can you help to summarize the discussion so far for Jiri as a
> reference?
> 
yes.
we are currently defining an device live migration compatibility
interface in order to let user space like openstack and libvirt knows
which two devices are live migration compatible.
currently the devices include mdev (a kernel emulated virtual device)
and physical devices (e.g.  a VF of a PCI SRIOV device).

the attributes we want user space to compare including
common attribues:
device_api: vfio-pci, vfio-ccw...
mdev_type: mdev type of mdev or similar signature for physical device
   It specifies a device's hardware capability. e.g.
   i915-GVTg_V5_4 means it's of 1/4 of a gen9 Intel graphics
   device.
software_version: device driver's version.
   in .[.bugfix] scheme, where there is no
   compatibility across major versions, minor versions have
   forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and
   bugfix version number indicates some degree of internal
   improvement that is not visible to the user in terms of
   features or compatibility,

vendor specific attributes: each vendor may define different attributes
   device id : device id of a physical devices or

Re: device compatibility interface for live migration with assigned devices

2020-08-04 Thread Yan Zhao

On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
> 
> On 2020/8/5 上午12:35, Cornelia Huck wrote:
> > [sorry about not chiming in earlier]
> > 
> > On Wed, 29 Jul 2020 16:05:03 +0800
> > Yan Zhao  wrote:
> > 
> > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
> > (...)
> > 
> > > > Based on the feedback we've received, the previously proposed interface
> > > > is not viable.  I think there's agreement that the user needs to be
> > > > able to parse and interpret the version information.  Using json seems
> > > > viable, but I don't know if it's the best option.  Is there any
> > > > precedent of markup strings returned via sysfs we could follow?
> > I don't think encoding complex information in a sysfs file is a viable
> > approach. Quoting Documentation/filesystems/sysfs.rst:
> > 
> > "Attributes should be ASCII text files, preferably with only one value
> > per file. It is noted that it may not be efficient to contain only one
> > value per file, so it is socially acceptable to express an array of
> > values of the same type.
> > Mixing types, expressing multiple lines of data, and doing fancy
> > formatting of data is heavily frowned upon."
> > 
> > Even though this is an older file, I think these restrictions still
> > apply.
> 
> 
> +1, that's another reason why devlink(netlink) is better.
>
hi Jason,
do you have any materials or sample code about devlink, so we can have a good
study of it?
I found some kernel docs about it but my preliminary study didn't show me the
advantage of devlink.

Thanks
Yan

Re: device compatibility interface for live migration with assigned devices

2020-08-04 Thread Yan Zhao

> > yes, include a device_api field is better.
> > for mdev, "device_type=vfio-mdev", is it right?
> 
> No, vfio-mdev is not a device API, it's the driver that attaches to the
> mdev bus device to expose it through vfio.  The device_api exposes the
> actual interface of the vfio device, it's also vfio-pci for typical
> mdev devices found on x86, but may be vfio-ccw, vfio-ap, etc...  See
> VFIO_DEVICE_API_PCI_STRING and friends.
> 
ok. got it.

> > > > >   device_id=8086591d  
> > > 
> > > Is device_id interpreted relative to device_type?  How does this
> > > relate to mdev_type?  If we have an mdev_type, doesn't that fully
> > > defined the software API?
> > >   
> > it's parent pci id for mdev actually.
>
> If we need to specify the parent PCI ID then something is fundamentally
> wrong with the mdev_type.  The mdev_type should define a unique,
> software compatible interface, regardless of the parent device IDs.  If
> a i915-GVTg_V5_2 means different things based on the parent device IDs,
> then then different mdev_types should be reported for those parent
> devices.
>
hmm, then do we allow vendor specific fields?
or is it a must that a vendor specific field should have corresponding
vendor attribute?

another thing is that the definition of mdev_type in GVT only corresponds
to vGPU computing ability currently,
e.g. i915-GVTg_V5_2, is 1/2 of a gen9 IGD, i915-GVTg_V4_2 is 1/2 of a
gen8 IGD.
It is too coarse-grained to live migration compatibility.

Do you think we need to update GVT's definition of mdev_type?
And is there any guide in mdev_type definition?

> > > > >   mdev_type=i915-GVTg_V5_2  
> > > 
> > > And how are non-mdev devices represented?
> > >   
> > non-mdev can opt to not include this field, or as you said below, a
> > vendor signature. 
> > 
> > > > >   aggregator=1
> > > > >   pv_mode="none+ppgtt+context"  
> > > 
> > > These are meaningless vendor specific matches afaict.
> > >   
> > yes, pv_mode and aggregator are vendor specific fields.
> > but they are important to decide whether two devices are compatible.
> > pv_mode means whether a vGPU supports guest paravirtualized api.
> > "none+ppgtt+context" means guest can not use pv, or use ppgtt mode pv or
> > use context mode pv.
> > 
> > > > >   interface_version=3  
> > > 
> > > Not much granularity here, I prefer Sean's previous
> > > .[.bugfix] scheme.
> > >   
> > yes, .[.bugfix] scheme may be better, but I'm not sure if
> > it works for a complicated scenario.
> > e.g for pv_mode,
> > (1) initially,  pv_mode is not supported, so it's pv_mode=none, it's 0.0.0,
> > (2) then, pv_mode=ppgtt is supported, pv_mode="none+ppgtt", it's 0.1.0,
> > indicating pv_mode=none can migrate to pv_mode="none+ppgtt", but not vice 
> > versa.
> > (3) later, pv_mode=context is also supported,
> > pv_mode="none+ppgtt+context", so it's 0.2.0.
> > 
> > But if later, pv_mode=ppgtt is removed. pv_mode="none+context", how to
> > name its version? "none+ppgtt" (0.1.0) is not compatible to
> > "none+context", but "none+ppgtt+context" (0.2.0) is compatible to
> > "none+context".
> 
> If pv_mode=ppgtt is removed, then the compatible versions would be
> 0.0.0 or 1.0.0, ie. the major version would be incremented due to
> feature removal.
>  
> > Maintain such scheme is painful to vendor driver.
> 
> Migration compatibility is painful, there's no way around that.  I
> think the version scheme is an attempt to push some of that low level
> burden on the vendor driver, otherwise the management tools need to
> work on an ever growing matrix of vendor specific features which is
> going to become unwieldy and is largely meaningless outside of the
> vendor driver.  Instead, the vendor driver can make strategic decisions
> about where to continue to maintain a support burden and make explicit
> decisions to maintain or break compatibility.  The version scheme is a
> simplification and abstraction of vendor driver features in order to
> create a small, logical compatibility matrix.  Compromises necessarily
> need to be made for that to occur.
>
ok. got it.

> > > > > COMPATIBLE:
> > > > >   device_type=pci
> > > > >   device_id=8086591d
> > > > >   mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
> > > > this mixed notation will be hard to parse so i would avoid that.  
> > > 
> > > Some background, Intel has been proposing aggregation as a solution to
> > > how we scale mdev devices when hardware exposes large numbers of
> > > assignable objects that can be composed in essentially arbitrary ways.
> > > So for instance, if we have a workqueue (wq), we might have an mdev
> > > type for 1wq, 2wq, 3wq,... Nwq.  It's not really practical to expose a
> > > discrete mdev type for each of those, so they want to define a base
> > > type which is composable to other types via this aggregation.  This is
> > > what this substitution and tagging is attempting to accomplish.  So
> > > imagine this set of values for cases where it's not practical to unroll
> >

Re: device compatibility interface for live migration with assigned devices

2020-07-29 Thread Yan Zhao

On Wed, Jul 29, 2020 at 01:12:55PM -0600, Alex Williamson wrote:
> On Wed, 29 Jul 2020 12:28:46 +0100
> Sean Mooney  wrote:
> 
> > On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
> > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:  
> > > > On Mon, 27 Jul 2020 15:24:40 +0800
> > > > Yan Zhao  wrote:
> > > >   
> > > > > > > As you indicate, the vendor driver is responsible for checking 
> > > > > > > version
> > > > > > > information embedded within the migration stream.  Therefore a
> > > > > > > migration should fail early if the devices are incompatible.  Is 
> > > > > > > it
> > > > > > 
> > > > > > but as I know, currently in VFIO migration protocol, we have no way 
> > > > > > to
> > > > > > get vendor specific compatibility checking string in migration 
> > > > > > setup stage
> > > > > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > > > > In this way, for devices who does not save device data in precopy 
> > > > > > stage,
> > > > > > the migration compatibility checking is as late as in stop-and-copy
> > > > > > stage, which is too late.
> > > > > > do you think we need to add the getting/checking of vendor specific
> > > > > > compatibility string early in save_setup stage?
> > > > > >
> > > > > 
> > > > > hi Alex,
> > > > > after an offline discussion with Kevin, I realized that it may not be 
> > > > > a
> > > > > problem if migration compatibility check in vendor driver occurs late 
> > > > > in
> > > > > stop-and-copy phase for some devices, because if we report device
> > > > > compatibility attributes clearly in an interface, the chances for
> > > > > libvirt/openstack to make a wrong decision is little.  
> > > > 
> > > > I think it would be wise for a vendor driver to implement a pre-copy
> > > > phase, even if only to send version information and verify it at the
> > > > target.  Deciding you have no device state to send during pre-copy does
> > > > not mean your vendor driver needs to opt-out of the pre-copy phase
> > > > entirely.  Please also note that pre-copy is at the user's discretion,
> > > > we've defined that we can enter stop-and-copy at any point, including
> > > > without a pre-copy phase, so I would recommend that vendor drivers
> > > > validate compatibility at the start of both the pre-copy and the
> > > > stop-and-copy phases.
> > > >   
> > > 
> > > ok. got it!
> > >   
> > > > > so, do you think we are now arriving at an agreement that we'll give 
> > > > > up
> > > > > the read-and-test scheme and start to defining one interface (perhaps 
> > > > > in
> > > > > json format), from which libvirt/openstack is able to parse and find 
> > > > > out
> > > > > compatibility list of a source mdev/physical device?  
> > > > 
> > > > Based on the feedback we've received, the previously proposed interface
> > > > is not viable.  I think there's agreement that the user needs to be
> > > > able to parse and interpret the version information.  Using json seems
> > > > viable, but I don't know if it's the best option.  Is there any
> > > > precedent of markup strings returned via sysfs we could follow?  
> > > 
> > > I found some examples of using formatted string under /sys, mostly under
> > > tracing. maybe we can do a similar implementation.
> > > 
> > > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
> > > 
> > > name: kvm_mmio
> > > ID: 32
> > > format:
> > > field:unsigned short common_type;   offset:0;   size:2; 
> > > signed:0;
> > > field:unsigned char common_flags;   offset:2;   size:1; 
> > > signed:0;
> > > field:unsigned char common_preempt_count;   offset:3;   
> > > size:1; signed:0;
> > > field:int common_pid;   offset:4;   size:4; signed:1;
> > > 
> > > field:u32 type; offset:8;   size:4; signed:0;
> > > field:u32 len;  offset:12;  size:4; signed:0;
> > > field:u64 gpa;  off

Re: device compatibility interface for live migration with assigned devices

2020-07-29 Thread Yan Zhao

On Wed, Jul 29, 2020 at 12:28:46PM +0100, Sean Mooney wrote:
> On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
> > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
> > > On Mon, 27 Jul 2020 15:24:40 +0800
> > > Yan Zhao  wrote:
> > > 
> > > > > > As you indicate, the vendor driver is responsible for checking 
> > > > > > version
> > > > > > information embedded within the migration stream.  Therefore a
> > > > > > migration should fail early if the devices are incompatible.  Is it 
> > > > > >  
> > > > > 
> > > > > but as I know, currently in VFIO migration protocol, we have no way to
> > > > > get vendor specific compatibility checking string in migration setup 
> > > > > stage
> > > > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > > > In this way, for devices who does not save device data in precopy 
> > > > > stage,
> > > > > the migration compatibility checking is as late as in stop-and-copy
> > > > > stage, which is too late.
> > > > > do you think we need to add the getting/checking of vendor specific
> > > > > compatibility string early in save_setup stage?
> > > > >  
> > > > 
> > > > hi Alex,
> > > > after an offline discussion with Kevin, I realized that it may not be a
> > > > problem if migration compatibility check in vendor driver occurs late in
> > > > stop-and-copy phase for some devices, because if we report device
> > > > compatibility attributes clearly in an interface, the chances for
> > > > libvirt/openstack to make a wrong decision is little.
> > > 
> > > I think it would be wise for a vendor driver to implement a pre-copy
> > > phase, even if only to send version information and verify it at the
> > > target.  Deciding you have no device state to send during pre-copy does
> > > not mean your vendor driver needs to opt-out of the pre-copy phase
> > > entirely.  Please also note that pre-copy is at the user's discretion,
> > > we've defined that we can enter stop-and-copy at any point, including
> > > without a pre-copy phase, so I would recommend that vendor drivers
> > > validate compatibility at the start of both the pre-copy and the
> > > stop-and-copy phases.
> > > 
> > 
> > ok. got it!
> > 
> > > > so, do you think we are now arriving at an agreement that we'll give up
> > > > the read-and-test scheme and start to defining one interface (perhaps in
> > > > json format), from which libvirt/openstack is able to parse and find out
> > > > compatibility list of a source mdev/physical device?
> > > 
> > > Based on the feedback we've received, the previously proposed interface
> > > is not viable.  I think there's agreement that the user needs to be
> > > able to parse and interpret the version information.  Using json seems
> > > viable, but I don't know if it's the best option.  Is there any
> > > precedent of markup strings returned via sysfs we could follow?
> > 
> > I found some examples of using formatted string under /sys, mostly under
> > tracing. maybe we can do a similar implementation.
> > 
> > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
> > 
> > name: kvm_mmio
> > ID: 32
> > format:
> > field:unsigned short common_type;   offset:0;   size:2; 
> > signed:0;
> > field:unsigned char common_flags;   offset:2;   size:1; 
> > signed:0;
> > field:unsigned char common_preempt_count;   offset:3;   
> > size:1; signed:0;
> > field:int common_pid;   offset:4;   size:4; signed:1;
> > 
> > field:u32 type; offset:8;   size:4; signed:0;
> > field:u32 len;  offset:12;  size:4; signed:0;
> > field:u64 gpa;  offset:16;  size:8; signed:0;
> > field:u64 val;  offset:24;  size:8; signed:0;
> > 
> > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", 
> > __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read"
> > }, { 2, "write" }), REC->len, REC->gpa, REC->val
> > 
> this is not json fromat and its not supper frendly to parse.
yes, it's just an example. It's exported to be used by userspace perf &
trace_cmd.

> > 
> > #cat /sys/devices/pci:00/:00:02.0/uevent
> > DRIVER=v

Re: device compatibility interface for live migration with assigned devices

2020-07-29 Thread Yan Zhao

On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
> On Mon, 27 Jul 2020 15:24:40 +0800
> Yan Zhao  wrote:
> 
> > > > As you indicate, the vendor driver is responsible for checking version
> > > > information embedded within the migration stream.  Therefore a
> > > > migration should fail early if the devices are incompatible.  Is it  
> > > but as I know, currently in VFIO migration protocol, we have no way to
> > > get vendor specific compatibility checking string in migration setup stage
> > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > In this way, for devices who does not save device data in precopy stage,
> > > the migration compatibility checking is as late as in stop-and-copy
> > > stage, which is too late.
> > > do you think we need to add the getting/checking of vendor specific
> > > compatibility string early in save_setup stage?
> > >  
> > hi Alex,
> > after an offline discussion with Kevin, I realized that it may not be a
> > problem if migration compatibility check in vendor driver occurs late in
> > stop-and-copy phase for some devices, because if we report device
> > compatibility attributes clearly in an interface, the chances for
> > libvirt/openstack to make a wrong decision is little.
> 
> I think it would be wise for a vendor driver to implement a pre-copy
> phase, even if only to send version information and verify it at the
> target.  Deciding you have no device state to send during pre-copy does
> not mean your vendor driver needs to opt-out of the pre-copy phase
> entirely.  Please also note that pre-copy is at the user's discretion,
> we've defined that we can enter stop-and-copy at any point, including
> without a pre-copy phase, so I would recommend that vendor drivers
> validate compatibility at the start of both the pre-copy and the
> stop-and-copy phases.
>
ok. got it!

> > so, do you think we are now arriving at an agreement that we'll give up
> > the read-and-test scheme and start to defining one interface (perhaps in
> > json format), from which libvirt/openstack is able to parse and find out
> > compatibility list of a source mdev/physical device?
> 
> Based on the feedback we've received, the previously proposed interface
> is not viable.  I think there's agreement that the user needs to be
> able to parse and interpret the version information.  Using json seems
> viable, but I don't know if it's the best option.  Is there any
> precedent of markup strings returned via sysfs we could follow?
I found some examples of using formatted string under /sys, mostly under
tracing. maybe we can do a similar implementation.

#cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format

name: kvm_mmio
ID: 32
format:
field:unsigned short common_type;   offset:0;   size:2; 
signed:0;
field:unsigned char common_flags;   offset:2;   size:1; 
signed:0;
field:unsigned char common_preempt_count;   offset:3;   size:1; 
signed:0;
field:int common_pid;   offset:4;   size:4; signed:1;

field:u32 type; offset:8;   size:4; signed:0;
field:u32 len;  offset:12;  size:4; signed:0;
field:u64 gpa;  offset:16;  size:8; signed:0;
field:u64 val;  offset:24;  size:8; signed:0;

print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, 
{ 0, "unsatisfied-read" }, { 1, "read" }, { 2, "write" }), REC->len, REC->gpa, 
REC->val


#cat /sys/devices/pci:00/:00:02.0/uevent
DRIVER=vfio-pci
PCI_CLASS=3
PCI_ID=8086:591D
PCI_SUBSYS_ID=8086:2212
PCI_SLOT_NAME=:00:02.0
MODALIAS=pci:v8086d591Dsv8086sd2212bc03sc00i00

> 
> Your idea of having both a "self" object and an array of "compatible"
> objects is perhaps something we can build on, but we must not assume
> PCI devices at the root level of the object.  Providing both the
> mdev-type and the driver is a bit redundant, since the former includes
> the latter.  We can't have vendor specific versioning schemes though,
> ie. gvt-version. We need to agree on a common scheme and decide which
> fields the version is relative to, ex. just the mdev type?
what about making all comparing fields vendor specific?
userspace like openstack only needs to parse and compare if target
device is within source compatible list without understanding the meaning
of each field.

> I had also proposed fields that provide information to create a
> compatible type, for example to create a type_x2 device from a type_x1
> mdev type, they need to know to apply an aggregation attribute.  If we
> need to explicitly list every aggregation value and the res

Re: device compatibility interface for live migration with assigned devices

2020-07-27 Thread Yan Zhao

> > As you indicate, the vendor driver is responsible for checking version
> > information embedded within the migration stream.  Therefore a
> > migration should fail early if the devices are incompatible.  Is it
> but as I know, currently in VFIO migration protocol, we have no way to
> get vendor specific compatibility checking string in migration setup stage
> (i.e. .save_setup stage) before the device is set to _SAVING state.
> In this way, for devices who does not save device data in precopy stage,
> the migration compatibility checking is as late as in stop-and-copy
> stage, which is too late.
> do you think we need to add the getting/checking of vendor specific
> compatibility string early in save_setup stage?
>
hi Alex,
after an offline discussion with Kevin, I realized that it may not be a
problem if migration compatibility check in vendor driver occurs late in
stop-and-copy phase for some devices, because if we report device
compatibility attributes clearly in an interface, the chances for
libvirt/openstack to make a wrong decision is little.
so, do you think we are now arriving at an agreement that we'll give up
the read-and-test scheme and start to defining one interface (perhaps in
json format), from which libvirt/openstack is able to parse and find out
compatibility list of a source mdev/physical device?

Thanks
Yan

Re: device compatibility interface for live migration with assigned devices

2020-07-20 Thread Yan Zhao

On Fri, Jul 17, 2020 at 10:12:58AM -0600, Alex Williamson wrote:
<...>
> > yes, in another reply, Alex proposed to use an interface in json format.
> > I guess we can define something like
> > 
> > { "self" :
> >   [
> > { "pciid" : "8086591d",
> >   "driver" : "i915",
> >   "gvt-version" : "v1",
> >   "mdev_type"   : "i915-GVTg_V5_2",
> >   "aggregator"  : "1",
> >   "pv-mode" : "none",
> > }
> >   ],
> >   "compatible" :
> >   [
> > { "pciid" : "8086591d",
> >   "driver" : "i915",
> >   "gvt-version" : "v1",
> >   "mdev_type"   : "i915-GVTg_V5_2",
> >   "aggregator"  : "1"
> >   "pv-mode" : "none",
> > },
> > { "pciid" : "8086591d",
> >   "driver" : "i915",
> >   "gvt-version" : "v1",
> >   "mdev_type"   : "i915-GVTg_V5_4",
> >   "aggregator"  : "2"
> >   "pv-mode" : "none",
> > },
> > { "pciid" : "8086591d",
> >   "driver" : "i915",
> >   "gvt-version" : "v2",
> >   "mdev_type"   : "i915-GVTg_V5_4",
> >   "aggregator"  : "2"
> >   "pv-mode" : "none, ppgtt, context",
> > }
> > ...
> >   ]
> > }
> > 
> > But as those fields are mostly vendor specific, the userspace can
> > only do simple string comparing, I guess the list would be very long as
> > it needs to enumerate all possible targets.
> 
> 
> This ignores so much of what I tried to achieve in my example :(
> 
sorry, I just was eager to show and confirm the way to list all compatible
combination of mdev_type and mdev attributes.

> 
> > also, in some fileds like "gvt-version", is there a simple way to express
> > things like v2+?
> 
> 
> That's not a reasonable thing to express anyway, how can you be certain
> that v3 won't break compatibility with v2?  Sean proposed a versioning
> scheme that accounts for this, using an x.y.z version expressing the
> major, minor, and bugfix versions, where there is no compatibility
> across major versions, minor versions have forward compatibility (ex. 1
> -> 2 is ok, 2 -> 1 is not) and bugfix version number indicates some
> degree of internal improvement that is not visible to the user in terms
> of features or compatibility, but provides a basis for preferring
> equally compatible candidates.
>
right. if self version is v1, it can't know its compatible version is
v2. it can only be done in reverse. i.e.
when self version is v2, it can list its compatible version is v1 and
v2.
and maybe later when self version is v3, there's no v1 in its compatible
list.

In this way, do you think we still need the complex x.y.z versioning scheme?

>  
> > If the userspace can read this interface both in src and target and
> > check whether both src and target are in corresponding compatible list, I
> > think it will work for us.
> > 
> > But still, kernel should not rely on userspace's choice, the opaque
> > compatibility string is still required in kernel. No matter whether
> > it would be exposed to userspace as an compatibility checking interface,
> > vendor driver would keep this part of code and embed the string into the
> > migration stream. so exposing it as an interface to be used by libvirt to
> > do a safety check before a real live migration is only about enabling
> > the kernel part of check to happen ahead.
> 
> As you indicate, the vendor driver is responsible for checking version
> information embedded within the migration stream.  Therefore a
> migration should fail early if the devices are incompatible.  Is it
but as I know, currently in VFIO migration protocol, we have no way to
get vendor specific compatibility checking string in migration setup stage
(i.e. .save_setup stage) before the device is set to _SAVING state.
In this way, for devices who does not save device data in precopy stage,
the migration compatibility checking is as late as in stop-and-copy
stage, which is too late.
do you think we need to add the getting/checking of vendor specific
compatibility string early in save_setup stage?

> really libvirt's place to second guess what it has been directed to do?
if libvirt uses the scheme of reading compatibility string at source and
writing for checking at the target, it can not be called "a second guess".
It's not a guess, but a confirmation.

> Why would we even proceed to design a user parse-able version interface
> if we still have a dependency on an opaque interface?  Thanks,
one reason is that libvirt can't trust the parsing result from
openstack.
Another reason is that libvirt can use this opaque interface easier than
another parsing by itself, in the fact that it would not introduce more
burden to kernel who would write this part of code anyway, no matter
libvirt uses it or not.
 
Thanks
Yan

Re: device compatibility interface for live migration with assigned devices

2020-07-16 Thread Yan Zhao

On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:
> 
> On 2020/7/14 上午7:29, Yan Zhao wrote:
> > hi folks,
> > we are defining a device migration compatibility interface that helps upper
> > layer stack like openstack/ovirt/libvirt to check if two devices are
> > live migration compatible.
> > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > e.g. we could use it to check whether
> > - a src MDEV can migrate to a target MDEV,
> > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > - a src MDEV can migration to a target VF in SRIOV.
> >(e.g. SIOV/SRIOV backward compatibility case)
> > 
> > The upper layer stack could use this interface as the last step to check
> > if one device is able to migrate to another device before triggering a real
> > live migration procedure.
> > we are not sure if this interface is of value or help to you. please don't
> > hesitate to drop your valuable comments.
> > 
> > 
> > (1) interface definition
> > The interface is defined in below way:
> > 
> >   __userspace
> >/\  \
> >   / \write
> >  / read  \
> > /__   ___\|/_
> >| migration_version | | migration_version |-->check migration
> >- -   compatibility
> >   device Adevice B
> > 
> > 
> > a device attribute named migration_version is defined under each device's
> > sysfs node. e.g. 
> > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).
> 
> 
> Are you aware of the devlink based device management interface that is
> proposed upstream? I think it has many advantages over sysfs, do you
> consider to switch to that?
not familiar with the devlink. will do some research of it.
> 
> 
> > userspace tools read the migration_version as a string from the source 
> > device,
> > and write it to the migration_version sysfs attribute in the target device.
> > 
> > The userspace should treat ANY of below conditions as two devices not 
> > compatible:
> > - any one of the two devices does not have a migration_version attribute
> > - error when reading from migration_version attribute of one device
> > - error when writing migration_version string of one device to
> >migration_version attribute of the other device
> > 
> > The string read from migration_version attribute is defined by device vendor
> > driver and is completely opaque to the userspace.
> 
> 
> My understanding is that something opaque to userspace is not the philosophy

but the VFIO live migration in itself is essentially a big opaque stream to 
userspace.

> of Linux. Instead of having a generic API but opaque value, why not do in a
> vendor specific way like:
> 
> 1) exposing the device capability in a vendor specific way via sysfs/devlink
> or other API
> 2) management read capability in both src and dst and determine whether we
> can do the migration
> 
> This is the way we plan to do with vDPA.
>
yes, in another reply, Alex proposed to use an interface in json format.
I guess we can define something like

{ "self" :
  [
{ "pciid" : "8086591d",
  "driver" : "i915",
  "gvt-version" : "v1",
  "mdev_type"   : "i915-GVTg_V5_2",
  "aggregator"  : "1",
  "pv-mode" : "none",
}
  ],
  "compatible" :
  [
{ "pciid" : "8086591d",
  "driver" : "i915",
  "gvt-version" : "v1",
  "mdev_type"   : "i915-GVTg_V5_2",
  "aggregator"  : "1"
  "pv-mode" : "none",
},
{ "pciid" : "8086591d",
  "driver" : "i915",
  "gvt-version" : "v1",
  "mdev_type"   : "i915-GVTg_V5_4",
  "aggregator"  : "2"
  "pv-mode" : "none",
},
{ "pciid" : "8086591d",
  "driver" : "i915",
  "gvt-version" : "v2",
  "mdev_type"   : "i915-GVTg_V5_4",
  "aggregator"  : "2"
  "pv-mode" : "none, ppgtt, context",
}
...
  ]
}

But as those fields are mostly vendor specific, the userspace can
only do simple string comparing, I guess the list would be very long as
it needs to

Re: device compatibility interface for live migration with assigned devices

2020-07-15 Thread Yan Zhao

On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote:
> On Tue, 14 Jul 2020 18:19:46 +0100
> "Dr. David Alan Gilbert"  wrote:
> 
> > * Alex Williamson (alex.william...@redhat.com) wrote:
> > > On Tue, 14 Jul 2020 11:21:29 +0100
> > > Daniel P. Berrangé  wrote:
> > >   
> > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:  
> > > > > hi folks,
> > > > > we are defining a device migration compatibility interface that helps 
> > > > > upper
> > > > > layer stack like openstack/ovirt/libvirt to check if two devices are
> > > > > live migration compatible.
> > > > > The "devices" here could be MDEVs, physical devices, or hybrid of the 
> > > > > two.
> > > > > e.g. we could use it to check whether
> > > > > - a src MDEV can migrate to a target MDEV,
> > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > > > - a src MDEV can migration to a target VF in SRIOV.
> > > > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > > > 
> > > > > The upper layer stack could use this interface as the last step to 
> > > > > check
> > > > > if one device is able to migrate to another device before triggering 
> > > > > a real
> > > > > live migration procedure.
> > > > > we are not sure if this interface is of value or help to you. please 
> > > > > don't
> > > > > hesitate to drop your valuable comments.
> > > > > 
> > > > > 
> > > > > (1) interface definition
> > > > > The interface is defined in below way:
> > > > > 
> > > > >  __userspace
> > > > >   /\  \
> > > > >  / \write
> > > > > / read  \
> > > > >/__   ___\|/_
> > > > >   | migration_version | | migration_version |-->check migration
> > > > >   - -   compatibility
> > > > >  device Adevice B
> > > > > 
> > > > > 
> > > > > a device attribute named migration_version is defined under each 
> > > > > device's
> > > > > sysfs node. e.g. 
> > > > > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).
> > > > > userspace tools read the migration_version as a string from the 
> > > > > source device,
> > > > > and write it to the migration_version sysfs attribute in the target 
> > > > > device.
> > > > > 
> > > > > The userspace should treat ANY of below conditions as two devices not 
> > > > > compatible:
> > > > > - any one of the two devices does not have a migration_version 
> > > > > attribute
> > > > > - error when reading from migration_version attribute of one device
> > > > > - error when writing migration_version string of one device to
> > > > >   migration_version attribute of the other device
> > > > > 
> > > > > The string read from migration_version attribute is defined by device 
> > > > > vendor
> > > > > driver and is completely opaque to the userspace.
> > > > > for a Intel vGPU, string format can be defined like
> > > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + 
> > > > > "aggregator count".
> > > > > 
> > > > > for an NVMe VF connecting to a remote storage. it could be
> > > > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > > > 
> > > > > for a QAT VF, it may be
> > > > > "PCI ID" + "driver version" + "supported encryption set".
> > > > > 
> > > > > (to avoid namespace confliction from each vendor, we may prefix a 
> > > > > driver name to
> > > > > each migration_version string. e.g. 
> > > > > i915-v1-8086-591d-i915-GVTg_V5_8-1)  
> > > 
> > > It's very strange to define it as opaque and then proceed to describe
> > > the contents of that opaque string.  The point is that its contents
> > > are defined by

device compatibility interface for live migration with assigned devices

2020-07-13 Thread Yan Zhao

hi folks,
we are defining a device migration compatibility interface that helps upper
layer stack like openstack/ovirt/libvirt to check if two devices are
live migration compatible.
The "devices" here could be MDEVs, physical devices, or hybrid of the two.
e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV.
  (e.g. SIOV/SRIOV backward compatibility case)

The upper layer stack could use this interface as the last step to check
if one device is able to migrate to another device before triggering a real
live migration procedure.
we are not sure if this interface is of value or help to you. please don't
hesitate to drop your valuable comments.


(1) interface definition
The interface is defined in below way:

 __userspace
  /\  \
 / \write
/ read  \
   /__   ___\|/_
  | migration_version | | migration_version |-->check migration
  - -   compatibility
 device Adevice B


a device attribute named migration_version is defined under each device's
sysfs node. e.g. 
(/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).
userspace tools read the migration_version as a string from the source device,
and write it to the migration_version sysfs attribute in the target device.

The userspace should treat ANY of below conditions as two devices not 
compatible:
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to
  migration_version attribute of the other device

The string read from migration_version attribute is defined by device vendor
driver and is completely opaque to the userspace.
for a Intel vGPU, string format can be defined like
"parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator 
count".

for an NVMe VF connecting to a remote storage. it could be
"PCI ID" + "driver version" + "configured remote storage URL"

for a QAT VF, it may be
"PCI ID" + "driver version" + "supported encryption set".

(to avoid namespace confliction from each vendor, we may prefix a driver name to
each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)


(2) backgrounds

The reason we hope the migration_version string is opaque to the userspace
is that it is hard to generalize standard comparing fields and comparing
methods for different devices from different vendors.
Though userspace now could still do a simple string compare to check if
two devices are compatible, and result should also be right, it's still
too limited as it excludes the possible candidate whose migration_version
string fails to be equal.
e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible
with another MDEV with mdev_type_3, aggregator count 1, even their
migration_version strings are not equal.
(assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).

besides that, driver version + configured resources are all elements demanding
to take into account.

So, we hope leaving the freedom to vendor driver and let it make the final 
decision
in a simple reading from source side and writing for test in the target side 
way.


we then think the device compatibility issues for live migration with assigned
devices can be divided into two steps:
a. management tools filter out possible migration target devices.
   Tags could be created according to info from product specification.
   we think openstack/ovirt may have vendor proprietary components to create
   those customized tags for each product from each vendor.
   e.g.
   for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to
   search target vGPU are like:
   a tag for compatible parent PCI IDs,
   a tag for a range of gvt driver versions,
   a tag for a range of mdev type + aggregator count

   for NVMe VF, the tags to search target VF may be like:
   a tag for compatible PCI IDs,
   a tag for a range of driver versions,
   a tag for URL of configured remote storage.

b. with the output from step a, openstack/ovirt/libvirt could use our proposed
   device migration compatibility interface to make sure the two devices are
   indeed live migration compatible before launching the real live migration
   process to start stream copying, src device stopping and target device
   resuming.
   It is supposed that this step would not bring any performance penalty as
   -in kernel it's just a simple string decoding and comparing
   -in openstack/ovirt, it could be done by extending current function
check_can_live_migrate_destination, along side claiming target resources.[1]


[1]

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-06-27 Thread Yan Zhao

On Sat, Jun 27, 2020 at 08:57:14AM -0400, Peter Xu wrote:
> On Sat, Jun 27, 2020 at 03:26:45AM -0400, Yan Zhao wrote:
> > > -assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> > > +if (notifier->notifier_flags & IOMMU_NOTIFIER_ARBITRARY_MASK) {
> > > +tmp.iova = MAX(tmp.iova, notifier->start);
> > > +tmp.addr_mask = MIN(tmp.addr_mask, notifier->end);
> > NIT:
> >tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
> 
> Right.  Thanks. :)
> 
> > > +assert(tmp.iova <= tmp.addr_mask);
> > no this assertion then.
> 
> Or change it into:
> 
>   assert(MIN(entry_end, notifier->end) >= tmp.iova);
> 
> To double confirm no overflow.
>
what about assert in this way, so that it's also useful to check overflow
in the other condition.

hwaddr entry_end = entry->iova + entry->addr_mask;
+
+ assert(notifier->end >= notifer->start && entry_end >= entry->iova);


then as there's a following filter
if (notifier->start > entry_end || notifier->end < entry->iova) {
return;
}

we can conclude that

entry_end >= entry->iova(tmp.iova)
entry_end >= notifier->start,
--> entry_end >= MAX(tmp.iova, notfier->start)
--> entry_end >= tmp.iova


notifier->end >= entry->iova (tmp.iova),
notifier->end >= notifer->start,
--> notifier->end >= MAX(tmp.iova, nofier->start)
--> notifier->end >= tmp.iova

==> MIN(end_end, notifer->end) >= tmp.iova

Thanks
Yan

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-06-27 Thread Yan Zhao

On Fri, Jun 26, 2020 at 05:29:17PM -0400, Peter Xu wrote:
> Hi, Eugenio,
> 
> (CCing Eric, Yan and Michael too)

> On Fri, Jun 26, 2020 at 08:41:22AM +0200, Eugenio Pérez wrote:
> > diff --git a/memory.c b/memory.c
> > index 2f15a4b250..7f789710d2 100644
> > --- a/memory.c
> > +++ b/memory.c
> > @@ -1915,8 +1915,6 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
> >  return;
> >  }
> >  
> > -assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> 
> I can understand removing the assertion should solve the issue, however imho
> the major issue is not about this single assertion but the whole addr_mask
> issue behind with virtio...
Yes, the background for this assertion is
https://lists.gnu.org/archive/html/qemu-devel/2019-06/msg04218.html


> 
> For normal IOTLB invalidations, we were trying our best to always make
> IOMMUTLBEntry contain a valid addr_mask to be 2**N-1.  E.g., that's what we're
> doing with the loop in vtd_address_space_unmap().
> 
> But this is not the first time that we may want to break this assumption for
> virtio so that we make the IOTLB a tuple of (start, len), then that len can be
> not a address mask any more.  That seems to be more efficient for things like
> vhost because iotlbs there are not page based, so it'll be inefficient if we
> always guarantee the addr_mask because it'll be quite a lot more roundtrips of
> the same range of invalidation.  Here we've encountered another issue of
> triggering the assertion with virtio-net, but only with the old RHEL7 guest.
> 
> I'm thinking whether we can make the IOTLB invalidation configurable by
> specifying whether the backend of the notifier can handle arbitary address
> range in some way.  So we still have the guaranteed addr_masks by default
> (since I still don't think totally break the addr_mask restriction is 
> wise...),
> however we can allow the special backends to take adavantage of using arbitary
> (start, len) ranges for reasons like performance.
> 
> To do that, a quick idea is to introduce a flag IOMMU_NOTIFIER_ARBITRARY_MASK
> to IOMMUNotifierFlag, to declare that the iommu notifier (and its backend) can
> take arbitrary address mask, then it can be any value and finally becomes a
> length rather than an addr_mask.  Then for every iommu notify() we can 
> directly
> deliver whatever we've got from the upper layer to this notifier.  With the 
> new
> flag, vhost can do iommu_notifier_init() with UNMAP|ARBITRARY_MASK so it
> declares this capability.  Then no matter for device iotlb or normal iotlb, we
> skip the complicated procedure to split a big range into small ranges that are
> with strict addr_mask, but directly deliver the message to the iommu notifier.
> E.g., we can skip the loop in vtd_address_space_unmap() if the notifier is 
> with
> ARBITRARY flag set.
> 
> Then, the assert() is not accurate either, and may become something like:
> 
> diff --git a/memory.c b/memory.c
> index 2f15a4b250..99d0492509 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1906,6 +1906,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>  {
>  IOMMUNotifierFlag request_flags;
>  hwaddr entry_end = entry->iova + entry->addr_mask;
> +IOMMUTLBEntry tmp = *entry;
> 
>  /*
>   * Skip the notification if the notification does not overlap
> @@ -1915,7 +1916,13 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>  return;
>  }
> 
> -assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> +if (notifier->notifier_flags & IOMMU_NOTIFIER_ARBITRARY_MASK) {
> +tmp.iova = MAX(tmp.iova, notifier->start);
> +tmp.addr_mask = MIN(tmp.addr_mask, notifier->end);
NIT:
   tmp.addr_mask = MIN(entry_end, notifier->end) - tmp.iova;
> +assert(tmp.iova <= tmp.addr_mask);
no this assertion then.

Thanks
Yan
   
> +} else {
> +assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> +}
> 
>  if (entry->perm & IOMMU_RW) {
>  request_flags = IOMMU_NOTIFIER_MAP;
> @@ -1924,7 +1931,7 @@ void memory_region_notify_one(IOMMUNotifier *notifier,
>  }
> 
>  if (notifier->notifier_flags & request_flags) {
> -notifier->notify(notifier, entry);
> +notifier->notify(notifier, );
>  }
>  }
> 
> Then we can keep the assert() for e.g. vfio, however vhost can skip it and 
> even
> get some further performance boosts..  Does that make sense?
> 
> Thanks,
> 
> -- 
> Peter Xu
> 
>

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-21 Thread Yan Zhao

On Fri, Jun 19, 2020 at 04:40:46PM -0600, Alex Williamson wrote:
> On Tue, 9 Jun 2020 20:37:31 -0400
> Yan Zhao  wrote:
> 
> > On Fri, Jun 05, 2020 at 03:39:50PM +0100, Dr. David Alan Gilbert wrote:
> > > > > > I tried to simplify the problem a bit, but we keep going backwards. 
> > > > > >  If
> > > > > > the requirement is that potentially any source device can migrate 
> > > > > > to any
> > > > > > target device and we cannot provide any means other than writing an
> > > > > > opaque source string into a version attribute on the target and
> > > > > > evaluating the result to determine compatibility, then we're 
> > > > > > requiring
> > > > > > userspace to do an exhaustive search to find a potential match.  
> > > > > > That
> > > > > > sucks. 
> > > > >  
> > hi Alex and Dave,
> > do you think it's good for us to put aside physical devices and mdev 
> > aggregation
> > for the moment, and use Alex's original idea that
> > 
> > +  Userspace should regard two mdev devices compatible when ALL of below
> > +  conditions are met:
> > +  (0) The mdev devices are of the same type
> > +  (1) success when reading from migration_version attribute of one mdev 
> > device.
> > +  (2) success when writing migration_version string of one mdev device to
> > +  migration_version attribute of the other mdev device.
> 
> I think Pandora's box is already opened, if we can't articulate how
> this solution would evolve to support features that we know are coming,
> why should we proceed with this approach?  We've already seen interest
> in breaking rule (0) in this thread, so we can't focus the solution on
> mdev devices.
> 
> Maybe the best we can do is to compare one instance of a device to
> another instance of a device, without any capability to predict
> compatibility prior to creating devices, in the case on mdev.  The
> string would need to include not only the device and vendor driver
> compatibility, but also anything that has modified the state of the
> device, such as creation time or post-creation time configuration.  The
> user is left on their own for creating a compatible device, or
> filtering devices to determine which might be, or which might generate,
> compatible devices.  It's not much of a solution, I wonder if anyone
> would even use it.
> 
> > and what about adding another sysfs attribute for vendors to put
> > recommended migration compatible device type. e.g.
> > #cat 
> > /sys/bus/pci/devices/:00:02.0/mdev_supported_types/i915-GVTg_V5_8/migration_compatible_devices
> > parent id: 8086 591d
> > mdev_type: i915-GVTg_V5_8
> > 
> > vendors are free to define the format and conent of this 
> > migration_compatible_devices
> > and it's even not to be a full list.
> > 
> > before libvirt or user to do live migration, they have to read and test
> > migration_version attributes of src/target devices to check migration 
> > compatibility.
> 
> AFAICT, free-form, vendor defined attributes are useless to libvirt.
> Vendors could already put this information in the description attribute
> and have it ignored by userspace tools due to the lack of defined
> format.  It's also not clear what value this provides when it's
> necessarily incomplete, a driver written today cannot know what future
> drivers might be compatible with its migration data.  Thanks,
>
hi Alex
maybe the problem can be divided into two pieces:
(1) how to create/locate two migration compatible devices. For normal
users, the most common and safest way to do it is to find a exact duplication
of the source device. so for mdev, it's probably to create a target mdev
of the same parent pci id, mdev type and creation parameters as the
source mdev; and for physical devices, it's to locate a target device of the
same pci id as the source device, plus some extra constraints (e.g. the
target NVMe device is configured to the same remote device as the source
NVMe device; or the target QAT device is supporting equal encryption
algorithm set as the source QAT device...).
I think a possible solution for this piece is to let vendor drivers provide a
creating/locating script to find such exact duplication of source device.
Then before libvirt is about to do live migration, it can use this script to
create a target vm of exactly duplicated configuration of the source vm.

(2) how to identify two devices are migration compatible after they are
created and even they are not exactly identical (e.g. their parent
devices are of minor difference in hardware SKUs). This identification is
necessary even

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-09 Thread Yan Zhao

On Fri, Jun 05, 2020 at 03:39:50PM +0100, Dr. David Alan Gilbert wrote:
> > > > I tried to simplify the problem a bit, but we keep going backwards.  If
> > > > the requirement is that potentially any source device can migrate to any
> > > > target device and we cannot provide any means other than writing an
> > > > opaque source string into a version attribute on the target and
> > > > evaluating the result to determine compatibility, then we're requiring
> > > > userspace to do an exhaustive search to find a potential match.  That
> > > > sucks.   
> > >
hi Alex and Dave,
do you think it's good for us to put aside physical devices and mdev aggregation
for the moment, and use Alex's original idea that

+  Userspace should regard two mdev devices compatible when ALL of below
+  conditions are met:
+  (0) The mdev devices are of the same type
+  (1) success when reading from migration_version attribute of one mdev device.
+  (2) success when writing migration_version string of one mdev device to
+  migration_version attribute of the other mdev device.

and what about adding another sysfs attribute for vendors to put
recommended migration compatible device type. e.g.
#cat 
/sys/bus/pci/devices/:00:02.0/mdev_supported_types/i915-GVTg_V5_8/migration_compatible_devices
parent id: 8086 591d
mdev_type: i915-GVTg_V5_8

vendors are free to define the format and conent of this 
migration_compatible_devices
and it's even not to be a full list.

before libvirt or user to do live migration, they have to read and test
migration_version attributes of src/target devices to check migration 
compatibility.

Thanks
Yan


> > > Why is the mechanism a 'write and test' why isn't it a 'write and ask'?
> > > i.e. the destination tells the driver what type it's received from the
> > > source, and the driver replies with a set of compatible configurations
> > > (in some preferred order).
> > 
> > A 'write and ask' interface would imply some sort of session in order
> > to not be racy with concurrent users.  More likely this would imply an
> > ioctl interface, which I don't think we have in sysfs.  Where do we
> > host this ioctl?
> 
> Or one fd?
>   f=open()
>   write(f, "The ID I want")
>   do {
>  read(f, ...)  -> The IDs we're offering that are compatible
>   } while (!eof)
> 
> > > It's also not clear to me why the name has to be that opaque;
> > > I agree it's only got to be understood by the driver but that doesn't
> > > seem to be a reason for the driver to make it purposely obfuscated.
> > > I wouldn't expect a user to be able to parse it necessarily; but would
> > > expect something that would be useful for an error message.
> > 
> > If the name is not opaque, then we're going to rat hole on the format
> > and the fields and evolving that format for every feature a vendor
> > decides they want the user to be able to parse out of the version
> > string.  Then we require a full specification of the string in order
> > that it be parsed according to a standard such that we don't break
> > users inferring features in subtly different ways.
> > 
> > This is a lot like the problems with mdev description attributes,
> > libvirt complains they can't use description because there's no
> > standard formatting, but even with two vendors describing the same class
> > of device we don't have an agreed set of things to expose in the
> > description attribute.  Thanks,
> 
> I'm not suggesting anything in anyway machine parsable; just something
> human readable that you can present in a menu/choice/configuration/error
> message.  The text would be down to the vendor, and I'd suggest it start
> with the vendor name just as a disambiguator and to make it obvious when
> we get it grossly wrong.
> 
> Dave
> 
> > Alex
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> 
> ___
> intel-gvt-dev mailing list
> intel-gvt-...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-02 Thread Yan Zhao

On Tue, Jun 02, 2020 at 09:55:28PM -0600, Alex Williamson wrote:
> On Tue, 2 Jun 2020 23:19:48 -0400
> Yan Zhao  wrote:
> 
> > On Tue, Jun 02, 2020 at 04:55:27PM -0600, Alex Williamson wrote:
> > > On Wed, 29 Apr 2020 20:39:50 -0400
> > > Yan Zhao  wrote:
> > >   
> > > > On Wed, Apr 29, 2020 at 05:48:44PM +0800, Dr. David Alan Gilbert wrote:
> > > >   
> > > > > > > > > > > > > > > > > An mdev type is meant to define a software 
> > > > > > > > > > > > > > > > > compatible interface, so in
> > > > > > > > > > > > > > > > > the case of mdev->mdev migration, doesn't 
> > > > > > > > > > > > > > > > > migrating to a different type
> > > > > > > > > > > > > > > > > fail the most basic of compatibility tests 
> > > > > > > > > > > > > > > > > that we expect userspace to
> > > > > > > > > > > > > > > > > perform?  IOW, if two mdev types are 
> > > > > > > > > > > > > > > > > migration compatible, it seems a
> > > > > > > > > > > > > > > > > prerequisite to that is that they provide the 
> > > > > > > > > > > > > > > > > same software interface,
> > > > > > > > > > > > > > > > > which means they should be the same mdev type.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > In the hybrid cases of mdev->phys or 
> > > > > > > > > > > > > > > > > phys->mdev, how does a
> > > > > > > > > > > > > > > > management
> > > > > > > > > > > > > > > > > tool begin to even guess what might be 
> > > > > > > > > > > > > > > > > compatible?  Are we expecting
> > > > > > > > > > > > > > > > > libvirt to probe ever device with this 
> > > > > > > > > > > > > > > > > attribute in the system?  Is
> > > > > > > > > > > > > > > > > there going to be a new class hierarchy 
> > > > > > > > > > > > > > > > > created to enumerate all
> > > > > > > > > > > > > > > > > possible migrate-able devices?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > yes, management tool needs to guess and test 
> > > > > > > > > > > > > > > > migration compatible
> > > > > > > > > > > > > > > > between two devices. But I think it's not the 
> > > > > > > > > > > > > > > > problem only for
> > > > > > > > > > > > > > > > mdev->phys or phys->mdev. even for mdev->mdev, 
> > > > > > > > > > > > > > > > management tool needs
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > first assume that the two mdevs have the same 
> > > > > > > > > > > > > > > > type of parent devices
> > > > > > > > > > > > > > > > (e.g.their pciids are equal). otherwise, it's 
> > > > > > > > > > > > > > > > still enumerating
> > > > > > > > > > > > > > > > possibilities.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > on the other hand, for two mdevs,
> > > > > > > > > > > > > > > > mdev1 from pdev1, its mdev_type is 1/2 of pdev1;
> > > > > > > > > > > > > > > > mdev2 from pdev2, its mdev_type is 1/4 of pdev2;
> > > > > > > > > > > > > > > > if pdev2 is exactly 2

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-02 Thread Yan Zhao

On Tue, Jun 02, 2020 at 04:55:27PM -0600, Alex Williamson wrote:
> On Wed, 29 Apr 2020 20:39:50 -0400
> Yan Zhao  wrote:
> 
> > On Wed, Apr 29, 2020 at 05:48:44PM +0800, Dr. David Alan Gilbert wrote:
> > 
> > > > > > > > > > > > > > > An mdev type is meant to define a software 
> > > > > > > > > > > > > > > compatible interface, so in
> > > > > > > > > > > > > > > the case of mdev->mdev migration, doesn't 
> > > > > > > > > > > > > > > migrating to a different type
> > > > > > > > > > > > > > > fail the most basic of compatibility tests that 
> > > > > > > > > > > > > > > we expect userspace to
> > > > > > > > > > > > > > > perform?  IOW, if two mdev types are migration 
> > > > > > > > > > > > > > > compatible, it seems a
> > > > > > > > > > > > > > > prerequisite to that is that they provide the 
> > > > > > > > > > > > > > > same software interface,
> > > > > > > > > > > > > > > which means they should be the same mdev type.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > In the hybrid cases of mdev->phys or phys->mdev, 
> > > > > > > > > > > > > > > how does a  
> > > > > > > > > > > > > > management  
> > > > > > > > > > > > > > > tool begin to even guess what might be 
> > > > > > > > > > > > > > > compatible?  Are we expecting
> > > > > > > > > > > > > > > libvirt to probe ever device with this attribute 
> > > > > > > > > > > > > > > in the system?  Is
> > > > > > > > > > > > > > > there going to be a new class hierarchy created 
> > > > > > > > > > > > > > > to enumerate all
> > > > > > > > > > > > > > > possible migrate-able devices?
> > > > > > > > > > > > > > >  
> > > > > > > > > > > > > > yes, management tool needs to guess and test 
> > > > > > > > > > > > > > migration compatible
> > > > > > > > > > > > > > between two devices. But I think it's not the 
> > > > > > > > > > > > > > problem only for
> > > > > > > > > > > > > > mdev->phys or phys->mdev. even for mdev->mdev, 
> > > > > > > > > > > > > > management tool needs
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > first assume that the two mdevs have the same type 
> > > > > > > > > > > > > > of parent devices
> > > > > > > > > > > > > > (e.g.their pciids are equal). otherwise, it's still 
> > > > > > > > > > > > > > enumerating
> > > > > > > > > > > > > > possibilities.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > on the other hand, for two mdevs,
> > > > > > > > > > > > > > mdev1 from pdev1, its mdev_type is 1/2 of pdev1;
> > > > > > > > > > > > > > mdev2 from pdev2, its mdev_type is 1/4 of pdev2;
> > > > > > > > > > > > > > if pdev2 is exactly 2 times of pdev1, why not allow 
> > > > > > > > > > > > > > migration between
> > > > > > > > > > > > > > mdev1 <-> mdev2.  
> > > > > > > > > > > > > 
> > > > > > > > > > > > > How could the manage tool figure out that 1/2 of 
> > > > > > > > > > > > > pdev1 is equivalent 
> > > > > > > > > > > > > to 1/4 of pdev2? If we really want to allow su

Re: [PATCH Kernel v22 0/8] Add UAPIs to support migration for VFIO devices

2020-05-28 Thread Yan Zhao

On Thu, May 28, 2020 at 04:59:06PM -0600, Alex Williamson wrote:
> On Wed, 27 May 2020 09:48:22 +0100
> "Dr. David Alan Gilbert"  wrote:
> > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > BTW, for viommu, the downtime data is as below. under the same network
> > > condition and guest memory size, and no running dirty data/memory produced
> > > by device.
> > > (1) viommu off
> > > single-round dirty query: downtime ~100ms   
> > 
> > Fine.
> > 
> > > (2) viommu on
> > > single-round dirty query: downtime 58s   
> > 
> > Youch.
> 
> Double Youch!  But we believe this is because we're getting the dirty
> bitmap one IOMMU leaf page at a time, right?  We've enable the kernel
> to get a dirty bitmap across multiple mappings, but QEMU isn't yet
> taking advantage of it.  Do I have this correct?  Thanks,
>
Yes, I think so, but I haven't looked into it yet.

Thanks
Yan

Re: [PATCH Kernel v22 0/8] Add UAPIs to support migration for VFIO devices

2020-05-28 Thread Yan Zhao

> > > This is my understanding of the protocol as well, when the device is
> > > running, pending_bytes might drop to zero if no internal state has
> > > changed and may be non-zero on the next iteration due to device
> > > activity.  When the device is not running, pending_bytes reporting zero
> > > indicates the device is done, there is no further state to transmit.
> > > Does that meet your need/expectation?
> > >
> > (1) on one side, as in vfio_save_pending(),
> > vfio_save_pending()
> > {
> > ...
> > ret = vfio_update_pending(vbasedev);
> > ...
> > *res_precopy_only += migration->pending_bytes;
> > ...
> > }
> > the pending_bytes tells migration thread how much data is still hold in
> > device side.
> > the device data includes
> > device internal data + running device dirty data + device state.
> > 
> > so the pending_bytes should include device state as well, right?
> > if so, the pending_bytes should never reach 0 if there's any device
> > state to be sent after device is stopped.
> 
> I hadn't expected the pending-bytes to include a fixed offset for device
> state (If you mean a few registers etc) - I'd expect pending to drop
> possibly to zero;  the heuristic as to when to switch from iteration to
> stop, is based on the total pending across all iterated devices; so it's
> got to be allowed to drop otherwise you'll never transition to stop.
> 
ok. got it.

> > (2) on the other side,
> > along side we updated the pending_bytes in vfio_save_pending() and
> > enter into the vfio_save_iterate(), if we repeatedly update
> > pending_bytes in vfio_save_iterate(), it would enter into a scenario
> > like
> > 
> > initially pending_bytes=500M.
> > vfio_save_iterate() -->
> >   round 1: transmitted 500M.
> >   round 2: update pending bytes, pending_bytes=50M (50M dirty data).
> >   round 3: update pending bytes, pending_bytes=50M.
> >   ...
> >   round N: update pending bytes, pending_bytes=50M.
> > 
> > If there're two vfio devices, the vfio_save_iterate() for the second device
> > may never get chance to be called because there's always pending_bytes
> > produced by the first device, even the size if small.
> 
> And between RAM and the vfio devices?

yes, is that right?

Thanks
Yan

Re: [PATCH v6 1/3] memory: drop guest writes to read-only ram device regions

2020-05-28 Thread Yan Zhao

On Thu, May 28, 2020 at 07:10:46AM +0200, Paolo Bonzini wrote:
> On 28/05/20 06:35, Yan Zhao wrote:
> > On Tue, May 26, 2020 at 10:26:35AM +0100, Peter Maydell wrote:
> >> On Mon, 25 May 2020 at 11:20, Paolo Bonzini  wrote:
> >>> Not all of them, only those that need to return MEMTX_ERROR.  I would
> >>> like some guidance from Peter as to whether (or when) reads from ROMs
> >>> should return MEMTX_ERROR.  This way, we can use that information to
> >>> device  what the read-only ram-device regions should do.
> >>
> >> In general I think writes to ROMs (and indeed reads from ROMs) should
> >> not return MEMTX_ERROR. I think that in real hardware you could have
> >> a ROM that behaved either way; so our default behaviour should probably
> >> be to do what we've always done and not report a MEMTX_ERROR. (If we
> >> needed to I suppose we should implement a MEMTX_ERROR-reporting ROM,
> >> but to be honest there aren't really many real ROMs in systems these
> >> days: it's more often flash, whose response to writes is defined
> >> by the spec and is I think to ignore writes which aren't the
> >> magic "shift to program-the-flash-mode" sequence.)
> >>
> > then should I just drop the writes to read-only ram-device regions and
> > vfio regions without returning MEMTX_ERROR?
> > do you think it's good?
> 
> I am not really sure, I have to think more about it.  I think read-only
> RAMD regions are slightly different because the guest can expect "magic"
> behavior from RAMD regions (e.g. registers that trigger I/O on writes)
> that are simply not there for ROM.  So I'm still inclined to queue your
> v6 patch series.
> 
ok. thank you Paolo. :)

Re: [PATCH Kernel v23 0/8] Add UAPIs to support migration for VFIO devices

2020-05-27 Thread Yan Zhao



The whole series works for us in general:
Reviewed-by: Yan Zhao 

On Wed, May 20, 2020 at 11:38:00PM +0530, Kirti Wankhede wrote:
> Hi,
> 
> This patch set adds:
> * IOCTL VFIO_IOMMU_DIRTY_PAGES to get dirty pages bitmap with
>   respect to IOMMU container rather than per device. All pages pinned by
>   vendor driver through vfio_pin_pages external API has to be marked as
>   dirty during  migration. When IOMMU capable device is present in the
>   container and all pages are pinned and mapped, then all pages are marked
>   dirty.
>   When there are CPU writes, CPU dirty page tracking can identify dirtied
>   pages, but any page pinned by vendor driver can also be written by
>   device. As of now there is no device which has hardware support for
>   dirty page tracking. So all pages which are pinned should be considered
>   as dirty.
>   This ioctl is also used to start/stop dirty pages tracking for pinned and
>   unpinned pages while migration is active.
> 
> * Updated IOCTL VFIO_IOMMU_UNMAP_DMA to get dirty pages bitmap before
>   unmapping IO virtual address range.
>   With vIOMMU, during pre-copy phase of migration, while CPUs are still
>   running, IO virtual address unmap can happen while device still keeping
>   reference of guest pfns. Those pages should be reported as dirty before
>   unmap, so that VFIO user space application can copy content of those
>   pages from source to destination.
> 
> * Patch 8 detect if IOMMU capable device driver is smart to report pages
>   to be marked dirty by pinning pages using vfio_pin_pages() API.
> 
> 
> Yet TODO:
> Since there is no device which has hardware support for system memmory
> dirty bitmap tracking, right now there is no other API from vendor driver
> to VFIO IOMMU module to report dirty pages. In future, when such hardware
> support will be implemented, an API will be required such that vendor
> driver could report dirty pages to VFIO module during migration phases.
> 
> v22 -> v23
> - Fixed issue reported by Yan
> https://lore.kernel.org/kvm/97977ede-3c5b-c5a5-7858-7eecd7dd5...@nvidia.com/
> - Fixed nit picks suggested by Cornelia
> 
> v21 -> v22
> - Fixed issue raised by Alex :
> https://lore.kernel.org/kvm/20200515163307.72951...@w520.home/
> 
> v20 -> v21
> - Added checkin for GET_BITMAP ioctl for vfio_dma boundaries.
> - Updated unmap ioctl function - as suggested by Alex.
> - Updated comments in DIRTY_TRACKING ioctl definition - as suggested by
>   Cornelia.
> 
> v19 -> v20
> - Fixed ioctl to get dirty bitmap to get bitmap of multiple vfio_dmas
> - Fixed unmap ioctl to get dirty bitmap of multiple vfio_dmas.
> - Removed flag definition from migration capability.
> 
> v18 -> v19
> - Updated migration capability with supported page sizes bitmap for dirty
>   page tracking and  maximum bitmap size supported by kernel module.
> - Added patch to calculate and cache pgsize_bitmap when iommu->domain_list
>   is updated.
> - Removed extra buffers added in previous version for bitmap manipulation
>   and optimised the code.
> 
> v17 -> v18
> - Add migration capability to the capability chain for VFIO_IOMMU_GET_INFO
>   ioctl
> - Updated UMAP_DMA ioctl to return bitmap of multiple vfio_dma
> 
> v16 -> v17
> - Fixed errors reported by kbuild test robot  on i386
> 
> v15 -> v16
> - Minor edits and nit picks (Auger Eric)
> - On copying bitmap to user, re-populated bitmap only for pinned pages,
>   excluding unmapped pages and CPU dirtied pages.
> - Patches are on tag: next-20200318 and 1-3 patches from Yan's series
>   https://lkml.org/lkml/2020/3/12/1255
> 
> v14 -> v15
> - Minor edits and nit picks.
> - In the verification of user allocated bitmap memory, added check of
>maximum size.
> - Patches are on tag: next-20200318 and 1-3 patches from Yan's series
>   https://lkml.org/lkml/2020/3/12/1255
> 
> v13 -> v14
> - Added struct vfio_bitmap to kabi. updated structure
>   vfio_iommu_type1_dirty_bitmap_get and vfio_iommu_type1_dma_unmap.
> - All small changes suggested by Alex.
> - Patches are on tag: next-20200318 and 1-3 patches from Yan's series
>   https://lkml.org/lkml/2020/3/12/1255
> 
> v12 -> v13
> - Changed bitmap allocation in vfio_iommu_type1 to per vfio_dma
> - Changed VFIO_IOMMU_DIRTY_PAGES ioctl behaviour to be per vfio_dma range.
> - Changed vfio_iommu_type1_dirty_bitmap structure to have separate data
>   field.
> 
> v11 -> v12
> - Changed bitmap allocation in vfio_iommu_type1.
> - Remove atomicity of ref_count.
> - Updated comments for migration device state structure about error
>   reporting.
> - Nit picks from v11 reviews
> 
> v10 -> v11
> - Fix pin pages AP

Re: [PATCH v6 1/3] memory: drop guest writes to read-only ram device regions

2020-05-27 Thread Yan Zhao

On Tue, May 26, 2020 at 10:26:35AM +0100, Peter Maydell wrote:
> On Mon, 25 May 2020 at 11:20, Paolo Bonzini  wrote:
> > Not all of them, only those that need to return MEMTX_ERROR.  I would
> > like some guidance from Peter as to whether (or when) reads from ROMs
> > should return MEMTX_ERROR.  This way, we can use that information to
> > device  what the read-only ram-device regions should do.
> 
> In general I think writes to ROMs (and indeed reads from ROMs) should
> not return MEMTX_ERROR. I think that in real hardware you could have
> a ROM that behaved either way; so our default behaviour should probably
> be to do what we've always done and not report a MEMTX_ERROR. (If we
> needed to I suppose we should implement a MEMTX_ERROR-reporting ROM,
> but to be honest there aren't really many real ROMs in systems these
> days: it's more often flash, whose response to writes is defined
> by the spec and is I think to ignore writes which aren't the
> magic "shift to program-the-flash-mode" sequence.)
>
then should I just drop the writes to read-only ram-device regions and
vfio regions without returning MEMTX_ERROR?
do you think it's good?

Thanks
Yan

Re: [PATCH Kernel v22 0/8] Add UAPIs to support migration for VFIO devices

2020-05-27 Thread Yan Zhao

On Tue, May 26, 2020 at 02:19:39PM -0600, Alex Williamson wrote:
> On Mon, 25 May 2020 18:50:54 +0530
> Kirti Wankhede  wrote:
> 
> > On 5/25/2020 12:29 PM, Yan Zhao wrote:
> > > On Tue, May 19, 2020 at 10:58:04AM -0600, Alex Williamson wrote:  
> > >> Hi folks,
> > >>
> > >> My impression is that we're getting pretty close to a workable
> > >> implementation here with v22 plus respins of patches 5, 6, and 8.  We
> > >> also have a matching QEMU series and a proposal for a new i40e
> > >> consumer, as well as I assume GVT-g updates happening internally at
> > >> Intel.  I expect all of the latter needs further review and discussion,
> > >> but we should be at the point where we can validate these proposed
> > >> kernel interfaces.  Therefore I'd like to make a call for reviews so
> > >> that we can get this wrapped up for the v5.8 merge window.  I know
> > >> Connie has some outstanding documentation comments and I'd like to make
> > >> sure everyone has an opportunity to check that their comments have been
> > >> addressed and we don't discover any new blocking issues.  Please send
> > >> your Acked-by/Reviewed-by/Tested-by tags if you're satisfied with this
> > >> interface and implementation.  Thanks!
> > >>  
> > > hi Alex
> > > after porting gvt/i40e vf migration code to kernel/qemu v23, we spoted
> > > two bugs.
> > > 1. "Failed to get dirty bitmap for iova: 0xfe011000 size: 0x3fb0 err: 22"
> > > This is a qemu bug that the dirty bitmap query range is not the same
> > > as the dma map range. It can be fixed in qemu. and I just have a 
> > > little
> > > concern for kernel to have this restriction.
> > >   
> > 
> > I never saw this unaligned size in my testing. In this case if you can 
> > provide vfio_* event traces, that will helpful.
> 
> Yeah, I'm curious why we're hitting such a call path, I think we were
> designing this under the assumption we wouldn't see these.  I also
that's because the algorithm for getting dirty bitmap query range is still not 
exactly
matching to that for dma map range in vfio_dma_map().


> wonder if we really need to enforce the dma mapping range for getting
> the dirty bitmap with the current implementation (unmap+dirty obviously
> still has the restriction).  We do shift the bitmap in place for
> alignment, but I'm not sure why we couldn't shift it back and only
> clear the range that was reported.  Kirti, do you see other issues?  I
> think a patch to lift that restriction is something we could plan to
> include after the initial series is included and before we've committed
> to the uapi at the v5.8 release.
>  
> > > 2. migration abortion, reporting
> > > "qemu-system-x86_64-lm: vfio_load_state: Error allocating buffer
> > > qemu-system-x86_64-lm: error while loading state section id 49(vfio)
> > > qemu-system-x86_64-lm: load of migration failed: Cannot allocate memory"
> > > 
> > > It's still a qemu bug and we can fixed it by
> > > "
> > > if (migration->pending_bytes == 0) {
> > > +qemu_put_be64(f, 0);
> > > +qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> > > "  
> > 
> > In which function in QEMU do you have to add this?
> 
> I think this is relative to QEMU path 09/ where Yan had the questions
> below on v16 and again tried to get answers to them on v22:
> 
> https://lore.kernel.org/qemu-devel/20200520031323.GB10369@joy-OptiPlex-7040/
> 
> Kirti, please address these questions.
> 
> > > and actually there are some extra concerns about this part, as reported in
> > > [1][2].
> > > 
> > > [1] data_size should be read ahead of data_offset
> > > https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg02795.html.
> > > [2] should not repeatedly update pending_bytes in vfio_save_iterate()
> > > https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg02796.html.
> > > 
> > > but as those errors are all in qemu, and we have finished basic tests in
> > > both gvt & i40e, we're fine with the kernel part interface in general now.
> > > (except for my concern [1], which needs to update kernel patch 1)
> > >   
> > 
> >  >> what if pending_bytes is not 0, but vendor driver just does not want  to
> >  >> send data in this iteration? isn't it right to get data_size first   
> > before
> >  >> getting data_offset?  
> > 
> > If vendor driver

Re: [PATCH v6 1/3] memory: drop guest writes to read-only ram device regions

2020-05-25 Thread Yan Zhao

On Mon, May 25, 2020 at 01:04:56PM +0200, Paolo Bonzini wrote:
> On 25/05/20 12:54, Philippe Mathieu-Daudé wrote:
> >> Not all of them, only those that need to return MEMTX_ERROR.  I would
> >> like some guidance from Peter as to whether (or when) reads from ROMs
> >> should return MEMTX_ERROR.  This way, we can use that information to
> >> device  what the read-only ram-device regions should do.
> > Is it only device-specific or might it be partly arch/machine-specific
> > (depending on the bus it is mapped)?
> 
> Good point, I think that could be handled by propagating the error up in
> the memory region hierarchy (i.e. the cached AddressSpaceDispatch radix
> tree is used in the common case, but when you have a failure you
> percolate it up through the whole hierarchy since that's not a fast path).
> 
>
but if we decide to propagate the error up by providing with
ops->write_with_attrs, then we have to remove ops->write correspondingly. 
as in

memory_region_dispatch_write()
{
...
if (mr->ops->write) {
return access_with_adjusted_size(addr, , size,
 mr->ops->impl.min_access_size,
 mr->ops->impl.max_access_size,
 memory_region_write_accessor, mr,
 attrs);
} else {
return
access_with_adjusted_size(addr, , size,
  mr->ops->impl.min_access_size,
  mr->ops->impl.max_access_size,
  memory_region_write_with_attrs_accessor,
  mr, attrs);
}
...
}

so which regions should keep ops->write and which regions should not?

Thanks
Yan

Re: [PATCH Kernel v22 0/8] Add UAPIs to support migration for VFIO devices

2020-05-25 Thread Yan Zhao

On Tue, May 19, 2020 at 10:58:04AM -0600, Alex Williamson wrote:
> Hi folks,
> 
> My impression is that we're getting pretty close to a workable
> implementation here with v22 plus respins of patches 5, 6, and 8.  We
> also have a matching QEMU series and a proposal for a new i40e
> consumer, as well as I assume GVT-g updates happening internally at
> Intel.  I expect all of the latter needs further review and discussion,
> but we should be at the point where we can validate these proposed
> kernel interfaces.  Therefore I'd like to make a call for reviews so
> that we can get this wrapped up for the v5.8 merge window.  I know
> Connie has some outstanding documentation comments and I'd like to make
> sure everyone has an opportunity to check that their comments have been
> addressed and we don't discover any new blocking issues.  Please send
> your Acked-by/Reviewed-by/Tested-by tags if you're satisfied with this
> interface and implementation.  Thanks!
>
hi Alex
after porting gvt/i40e vf migration code to kernel/qemu v23, we spoted
two bugs.
1. "Failed to get dirty bitmap for iova: 0xfe011000 size: 0x3fb0 err: 22"
   This is a qemu bug that the dirty bitmap query range is not the same
   as the dma map range. It can be fixed in qemu. and I just have a little
   concern for kernel to have this restriction.

2. migration abortion, reporting
"qemu-system-x86_64-lm: vfio_load_state: Error allocating buffer
qemu-system-x86_64-lm: error while loading state section id 49(vfio)
qemu-system-x86_64-lm: load of migration failed: Cannot allocate memory"

It's still a qemu bug and we can fixed it by
"
if (migration->pending_bytes == 0) {
+qemu_put_be64(f, 0);
+qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
"
and actually there are some extra concerns about this part, as reported in
[1][2].

[1] data_size should be read ahead of data_offset
https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg02795.html.
[2] should not repeatedly update pending_bytes in vfio_save_iterate()
https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg02796.html.

but as those errors are all in qemu, and we have finished basic tests in
both gvt & i40e, we're fine with the kernel part interface in general now.
(except for my concern [1], which needs to update kernel patch 1)

so I wonder which way in your mind is better, to give our reviewed-by to
the kernel part now, or hold until next qemu fixes?
and as performance data from gvt is requested from your previous mail, is
that still required before the code is accepted?

BTW, we have also conducted some basic tests when viommu is on, and found out
errors like 
"qemu-system-x86_64-dt: vtd_iova_to_slpte: detected slpte permission error 
(iova=0x0, level=0x3, slpte=0x0, write=1)
qemu-system-x86_64-dt: vtd_iommu_translate: detected translation failure 
(dev=00:03:00, iova=0x0)
qemu-system-x86_64-dt: New fault is not recorded due to compression of faults".

Thanks
Yan

Re: [PATCH v6 1/3] memory: drop guest writes to read-only ram device regions

2020-05-24 Thread Yan Zhao

On Thu, May 21, 2020 at 04:38:47PM +0200, Paolo Bonzini wrote:
> On 30/04/20 11:40, Peter Maydell wrote:
> >> This does not "drop" a write to a r/o region -- it causes it to generate
> >> whatever the guest architecture's equivalent of a bus error is (eg data
> >> abort on Arm).
> 
> 
> > More generally, this change seems a bit odd: currently we do not
> > check the mr->readonly flag here, but in general guests don't get
> > to write to ROM areas. Where is that check currently done
> 
> Writes to ROM are directed to mr->ops unassigned_mem_ops.  Because _all_
> ram-device reads and writes go through the ops, for ram-device we have
> to stick the check for mr->readonly in the ops.
> 
> On one hand, I was quite surprised to see that unassigned_mem_write does
> not return MEMTX_ERROR now that I looked at it.
> 
> On the other hand, we should use MEMTX_ERROR in patch 2 as well, if we
> decide it's the way to go.
> 
> (Sorry Yan for the late response).
> 
hi Paolo,
thanks for your reply and never mind :)

But there's one thing I just can't figure out the reason and eagerly need
your guide.

why do we have to convert all .write operations to .write_with_attrs and
return MEMTX_ERROR? because of the handling of writes to read-only region?

however, it seems that all regions have to handle this case, so ultimately
we have to convert all .write to .write_with_attrs and there would be no
.write operations any more?

Thanks
Yan

Re: [PATCH Kernel v22 0/8] Add UAPIs to support migration for VFIO devices

2020-05-21 Thread Yan Zhao

On Thu, May 21, 2020 at 12:39:48PM +0530, Kirti Wankhede wrote:
> 
> 
> On 5/21/2020 10:38 AM, Yan Zhao wrote:
> > On Wed, May 20, 2020 at 10:46:12AM -0600, Alex Williamson wrote:
> > > On Wed, 20 May 2020 19:10:07 +0530
> > > Kirti Wankhede  wrote:
> > > 
> > > > On 5/20/2020 8:25 AM, Yan Zhao wrote:
> > > > > On Tue, May 19, 2020 at 10:58:04AM -0600, Alex Williamson wrote:
> > > > > > Hi folks,
> > > > > > 
> > > > > > My impression is that we're getting pretty close to a workable
> > > > > > implementation here with v22 plus respins of patches 5, 6, and 8.  
> > > > > > We
> > > > > > also have a matching QEMU series and a proposal for a new i40e
> > > > > > consumer, as well as I assume GVT-g updates happening internally at
> > > > > > Intel.  I expect all of the latter needs further review and 
> > > > > > discussion,
> > > > > > but we should be at the point where we can validate these proposed
> > > > > > kernel interfaces.  Therefore I'd like to make a call for reviews so
> > > > > > that we can get this wrapped up for the v5.8 merge window.  I know
> > > > > > Connie has some outstanding documentation comments and I'd like to 
> > > > > > make
> > > > > > sure everyone has an opportunity to check that their comments have 
> > > > > > been
> > > > > > addressed and we don't discover any new blocking issues.  Please 
> > > > > > send
> > > > > > your Acked-by/Reviewed-by/Tested-by tags if you're satisfied with 
> > > > > > this
> > > > > > interface and implementation.  Thanks!
> > > > > hi Alex and Kirti,
> > > > > after porting to qemu v22 and kernel v22, it is found out that
> > > > > it can not even pass basic live migration test with error like
> > > > > 
> > > > > "Failed to get dirty bitmap for iova: 0xca000 size: 0x3000 err: 22"
> > > > 
> > > > Thanks for testing Yan.
> > > > I think last moment change in below cause this failure
> > > > 
> > > > https://lore.kernel.org/kvm/1589871178-8282-1-git-send-email-kwankh...@nvidia.com/
> > > > 
> > > >   > if (dma->iova > iova + size)
> > > >   > break;
> > > > 
> > > > Surprisingly with my basic testing with 2G sys mem QEMU didn't raise
> > > > abort on g_free, but I do hit this with large sys mem.
> > > > With above change, that function iterated through next vfio_dma as well.
> > > > Check should be as below:
> > > > 
> > > > -   if (dma->iova > iova + size)
> > > > +   if (dma->iova > iova + size -1)
> > > 
> > > 
> > > Or just:
> > > 
> > >   if (dma->iova >= iova + size)
> > > 
> > > Thanks,
> > > Alex
> > > 
> > > 
> > > >   break;
> > > > 
> > > > Another fix is in QEMU.
> > > > https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg04751.html
> > > > 
> > > >   > > +range->bitmap.size = ROUND_UP(pages, 64) / 8;
> > > >   >
> > > >   > ROUND_UP(npages/8, sizeof(u64))?
> > > >   >
> > > > 
> > > > If npages < 8, npages/8 is 0 and ROUND_UP(0, 8) returns 0.
> > > > 
> > > > Changing it as below
> > > > 
> > > > -range->bitmap.size = ROUND_UP(pages / 8, sizeof(uint64_t));
> > > > +range->bitmap.size = ROUND_UP(pages, sizeof(__u64) *
> > > > BITS_PER_BYTE) /
> > > > + BITS_PER_BYTE;
> > > > 
> > > > I'm updating patches with these fixes and Cornelia's suggestion soon.
> > > > 
> > > > Due to short of time I may not be able to address all the concerns
> > > > raised on previous versions of QEMU, I'm trying make QEMU side code
> > > > available for testing for others with latest kernel changes. Don't
> > > > worry, I will revisit comments on QEMU patches. Right now first priority
> > > > is to test kernel UAPI and prepare kernel patches for 5.8
> > > > 
> > > 
> > hi Kirti
> > by updating kernel/qemu to v23, still met below two types of errors.
> > just basic migration test.
> > (the guest VM size is 2G for all reported bugs).
> > 
> > "Failed to get dirty bitmap for iova: 0xfe011000 size: 0x3fb0 err: 22"
> > 
> 
> size doesn't look correct here, below check should be failing.
>  range.size & (iommu_pgsize - 1)
> 
> > or
> > 
> > "qemu-system-x86_64-lm: vfio_load_state: Error allocating buffer
> > qemu-system-x86_64-lm: error while loading state section id 49(vfio)
> > qemu-system-x86_64-lm: load of migration failed: Cannot allocate memory"
> > 
> > 
> 
> Above error is from:
> buf = g_try_malloc0(data_size);
> if (!buf) {
> error_report("%s: Error allocating buffer ", __func__);
> return -ENOMEM;
> }
> 
> Seems you are running out of memory?
>
no. my host memory is about 60G.
just migrate with command "migrate -d xxx" without speed limit.
FYI.

Yan

Re: [PATCH Kernel v22 0/8] Add UAPIs to support migration for VFIO devices

2020-05-20 Thread Yan Zhao

On Wed, May 20, 2020 at 10:46:12AM -0600, Alex Williamson wrote:
> On Wed, 20 May 2020 19:10:07 +0530
> Kirti Wankhede  wrote:
> 
> > On 5/20/2020 8:25 AM, Yan Zhao wrote:
> > > On Tue, May 19, 2020 at 10:58:04AM -0600, Alex Williamson wrote:  
> > >> Hi folks,
> > >>
> > >> My impression is that we're getting pretty close to a workable
> > >> implementation here with v22 plus respins of patches 5, 6, and 8.  We
> > >> also have a matching QEMU series and a proposal for a new i40e
> > >> consumer, as well as I assume GVT-g updates happening internally at
> > >> Intel.  I expect all of the latter needs further review and discussion,
> > >> but we should be at the point where we can validate these proposed
> > >> kernel interfaces.  Therefore I'd like to make a call for reviews so
> > >> that we can get this wrapped up for the v5.8 merge window.  I know
> > >> Connie has some outstanding documentation comments and I'd like to make
> > >> sure everyone has an opportunity to check that their comments have been
> > >> addressed and we don't discover any new blocking issues.  Please send
> > >> your Acked-by/Reviewed-by/Tested-by tags if you're satisfied with this
> > >> interface and implementation.  Thanks!
> > >>  
> > > hi Alex and Kirti,
> > > after porting to qemu v22 and kernel v22, it is found out that
> > > it can not even pass basic live migration test with error like
> > > 
> > > "Failed to get dirty bitmap for iova: 0xca000 size: 0x3000 err: 22"
> > >   
> > 
> > Thanks for testing Yan.
> > I think last moment change in below cause this failure
> > 
> > https://lore.kernel.org/kvm/1589871178-8282-1-git-send-email-kwankh...@nvidia.com/
> > 
> >  >  if (dma->iova > iova + size)
> >  >  break;  
> > 
> > Surprisingly with my basic testing with 2G sys mem QEMU didn't raise 
> > abort on g_free, but I do hit this with large sys mem.
> > With above change, that function iterated through next vfio_dma as well. 
> > Check should be as below:
> > 
> > -   if (dma->iova > iova + size)
> > +   if (dma->iova > iova + size -1)
> 
> 
> Or just:
> 
>   if (dma->iova >= iova + size)
> 
> Thanks,
> Alex
> 
> 
> >  break;
> > 
> > Another fix is in QEMU.
> > https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg04751.html
> > 
> >  > > +range->bitmap.size = ROUND_UP(pages, 64) / 8;  
> >  >
> >  > ROUND_UP(npages/8, sizeof(u64))?
> >  >  
> > 
> > If npages < 8, npages/8 is 0 and ROUND_UP(0, 8) returns 0.
> > 
> > Changing it as below
> > 
> > -range->bitmap.size = ROUND_UP(pages / 8, sizeof(uint64_t));
> > +range->bitmap.size = ROUND_UP(pages, sizeof(__u64) * 
> > BITS_PER_BYTE) /
> > + BITS_PER_BYTE;
> > 
> > I'm updating patches with these fixes and Cornelia's suggestion soon.
> > 
> > Due to short of time I may not be able to address all the concerns 
> > raised on previous versions of QEMU, I'm trying make QEMU side code 
> > available for testing for others with latest kernel changes. Don't 
> > worry, I will revisit comments on QEMU patches. Right now first priority 
> > is to test kernel UAPI and prepare kernel patches for 5.8
> > 
>
hi Kirti
by updating kernel/qemu to v23, still met below two types of errors.
just basic migration test.
(the guest VM size is 2G for all reported bugs).

"Failed to get dirty bitmap for iova: 0xfe011000 size: 0x3fb0 err: 22"

or 

"qemu-system-x86_64-lm: vfio_load_state: Error allocating buffer
qemu-system-x86_64-lm: error while loading state section id 49(vfio)
qemu-system-x86_64-lm: load of migration failed: Cannot allocate memory"


Thanks
Yan

Re: [PATCH QEMU v22 09/18] vfio: Add save state functions to SaveVMHandlers

2020-05-19 Thread Yan Zhao

On Mon, May 18, 2020 at 11:43:09AM +0530, Kirti Wankhede wrote:

<...>
> +
> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> +{
> +VFIOMigration *migration = vbasedev->migration;
> +VFIORegion *region = >region;
> +uint64_t data_offset = 0, data_size = 0;
> +int ret;
> +
> +ret = pread(vbasedev->fd, _offset, sizeof(data_offset),
> +region->fd_offset + offsetof(struct 
> vfio_device_migration_info,
> + data_offset));
> +if (ret != sizeof(data_offset)) {
> +error_report("%s: Failed to get migration buffer data offset %d",
> + vbasedev->name, ret);
> +return -EINVAL;
> +}
> +
> +ret = pread(vbasedev->fd, _size, sizeof(data_size),
> +region->fd_offset + offsetof(struct 
> vfio_device_migration_info,
> + data_size));
> +if (ret != sizeof(data_size)) {
> +error_report("%s: Failed to get migration buffer data size %d",
> + vbasedev->name, ret);
> +return -EINVAL;
> +}
> +
> +if (data_size > 0) {
> +void *buf = NULL;
> +bool buffer_mmaped;
> +
> +if (region->mmaps) {
> +buf = find_data_region(region, data_offset, data_size);
> +}
> +
> +buffer_mmaped = (buf != NULL);
> +
> +if (!buffer_mmaped) {
> +buf = g_try_malloc(data_size);
> +if (!buf) {
> +error_report("%s: Error allocating buffer ", __func__);
> +return -ENOMEM;
> +}
> +
> +ret = pread(vbasedev->fd, buf, data_size,
> +region->fd_offset + data_offset);
> +if (ret != data_size) {
> +error_report("%s: Failed to get migration data %d",
> + vbasedev->name, ret);
> +g_free(buf);
> +return -EINVAL;
> +}
> +}
> +
> +qemu_put_be64(f, data_size);
> +qemu_put_buffer(f, buf, data_size);
> +
> +if (!buffer_mmaped) {
> +g_free(buf);
> +}
> +} else {
> +qemu_put_be64(f, data_size);
> +}
> +
> +trace_vfio_save_buffer(vbasedev->name, data_offset, data_size,
> +   migration->pending_bytes);
> +
> +ret = qemu_file_get_error(f);
> +if (ret) {
> +return ret;
> +}
> +
> +return data_size;
> +}
> +
> +static int vfio_update_pending(VFIODevice *vbasedev)
> +{
> +VFIOMigration *migration = vbasedev->migration;
> +VFIORegion *region = >region;
> +uint64_t pending_bytes = 0;
> +int ret;
> +
> +ret = pread(vbasedev->fd, _bytes, sizeof(pending_bytes),
> +region->fd_offset + offsetof(struct 
> vfio_device_migration_info,
> + pending_bytes));
> +if ((ret < 0) || (ret != sizeof(pending_bytes))) {
> +error_report("%s: Failed to get pending bytes %d",
> + vbasedev->name, ret);
> +migration->pending_bytes = 0;
> +return (ret < 0) ? ret : -EINVAL;
> +}
> +
> +migration->pending_bytes = pending_bytes;
> +trace_vfio_update_pending(vbasedev->name, pending_bytes);
> +return 0;
> +}
> +
<...>
>  
> +static void vfio_save_pending(QEMUFile *f, void *opaque,
> +  uint64_t threshold_size,
> +  uint64_t *res_precopy_only,
> +  uint64_t *res_compatible,
> +  uint64_t *res_postcopy_only)
> +{
> +VFIODevice *vbasedev = opaque;
> +VFIOMigration *migration = vbasedev->migration;
> +int ret;
> +
> +ret = vfio_update_pending(vbasedev);
> +if (ret) {
> +return;
> +}
> +
> +*res_precopy_only += migration->pending_bytes;
> +
> +trace_vfio_save_pending(vbasedev->name, *res_precopy_only,
> +*res_postcopy_only, *res_compatible);
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +VFIODevice *vbasedev = opaque;
> +VFIOMigration *migration = vbasedev->migration;
> +int ret, data_size;
> +
> +qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
> +
hi Kirti
seems you also didn't address my previous comments.
https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg02795.html.
https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg02796.html


> +if (migration->pending_bytes == 0) {
> +ret = vfio_update_pending(vbasedev);
repeated get pending_bytes here would cause vmstats following vfio-pci
have no chance to get called.

Thanks
Yan

> +if (ret) {
> +return ret;
> +}
> +
> +if (migration->pending_bytes == 0) {
> +/* indicates data finished, goto complete phase */
> +return 1;
> +}
> +}
> +
> +data_size = vfio_save_buffer(f, vbasedev);
> +
> +if

Re: [PATCH Kernel v22 0/8] Add UAPIs to support migration for VFIO devices

2020-05-19 Thread Yan Zhao

On Tue, May 19, 2020 at 10:58:04AM -0600, Alex Williamson wrote:
> Hi folks,
> 
> My impression is that we're getting pretty close to a workable
> implementation here with v22 plus respins of patches 5, 6, and 8.  We
> also have a matching QEMU series and a proposal for a new i40e
> consumer, as well as I assume GVT-g updates happening internally at
> Intel.  I expect all of the latter needs further review and discussion,
> but we should be at the point where we can validate these proposed
> kernel interfaces.  Therefore I'd like to make a call for reviews so
> that we can get this wrapped up for the v5.8 merge window.  I know
> Connie has some outstanding documentation comments and I'd like to make
> sure everyone has an opportunity to check that their comments have been
> addressed and we don't discover any new blocking issues.  Please send
> your Acked-by/Reviewed-by/Tested-by tags if you're satisfied with this
> interface and implementation.  Thanks!
> 
hi Alex and Kirti,
after porting to qemu v22 and kernel v22, it is found out that
it can not even pass basic live migration test with error like

"Failed to get dirty bitmap for iova: 0xca000 size: 0x3000 err: 22"

Thanks
Yan

> 
> On Mon, 18 May 2020 11:26:29 +0530
> Kirti Wankhede  wrote:
> 
> > Hi,
> > 
> > This patch set adds:
> > * IOCTL VFIO_IOMMU_DIRTY_PAGES to get dirty pages bitmap with
> >   respect to IOMMU container rather than per device. All pages pinned by
> >   vendor driver through vfio_pin_pages external API has to be marked as
> >   dirty during  migration. When IOMMU capable device is present in the
> >   container and all pages are pinned and mapped, then all pages are marked
> >   dirty.
> >   When there are CPU writes, CPU dirty page tracking can identify dirtied
> >   pages, but any page pinned by vendor driver can also be written by
> >   device. As of now there is no device which has hardware support for
> >   dirty page tracking. So all pages which are pinned should be considered
> >   as dirty.
> >   This ioctl is also used to start/stop dirty pages tracking for pinned and
> >   unpinned pages while migration is active.
> > 
> > * Updated IOCTL VFIO_IOMMU_UNMAP_DMA to get dirty pages bitmap before
> >   unmapping IO virtual address range.
> >   With vIOMMU, during pre-copy phase of migration, while CPUs are still
> >   running, IO virtual address unmap can happen while device still keeping
> >   reference of guest pfns. Those pages should be reported as dirty before
> >   unmap, so that VFIO user space application can copy content of those
> >   pages from source to destination.
> > 
> > * Patch 8 detect if IOMMU capable device driver is smart to report pages
> >   to be marked dirty by pinning pages using vfio_pin_pages() API.
> > 
> > 
> > Yet TODO:
> > Since there is no device which has hardware support for system memmory
> > dirty bitmap tracking, right now there is no other API from vendor driver
> > to VFIO IOMMU module to report dirty pages. In future, when such hardware
> > support will be implemented, an API will be required such that vendor
> > driver could report dirty pages to VFIO module during migration phases.
> > 
> > Adding revision history from previous QEMU patch set to understand KABI
> > changes done till now
> > 
> > v21 -> v22
> > - Fixed issue raised by Alex :
> > https://lore.kernel.org/kvm/20200515163307.72951...@w520.home/
> > 
> > v20 -> v21
> > - Added checkin for GET_BITMAP ioctl for vfio_dma boundaries.
> > - Updated unmap ioctl function - as suggested by Alex.
> > - Updated comments in DIRTY_TRACKING ioctl definition - as suggested by
> >   Cornelia.
> > 
> > v19 -> v20
> > - Fixed ioctl to get dirty bitmap to get bitmap of multiple vfio_dmas
> > - Fixed unmap ioctl to get dirty bitmap of multiple vfio_dmas.
> > - Removed flag definition from migration capability.
> > 
> > v18 -> v19
> > - Updated migration capability with supported page sizes bitmap for dirty
> >   page tracking and  maximum bitmap size supported by kernel module.
> > - Added patch to calculate and cache pgsize_bitmap when iommu->domain_list
> >   is updated.
> > - Removed extra buffers added in previous version for bitmap manipulation
> >   and optimised the code.
> > 
> > v17 -> v18
> > - Add migration capability to the capability chain for VFIO_IOMMU_GET_INFO
> >   ioctl
> > - Updated UMAP_DMA ioctl to return bitmap of multiple vfio_dma
> > 
> > v16 -> v17
> > - Fixed errors reported by kbuild test robot  on i386
> > 
> > v15 -> v16
> > - Minor edits and nit picks (Auger Eric)
> > - On copying bitmap to user, re-populated bitmap only for pinned pages,
> >   excluding unmapped pages and CPU dirtied pages.
> > - Patches are on tag: next-20200318 and 1-3 patches from Yan's series
> >   https://lkml.org/lkml/2020/3/12/1255
> > 
> > v14 -> v15
> > - Minor edits and nit picks.
> > - In the verification of user allocated bitmap memory, added check of
> >maximum size.
> > - Patches are on tag: next-20200318 and 1-3 patches

Re: [PATCH Kernel v21 0/8] Add UAPIs to support migration for VFIO devices

2020-05-17 Thread Yan Zhao

On Mon, May 18, 2020 at 10:39:52AM +0800, Xiang Zheng wrote:
> Hi Kirti and Yan,
> 
> How can I test this patch series on my SR-IOV devices?
> I have looked through Yan's pathes for i40e VF live migration support：
> https://patchwork.kernel.org/patch/11375177/
> 
I just updated the patches to v4.
https://patchwork.kernel.org/cover/11554617/.

It's based on v17 kernel + v16 qemu with some minor changes in qemu.

> However, I cannot find the detailed implementation about device state
> saving/restoring and dirty page logging. Has i40e hardware already supported
> these two features?
>
In v4, vendor driver for i40e vf reports dirty pages to vfio container.
the detailed implementation of identifying dirty pages and device state
is not sent yet for process reason.
We use a software way to get dirty pages i.e. dynamically trapping of BAR 0.

Thanks
Yan
> And if once a device supports both features, how to implement live
> migration for this device via this series patch?
> 
> On 2020/5/16 5:13, Kirti Wankhede wrote:
> > Hi,
> > 
> > This patch set adds:
> > * IOCTL VFIO_IOMMU_DIRTY_PAGES to get dirty pages bitmap with
> >   respect to IOMMU container rather than per device. All pages pinned by
> >   vendor driver through vfio_pin_pages external API has to be marked as
> >   dirty during  migration. When IOMMU capable device is present in the
> >   container and all pages are pinned and mapped, then all pages are marked
> >   dirty.
> >   When there are CPU writes, CPU dirty page tracking can identify dirtied
> >   pages, but any page pinned by vendor driver can also be written by
> >   device. As of now there is no device which has hardware support for
> >   dirty page tracking. So all pages which are pinned should be considered
> >   as dirty.
> >   This ioctl is also used to start/stop dirty pages tracking for pinned and
> >   unpinned pages while migration is active.
> > 
> > * Updated IOCTL VFIO_IOMMU_UNMAP_DMA to get dirty pages bitmap before
> >   unmapping IO virtual address range.
> >   With vIOMMU, during pre-copy phase of migration, while CPUs are still
> >   running, IO virtual address unmap can happen while device still keeping
> >   reference of guest pfns. Those pages should be reported as dirty before
> >   unmap, so that VFIO user space application can copy content of those
> >   pages from source to destination.
> > 
> > * Patch 8 detect if IOMMU capable device driver is smart to report pages
> >   to be marked dirty by pinning pages using vfio_pin_pages() API.
> > 
> > 
> > Yet TODO:
> > Since there is no device which has hardware support for system memmory
> > dirty bitmap tracking, right now there is no other API from vendor driver
> > to VFIO IOMMU module to report dirty pages. In future, when such hardware
> > support will be implemented, an API will be required such that vendor
> > driver could report dirty pages to VFIO module during migration phases.
> > 
> > Adding revision history from previous QEMU patch set to understand KABI
> > changes done till now
> > 
> > v20 -> v21
> > - Added checkin for GET_BITMAP ioctl for vfio_dma boundaries.
> > - Updated unmap ioctl function - as suggested by Alex.
> > - Updated comments in DIRTY_TRACKING ioctl definition - as suggested by
> >   Cornelia.
> > 
> > v19 -> v20
> > - Fixed ioctl to get dirty bitmap to get bitmap of multiple vfio_dmas
> > - Fixed unmap ioctl to get dirty bitmap of multiple vfio_dmas.
> > - Removed flag definition from migration capability.
> > 
> > v18 -> v19
> > - Updated migration capability with supported page sizes bitmap for dirty
> >   page tracking and  maximum bitmap size supported by kernel module.
> > - Added patch to calculate and cache pgsize_bitmap when iommu->domain_list
> >   is updated.
> > - Removed extra buffers added in previous version for bitmap manipulation
> >   and optimised the code.
> > 
> > v17 -> v18
> > - Add migration capability to the capability chain for VFIO_IOMMU_GET_INFO
> >   ioctl
> > - Updated UMAP_DMA ioctl to return bitmap of multiple vfio_dma
> > 
> > v16 -> v17
> > - Fixed errors reported by kbuild test robot  on i386
> > 
> > v15 -> v16
> > - Minor edits and nit picks (Auger Eric)
> > - On copying bitmap to user, re-populated bitmap only for pinned pages,
> >   excluding unmapped pages and CPU dirtied pages.
> > - Patches are on tag: next-20200318 and 1-3 patches from Yan's series
> >   https://lkml.org/lkml/2020/3/12/1255
> > 
> > v14 -> v15
> > - Minor edits and nit picks.
> > - In the verification of user allocated bitmap memory, added check of
> >maximum size.
> > - Patches are on tag: next-20200318 and 1-3 patches from Yan's series
> >   https://lkml.org/lkml/2020/3/12/1255
> > 
> > v13 -> v14
> > - Added struct vfio_bitmap to kabi. updated structure
> >   vfio_iommu_type1_dirty_bitmap_get and vfio_iommu_type1_dma_unmap.
> > - All small changes suggested by Alex.
> > - Patches are on tag: next-20200318 and 1-3 patches from Yan's series
> >

Re: [PATCH Kernel v20 0/8] Add UAPIs to support migration for VFIO devices

2020-05-15 Thread Yan Zhao

On Thu, May 14, 2020 at 09:32:06PM -0600, Alex Williamson wrote:
> Hi Yan & Intel folks,
> 
> I'm starting to run out of comments on this series, where are you with
> porting GVT-g migration to this API?  Are there remaining blocking
> issues?  Are we satisfied that the API is sufficient to support vIOMMU
> now?  Thanks,
> 
hi Alex
currently, we have ported to v17 kernel + v16 qemu with some fixes. gvt
is working but we didn't try viommu yet.
for this v20 kernel series, is there any qemu patches matching to it?

Thanks
Yan


> On Fri, 15 May 2020 02:07:39 +0530
> Kirti Wankhede  wrote:
> 
> > Hi,
> > 
> > This patch set adds:
> > * IOCTL VFIO_IOMMU_DIRTY_PAGES to get dirty pages bitmap with
> >   respect to IOMMU container rather than per device. All pages pinned by
> >   vendor driver through vfio_pin_pages external API has to be marked as
> >   dirty during  migration. When IOMMU capable device is present in the
> >   container and all pages are pinned and mapped, then all pages are marked
> >   dirty.
> >   When there are CPU writes, CPU dirty page tracking can identify dirtied
> >   pages, but any page pinned by vendor driver can also be written by
> >   device. As of now there is no device which has hardware support for
> >   dirty page tracking. So all pages which are pinned should be considered
> >   as dirty.
> >   This ioctl is also used to start/stop dirty pages tracking for pinned and
> >   unpinned pages while migration is active.
> > 
> > * Updated IOCTL VFIO_IOMMU_UNMAP_DMA to get dirty pages bitmap before
> >   unmapping IO virtual address range.
> >   With vIOMMU, during pre-copy phase of migration, while CPUs are still
> >   running, IO virtual address unmap can happen while device still keeping
> >   reference of guest pfns. Those pages should be reported as dirty before
> >   unmap, so that VFIO user space application can copy content of those
> >   pages from source to destination.
> > 
> > * Patch 8 detect if IOMMU capable device driver is smart to report pages
> >   to be marked dirty by pinning pages using vfio_pin_pages() API.
> > 
> > 
> > Yet TODO:
> > Since there is no device which has hardware support for system memmory
> > dirty bitmap tracking, right now there is no other API from vendor driver
> > to VFIO IOMMU module to report dirty pages. In future, when such hardware
> > support will be implemented, an API will be required such that vendor
> > driver could report dirty pages to VFIO module during migration phases.
> > 
> > Adding revision history from previous QEMU patch set to understand KABI
> > changes done till now
> > 
> > v19 -> v20
> > - Fixed ioctl to get dirty bitmap to get bitmap of multiple vfio_dmas
> > - Fixed unmap ioctl to get dirty bitmap of multiple vfio_dmas.
> > - Removed flag definition from migration capability.
> > 
> > v18 -> v19
> > - Updated migration capability with supported page sizes bitmap for dirty
> >   page tracking and  maximum bitmap size supported by kernel module.
> > - Added patch to calculate and cache pgsize_bitmap when iommu->domain_list
> >   is updated.
> > - Removed extra buffers added in previous version for bitmap manipulation
> >   and optimised the code.
> > 
> > v17 -> v18
> > - Add migration capability to the capability chain for VFIO_IOMMU_GET_INFO
> >   ioctl
> > - Updated UMAP_DMA ioctl to return bitmap of multiple vfio_dma
> > 
> > v16 -> v17
> > - Fixed errors reported by kbuild test robot  on i386
> > 
> > v15 -> v16
> > - Minor edits and nit picks (Auger Eric)
> > - On copying bitmap to user, re-populated bitmap only for pinned pages,
> >   excluding unmapped pages and CPU dirtied pages.
> > - Patches are on tag: next-20200318 and 1-3 patches from Yan's series
> >   https://lkml.org/lkml/2020/3/12/1255
> > 
> > v14 -> v15
> > - Minor edits and nit picks.
> > - In the verification of user allocated bitmap memory, added check of
> >maximum size.
> > - Patches are on tag: next-20200318 and 1-3 patches from Yan's series
> >   https://lkml.org/lkml/2020/3/12/1255
> > 
> > v13 -> v14
> > - Added struct vfio_bitmap to kabi. updated structure
> >   vfio_iommu_type1_dirty_bitmap_get and vfio_iommu_type1_dma_unmap.
> > - All small changes suggested by Alex.
> > - Patches are on tag: next-20200318 and 1-3 patches from Yan's series
> >   https://lkml.org/lkml/2020/3/12/1255
> > 
> > v12 -> v13
> > - Changed bitmap allocation in vfio_iommu_type1 to per vfio_dma
> > - Changed VFIO_IOMMU_DIRTY_PAGES ioctl behaviour to be per vfio_dma range.
> > - Changed vfio_iommu_type1_dirty_bitmap structure to have separate data
> >   field.
> > 
> > v11 -> v12
> > - Changed bitmap allocation in vfio_iommu_type1.
> > - Remove atomicity of ref_count.
> > - Updated comments for migration device state structure about error
> >   reporting.
> > - Nit picks from v11 reviews
> > 
> > v10 -> v11
> > - Fix pin pages API to free vpfn if it is marked as unpinned tracking page.
> > - Added proposal to detect if IOMMU capable device calls external pin

Re: [PATCH Kernel v20 5/8] vfio iommu: Implementation of ioctl for dirty pages tracking

2020-05-15 Thread Yan Zhao

On Fri, May 15, 2020 at 02:07:44AM +0530, Kirti Wankhede wrote:
> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> - Start dirty pages tracking while migration is active
> - Stop dirty pages tracking.
> - Get dirty pages bitmap. Its user space application's responsibility to
>   copy content of dirty pages from source to destination during migration.
> 
> To prevent DoS attack, memory for bitmap is allocated per vfio_dma
> structure. Bitmap size is calculated considering smallest supported page
> size. Bitmap is allocated for all vfio_dmas when dirty logging is enabled
> 
> Bitmap is populated for already pinned pages when bitmap is allocated for
> a vfio_dma with the smallest supported page size. Update bitmap from
> pinning functions when tracking is enabled. When user application queries
> bitmap, check if requested page size is same as page size used to
> populated bitmap. If it is equal, copy bitmap, but if not equal, return
> error.
> 
> Signed-off-by: Kirti Wankhede 
> Reviewed-by: Neo Jia 
> 
> Fixed error reported by build bot by changing pgsize type from uint64_t
> to size_t.
> Reported-by: kbuild test robot 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 294 
> +++-
>  1 file changed, 288 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index de17787ffece..b76d3b14abfd 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -72,6 +72,7 @@ struct vfio_iommu {
>   uint64_tpgsize_bitmap;
>   boolv2;
>   boolnesting;
> + booldirty_page_tracking;
>  };
>  
>  struct vfio_domain {
> @@ -92,6 +93,7 @@ struct vfio_dma {
>   boollock_cap;   /* capable(CAP_IPC_LOCK) */
>   struct task_struct  *task;
>   struct rb_root  pfn_list;   /* Ex-user pinned pfn list */
> + unsigned long   *bitmap;
>  };
>  
>  struct vfio_group {
> @@ -126,6 +128,19 @@ struct vfio_regions {
>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)  \
>   (!list_empty(>domain_list))
>  
> +#define DIRTY_BITMAP_BYTES(n)(ALIGN(n, BITS_PER_TYPE(u64)) / 
> BITS_PER_BYTE)
> +
> +/*
> + * Input argument of number of bits to bitmap_set() is unsigned integer, 
> which
> + * further casts to signed integer for unaligned multi-bit operation,
> + * __bitmap_set().
> + * Then maximum bitmap size supported is 2^31 bits divided by 2^3 bits/byte,
> + * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K page
> + * system.
> + */
> +#define DIRTY_BITMAP_PAGES_MAX((u64)INT_MAX)
> +#define DIRTY_BITMAP_SIZE_MAX 
> DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
> +
>  static int put_pfn(unsigned long pfn, int prot);
>  
>  /*
> @@ -176,6 +191,74 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, 
> struct vfio_dma *old)
>   rb_erase(>node, >dma_list);
>  }
>  
> +
> +static int vfio_dma_bitmap_alloc(struct vfio_dma *dma, size_t pgsize)
> +{
> + uint64_t npages = dma->size / pgsize;
> +
> + if (npages > DIRTY_BITMAP_PAGES_MAX)
> + return -EINVAL;
> +
> + dma->bitmap = kvzalloc(DIRTY_BITMAP_BYTES(npages), GFP_KERNEL);
> + if (!dma->bitmap)
> + return -ENOMEM;
> +
> + return 0;
> +}
> +
> +static void vfio_dma_bitmap_free(struct vfio_dma *dma)
> +{
> + kfree(dma->bitmap);
> + dma->bitmap = NULL;
> +}
> +
> +static void vfio_dma_populate_bitmap(struct vfio_dma *dma, size_t pgsize)
> +{
> + struct rb_node *p;
> +
> + for (p = rb_first(>pfn_list); p; p = rb_next(p)) {
> + struct vfio_pfn *vpfn = rb_entry(p, struct vfio_pfn, node);
> +
> + bitmap_set(dma->bitmap, (vpfn->iova - dma->iova) / pgsize, 1);
> + }
> +}
> +
> +static int vfio_dma_bitmap_alloc_all(struct vfio_iommu *iommu, size_t pgsize)
> +{
> + struct rb_node *n = rb_first(>dma_list);
> +
> + for (; n; n = rb_next(n)) {
> + struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> + int ret;
> +
> + ret = vfio_dma_bitmap_alloc(dma, pgsize);
> + if (ret) {
> + struct rb_node *p = rb_prev(n);
> +
> + for (; p; p = rb_prev(p)) {
> + struct vfio_dma *dma = rb_entry(n,
> + struct vfio_dma, node);
> +
> + vfio_dma_bitmap_free(dma);
> + }
> + return ret;
> + }
> + vfio_dma_populate_bitmap(dma, pgsize);
> + }
> + return 0;
> +}
> +
> +static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
> +{
> + struct rb_node *n = rb_first(>dma_list);
> +
> + for (; n; n = rb_next(n)) {
> + struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> +
> +

Re: [PATCH v16 QEMU 09/16] vfio: Add save state functions to SaveVMHandlers

2020-05-11 Thread Yan Zhao

On Mon, May 11, 2020 at 05:53:37PM +0800, Kirti Wankhede wrote:
> 
> 
> On 5/5/2020 10:07 AM, Alex Williamson wrote:
> > On Tue, 5 May 2020 04:48:14 +0530
> > Kirti Wankhede  wrote:
> > 
> >> On 3/26/2020 3:33 AM, Alex Williamson wrote:
> >>> On Wed, 25 Mar 2020 02:39:07 +0530
> >>> Kirti Wankhede  wrote:
> >>>

<...>

>  +static int vfio_save_iterate(QEMUFile *f, void *opaque)
>  +{
>  +VFIODevice *vbasedev = opaque;
>  +int ret, data_size;
>  +
>  +qemu_put_be64(f, VFIO_MIG_FLAG_DEV_DATA_STATE);
>  +
>  +data_size = vfio_save_buffer(f, vbasedev);
>  +
>  +if (data_size < 0) {
>  +error_report("%s: vfio_save_buffer failed %s", vbasedev->name,
>  + strerror(errno));
>  +return data_size;
>  +}
>  +
>  +qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>  +
>  +ret = qemu_file_get_error(f);
>  +if (ret) {
>  +return ret;
>  +}
>  +
>  +trace_vfio_save_iterate(vbasedev->name, data_size);
>  +if (data_size == 0) {
>  +/* indicates data finished, goto complete phase */
>  +return 1;
> >>>
> >>> But it's pending_bytes not data_size that indicates we're done.  How do
> >>> we get away with ignoring pending_bytes for the save_live_iterate phase?
> >>>
> >>
> >> This is requirement mentioned above qemu_savevm_state_iterate() which
> >> calls .save_live_iterate.
> >>
> >> /* 
> >>* this function has three return values:
> >>*   negative: there was one error, and we have -errno.
> >>*   0 : We haven't finished, caller have to go again
> >>*   1 : We have finished, we can go to complete phase
> >>*/
> >> int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
> >>
> >> This is to serialize savevm_state.handlers (or in other words devices).
> > 
> > I've lost all context on this question in the interim, but I think this
> > highlights my question.  We use pending_bytes to know how close we are
> > to the end of the stream and data_size to iterate each transaction
> > within that stream.  So how does data_size == 0 indicate we've
> > completed the current phase?  It seems like pending_bytes should
> > indicate that.  Thanks,
> > 
> 
> Fixing this by adding a read on pending_bytes if its 0 and return 
> accordingly.
>  if (migration->pending_bytes == 0) {
>  ret = vfio_update_pending(vbasedev);
>  if (ret) {
>  return ret;
>  }
> 
>  if (migration->pending_bytes == 0) {
>  /* indicates data finished, goto complete phase */
>  return 1;
>  }
>  }
> 

just a question. if 1 is only returned when migration->pending_bytes is 0,
does that mean .save_live_iterate of vmstates after "vfio-pci"
would never be called until migration->pending_bytes is 0 ?

as in qemu_savevm_state_iterate(),

qemu_savevm_state_iterate {
...
  QTAILQ_FOREACH(se, _state.handlers, entry) {
...
ret = se->ops->save_live_iterate(f, se->opaque);
...
if (ret <= 0) {
/* Do not proceed to the next vmstate before this one reported
   completion of the current stage. This serializes the migration
   and reduces the probability that a faster changing state is
   synchronized over and over again. */
break;
}
  }
  return ret;
}

in ram's migration code, its pending_bytes(remaining_size) is only updated in
ram_save_pending() when it's below threshold, which means in
ram_save_iterate() the pending_bytes is possible to be 0, so other
vmstates have their chance to be called.

Thanks
Yan

Re: [PATCH v16 QEMU 09/16] vfio: Add save state functions to SaveVMHandlers

2020-05-11 Thread Yan Zhao

On Mon, May 11, 2020 at 06:22:47PM +0800, Kirti Wankhede wrote:
> 
> 
> On 5/9/2020 11:01 AM, Yan Zhao wrote:
> > On Wed, Mar 25, 2020 at 05:09:07AM +0800, Kirti Wankhede wrote:
> >> Added .save_live_pending, .save_live_iterate and 
> >> .save_live_complete_precopy
> >> functions. These functions handles pre-copy and stop-and-copy phase.
> >>
> >> In _SAVING|_RUNNING device state or pre-copy phase:
> >> - read pending_bytes. If pending_bytes > 0, go through below steps.
> >> - read data_offset - indicates kernel driver to write data to staging
> >>buffer.
> >> - read data_size - amount of data in bytes written by vendor driver in
> >>migration region.
> > I think we should change the sequence of reading data_size and
> > data_offset. see the next comment below.
> > 
> >> - read data_size bytes of data from data_offset in the migration region.
> >> - Write data packet to file stream as below:
> >> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> >> VFIO_MIG_FLAG_END_OF_STATE }
> >>
> >> In _SAVING device state or stop-and-copy phase
> >> a. read config space of device and save to migration file stream. This
> >> doesn't need to be from vendor driver. Any other special config state
> >> from driver can be saved as data in following iteration.
> >> b. read pending_bytes. If pending_bytes > 0, go through below steps.
> >> c. read data_offset - indicates kernel driver to write data to staging
> >> buffer.
> >> d. read data_size - amount of data in bytes written by vendor driver in
> >> migration region.
> >> e. read data_size bytes of data from data_offset in the migration region.
> >> f. Write data packet as below:
> >> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> >> g. iterate through steps b to f while (pending_bytes > 0)
> >> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> >>
> >> When data region is mapped, its user's responsibility to read data from
> >> data_offset of data_size before moving to next steps.
> >>
> >> Signed-off-by: Kirti Wankhede 
> >> Reviewed-by: Neo Jia 
> >> ---
> >>   hw/vfio/migration.c   | 245 
> >> +-
> >>   hw/vfio/trace-events  |   6 ++
> >>   include/hw/vfio/vfio-common.h |   1 +
> >>   3 files changed, 251 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> >> index 033f76526e49..ecbeed5182c2 100644
> >> --- a/hw/vfio/migration.c
> >> +++ b/hw/vfio/migration.c
> >> @@ -138,6 +138,137 @@ static int vfio_migration_set_state(VFIODevice 
> >> *vbasedev, uint32_t mask,
> >>   return 0;
> >>   }
> >>   
> >> +static void *find_data_region(VFIORegion *region,
> >> +  uint64_t data_offset,
> >> +  uint64_t data_size)
> >> +{
> >> +void *ptr = NULL;
> >> +int i;
> >> +
> >> +for (i = 0; i < region->nr_mmaps; i++) {
> >> +if ((data_offset >= region->mmaps[i].offset) &&
> >> +(data_offset < region->mmaps[i].offset + 
> >> region->mmaps[i].size) &&
> >> +(data_size <= region->mmaps[i].size)) {
> >> +ptr = region->mmaps[i].mmap + (data_offset -
> >> +   region->mmaps[i].offset);
> >> +break;
> >> +}
> >> +}
> >> +return ptr;
> >> +}
> >> +
> >> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> >> +{
> >> +VFIOMigration *migration = vbasedev->migration;
> >> +VFIORegion *region = >region;
> >> +uint64_t data_offset = 0, data_size = 0;
> >> +int ret;
> >> +
> >> +ret = pread(vbasedev->fd, _offset, sizeof(data_offset),
> >> +region->fd_offset + offsetof(struct 
> >> vfio_device_migration_info,
> >> + data_offset));
> >> +if (ret != sizeof(data_offset)) {
> >> +error_report("%s: Failed to get migration buffer data offset %d",
> >> + vbasedev->name, ret);
> >> +return -EINVAL;
> >> +}
> >> +
> >> +ret = pread(vbasedev->fd, _siz

Re: [PATCH v16 QEMU 09/16] vfio: Add save state functions to SaveVMHandlers

2020-05-08 Thread Yan Zhao

On Wed, Mar 25, 2020 at 05:09:07AM +0800, Kirti Wankhede wrote:
> Added .save_live_pending, .save_live_iterate and .save_live_complete_precopy
> functions. These functions handles pre-copy and stop-and-copy phase.
> 
> In _SAVING|_RUNNING device state or pre-copy phase:
> - read pending_bytes. If pending_bytes > 0, go through below steps.
> - read data_offset - indicates kernel driver to write data to staging
>   buffer.
> - read data_size - amount of data in bytes written by vendor driver in
>   migration region.
I think we should change the sequence of reading data_size and
data_offset. see the next comment below.

> - read data_size bytes of data from data_offset in the migration region.
> - Write data packet to file stream as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> VFIO_MIG_FLAG_END_OF_STATE }
> 
> In _SAVING device state or stop-and-copy phase
> a. read config space of device and save to migration file stream. This
>doesn't need to be from vendor driver. Any other special config state
>from driver can be saved as data in following iteration.
> b. read pending_bytes. If pending_bytes > 0, go through below steps.
> c. read data_offset - indicates kernel driver to write data to staging
>buffer.
> d. read data_size - amount of data in bytes written by vendor driver in
>migration region.
> e. read data_size bytes of data from data_offset in the migration region.
> f. Write data packet as below:
>{VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> g. iterate through steps b to f while (pending_bytes > 0)
> h. Write {VFIO_MIG_FLAG_END_OF_STATE}
> 
> When data region is mapped, its user's responsibility to read data from
> data_offset of data_size before moving to next steps.
> 
> Signed-off-by: Kirti Wankhede 
> Reviewed-by: Neo Jia 
> ---
>  hw/vfio/migration.c   | 245 
> +-
>  hw/vfio/trace-events  |   6 ++
>  include/hw/vfio/vfio-common.h |   1 +
>  3 files changed, 251 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 033f76526e49..ecbeed5182c2 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -138,6 +138,137 @@ static int vfio_migration_set_state(VFIODevice 
> *vbasedev, uint32_t mask,
>  return 0;
>  }
>  
> +static void *find_data_region(VFIORegion *region,
> +  uint64_t data_offset,
> +  uint64_t data_size)
> +{
> +void *ptr = NULL;
> +int i;
> +
> +for (i = 0; i < region->nr_mmaps; i++) {
> +if ((data_offset >= region->mmaps[i].offset) &&
> +(data_offset < region->mmaps[i].offset + region->mmaps[i].size) 
> &&
> +(data_size <= region->mmaps[i].size)) {
> +ptr = region->mmaps[i].mmap + (data_offset -
> +   region->mmaps[i].offset);
> +break;
> +}
> +}
> +return ptr;
> +}
> +
> +static int vfio_save_buffer(QEMUFile *f, VFIODevice *vbasedev)
> +{
> +VFIOMigration *migration = vbasedev->migration;
> +VFIORegion *region = >region;
> +uint64_t data_offset = 0, data_size = 0;
> +int ret;
> +
> +ret = pread(vbasedev->fd, _offset, sizeof(data_offset),
> +region->fd_offset + offsetof(struct 
> vfio_device_migration_info,
> + data_offset));
> +if (ret != sizeof(data_offset)) {
> +error_report("%s: Failed to get migration buffer data offset %d",
> + vbasedev->name, ret);
> +return -EINVAL;
> +}
> +
> +ret = pread(vbasedev->fd, _size, sizeof(data_size),
> +region->fd_offset + offsetof(struct 
> vfio_device_migration_info,
> + data_size));
> +if (ret != sizeof(data_size)) {
> +error_report("%s: Failed to get migration buffer data size %d",
> + vbasedev->name, ret);
> +return -EINVAL;
> +}
data_size should be read first, and if it's 0, data_offset will not
be read further.

the reasons are below:
1. if there's no data region provided by vendor driver, there's no
reason to get a valid data_offset, so reading/writing of data_offset
should fail. And this should not be treated as a migration error.

2. even if pending_bytes is 0, vfio_save_iterate() is still possible to be
called and therefore vfio_save_buffer() is called.

Thanks
Yan
> +
> +if (data_size > 0) {
> +void *buf = NULL;
> +bool buffer_mmaped;
> +
> +if (region->mmaps) {
> +buf = find_data_region(region, data_offset, data_size);
> +}
> +
> +buffer_mmaped = (buf != NULL) ? true : false;
> +
> +if (!buffer_mmaped) {
> +buf = g_try_malloc0(data_size);
> +if (!buf) {
> +error_report("%s: Error allocating buffer ", __func__);
> +return -ENOMEM;
> +}
> +

Re: [PATCH v1 2/2] Sample mtty: Add migration capability to mtty module

2020-05-07 Thread Yan Zhao

On Thu, May 07, 2020 at 11:19:40AM +0530, Kirti Wankhede wrote:
> 
> 
> On 5/7/2020 6:31 AM, Yan Zhao wrote:
> > On Tue, May 05, 2020 at 01:54:20AM +0800, Kirti Wankhede wrote:
> > > This patch makes mtty device migration capable. Purpose od this code is
> > > to test migration interface. Only stop-and-copy phase is implemented.
> > > Postcopy migration is not supported.
> > > 
> > > Actual data for mtty device migration is very less. Appended dummy data to
> > > migration data stream, default 100 Mbytes. Added sysfs file
> > > 'dummy_data_size_MB' to get dummy data size from user which can be used
> > > to check performance of based of data size. During resuming dummy data is
> > > read and discarded.
> > > 
> > > Signed-off-by: Kirti Wankhede 
> > > ---
> > >   samples/vfio-mdev/mtty.c | 602 
> > > ---
> > >   1 file changed, 574 insertions(+), 28 deletions(-)
> > > 
> > > diff --git a/samples/vfio-mdev/mtty.c b/samples/vfio-mdev/mtty.c
> > > index bf666cce5bb7..f9194234fc6a 100644
> > > --- a/samples/vfio-mdev/mtty.c
> > > +++ b/samples/vfio-mdev/mtty.c
> > > @@ -44,9 +44,23 @@
> > >   #define MTTY_STRING_LEN 16
> > > -#define MTTY_CONFIG_SPACE_SIZE  0xff
> > > -#define MTTY_IO_BAR_SIZE0x8
> > > -#define MTTY_MMIO_BAR_SIZE  0x10
> > > +#define MTTY_CONFIG_SPACE_SIZE   0xff
> > > +#define MTTY_IO_BAR_SIZE 0x8
> > > +#define MTTY_MMIO_BAR_SIZE   0x10
> > > +#define MTTY_MIGRATION_REGION_SIZE   0x100   // 16M
> > > +
> > > +#define MTTY_MIGRATION_REGION_INDEX  VFIO_PCI_NUM_REGIONS
> > > +#define MTTY_REGIONS_MAX (MTTY_MIGRATION_REGION_INDEX + 1)
> > > +
> > > +/* Data section start from page aligned offset */
> > > +#define MTTY_MIGRATION_REGION_DATA_OFFSET(0x1000)
> > > +
> > > +/* First page is used for struct vfio_device_migration_info */
> > > +#define MTTY_MIGRATION_REGION_SIZE_MMAP \
> > > + (MTTY_MIGRATION_REGION_SIZE - MTTY_MIGRATION_REGION_DATA_OFFSET)
> > > +
> > > +#define MIGRATION_INFO_OFFSET(MEMBER)\
> > > + offsetof(struct vfio_device_migration_info, MEMBER)
> > >   #define STORE_LE16(addr, val)   (*(u16 *)addr = val)
> > >   #define STORE_LE32(addr, val)   (*(u32 *)addr = val)
> > > @@ -129,6 +143,28 @@ struct serial_port {
> > >   u8 intr_trigger_level;  /* interrupt trigger level */
> > >   };
> > > +/* Migration packet */
> > > +#define PACKET_ID(u16)(0xfeedbaba)
> > > +
> > > +#define PACKET_FLAGS_ACTUAL_DATA (1 << 0)
> > > +#define PACKET_FLAGS_DUMMY_DATA  (1 << 1)
> > > +
> > > +#define PACKET_DATA_SIZE_MAX (8 * 1024 * 1024)
> > > +
> > > +struct packet {
> > > + u16 id;
> > > + u16 flags;
> > > + u32 data_size;
> > > + u8 data[];
> > > +};
> > > +
> > > +enum {
> > > + PACKET_STATE_NONE = 0,
> > > + PACKET_STATE_PREPARED,
> > > + PACKET_STATE_COPIED,
> > > + PACKET_STATE_LAST,
> > > +};
> > > +
> > >   /* State of each mdev device */
> > >   struct mdev_state {
> > >   int irq_fd;
> > > @@ -138,22 +174,37 @@ struct mdev_state {
> > >   u8 *vconfig;
> > >   struct mutex ops_lock;
> > >   struct mdev_device *mdev;
> > > - struct mdev_region_info region_info[VFIO_PCI_NUM_REGIONS];
> > > - u32 bar_mask[VFIO_PCI_NUM_REGIONS];
> > > + struct mdev_region_info region_info[MTTY_REGIONS_MAX];
> > > + u32 bar_mask[MTTY_REGIONS_MAX];
> > >   struct list_head next;
> > >   struct serial_port s[2];
> > >   struct mutex rxtx_lock;
> > >   struct vfio_device_info dev_info;
> > > - int nr_ports;
> > > + u32 nr_ports;
> > >   /* List of pinned gpfns, gpfn as index and content is 
> > > translated hpfn */
> > >   unsigned long *gpfn_to_hpfn;
> > >   struct notifier_block nb;
> > > +
> > > + u32 device_state;
> > > + u64 saved_size;
> > > + void *mig_region_base;
> > > + bool is_actual_data_sent;
> > > + struct packet *pkt;
> > > + u32 packet_state;
> > > + u64 dummy_data_size;
> > >   };
> > >   static struct mu

Re: [PATCH v1 2/2] Sample mtty: Add migration capability to mtty module

2020-05-06 Thread Yan Zhao

On Tue, May 05, 2020 at 01:54:20AM +0800, Kirti Wankhede wrote:
> This patch makes mtty device migration capable. Purpose od this code is
> to test migration interface. Only stop-and-copy phase is implemented.
> Postcopy migration is not supported.
> 
> Actual data for mtty device migration is very less. Appended dummy data to
> migration data stream, default 100 Mbytes. Added sysfs file
> 'dummy_data_size_MB' to get dummy data size from user which can be used
> to check performance of based of data size. During resuming dummy data is
> read and discarded.
> 
> Signed-off-by: Kirti Wankhede 
> ---
>  samples/vfio-mdev/mtty.c | 602 
> ---
>  1 file changed, 574 insertions(+), 28 deletions(-)
> 
> diff --git a/samples/vfio-mdev/mtty.c b/samples/vfio-mdev/mtty.c
> index bf666cce5bb7..f9194234fc6a 100644
> --- a/samples/vfio-mdev/mtty.c
> +++ b/samples/vfio-mdev/mtty.c
> @@ -44,9 +44,23 @@
>  
>  #define MTTY_STRING_LEN  16
>  
> -#define MTTY_CONFIG_SPACE_SIZE  0xff
> -#define MTTY_IO_BAR_SIZE0x8
> -#define MTTY_MMIO_BAR_SIZE  0x10
> +#define MTTY_CONFIG_SPACE_SIZE   0xff
> +#define MTTY_IO_BAR_SIZE 0x8
> +#define MTTY_MMIO_BAR_SIZE   0x10
> +#define MTTY_MIGRATION_REGION_SIZE   0x100   // 16M
> +
> +#define MTTY_MIGRATION_REGION_INDEX  VFIO_PCI_NUM_REGIONS
> +#define MTTY_REGIONS_MAX (MTTY_MIGRATION_REGION_INDEX + 1)
> +
> +/* Data section start from page aligned offset */
> +#define MTTY_MIGRATION_REGION_DATA_OFFSET(0x1000)
> +
> +/* First page is used for struct vfio_device_migration_info */
> +#define MTTY_MIGRATION_REGION_SIZE_MMAP \
> + (MTTY_MIGRATION_REGION_SIZE - MTTY_MIGRATION_REGION_DATA_OFFSET)
> +
> +#define MIGRATION_INFO_OFFSET(MEMBER)\
> + offsetof(struct vfio_device_migration_info, MEMBER)
>  
>  #define STORE_LE16(addr, val)   (*(u16 *)addr = val)
>  #define STORE_LE32(addr, val)   (*(u32 *)addr = val)
> @@ -129,6 +143,28 @@ struct serial_port {
>   u8 intr_trigger_level;  /* interrupt trigger level */
>  };
>  
> +/* Migration packet */
> +#define PACKET_ID(u16)(0xfeedbaba)
> +
> +#define PACKET_FLAGS_ACTUAL_DATA (1 << 0)
> +#define PACKET_FLAGS_DUMMY_DATA  (1 << 1)
> +
> +#define PACKET_DATA_SIZE_MAX (8 * 1024 * 1024)
> +
> +struct packet {
> + u16 id;
> + u16 flags;
> + u32 data_size;
> + u8 data[];
> +};
> +
> +enum {
> + PACKET_STATE_NONE = 0,
> + PACKET_STATE_PREPARED,
> + PACKET_STATE_COPIED,
> + PACKET_STATE_LAST,
> +};
> +
>  /* State of each mdev device */
>  struct mdev_state {
>   int irq_fd;
> @@ -138,22 +174,37 @@ struct mdev_state {
>   u8 *vconfig;
>   struct mutex ops_lock;
>   struct mdev_device *mdev;
> - struct mdev_region_info region_info[VFIO_PCI_NUM_REGIONS];
> - u32 bar_mask[VFIO_PCI_NUM_REGIONS];
> + struct mdev_region_info region_info[MTTY_REGIONS_MAX];
> + u32 bar_mask[MTTY_REGIONS_MAX];
>   struct list_head next;
>   struct serial_port s[2];
>   struct mutex rxtx_lock;
>   struct vfio_device_info dev_info;
> - int nr_ports;
> + u32 nr_ports;
>  
>   /* List of pinned gpfns, gpfn as index and content is translated hpfn */
>   unsigned long *gpfn_to_hpfn;
>   struct notifier_block nb;
> +
> + u32 device_state;
> + u64 saved_size;
> + void *mig_region_base;
> + bool is_actual_data_sent;
> + struct packet *pkt;
> + u32 packet_state;
> + u64 dummy_data_size;
>  };
>  
>  static struct mutex mdev_list_lock;
>  static struct list_head mdev_devices_list;
>  
> +/*
> + * Default dummy data size set to 100 MB. To change value of dummy data size 
> at
> + * runtime but before migration write size in MB to sysfs file
> + * dummy_data_size_MB
> + */
> +static unsigned long user_dummy_data_size = (100 * 1024 * 1024);
> +
>  static const struct file_operations vd_fops = {
>   .owner  = THIS_MODULE,
>  };
> @@ -639,6 +690,288 @@ static void mdev_read_base(struct mdev_state 
> *mdev_state)
>   }
>  }
>  
> +static int save_setup(struct mdev_state *mdev_state)
> +{
> + mdev_state->is_actual_data_sent = false;
> +
> + memset(mdev_state->pkt, 0, sizeof(struct packet) +
> +PACKET_DATA_SIZE_MAX);
> +
> + return 0;
> +}
> +
> +static int set_device_state(struct mdev_state *mdev_state, u32 device_state)
> +{
> + int ret = 0;
> +
> + if (mdev_state->device_state == device_state)
> + return 0;
> +
> + if (device_state & VFIO_DEVICE_STATE_RUNNING) {
> +#if defined(DEBUG)
> + if (device_state & VFIO_DEVICE_STATE_SAVING) {
> + pr_info("%s: %s Pre-copy\n", __func__,
> + dev_name(mdev_dev(mdev_state->mdev)));
> + } else
> + pr_info("%s: %s Running\n", __func__,
> +

Re: [PATCH Kernel v18 4/7] vfio iommu: Implementation of ioctl for dirty pages tracking.

2020-05-06 Thread Yan Zhao

On Mon, May 04, 2020 at 11:58:56PM +0800, Kirti Wankhede wrote:
> VFIO_IOMMU_DIRTY_PAGES ioctl performs three operations:
> - Start dirty pages tracking while migration is active
> - Stop dirty pages tracking.
> - Get dirty pages bitmap. Its user space application's responsibility to
>   copy content of dirty pages from source to destination during migration.
> 
> To prevent DoS attack, memory for bitmap is allocated per vfio_dma
> structure. Bitmap size is calculated considering smallest supported page
> size. Bitmap is allocated for all vfio_dmas when dirty logging is enabled
> 
> Bitmap is populated for already pinned pages when bitmap is allocated for
> a vfio_dma with the smallest supported page size. Update bitmap from
> pinning functions when tracking is enabled. When user application queries
> bitmap, check if requested page size is same as page size used to
> populated bitmap. If it is equal, copy bitmap, but if not equal, return
> error.
> 
> Fixed below error by changing pgsize type from uint64_t to size_t.
> Reported-by: kbuild test robot 
> 
> All errors:
> drivers/vfio/vfio_iommu_type1.c:197: undefined reference to `__udivdi3'
> 
> drivers/vfio/vfio_iommu_type1.c:225: undefined reference to `__udivdi3'
> 
> Signed-off-by: Kirti Wankhede 
> Reviewed-by: Neo Jia 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 266 
> +++-
>  1 file changed, 260 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index fa735047b04d..01dcb417836f 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -71,6 +71,7 @@ struct vfio_iommu {
>   unsigned intdma_avail;
>   boolv2;
>   boolnesting;
> + booldirty_page_tracking;
>  };
>  
>  struct vfio_domain {
> @@ -91,6 +92,7 @@ struct vfio_dma {
>   boollock_cap;   /* capable(CAP_IPC_LOCK) */
>   struct task_struct  *task;
>   struct rb_root  pfn_list;   /* Ex-user pinned pfn list */
> + unsigned long   *bitmap;
>  };
>  
>  struct vfio_group {
> @@ -125,7 +127,21 @@ struct vfio_regions {
>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)  \
>   (!list_empty(>domain_list))
>  
> +#define DIRTY_BITMAP_BYTES(n)(ALIGN(n, BITS_PER_TYPE(u64)) / 
> BITS_PER_BYTE)
> +
> +/*
> + * Input argument of number of bits to bitmap_set() is unsigned integer, 
> which
> + * further casts to signed integer for unaligned multi-bit operation,
> + * __bitmap_set().
> + * Then maximum bitmap size supported is 2^31 bits divided by 2^3 bits/byte,
> + * that is 2^28 (256 MB) which maps to 2^31 * 2^12 = 2^43 (8TB) on 4K page
> + * system.
> + */
> +#define DIRTY_BITMAP_PAGES_MAX((u64)INT_MAX)
> +#define DIRTY_BITMAP_SIZE_MAX 
> DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
> +
>  static int put_pfn(unsigned long pfn, int prot);
> +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu);
>  
>  /*
>   * This code handles mapping and unmapping of user data buffers
> @@ -175,6 +191,77 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, 
> struct vfio_dma *old)
>   rb_erase(>node, >dma_list);
>  }
>  
> +
> +static int vfio_dma_bitmap_alloc(struct vfio_dma *dma, size_t pgsize)
> +{
> + uint64_t npages = dma->size / pgsize;
> +
> + if (npages > DIRTY_BITMAP_PAGES_MAX)
> + return -EINVAL;
> +
> + dma->bitmap = kvzalloc(DIRTY_BITMAP_BYTES(npages), GFP_KERNEL);
> + if (!dma->bitmap)
> + return -ENOMEM;
> +
> + return 0;
> +}
> +
> +static void vfio_dma_bitmap_free(struct vfio_dma *dma)
> +{
> + kfree(dma->bitmap);
> + dma->bitmap = NULL;
> +}
> +
> +static void vfio_dma_populate_bitmap(struct vfio_dma *dma, size_t pgsize)
> +{
> + struct rb_node *p;
> +
> + if (RB_EMPTY_ROOT(>pfn_list))
> + return;
> +
> + for (p = rb_first(>pfn_list); p; p = rb_next(p)) {
> + struct vfio_pfn *vpfn = rb_entry(p, struct vfio_pfn, node);
> +
> + bitmap_set(dma->bitmap, (vpfn->iova - dma->iova) / pgsize, 1);
> + }
> +}
> +
> +static int vfio_dma_bitmap_alloc_all(struct vfio_iommu *iommu, size_t pgsize)
> +{
> + struct rb_node *n = rb_first(>dma_list);
> +
> + for (; n; n = rb_next(n)) {
> + struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> + int ret;
> +
> + ret = vfio_dma_bitmap_alloc(dma, pgsize);
> + if (ret) {
> + struct rb_node *p = rb_prev(n);
> +
> + for (; p; p = rb_prev(p)) {
> + struct vfio_dma *dma = rb_entry(n,
> + struct vfio_dma, node);
> +
> + vfio_dma_bitmap_free(dma);
> + }
> + return ret;
> +

Re: [PATCH v16 QEMU 08/16] vfio: Register SaveVMHandlers for VFIO device

2020-05-06 Thread Yan Zhao

On Tue, May 05, 2020 at 12:37:26PM +0800, Alex Williamson wrote:
> On Tue, 5 May 2020 04:49:10 +0530
> Kirti Wankhede  wrote:
> 
> > On 3/26/2020 2:32 AM, Alex Williamson wrote:
> > > On Wed, 25 Mar 2020 02:39:06 +0530
> > > Kirti Wankhede  wrote:
> > >   
> > >> Define flags to be used as delimeter in migration file stream.
> > >> Added .save_setup and .save_cleanup functions. Mapped & unmapped 
> > >> migration
> > >> region from these functions at source during saving or pre-copy phase.
> > >> Set VFIO device state depending on VM's state. During live migration, VM 
> > >> is
> > >> running when .save_setup is called, _SAVING | _RUNNING state is set for 
> > >> VFIO
> > >> device. During save-restore, VM is paused, _SAVING state is set for VFIO 
> > >> device.
> > >>
> > >> Signed-off-by: Kirti Wankhede 
> > >> Reviewed-by: Neo Jia 
> > >> ---
> > >>   hw/vfio/migration.c  | 76 
> > >> 
> > >>   hw/vfio/trace-events |  2 ++
> > >>   2 files changed, 78 insertions(+)
> > >>
> > >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > >> index 22ded9d28cf3..033f76526e49 100644
> > >> --- a/hw/vfio/migration.c
> > >> +++ b/hw/vfio/migration.c
> > >> @@ -8,6 +8,7 @@
> > >>*/
> > >>   
> > >>   #include "qemu/osdep.h"
> > >> +#include "qemu/main-loop.h"
> > >>   #include 
> > >>   
> > >>   #include "sysemu/runstate.h"
> > >> @@ -24,6 +25,17 @@
> > >>   #include "pci.h"
> > >>   #include "trace.h"
> > >>   
> > >> +/*
> > >> + * Flags used as delimiter:
> > >> + * 0x => MSB 32-bit all 1s
> > >> + * 0xef10 => emulated (virtual) function IO
> > >> + * 0x => 16-bits reserved for flags
> > >> + */
> > >> +#define VFIO_MIG_FLAG_END_OF_STATE  (0xef11ULL)
> > >> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xef12ULL)
> > >> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xef13ULL)
> > >> +#define VFIO_MIG_FLAG_DEV_DATA_STATE(0xef14ULL)
> > >> +
> > >>   static void vfio_migration_region_exit(VFIODevice *vbasedev)
> > >>   {
> > >>   VFIOMigration *migration = vbasedev->migration;
> > >> @@ -126,6 +138,69 @@ static int vfio_migration_set_state(VFIODevice 
> > >> *vbasedev, uint32_t mask,
> > >>   return 0;
> > >>   }
> > >>   
> > >> +/* 
> > >> -- */
> > >> +
> > >> +static int vfio_save_setup(QEMUFile *f, void *opaque)
> > >> +{
> > >> +VFIODevice *vbasedev = opaque;
> > >> +VFIOMigration *migration = vbasedev->migration;
> > >> +int ret;
> > >> +
> > >> +qemu_put_be64(f, VFIO_MIG_FLAG_DEV_SETUP_STATE);
> > >> +
> > >> +if (migration->region.mmaps) {
> > >> +qemu_mutex_lock_iothread();
> > >> +ret = vfio_region_mmap(>region);
> > >> +qemu_mutex_unlock_iothread();
> > >> +if (ret) {
> > >> +error_report("%s: Failed to mmap VFIO migration region %d: 
> > >> %s",
> > >> + vbasedev->name, migration->region.index,
> > >> + strerror(-ret));
> > >> +return ret;
> > >> +}
> > >> +}
> > >> +
> > >> +ret = vfio_migration_set_state(vbasedev, ~0, 
> > >> VFIO_DEVICE_STATE_SAVING);
> > >> +if (ret) {
> > >> +error_report("%s: Failed to set state SAVING", vbasedev->name);
> > >> +return ret;
> > >> +}
> > >> +
> > >> +/*
> > >> + * Save migration region size. This is used to verify migration 
> > >> region size
> > >> + * is greater than or equal to migration region size at destination
> > >> + */
> > >> +qemu_put_be64(f, migration->region.size);  
> > > 
> > > Is this requirement supported by the uapi?
> > 
> > Yes, on UAPI thread we discussed this:
> > 
> >   * For the user application, data is opaque. The user application 
> > should write
> >   * data in the same order as the data is received and the data should be of
> >   * same transaction size at the source.
> > 
> > data should be same transaction size, so migration region size should be 
> > greater than or equal to the size at source when verifying at destination.
> 
> We are that user application for which the data is opaque, therefore we
> should make no assumptions about how the vendor driver makes use of
> their region.  If we get a transaction that exceeds the end of the
> region, I agree, that would be an error.  But we have no business
> predicting that such a transaction might occur if the vendor driver
> indicates it can support the migration.
> 
> > > The vendor driver operates
> > > within the migration region, but it has no requirement to use the full
> > > extent of the region.  Shouldn't we instead insert the version string
> > > from versioning API Yan proposed?  Is this were we might choose to use
> > > an interface via the vfio API rather than sysfs if we had one?
> > >  
> > 
> > VFIO API cannot be used by libvirt or management tool stack. We need 
> > sysfs as

Re: [PATCH v16 QEMU 04/16] vfio: Add save and load functions for VFIO PCI devices

2020-05-06 Thread Yan Zhao

On Tue, May 05, 2020 at 12:37:11PM +0800, Alex Williamson wrote:
> On Tue, 5 May 2020 04:48:37 +0530
> Kirti Wankhede  wrote:
> 
> > On 3/26/2020 1:26 AM, Alex Williamson wrote:
> > > On Wed, 25 Mar 2020 02:39:02 +0530
> > > Kirti Wankhede  wrote:
> > >   
> > >> These functions save and restore PCI device specific data - config
> > >> space of PCI device.
> > >> Tested save and restore with MSI and MSIX type.
> > >>
> > >> Signed-off-by: Kirti Wankhede 
> > >> Reviewed-by: Neo Jia 
> > >> ---
> > >>   hw/vfio/pci.c | 163 
> > >> ++
> > >>   include/hw/vfio/vfio-common.h |   2 +
> > >>   2 files changed, 165 insertions(+)
> > >>
> > >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > >> index 6c77c12e44b9..8deb11e87ef7 100644
> > >> --- a/hw/vfio/pci.c
> > >> +++ b/hw/vfio/pci.c
> > >> @@ -41,6 +41,7 @@
> > >>   #include "trace.h"
> > >>   #include "qapi/error.h"
> > >>   #include "migration/blocker.h"
> > >> +#include "migration/qemu-file.h"
> > >>   
> > >>   #define TYPE_VFIO_PCI "vfio-pci"
> > >>   #define PCI_VFIO(obj)OBJECT_CHECK(VFIOPCIDevice, obj, 
> > >> TYPE_VFIO_PCI)
> > >> @@ -1632,6 +1633,50 @@ static void vfio_bars_prepare(VFIOPCIDevice *vdev)
> > >>   }
> > >>   }
> > >>   
> > >> +static int vfio_bar_validate(VFIOPCIDevice *vdev, int nr)
> > >> +{
> > >> +PCIDevice *pdev = >pdev;
> > >> +VFIOBAR *bar = >bars[nr];
> > >> +uint64_t addr;
> > >> +uint32_t addr_lo, addr_hi = 0;
> > >> +
> > >> +/* Skip unimplemented BARs and the upper half of 64bit BARS. */
> > >> +if (!bar->size) {
> > >> +return 0;
> > >> +}
> > >> +
> > >> +addr_lo = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + nr * 
> > >> 4, 4);
> > >> +
> > >> +addr_lo = addr_lo & (bar->ioport ? PCI_BASE_ADDRESS_IO_MASK :
> > >> +   PCI_BASE_ADDRESS_MEM_MASK);  
> > > 
> > > Nit, &= or combine with previous set.
> > >   
> > >> +if (bar->type == PCI_BASE_ADDRESS_MEM_TYPE_64) {
> > >> +addr_hi = pci_default_read_config(pdev,
> > >> + PCI_BASE_ADDRESS_0 + (nr + 1) 
> > >> * 4, 4);
> > >> +}
> > >> +
> > >> +addr = ((uint64_t)addr_hi << 32) | addr_lo;  
> > > 
> > > Could we use a union?
> > >   
> > >> +
> > >> +if (!QEMU_IS_ALIGNED(addr, bar->size)) {
> > >> +return -EINVAL;
> > >> +}  
> > > 
> > > What specifically are we validating here?  This should be true no
> > > matter what we wrote to the BAR or else BAR emulation is broken.  The
> > > bits that could make this unaligned are not implemented in the BAR.
> > >   
> > >> +
> > >> +return 0;
> > >> +}
> > >> +
> > >> +static int vfio_bars_validate(VFIOPCIDevice *vdev)
> > >> +{
> > >> +int i, ret;
> > >> +
> > >> +for (i = 0; i < PCI_ROM_SLOT; i++) {
> > >> +ret = vfio_bar_validate(vdev, i);
> > >> +if (ret) {
> > >> +error_report("vfio: BAR address %d validation failed", i);
> > >> +return ret;
> > >> +}
> > >> +}
> > >> +return 0;
> > >> +}
> > >> +
> > >>   static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
> > >>   {
> > >>   VFIOBAR *bar = >bars[nr];
> > >> @@ -2414,11 +2459,129 @@ static Object *vfio_pci_get_object(VFIODevice 
> > >> *vbasedev)
> > >>   return OBJECT(vdev);
> > >>   }
> > >>   
> > >> +static void vfio_pci_save_config(VFIODevice *vbasedev, QEMUFile *f)
> > >> +{
> > >> +VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, 
> > >> vbasedev);
> > >> +PCIDevice *pdev = >pdev;
> > >> +uint16_t pci_cmd;
> > >> +int i;
> > >> +
> > >> +for (i = 0; i < PCI_ROM_SLOT; i++) {
> > >> +uint32_t bar;
> > >> +
> > >> +bar = pci_default_read_config(pdev, PCI_BASE_ADDRESS_0 + i * 4, 
> > >> 4);
> > >> +qemu_put_be32(f, bar);
> > >> +}
> > >> +
> > >> +qemu_put_be32(f, vdev->interrupt);
> > >> +if (vdev->interrupt == VFIO_INT_MSI) {
> > >> +uint32_t msi_flags, msi_addr_lo, msi_addr_hi = 0, msi_data;
> > >> +bool msi_64bit;
> > >> +
> > >> +msi_flags = pci_default_read_config(pdev, pdev->msi_cap + 
> > >> PCI_MSI_FLAGS,
> > >> +2);
> > >> +msi_64bit = (msi_flags & PCI_MSI_FLAGS_64BIT);
> > >> +
> > >> +msi_addr_lo = pci_default_read_config(pdev,
> > >> + pdev->msi_cap + 
> > >> PCI_MSI_ADDRESS_LO, 4);
> > >> +qemu_put_be32(f, msi_addr_lo);
> > >> +
> > >> +if (msi_64bit) {
> > >> +msi_addr_hi = pci_default_read_config(pdev,
> > >> + pdev->msi_cap + 
> > >> PCI_MSI_ADDRESS_HI,
> > >> + 4);
> > >> +}
> > >> +qemu_put_be32(f, msi_addr_hi);
> > >> +
> > >> +msi_data = pci_default_read_config(pdev,
> > >> +pdev->msi_cap + (msi_64bit ?

Re: [PATCH v6 1/3] memory: drop guest writes to read-only ram device regions

2020-04-30 Thread Yan Zhao

On Thu, Apr 30, 2020 at 05:40:25PM +0800, Peter Maydell wrote:
> On Thu, 30 Apr 2020 at 09:20, Yan Zhao  wrote:
> >
> > for ram device regions, drop guest writes if the region is read-only.
> >
> > Cc: Philippe Mathieu-Daudé 
> > Reviewed-by: Philippe Mathieu-Daudé 
> > Signed-off-by: Yan Zhao 
> > Signed-off-by: Xin Zeng 
> > ---
> >  memory.c | 15 ---
> >  1 file changed, 12 insertions(+), 3 deletions(-)
> >
> > diff --git a/memory.c b/memory.c
> > index 601b749906..a1bba985b9 100644
> > --- a/memory.c
> > +++ b/memory.c
> > @@ -34,6 +34,7 @@
> >  #include "sysemu/accel.h"
> >  #include "hw/boards.h"
> >  #include "migration/vmstate.h"
> > +#include "qemu/log.h"
> >
> >  //#define DEBUG_UNASSIGNED
> >
> > @@ -1307,12 +1308,19 @@ static uint64_t memory_region_ram_device_read(void 
> > *opaque,
> >  return data;
> >  }
> >
> > -static void memory_region_ram_device_write(void *opaque, hwaddr addr,
> > -   uint64_t data, unsigned size)
> > +static MemTxResult memory_region_ram_device_write(void *opaque, hwaddr 
> > addr,
> > +  uint64_t data, unsigned 
> > size,
> > +  MemTxAttrs attrs)
> >  {
> >  MemoryRegion *mr = opaque;
> >
> >  trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, 
> > size);
> > +if (mr->readonly) {
> > +qemu_log_mask(LOG_GUEST_ERROR,
> > +  "Invalid write to read-only ram device region addr 
> > 0x%"
> > +  HWADDR_PRIx" size %u\n", addr, size);
> > +return MEMTX_ERROR;
> > +}
> 
> This does not "drop" a write to a r/o region -- it causes it to generate
> whatever the guest architecture's equivalent of a bus error is (eg data
> abort on Arm).
>
hmm, I'm not sure. so your expectation is silently dropping guest writes
without any bus error, right?

> More generally, this change seems a bit odd: currently we do not
> check the mr->readonly flag here, but in general guests don't get
> to write to ROM areas. Where is that check currently done, and
it's not a ROM, but a ram region backed by a device. we wish this region
to be read-only sometimes, in order to implement some useful features.
It can be a virtual BAR region in a virtual mdev device.

> should the vfio case you're trying to fix do its check in whatever
> the equivalent of that place is? Alternatively, if we want to make
> memory_region_ram_device_write() do the check, does that mean we
> now have unnecessary checks elsewhere.
currently, vfio implements the BAR regions in two types:
1. non-mmap'd,  meaning this region will not be added into kvm memory
slots, and whenever guest accesses it, it will be trapped into a host
handler. we do the read-only check in patch 2 of this series.
2. mmap'd, meaning this region will be added into kvm memory slots, and
guest could access it without any hypervisor intervening.
so without patch 3 in the series, there's no write protection to guest
writes.
after setting this mmap'd region to read-only in patch 3, the
corresponding memory slot in kvm is set to read-only, so only guest
writes would be trapped into host, i.e. into the
memory_region_ram_device_write(). guest reads is still within the guest
without hypervisor intervening.


> 
> My guess is that memory_region_ram_device_write() isn't the
> right place to check for read-only-ness, because it only applies
> to RAM-backed MRs, not to any other kind of MR which might equally
> be readonly.
>
there might be other MRs that require checking of read-only-ness.
but their handlers have the right to be called to know it has happened,
and they might want to do some special handling of it. That's why I did
not put the check in general dispatcher.

Thanks
Yan

[PATCH v6 3/3] hw/vfio: let read-only flag take effect for mmap'd regions

2020-04-30 Thread Yan Zhao

along side setting host page table to be read-only, the memory regions
are also required to be read-only, so that when guest writes to the
read-only & mmap'd regions, vmexits would happen and region write handlers
are called.

Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Yan Zhao 
Signed-off-by: Xin Zeng 
---
 hw/vfio/common.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 2a4fedfeaa..bf510e66c0 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -980,6 +980,10 @@ int vfio_region_mmap(VFIORegion *region)
   name, region->mmaps[i].size,
   region->mmaps[i].mmap);
 g_free(name);
+
+if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE)) {
+memory_region_set_readonly(>mmaps[i].mem, true);
+}
 memory_region_add_subregion(region->mem, region->mmaps[i].offset,
 >mmaps[i].mem);
 
-- 
2.17.1

[PATCH v6 2/3] hw/vfio: drop guest writes to ro regions

2020-04-30 Thread Yan Zhao

for vfio regions that are without write permission,
drop guest writes to those regions.

Cc: Philippe Mathieu-Daudé 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Yan Zhao 
Signed-off-by: Xin Zeng 
---
 hw/vfio/common.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0b3593b3c0..2a4fedfeaa 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -38,6 +38,7 @@
 #include "sysemu/reset.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "qemu/log.h"
 
 VFIOGroupList vfio_group_list =
 QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -190,6 +191,16 @@ void vfio_region_write(void *opaque, hwaddr addr,
 uint64_t qword;
 } buf;
 
+trace_vfio_region_write(vbasedev->name, region->nr, addr, data, size);
+if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE)) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "Invalid write to read only vfio region (%s:region%d"
+  "+0x%"HWADDR_PRIx" size %d)\n", vbasedev->name,
+  region->nr, addr, size);
+
+return;
+}
+
 switch (size) {
 case 1:
 buf.byte = data;
@@ -215,8 +226,6 @@ void vfio_region_write(void *opaque, hwaddr addr,
  addr, data, size);
 }
 
-trace_vfio_region_write(vbasedev->name, region->nr, addr, data, size);
-
 /*
  * A read or write to a BAR always signals an INTx EOI.  This will
  * do nothing if not pending (including not in INTx mode).  We assume
-- 
2.17.1

[PATCH v6 1/3] memory: drop guest writes to read-only ram device regions

2020-04-30 Thread Yan Zhao

for ram device regions, drop guest writes if the region is read-only.

Cc: Philippe Mathieu-Daudé 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Yan Zhao 
Signed-off-by: Xin Zeng 
---
 memory.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/memory.c b/memory.c
index 601b749906..a1bba985b9 100644
--- a/memory.c
+++ b/memory.c
@@ -34,6 +34,7 @@
 #include "sysemu/accel.h"
 #include "hw/boards.h"
 #include "migration/vmstate.h"
+#include "qemu/log.h"
 
 //#define DEBUG_UNASSIGNED
 
@@ -1307,12 +1308,19 @@ static uint64_t memory_region_ram_device_read(void 
*opaque,
 return data;
 }
 
-static void memory_region_ram_device_write(void *opaque, hwaddr addr,
-   uint64_t data, unsigned size)
+static MemTxResult memory_region_ram_device_write(void *opaque, hwaddr addr,
+  uint64_t data, unsigned size,
+  MemTxAttrs attrs)
 {
 MemoryRegion *mr = opaque;
 
 trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, 
size);
+if (mr->readonly) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "Invalid write to read-only ram device region addr 0x%"
+  HWADDR_PRIx" size %u\n", addr, size);
+return MEMTX_ERROR;
+}
 
 switch (size) {
 case 1:
@@ -1328,11 +1336,12 @@ static void memory_region_ram_device_write(void 
*opaque, hwaddr addr,
 *(uint64_t *)(mr->ram_block->host + addr) = data;
 break;
 }
+return MEMTX_OK;
 }
 
 static const MemoryRegionOps ram_device_mem_ops = {
 .read = memory_region_ram_device_read,
-.write = memory_region_ram_device_write,
+.write_with_attrs = memory_region_ram_device_write,
 .endianness = DEVICE_HOST_ENDIAN,
 .valid = {
 .min_access_size = 1,
-- 
2.17.1

[PATCH v6 0/3] drop writes to read-only ram device & vfio regions

2020-04-30 Thread Yan Zhao

guest writes to read-only memory regions need to be dropped.

patch 1 modifies handler of ram device memory regions to drop guest writes
to read-only ram device memory regions

patch 2 modifies handler of non-mmap'd read-only vfio regions to drop guest
writes to those regions 

patch 3 set read-only flag to mmap'd read-only vfio regions, so that guest
writes to those regions would be trapped.
without patch 1, host qemu would then crash on guest write to those
read-only regions.
with patch 1, host qemu would drop the writes.

Changelog:
v6:
-fixed two style alignment problems in patch 1. (Philippe)

v5:
-changed write handler of ram device memory region from .write to
.write_with_attrs in patch 1 (Paolo)
(for vfio region in patch 2, I still keep the operations as .read & .write.
the reasons are:
1. vfio_region_ops are for mmio/pio regions. the top level read/write
dispatcher in kvm just ignores their return values. (the return value of
address_space_rw() is just ignored)
2. there are a lot of callers to vfio_region_read() and
vfio_region_write(), who actually do not care about the return values
)
-minor changes on text format in error logs.

v4:
-instead of modifying tracing log, added qemu_log_mask(LOG_GUEST_ERROR...)
to log guest writes to read-only regions (Philippe)

for
v3:
-refreshed and Cc Stefan for reviewing of tracing part

v2:
-split one big patches into smaller ones (Philippe)
-modify existing trace to record guest writes to read-only memory (Alex)
-modify vfio_region_write() to drop guest writes to non-mmap'd read-only
 region (Alex)



Yan Zhao (3):
  memory: drop guest writes to read-only ram device regions
  hw/vfio: drop guest writes to ro regions
  hw/vfio: let read-only flag take effect for mmap'd regions

 hw/vfio/common.c | 17 +++--
 memory.c | 15 ---
 2 files changed, 27 insertions(+), 5 deletions(-)

-- 
2.17.1

Re: [PATCH v5 1/3] memory: drop guest writes to read-only ram device regions

2020-04-30 Thread Yan Zhao

On Thu, Apr 30, 2020 at 03:07:21PM +0800, Philippe Mathieu-Daudé wrote:
> On 4/30/20 7:19 AM, Yan Zhao wrote:
> > for ram device regions, drop guest writes if the regions is read-only.
> > 
> > Cc: Philippe Mathieu-Daudé 
> > Signed-off-by: Yan Zhao 
> > Signed-off-by: Xin Zeng 
> > ---
> >   memory.c | 15 ---
> >   1 file changed, 12 insertions(+), 3 deletions(-)
> > 
> > diff --git a/memory.c b/memory.c
> > index 601b749906..90a748912f 100644
> > --- a/memory.c
> > +++ b/memory.c
> > @@ -34,6 +34,7 @@
> >   #include "sysemu/accel.h"
> >   #include "hw/boards.h"
> >   #include "migration/vmstate.h"
> > +#include "qemu/log.h"
> >   
> >   //#define DEBUG_UNASSIGNED
> >   
> > @@ -1307,12 +1308,19 @@ static uint64_t memory_region_ram_device_read(void 
> > *opaque,
> >   return data;
> >   }
> >   
> > -static void memory_region_ram_device_write(void *opaque, hwaddr addr,
> > -   uint64_t data, unsigned size)
> > +static MemTxResult memory_region_ram_device_write(void *opaque, hwaddr 
> > addr,
> > +   uint64_t data, unsigned size,
> > +   MemTxAttrs attrs)
> 
> Style alignment is now of and can be adjusted easily.
> 
> >   {
> >   MemoryRegion *mr = opaque;
> >   
> >   trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, 
> > size);
> > +if (mr->readonly) {
> > +qemu_log_mask(LOG_GUEST_ERROR,
> > +  "Invalid write to read only ram device region addr 
> > 0x%"
> > +   HWADDR_PRIx" size %u\n", addr, size);
> 
> Style alignment of here too.
> 
> Otherwise:
> Reviewed-by: Philippe Mathieu-Daudé 

Thanks! I'll update it right now!

> 
> > +return MEMTX_ERROR;
> > +}
> >   
> >   switch (size) {
> >   case 1:
> > @@ -1328,11 +1336,12 @@ static void memory_region_ram_device_write(void 
> > *opaque, hwaddr addr,
> >   *(uint64_t *)(mr->ram_block->host + addr) = data;
> >   break;
> >   }
> > +return MEMTX_OK;
> >   }
> >   
> >   static const MemoryRegionOps ram_device_mem_ops = {
> >   .read = memory_region_ram_device_read,
> > -.write = memory_region_ram_device_write,
> > +.write_with_attrs = memory_region_ram_device_write,
> >   .endianness = DEVICE_HOST_ENDIAN,
> >   .valid = {
> >   .min_access_size = 1,
> > 
>

Re: [PATCH v5 2/3] hw/vfio: drop guest writes to ro regions

2020-04-30 Thread Yan Zhao

On Thu, Apr 30, 2020 at 03:02:36PM +0800, Philippe Mathieu-Daudé wrote:
> On 4/30/20 7:23 AM, Yan Zhao wrote:
> > for vfio regions that are without write permission,
> > drop guest writes to those regions.
> > 
> > Cc: Philippe Mathieu-Daudé 
> > Reviewed-by: Philippe Mathieu-Daudé 
> 
> The full domain name is redhat.com.
>
oops. really sorry

> > Signed-off-by: Yan Zhao 
> > Signed-off-by: Xin Zeng 
> > ---
> >   hw/vfio/common.c | 13 +++--
> >   1 file changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index 0b3593b3c0..2a4fedfeaa 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -38,6 +38,7 @@
> >   #include "sysemu/reset.h"
> >   #include "trace.h"
> >   #include "qapi/error.h"
> > +#include "qemu/log.h"
> >   
> >   VFIOGroupList vfio_group_list =
> >   QLIST_HEAD_INITIALIZER(vfio_group_list);
> > @@ -190,6 +191,16 @@ void vfio_region_write(void *opaque, hwaddr addr,
> >   uint64_t qword;
> >   } buf;
> >   
> > +trace_vfio_region_write(vbasedev->name, region->nr, addr, data, size);
> > +if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE)) {
> > +qemu_log_mask(LOG_GUEST_ERROR,
> > +  "Invalid write to read only vfio region (%s:region%d"
> > +  "+0x%"HWADDR_PRIx" size %d)\n", vbasedev->name,
> > +  region->nr, addr, size);
> > +
> > +return;
> > +}
> > +
> >   switch (size) {
> >   case 1:
> >   buf.byte = data;
> > @@ -215,8 +226,6 @@ void vfio_region_write(void *opaque, hwaddr addr,
> >addr, data, size);
> >   }
> >   
> > -trace_vfio_region_write(vbasedev->name, region->nr, addr, data, size);
> > -
> >   /*
> >* A read or write to a BAR always signals an INTx EOI.  This will
> >* do nothing if not pending (including not in INTx mode).  We assume
> > 
>

[PATCH v5 3/3] hw/vfio: let read-only flag take effect for mmap'd regions

2020-04-29 Thread Yan Zhao

along side setting host page table to be read-only, the memory regions
are also required to be read-only, so that when guest writes to the
read-only & mmap'd regions, vmexits would happen and region write handlers
are called.

Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Yan Zhao 
Signed-off-by: Xin Zeng 
---
 hw/vfio/common.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 2a4fedfeaa..bf510e66c0 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -980,6 +980,10 @@ int vfio_region_mmap(VFIORegion *region)
   name, region->mmaps[i].size,
   region->mmaps[i].mmap);
 g_free(name);
+
+if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE)) {
+memory_region_set_readonly(>mmaps[i].mem, true);
+}
 memory_region_add_subregion(region->mem, region->mmaps[i].offset,
 >mmaps[i].mem);
 
-- 
2.17.1

[PATCH v5 2/3] hw/vfio: drop guest writes to ro regions

2020-04-29 Thread Yan Zhao

for vfio regions that are without write permission,
drop guest writes to those regions.

Cc: Philippe Mathieu-Daudé 
Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Yan Zhao 
Signed-off-by: Xin Zeng 
---
 hw/vfio/common.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0b3593b3c0..2a4fedfeaa 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -38,6 +38,7 @@
 #include "sysemu/reset.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "qemu/log.h"
 
 VFIOGroupList vfio_group_list =
 QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -190,6 +191,16 @@ void vfio_region_write(void *opaque, hwaddr addr,
 uint64_t qword;
 } buf;
 
+trace_vfio_region_write(vbasedev->name, region->nr, addr, data, size);
+if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE)) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "Invalid write to read only vfio region (%s:region%d"
+  "+0x%"HWADDR_PRIx" size %d)\n", vbasedev->name,
+  region->nr, addr, size);
+
+return;
+}
+
 switch (size) {
 case 1:
 buf.byte = data;
@@ -215,8 +226,6 @@ void vfio_region_write(void *opaque, hwaddr addr,
  addr, data, size);
 }
 
-trace_vfio_region_write(vbasedev->name, region->nr, addr, data, size);
-
 /*
  * A read or write to a BAR always signals an INTx EOI.  This will
  * do nothing if not pending (including not in INTx mode).  We assume
-- 
2.17.1

[PATCH v5 1/3] memory: drop guest writes to read-only ram device regions

2020-04-29 Thread Yan Zhao

for ram device regions, drop guest writes if the regions is read-only.

Cc: Philippe Mathieu-Daudé 
Signed-off-by: Yan Zhao 
Signed-off-by: Xin Zeng 
---
 memory.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/memory.c b/memory.c
index 601b749906..90a748912f 100644
--- a/memory.c
+++ b/memory.c
@@ -34,6 +34,7 @@
 #include "sysemu/accel.h"
 #include "hw/boards.h"
 #include "migration/vmstate.h"
+#include "qemu/log.h"
 
 //#define DEBUG_UNASSIGNED
 
@@ -1307,12 +1308,19 @@ static uint64_t memory_region_ram_device_read(void 
*opaque,
 return data;
 }
 
-static void memory_region_ram_device_write(void *opaque, hwaddr addr,
-   uint64_t data, unsigned size)
+static MemTxResult memory_region_ram_device_write(void *opaque, hwaddr addr,
+   uint64_t data, unsigned size,
+   MemTxAttrs attrs)
 {
 MemoryRegion *mr = opaque;
 
 trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, 
size);
+if (mr->readonly) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "Invalid write to read only ram device region addr 0x%"
+   HWADDR_PRIx" size %u\n", addr, size);
+return MEMTX_ERROR;
+}
 
 switch (size) {
 case 1:
@@ -1328,11 +1336,12 @@ static void memory_region_ram_device_write(void 
*opaque, hwaddr addr,
 *(uint64_t *)(mr->ram_block->host + addr) = data;
 break;
 }
+return MEMTX_OK;
 }
 
 static const MemoryRegionOps ram_device_mem_ops = {
 .read = memory_region_ram_device_read,
-.write = memory_region_ram_device_write,
+.write_with_attrs = memory_region_ram_device_write,
 .endianness = DEVICE_HOST_ENDIAN,
 .valid = {
 .min_access_size = 1,
-- 
2.17.1

[PATCH v5 0/3] drop writes to read-only ram device & vfio regions

2020-04-29 Thread Yan Zhao

guest writes to read-only memory regions need to be dropped.

patch 1 modifies handler of ram device memory regions to drop guest writes
to read-only ram device memory regions

patch 2 modifies handler of non-mmap'd read-only vfio regions to drop guest
writes to those regions 

patch 3 set read-only flag to mmap'd read-only vfio regions, so that guest
writes to those regions would be trapped.
without patch 1, host qemu would then crash on guest write to those
read-only regions.
with patch 1, host qemu would drop the writes.

Changelog:
v5:
-changed write handler of ram device memory region from .write to
.write_with_attrs in patch 1 (Paolo)
(for vfio region in patch 2, I still keep the operations as .read & .write.
the reasons are:
1. vfio_region_ops are for mmio/pio regions. the top level read/write
dispatcher in kvm just ignores their return values. (the return value of
address_space_rw() is just ignored)
2. there are a lot of callers to vfio_region_read() and
vfio_region_write(), who actually do not care about the return values
)
-minor changes on text format in error logs.

v4:
-instead of modifying tracing log, added qemu_log_mask(LOG_GUEST_ERROR...)
to log guest writes to read-only regions (Philippe)

for
v3:
-refreshed and Cc Stefan for reviewing of tracing part

v2:
-split one big patches into smaller ones (Philippe)
-modify existing trace to record guest writes to read-only memory (Alex)
-modify vfio_region_write() to drop guest writes to non-mmap'd read-only
 region (Alex)



Yan Zhao (3):
  memory: drop guest writes to read-only ram device regions
  hw/vfio: drop guest writes to ro regions
  hw/vfio: let read-only flag take effect for mmap'd regions

 hw/vfio/common.c | 17 +++--
 memory.c | 15 ---
 2 files changed, 27 insertions(+), 5 deletions(-)

-- 
2.17.1

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-29 Thread Yan Zhao

On Wed, Apr 29, 2020 at 10:13:01PM +0800, Eric Blake wrote:
> [meta-comment]
> 
> On 4/29/20 4:35 AM, Yan Zhao wrote:
> > On Wed, Apr 29, 2020 at 04:22:01PM +0800, Dr. David Alan Gilbert wrote:
> [...]
> >>>>>>>>>>>>>>>>> This patchset introduces a migration_version attribute 
> >>>>>>>>>>>>>>>>> under sysfs
> >>>>>>>>>>> of VFIO
> >>>>>>>>>>>>>>>>> Mediated devices.
> 
> Hmm, several pages with up to 16 levels of quoting, with editors making 
> the lines ragged, all before I get to the real meat of the email. 
> Remember, it's okay to trim content,...
> 
> >> So why don't we split the difference; lets say that it should start with
> >> the hex PCI Vendor ID.
> >>
> > The problem is for mdev devices, if the parent devices are not PCI devices,
> > they don't have PCI vendor IDs.
> 
> ...to just what you are replying to.
>
sorry for that. next time I'll try to make a better balance between
keeping conversation background and leaving the real meat of the email.

Thanks for reminding.
Yan

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-29 Thread Yan Zhao

On Wed, Apr 29, 2020 at 05:48:44PM +0800, Dr. David Alan Gilbert wrote:

> > > > > > > > > > > > > An mdev type is meant to define a software compatible 
> > > > > > > > > > > > > interface, so in
> > > > > > > > > > > > > the case of mdev->mdev migration, doesn't migrating 
> > > > > > > > > > > > > to a different type
> > > > > > > > > > > > > fail the most basic of compatibility tests that we 
> > > > > > > > > > > > > expect userspace to
> > > > > > > > > > > > > perform?  IOW, if two mdev types are migration 
> > > > > > > > > > > > > compatible, it seems a
> > > > > > > > > > > > > prerequisite to that is that they provide the same 
> > > > > > > > > > > > > software interface,
> > > > > > > > > > > > > which means they should be the same mdev type.
> > > > > > > > > > > > >
> > > > > > > > > > > > > In the hybrid cases of mdev->phys or phys->mdev, how 
> > > > > > > > > > > > > does a
> > > > > > > > > > > > management
> > > > > > > > > > > > > tool begin to even guess what might be compatible?  
> > > > > > > > > > > > > Are we expecting
> > > > > > > > > > > > > libvirt to probe ever device with this attribute in 
> > > > > > > > > > > > > the system?  Is
> > > > > > > > > > > > > there going to be a new class hierarchy created to 
> > > > > > > > > > > > > enumerate all
> > > > > > > > > > > > > possible migrate-able devices?
> > > > > > > > > > > > >
> > > > > > > > > > > > yes, management tool needs to guess and test migration 
> > > > > > > > > > > > compatible
> > > > > > > > > > > > between two devices. But I think it's not the problem 
> > > > > > > > > > > > only for
> > > > > > > > > > > > mdev->phys or phys->mdev. even for mdev->mdev, 
> > > > > > > > > > > > management tool needs
> > > > > > > > > > > > to
> > > > > > > > > > > > first assume that the two mdevs have the same type of 
> > > > > > > > > > > > parent devices
> > > > > > > > > > > > (e.g.their pciids are equal). otherwise, it's still 
> > > > > > > > > > > > enumerating
> > > > > > > > > > > > possibilities.
> > > > > > > > > > > > 
> > > > > > > > > > > > on the other hand, for two mdevs,
> > > > > > > > > > > > mdev1 from pdev1, its mdev_type is 1/2 of pdev1;
> > > > > > > > > > > > mdev2 from pdev2, its mdev_type is 1/4 of pdev2;
> > > > > > > > > > > > if pdev2 is exactly 2 times of pdev1, why not allow 
> > > > > > > > > > > > migration between
> > > > > > > > > > > > mdev1 <-> mdev2.
> > > > > > > > > > > 
> > > > > > > > > > > How could the manage tool figure out that 1/2 of pdev1 is 
> > > > > > > > > > > equivalent 
> > > > > > > > > > > to 1/4 of pdev2? If we really want to allow such thing 
> > > > > > > > > > > happen, the best
> > > > > > > > > > > choice is to report the same mdev type on both pdev1 and 
> > > > > > > > > > > pdev2.
> > > > > > > > > > I think that's exactly the value of this migration_version 
> > > > > > > > > > interface.
> > > > > > > > > > the management tool can take advantage of this interface to 
> > > > > > > > > > know if two
> > > > > > > > > > devices are migration compatible, no matter they are mdevs, 
> > > > > > > > > > non-mdevs,
> > > > > > > > > > or mix.
> > > > > > > > > > 
> > > > > > > > > > as I know, (please correct me if not right), current 
> > > > > > > > > > libvirt still
> > > > > > > > > > requires manually generating mdev devices, and it just 
> > > > > > > > > > duplicates src vm
> > > > > > > > > > configuration to the target vm.
> > > > > > > > > > for libvirt, currently it's always phys->phys and 
> > > > > > > > > > mdev->mdev (and of the
> > > > > > > > > > same mdev type).
> > > > > > > > > > But it does not justify that hybrid cases should not be 
> > > > > > > > > > allowed. otherwise,
> > > > > > > > > > why do we need to introduce this migration_version 
> > > > > > > > > > interface and leave
> > > > > > > > > > the judgement of migration compatibility to vendor driver? 
> > > > > > > > > > why not simply
> > > > > > > > > > set the criteria to something like "pciids of parent 
> > > > > > > > > > devices are equal,
> > > > > > > > > > and mdev types are equal" ?
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > btw mdev<->phys just brings trouble to upper stack as 
> > > > > > > > > > > Alex pointed out. 
> > > > > > > > > > could you help me understand why it will bring trouble to 
> > > > > > > > > > upper stack?
> > > > > > > > > > 
> > > > > > > > > > I think it just needs to read src migration_version under 
> > > > > > > > > > src dev node,
> > > > > > > > > > and test it in target migration version under target dev 
> > > > > > > > > > node. 
> > > > > > > > > > 
> > > > > > > > > > after all, through this interface we just help the upper 
> > > > > > > > > > layer
> > > > > > > > > > knowing available options through reading and testing, and 
> > > > > > > > > > they decide
> > > > > > > > > > to use it or not.
> > > > > > > > > > 
> > > > > > > > > > > Can we simplify the requirement by allowing only 
> > > > > > > > > > >

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-29 Thread Yan Zhao

On Wed, Apr 29, 2020 at 04:22:01PM +0800, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.z...@intel.com) wrote:
> > On Tue, Apr 28, 2020 at 10:14:37PM +0800, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > On Mon, Apr 27, 2020 at 11:37:43PM +0800, Dr. David Alan Gilbert wrote:
> > > > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > > > On Sat, Apr 25, 2020 at 03:10:49AM +0800, Dr. David Alan Gilbert 
> > > > > > wrote:
> > > > > > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > > > > > On Tue, Apr 21, 2020 at 08:08:49PM +0800, Tian, Kevin wrote:
> > > > > > > > > > From: Yan Zhao
> > > > > > > > > > Sent: Tuesday, April 21, 2020 10:37 AM
> > > > > > > > > > 
> > > > > > > > > > On Tue, Apr 21, 2020 at 06:56:00AM +0800, Alex Williamson 
> > > > > > > > > > wrote:
> > > > > > > > > > > On Sun, 19 Apr 2020 21:24:57 -0400
> > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > >
> > > > > > > > > > > > On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia Huck 
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > On Fri, 17 Apr 2020 05:52:02 -0400
> > > > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia 
> > > > > > > > > > > > > > Huck wrote:
> > > > > > > > > > > > > > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > > > > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This patchset introduces a migration_version 
> > > > > > > > > > > > > > > > attribute under sysfs
> > > > > > > > > > of VFIO
> > > > > > > > > > > > > > > > Mediated devices.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This migration_version attribute is used to 
> > > > > > > > > > > > > > > > check migration
> > > > > > > > > > compatibility
> > > > > > > > > > > > > > > > between two mdev devices.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Currently, it has two locations:
> > > > > > > > > > > > > > > > (1) under mdev_type node,
> > > > > > > > > > > > > > > > which can be used even before device 
> > > > > > > > > > > > > > > > creation, but only for
> > > > > > > > > > mdev
> > > > > > > > > > > > > > > > devices of the same mdev type.
> > > > > > > > > > > > > > > > (2) under mdev device node,
> > > > > > > > > > > > > > > > which can only be used after the mdev 
> > > > > > > > > > > > > > > > devices are created, but
> > > > > > > > > > the src
> > > > > > > > > > > > > > > > and target mdev devices are not necessarily 
> > > > > > > > > > > > > > > > be of the same
> > > > > > > > > > mdev type
> > > > > > > > > > > > > > > > (The second location is newly added in v5, in 
> > > > > > > > > > > > > > > > order to keep
> > > > > > > > > > consistent
> > > > > > > > > > > > > > > > with the migration_version node for migratable 
> > > > > > > > > > > > > > > > pass-though
> > > > &

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-29 Thread Yan Zhao

On Tue, Apr 28, 2020 at 10:14:37PM +0800, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.z...@intel.com) wrote:
> > On Mon, Apr 27, 2020 at 11:37:43PM +0800, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > On Sat, Apr 25, 2020 at 03:10:49AM +0800, Dr. David Alan Gilbert wrote:
> > > > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > > > On Tue, Apr 21, 2020 at 08:08:49PM +0800, Tian, Kevin wrote:
> > > > > > > > From: Yan Zhao
> > > > > > > > Sent: Tuesday, April 21, 2020 10:37 AM
> > > > > > > > 
> > > > > > > > On Tue, Apr 21, 2020 at 06:56:00AM +0800, Alex Williamson wrote:
> > > > > > > > > On Sun, 19 Apr 2020 21:24:57 -0400
> > > > > > > > > Yan Zhao  wrote:
> > > > > > > > >
> > > > > > > > > > On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia Huck 
> > > > > > > > > > wrote:
> > > > > > > > > > > On Fri, 17 Apr 2020 05:52:02 -0400
> > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > >
> > > > > > > > > > > > On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia Huck 
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > This patchset introduces a migration_version 
> > > > > > > > > > > > > > attribute under sysfs
> > > > > > > > of VFIO
> > > > > > > > > > > > > > Mediated devices.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This migration_version attribute is used to check 
> > > > > > > > > > > > > > migration
> > > > > > > > compatibility
> > > > > > > > > > > > > > between two mdev devices.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Currently, it has two locations:
> > > > > > > > > > > > > > (1) under mdev_type node,
> > > > > > > > > > > > > > which can be used even before device creation, 
> > > > > > > > > > > > > > but only for
> > > > > > > > mdev
> > > > > > > > > > > > > > devices of the same mdev type.
> > > > > > > > > > > > > > (2) under mdev device node,
> > > > > > > > > > > > > > which can only be used after the mdev devices 
> > > > > > > > > > > > > > are created, but
> > > > > > > > the src
> > > > > > > > > > > > > > and target mdev devices are not necessarily be 
> > > > > > > > > > > > > > of the same
> > > > > > > > mdev type
> > > > > > > > > > > > > > (The second location is newly added in v5, in order 
> > > > > > > > > > > > > > to keep
> > > > > > > > consistent
> > > > > > > > > > > > > > with the migration_version node for migratable 
> > > > > > > > > > > > > > pass-though
> > > > > > > > devices)
> > > > > > > > > > > > >
> > > > > > > > > > > > > What is the relationship between those two attributes?
> > > > > > > > > > > > >
> > > > > > > > > > > > (1) is for mdev devices specifically, and (2) is 
> > > > > > > > > > > > provided to keep the
> > > > > > > > same
> > > > > > > > > > > > sysfs interface as with non-mdev cases. so (2) is for 
> > > > > > > > > > > > both mdev
> > > > > > > > dev

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-27 Thread Yan Zhao

On Mon, Apr 27, 2020 at 11:37:43PM +0800, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.z...@intel.com) wrote:
> > On Sat, Apr 25, 2020 at 03:10:49AM +0800, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > On Tue, Apr 21, 2020 at 08:08:49PM +0800, Tian, Kevin wrote:
> > > > > > From: Yan Zhao
> > > > > > Sent: Tuesday, April 21, 2020 10:37 AM
> > > > > > 
> > > > > > On Tue, Apr 21, 2020 at 06:56:00AM +0800, Alex Williamson wrote:
> > > > > > > On Sun, 19 Apr 2020 21:24:57 -0400
> > > > > > > Yan Zhao  wrote:
> > > > > > >
> > > > > > > > On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia Huck wrote:
> > > > > > > > > On Fri, 17 Apr 2020 05:52:02 -0400
> > > > > > > > > Yan Zhao  wrote:
> > > > > > > > >
> > > > > > > > > > On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia Huck 
> > > > > > > > > > wrote:
> > > > > > > > > > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > >
> > > > > > > > > > > > This patchset introduces a migration_version attribute 
> > > > > > > > > > > > under sysfs
> > > > > > of VFIO
> > > > > > > > > > > > Mediated devices.
> > > > > > > > > > > >
> > > > > > > > > > > > This migration_version attribute is used to check 
> > > > > > > > > > > > migration
> > > > > > compatibility
> > > > > > > > > > > > between two mdev devices.
> > > > > > > > > > > >
> > > > > > > > > > > > Currently, it has two locations:
> > > > > > > > > > > > (1) under mdev_type node,
> > > > > > > > > > > > which can be used even before device creation, but 
> > > > > > > > > > > > only for
> > > > > > mdev
> > > > > > > > > > > > devices of the same mdev type.
> > > > > > > > > > > > (2) under mdev device node,
> > > > > > > > > > > > which can only be used after the mdev devices are 
> > > > > > > > > > > > created, but
> > > > > > the src
> > > > > > > > > > > > and target mdev devices are not necessarily be of 
> > > > > > > > > > > > the same
> > > > > > mdev type
> > > > > > > > > > > > (The second location is newly added in v5, in order to 
> > > > > > > > > > > > keep
> > > > > > consistent
> > > > > > > > > > > > with the migration_version node for migratable 
> > > > > > > > > > > > pass-though
> > > > > > devices)
> > > > > > > > > > >
> > > > > > > > > > > What is the relationship between those two attributes?
> > > > > > > > > > >
> > > > > > > > > > (1) is for mdev devices specifically, and (2) is provided 
> > > > > > > > > > to keep the
> > > > > > same
> > > > > > > > > > sysfs interface as with non-mdev cases. so (2) is for both 
> > > > > > > > > > mdev
> > > > > > devices and
> > > > > > > > > > non-mdev devices.
> > > > > > > > > >
> > > > > > > > > > in future, if we enable vfio-pci vendor ops, (i.e. a 
> > > > > > > > > > non-mdev device
> > > > > > > > > > is binding to vfio-pci, but is able to register migration 
> > > > > > > > > > region and do
> > > > > > > > > > migration transactions from a vendor provided affiliate 
> > > > > > > > > > driver),
> > > > > > > > > > the vendor driver would export (2) directly, under device 
> > > &

Re: [PATCH v4 1/3] memory: drop guest writes to read-only ram device regions

2020-04-27 Thread Yan Zhao

On Mon, Apr 27, 2020 at 05:31:48PM +0800, Philippe Mathieu-Daudé wrote:
> On 4/27/20 11:15 AM, Yan Zhao wrote:
> > On Sun, Apr 26, 2020 at 09:04:31AM +0800, Yan Zhao wrote:
> >> On Sat, Apr 25, 2020 at 06:55:33PM +0800, Paolo Bonzini wrote:
> >>> On 17/04/20 09:44, Yan Zhao wrote:
> >>>> for ram device regions, drop guest writes if the regions is read-only.
> >>>>
> >>>> Cc: Philippe Mathieu-Daudé 
> >>>> Signed-off-by: Yan Zhao 
> >>>> Signed-off-by: Xin Zeng 
> >>>> ---
> >>>>   memory.c | 7 +++
> >>>>   1 file changed, 7 insertions(+)
> >>>>
> >>>> diff --git a/memory.c b/memory.c
> >>>> index 601b749906..9576dd6807 100644
> >>>> --- a/memory.c
> >>>> +++ b/memory.c
> >>>> @@ -34,6 +34,7 @@
> >>>>   #include "sysemu/accel.h"
> >>>>   #include "hw/boards.h"
> >>>>   #include "migration/vmstate.h"
> >>>> +#include "qemu/log.h"
> >>>>   
> >>>>   //#define DEBUG_UNASSIGNED
> >>>>   
> >>>> @@ -1313,6 +1314,12 @@ static void memory_region_ram_device_write(void 
> >>>> *opaque, hwaddr addr,
> >>>>   MemoryRegion *mr = opaque;
> >>>>   
> >>>>   trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, 
> >>>> data, size);
> >>>> +if (mr->readonly) {
> >>>> +qemu_log_mask(LOG_GUEST_ERROR,
> >>>> +  "Invalid write to read only ram device region 0x%"
> >>>> +   HWADDR_PRIx" size %u\n", addr, size);
> >>>> +return;
> >>>> +}
> >>>
> >>> As mentioned in the review of v1, memory_region_ram_device_write should
> >>> be changed to a .write_with_attrs operation, so that it can return
> >>> MEMTX_ERROR.
> >>>
> > hi Paolo and Alex,
> > need I also change vfio_region_write() in patch 2 to a .write_with_attrs
> > operation?
> > vfio_region_read() is also possible to fail, so should I change it to a
> > .read_with_attrs, too?
> 
> Yes.
> 
> Please submit your series as a thread, with a cover letter:
> https://wiki.qemu.org/Contribute/SubmitAPatch#Include_a_meaningful_cover_letter
>
hi Philippe
thanks for pointing out this issue.
I just realized this version of patches were sent separately, though I did send
a cover letter. not sure what happened. maybe I just forgot to add an
-in-reply-to before sending out.
will pay attention to it next time.

Thanks
Yan

> > 
> > Thanks
> > Yan
> > 
> >>> Otherwise this looks good.
> >>>
> >> hi Paolo,
> >> thanks for pointing it out again!
> >> I didn't get your meaning in v1. will update the patch!
> >>
> >> Thanks
> >> Yan
> >>>
> >>
> > 
>

Re: [PATCH v4 1/3] memory: drop guest writes to read-only ram device regions

2020-04-27 Thread Yan Zhao

On Sun, Apr 26, 2020 at 09:04:31AM +0800, Yan Zhao wrote:
> On Sat, Apr 25, 2020 at 06:55:33PM +0800, Paolo Bonzini wrote:
> > On 17/04/20 09:44, Yan Zhao wrote:
> > > for ram device regions, drop guest writes if the regions is read-only.
> > > 
> > > Cc: Philippe Mathieu-Daudé 
> > > Signed-off-by: Yan Zhao 
> > > Signed-off-by: Xin Zeng 
> > > ---
> > >  memory.c | 7 +++
> > >  1 file changed, 7 insertions(+)
> > > 
> > > diff --git a/memory.c b/memory.c
> > > index 601b749906..9576dd6807 100644
> > > --- a/memory.c
> > > +++ b/memory.c
> > > @@ -34,6 +34,7 @@
> > >  #include "sysemu/accel.h"
> > >  #include "hw/boards.h"
> > >  #include "migration/vmstate.h"
> > > +#include "qemu/log.h"
> > >  
> > >  //#define DEBUG_UNASSIGNED
> > >  
> > > @@ -1313,6 +1314,12 @@ static void memory_region_ram_device_write(void 
> > > *opaque, hwaddr addr,
> > >  MemoryRegion *mr = opaque;
> > >  
> > >  trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, 
> > > data, size);
> > > +if (mr->readonly) {
> > > +qemu_log_mask(LOG_GUEST_ERROR,
> > > +  "Invalid write to read only ram device region 0x%"
> > > +   HWADDR_PRIx" size %u\n", addr, size);
> > > +return;
> > > +}
> > 
> > As mentioned in the review of v1, memory_region_ram_device_write should
> > be changed to a .write_with_attrs operation, so that it can return
> > MEMTX_ERROR.
> >
hi Paolo and Alex,
need I also change vfio_region_write() in patch 2 to a .write_with_attrs
operation?
vfio_region_read() is also possible to fail, so should I change it to a
.read_with_attrs, too?

Thanks
Yan

> > Otherwise this looks good.
> > 
> hi Paolo,
> thanks for pointing it out again!
> I didn't get your meaning in v1. will update the patch!
> 
> Thanks
> Yan
> > 
>

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-25 Thread Yan Zhao

On Sat, Apr 25, 2020 at 03:10:49AM +0800, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.z...@intel.com) wrote:
> > On Tue, Apr 21, 2020 at 08:08:49PM +0800, Tian, Kevin wrote:
> > > > From: Yan Zhao
> > > > Sent: Tuesday, April 21, 2020 10:37 AM
> > > > 
> > > > On Tue, Apr 21, 2020 at 06:56:00AM +0800, Alex Williamson wrote:
> > > > > On Sun, 19 Apr 2020 21:24:57 -0400
> > > > > Yan Zhao  wrote:
> > > > >
> > > > > > On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia Huck wrote:
> > > > > > > On Fri, 17 Apr 2020 05:52:02 -0400
> > > > > > > Yan Zhao  wrote:
> > > > > > >
> > > > > > > > On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia Huck wrote:
> > > > > > > > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > > > > > > > Yan Zhao  wrote:
> > > > > > > > >
> > > > > > > > > > This patchset introduces a migration_version attribute 
> > > > > > > > > > under sysfs
> > > > of VFIO
> > > > > > > > > > Mediated devices.
> > > > > > > > > >
> > > > > > > > > > This migration_version attribute is used to check migration
> > > > compatibility
> > > > > > > > > > between two mdev devices.
> > > > > > > > > >
> > > > > > > > > > Currently, it has two locations:
> > > > > > > > > > (1) under mdev_type node,
> > > > > > > > > > which can be used even before device creation, but only 
> > > > > > > > > > for
> > > > mdev
> > > > > > > > > > devices of the same mdev type.
> > > > > > > > > > (2) under mdev device node,
> > > > > > > > > > which can only be used after the mdev devices are 
> > > > > > > > > > created, but
> > > > the src
> > > > > > > > > > and target mdev devices are not necessarily be of the 
> > > > > > > > > > same
> > > > mdev type
> > > > > > > > > > (The second location is newly added in v5, in order to keep
> > > > consistent
> > > > > > > > > > with the migration_version node for migratable pass-though
> > > > devices)
> > > > > > > > >
> > > > > > > > > What is the relationship between those two attributes?
> > > > > > > > >
> > > > > > > > (1) is for mdev devices specifically, and (2) is provided to 
> > > > > > > > keep the
> > > > same
> > > > > > > > sysfs interface as with non-mdev cases. so (2) is for both mdev
> > > > devices and
> > > > > > > > non-mdev devices.
> > > > > > > >
> > > > > > > > in future, if we enable vfio-pci vendor ops, (i.e. a non-mdev 
> > > > > > > > device
> > > > > > > > is binding to vfio-pci, but is able to register migration 
> > > > > > > > region and do
> > > > > > > > migration transactions from a vendor provided affiliate driver),
> > > > > > > > the vendor driver would export (2) directly, under device node.
> > > > > > > > It is not able to provide (1) as there're no mdev devices 
> > > > > > > > involved.
> > > > > > >
> > > > > > > Ok, creating an alternate attribute for non-mdev devices makes 
> > > > > > > sense.
> > > > > > > However, wouldn't that rather be a case (3)? The change here only
> > > > > > > refers to mdev devices.
> > > > > > >
> > > > > > as you pointed below, (3) and (2) serve the same purpose.
> > > > > > and I think a possible usage is to migrate between a non-mdev 
> > > > > > device and
> > > > > > an mdev device. so I think it's better for them both to use (2) 
> > > > > > rather
> > > > > > than creating (3).
> > > > >
> > > > > An mdev type is meant to define a software compatible interface, so in
> > > >

Re: [PATCH v4 1/3] memory: drop guest writes to read-only ram device regions

2020-04-25 Thread Yan Zhao

On Sat, Apr 25, 2020 at 06:55:33PM +0800, Paolo Bonzini wrote:
> On 17/04/20 09:44, Yan Zhao wrote:
> > for ram device regions, drop guest writes if the regions is read-only.
> > 
> > Cc: Philippe Mathieu-Daudé 
> > Signed-off-by: Yan Zhao 
> > Signed-off-by: Xin Zeng 
> > ---
> >  memory.c | 7 +++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/memory.c b/memory.c
> > index 601b749906..9576dd6807 100644
> > --- a/memory.c
> > +++ b/memory.c
> > @@ -34,6 +34,7 @@
> >  #include "sysemu/accel.h"
> >  #include "hw/boards.h"
> >  #include "migration/vmstate.h"
> > +#include "qemu/log.h"
> >  
> >  //#define DEBUG_UNASSIGNED
> >  
> > @@ -1313,6 +1314,12 @@ static void memory_region_ram_device_write(void 
> > *opaque, hwaddr addr,
> >  MemoryRegion *mr = opaque;
> >  
> >  trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, 
> > size);
> > +if (mr->readonly) {
> > +qemu_log_mask(LOG_GUEST_ERROR,
> > +  "Invalid write to read only ram device region 0x%"
> > +   HWADDR_PRIx" size %u\n", addr, size);
> > +return;
> > +}
> 
> As mentioned in the review of v1, memory_region_ram_device_write should
> be changed to a .write_with_attrs operation, so that it can return
> MEMTX_ERROR.
> 
> Otherwise this looks good.
> 
hi Paolo,
thanks for pointing it out again!
I didn't get your meaning in v1. will update the patch!

Thanks
Yan
>

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-22 Thread Yan Zhao

On Tue, Apr 21, 2020 at 08:08:49PM +0800, Tian, Kevin wrote:
> > From: Yan Zhao
> > Sent: Tuesday, April 21, 2020 10:37 AM
> > 
> > On Tue, Apr 21, 2020 at 06:56:00AM +0800, Alex Williamson wrote:
> > > On Sun, 19 Apr 2020 21:24:57 -0400
> > > Yan Zhao  wrote:
> > >
> > > > On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia Huck wrote:
> > > > > On Fri, 17 Apr 2020 05:52:02 -0400
> > > > > Yan Zhao  wrote:
> > > > >
> > > > > > On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia Huck wrote:
> > > > > > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > > > > > Yan Zhao  wrote:
> > > > > > >
> > > > > > > > This patchset introduces a migration_version attribute under 
> > > > > > > > sysfs
> > of VFIO
> > > > > > > > Mediated devices.
> > > > > > > >
> > > > > > > > This migration_version attribute is used to check migration
> > compatibility
> > > > > > > > between two mdev devices.
> > > > > > > >
> > > > > > > > Currently, it has two locations:
> > > > > > > > (1) under mdev_type node,
> > > > > > > > which can be used even before device creation, but only for
> > mdev
> > > > > > > > devices of the same mdev type.
> > > > > > > > (2) under mdev device node,
> > > > > > > > which can only be used after the mdev devices are created, 
> > > > > > > > but
> > the src
> > > > > > > > and target mdev devices are not necessarily be of the same
> > mdev type
> > > > > > > > (The second location is newly added in v5, in order to keep
> > consistent
> > > > > > > > with the migration_version node for migratable pass-though
> > devices)
> > > > > > >
> > > > > > > What is the relationship between those two attributes?
> > > > > > >
> > > > > > (1) is for mdev devices specifically, and (2) is provided to keep 
> > > > > > the
> > same
> > > > > > sysfs interface as with non-mdev cases. so (2) is for both mdev
> > devices and
> > > > > > non-mdev devices.
> > > > > >
> > > > > > in future, if we enable vfio-pci vendor ops, (i.e. a non-mdev device
> > > > > > is binding to vfio-pci, but is able to register migration region 
> > > > > > and do
> > > > > > migration transactions from a vendor provided affiliate driver),
> > > > > > the vendor driver would export (2) directly, under device node.
> > > > > > It is not able to provide (1) as there're no mdev devices involved.
> > > > >
> > > > > Ok, creating an alternate attribute for non-mdev devices makes sense.
> > > > > However, wouldn't that rather be a case (3)? The change here only
> > > > > refers to mdev devices.
> > > > >
> > > > as you pointed below, (3) and (2) serve the same purpose.
> > > > and I think a possible usage is to migrate between a non-mdev device and
> > > > an mdev device. so I think it's better for them both to use (2) rather
> > > > than creating (3).
> > >
> > > An mdev type is meant to define a software compatible interface, so in
> > > the case of mdev->mdev migration, doesn't migrating to a different type
> > > fail the most basic of compatibility tests that we expect userspace to
> > > perform?  IOW, if two mdev types are migration compatible, it seems a
> > > prerequisite to that is that they provide the same software interface,
> > > which means they should be the same mdev type.
> > >
> > > In the hybrid cases of mdev->phys or phys->mdev, how does a
> > management
> > > tool begin to even guess what might be compatible?  Are we expecting
> > > libvirt to probe ever device with this attribute in the system?  Is
> > > there going to be a new class hierarchy created to enumerate all
> > > possible migrate-able devices?
> > >
> > yes, management tool needs to guess and test migration compatible
> > between two devices. But I think it's not the problem only for
> > mdev->phys or phys->mdev. even for mdev->mdev, management tool needs
> >

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-20 Thread Yan Zhao

On Tue, Apr 21, 2020 at 06:56:00AM +0800, Alex Williamson wrote:
> On Sun, 19 Apr 2020 21:24:57 -0400
> Yan Zhao  wrote:
> 
> > On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia Huck wrote:
> > > On Fri, 17 Apr 2020 05:52:02 -0400
> > > Yan Zhao  wrote:
> > >   
> > > > On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia Huck wrote:  
> > > > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > > > Yan Zhao  wrote:
> > > > > 
> > > > > > This patchset introduces a migration_version attribute under sysfs 
> > > > > > of VFIO
> > > > > > Mediated devices.
> > > > > > 
> > > > > > This migration_version attribute is used to check migration 
> > > > > > compatibility
> > > > > > between two mdev devices.
> > > > > > 
> > > > > > Currently, it has two locations:
> > > > > > (1) under mdev_type node,
> > > > > > which can be used even before device creation, but only for mdev
> > > > > > devices of the same mdev type.
> > > > > > (2) under mdev device node,
> > > > > > which can only be used after the mdev devices are created, but 
> > > > > > the src
> > > > > > and target mdev devices are not necessarily be of the same mdev 
> > > > > > type
> > > > > > (The second location is newly added in v5, in order to keep 
> > > > > > consistent
> > > > > > with the migration_version node for migratable pass-though devices) 
> > > > > >
> > > > > 
> > > > > What is the relationship between those two attributes?
> > > > > 
> > > > (1) is for mdev devices specifically, and (2) is provided to keep the 
> > > > same
> > > > sysfs interface as with non-mdev cases. so (2) is for both mdev devices 
> > > > and
> > > > non-mdev devices.
> > > > 
> > > > in future, if we enable vfio-pci vendor ops, (i.e. a non-mdev device
> > > > is binding to vfio-pci, but is able to register migration region and do
> > > > migration transactions from a vendor provided affiliate driver),
> > > > the vendor driver would export (2) directly, under device node.
> > > > It is not able to provide (1) as there're no mdev devices involved.  
> > > 
> > > Ok, creating an alternate attribute for non-mdev devices makes sense.
> > > However, wouldn't that rather be a case (3)? The change here only
> > > refers to mdev devices.
> > >  
> > as you pointed below, (3) and (2) serve the same purpose. 
> > and I think a possible usage is to migrate between a non-mdev device and
> > an mdev device. so I think it's better for them both to use (2) rather
> > than creating (3).
> 
> An mdev type is meant to define a software compatible interface, so in
> the case of mdev->mdev migration, doesn't migrating to a different type
> fail the most basic of compatibility tests that we expect userspace to
> perform?  IOW, if two mdev types are migration compatible, it seems a
> prerequisite to that is that they provide the same software interface,
> which means they should be the same mdev type.
> 
> In the hybrid cases of mdev->phys or phys->mdev, how does a management
> tool begin to even guess what might be compatible?  Are we expecting
> libvirt to probe ever device with this attribute in the system?  Is
> there going to be a new class hierarchy created to enumerate all
> possible migrate-able devices?
>
yes, management tool needs to guess and test migration compatible
between two devices. But I think it's not the problem only for
mdev->phys or phys->mdev. even for mdev->mdev, management tool needs to
first assume that the two mdevs have the same type of parent devices
(e.g.their pciids are equal). otherwise, it's still enumerating
possibilities.

on the other hand, for two mdevs,
mdev1 from pdev1, its mdev_type is 1/2 of pdev1;
mdev2 from pdev2, its mdev_type is 1/4 of pdev2;
if pdev2 is exactly 2 times of pdev1, why not allow migration between
mdev1 <-> mdev2.


> I agree that there was a gap in the previous proposal for non-mdev
> devices, but I think this bring a lot of questions that we need to
> puzzle through and libvirt will need to re-evaluate how they might
> decide to pick a migration target device.  For example, I'm sure
> libvirt would reject any policy decisions regarding picking a physical
> device versus an mdev device.  Had we previous

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-19 Thread Yan Zhao

On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia Huck wrote:
> On Fri, 17 Apr 2020 05:52:02 -0400
> Yan Zhao  wrote:
> 
> > On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia Huck wrote:
> > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > Yan Zhao  wrote:
> > >   
> > > > This patchset introduces a migration_version attribute under sysfs of 
> > > > VFIO
> > > > Mediated devices.
> > > > 
> > > > This migration_version attribute is used to check migration 
> > > > compatibility
> > > > between two mdev devices.
> > > > 
> > > > Currently, it has two locations:
> > > > (1) under mdev_type node,
> > > > which can be used even before device creation, but only for mdev
> > > > devices of the same mdev type.
> > > > (2) under mdev device node,
> > > > which can only be used after the mdev devices are created, but the 
> > > > src
> > > > and target mdev devices are not necessarily be of the same mdev type
> > > > (The second location is newly added in v5, in order to keep consistent
> > > > with the migration_version node for migratable pass-though devices)  
> > > 
> > > What is the relationship between those two attributes?
> > >   
> > (1) is for mdev devices specifically, and (2) is provided to keep the same
> > sysfs interface as with non-mdev cases. so (2) is for both mdev devices and
> > non-mdev devices.
> > 
> > in future, if we enable vfio-pci vendor ops, (i.e. a non-mdev device
> > is binding to vfio-pci, but is able to register migration region and do
> > migration transactions from a vendor provided affiliate driver),
> > the vendor driver would export (2) directly, under device node.
> > It is not able to provide (1) as there're no mdev devices involved.
> 
> Ok, creating an alternate attribute for non-mdev devices makes sense.
> However, wouldn't that rather be a case (3)? The change here only
> refers to mdev devices.
>
as you pointed below, (3) and (2) serve the same purpose. 
and I think a possible usage is to migrate between a non-mdev device and
an mdev device. so I think it's better for them both to use (2) rather
than creating (3).
> > 
> > > Is existence (and compatibility) of (1) a pre-req for possible
> > > existence (and compatibility) of (2)?
> > >  
> > no. (2) does not reply on (1).
> 
> Hm. Non-existence of (1) seems to imply "this type does not support
> migration". If an mdev created for such a type suddenly does support
> migration, it feels a bit odd.
> 
yes. but I think if the condition happens, it should be reported a bug
to vendor driver.
should I add a line in the doc like "vendor driver should ensure that the
migration compatibility from migration_version under mdev_type should be
consistent with that from migration_version under device node" ?

> (It obviously cannot be a prereq for what I called (3) above.)
> 
> > 
> > > Does userspace need to check (1) or can it completely rely on (2), if
> > > it so chooses?
> > >  
> > I think it can completely reply on (2) if compatibility check before
> > mdev creation is not required.
> > 
> > > If devices with a different mdev type are indeed compatible, it seems
> > > userspace can only find out after the devices have actually been
> > > created, as (1) does not apply?  
> > yes, I think so. 
> 
> How useful would it be for userspace to even look at (1) in that case?
> It only knows if things have a chance of working if it actually goes
> ahead and creates devices.
>
hmm, is it useful for userspace to test the migration_version under mdev
type before it knows what mdev device to generate ?
like when the userspace wants to migrate an mdev device in src vm,
but it has not created target vm and the target mdev device.

> > 
> > > One of my worries is that the existence of an attribute with the same
> > > name in two similar locations might lead to confusion. But maybe it
> > > isn't a problem.
> > >  
> > Yes, I have the same feeling. but as (2) is for sysfs interface
> > consistency, to make it transparent to userspace tools like libvirt,
> > I guess the same name is necessary?
> 
> What do we actually need here, I wonder? (1) and (2) seem to serve
> slightly different purposes, while (2) and what I called (3) have the
> same purpose. Is it important to userspace that (1) and (2) have the
> same name?
so change (1) to migration_type_version and (2) to
migration_instance_version?
But as they are under different locations, could that location imply
enough information?


Thanks
Yan

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-17 Thread Yan Zhao

On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia Huck wrote:
> On Mon, 13 Apr 2020 01:52:01 -0400
> Yan Zhao  wrote:
> 
> > This patchset introduces a migration_version attribute under sysfs of VFIO
> > Mediated devices.
> > 
> > This migration_version attribute is used to check migration compatibility
> > between two mdev devices.
> > 
> > Currently, it has two locations:
> > (1) under mdev_type node,
> > which can be used even before device creation, but only for mdev
> > devices of the same mdev type.
> > (2) under mdev device node,
> > which can only be used after the mdev devices are created, but the src
> > and target mdev devices are not necessarily be of the same mdev type
> > (The second location is newly added in v5, in order to keep consistent
> > with the migration_version node for migratable pass-though devices)
> 
> What is the relationship between those two attributes?
> 
(1) is for mdev devices specifically, and (2) is provided to keep the same
sysfs interface as with non-mdev cases. so (2) is for both mdev devices and
non-mdev devices.

in future, if we enable vfio-pci vendor ops, (i.e. a non-mdev device
is binding to vfio-pci, but is able to register migration region and do
migration transactions from a vendor provided affiliate driver),
the vendor driver would export (2) directly, under device node.
It is not able to provide (1) as there're no mdev devices involved.

> Is existence (and compatibility) of (1) a pre-req for possible
> existence (and compatibility) of (2)?
>
no. (2) does not reply on (1).

> Does userspace need to check (1) or can it completely rely on (2), if
> it so chooses?
>
I think it can completely reply on (2) if compatibility check before
mdev creation is not required.

> If devices with a different mdev type are indeed compatible, it seems
> userspace can only find out after the devices have actually been
> created, as (1) does not apply?
yes, I think so. 

> One of my worries is that the existence of an attribute with the same
> name in two similar locations might lead to confusion. But maybe it
> isn't a problem.
>
Yes, I have the same feeling. but as (2) is for sysfs interface
consistency, to make it transparent to userspace tools like libvirt,
I guess the same name is necessary?

Thanks
Yan
> > 
> > Patch 1 defines migration_version attribute for the first location in
> > Documentation/vfio-mediated-device.txt
> > 
> > Patch 2 uses GVT as an example for patch 1 to show how to expose
> > migration_version attribute and check migration compatibility in vendor
> > driver.
> > 
> > Patch 3 defines migration_version attribute for the second location in
> > Documentation/vfio-mediated-device.txt
> > 
> > Patch 4 uses GVT as an example for patch 3 to show how to expose
> > migration_version attribute and check migration compatibility in vendor
> > driver.
> > 
> > (The previous "Reviewed-by" and "Acked-by" for patch 1 and patch 2 are
> > kept in v5, as there are only small changes to commit messages of the two
> > patches.)
> > 
> > v5:
> > added patch 2 and 4 for mdev device part of migration_version attribute.
> > 
> > v4:
> > 1. fixed indentation/spell errors, reworded several error messages
> > 2. added a missing memory free for error handling in patch 2
> > 
> > v3:
> > 1. renamed version to migration_version
> > 2. let errno to be freely defined by vendor driver
> > 3. let checking mdev_type be prerequisite of migration compatibility check
> > 4. reworded most part of patch 1
> > 5. print detailed error log in patch 2 and generate migration_version
> > string at init time
> > 
> > v2:
> > 1. renamed patched 1
> > 2. made definition of device version string completely private to vendor
> > driver
> > 3. reverted changes to sample mdev drivers
> > 4. described intent and usage of version attribute more clearly.
> > 
> > 
> > Yan Zhao (4):
> >   vfio/mdev: add migration_version attribute for mdev (under mdev_type
> > node)
> >   drm/i915/gvt: export migration_version to mdev sysfs (under mdev_type
> > node)
> >   vfio/mdev: add migration_version attribute for mdev (under mdev device
> > node)
> >   drm/i915/gvt: export migration_version to mdev sysfs (under mdev
> > device node)
> > 
> >  .../driver-api/vfio-mediated-device.rst   | 183 ++
> >  drivers/gpu/drm/i915/gvt/Makefile |   2 +-
> >  drivers/gpu/drm/i915/gvt/gvt.c|  39 
> >  drivers/gpu/drm/i915/gvt/gvt.h|   7 +
> >  drivers/gpu/drm/i915/gvt/kvmgt.c  |  55 ++
> >  drivers/gpu/drm/i915/gvt/migration_version.c  | 170 
> >  drivers/gpu/drm/i915/gvt/vgpu.c   |  13 +-
> >  7 files changed, 466 insertions(+), 3 deletions(-)
> >  create mode 100644 drivers/gpu/drm/i915/gvt/migration_version.c
> > 
>

[PATCH v4 3/3] hw/vfio: let read-only flag take effect for mmap'd regions

2020-04-16 Thread Yan Zhao

along side setting host page table to be read-only, the memory regions
are also required to be read-only, so that when guest writes to the
read-only & mmap'd regions, vmexits would happen and region write handlers
are called.

Reviewed-by: Philippe Mathieu-Daudé 
Signed-off-by: Yan Zhao 
Signed-off-by: Xin Zeng 
---
 hw/vfio/common.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index b6956a8098..0049e97c34 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -979,6 +979,10 @@ int vfio_region_mmap(VFIORegion *region)
   name, region->mmaps[i].size,
   region->mmaps[i].mmap);
 g_free(name);
+
+if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE)) {
+memory_region_set_readonly(>mmaps[i].mem, true);
+}
 memory_region_add_subregion(region->mem, region->mmaps[i].offset,
 >mmaps[i].mem);
 
-- 
2.17.1

[PATCH v4 2/3] hw/vfio: drop guest writes to ro regions

2020-04-16 Thread Yan Zhao

for vfio regions that are without write permission,
drop guest writes to those regions.

Cc: Philippe Mathieu-Daudé 
Signed-off-by: Yan Zhao 
Signed-off-by: Xin Zeng 
---
 hw/vfio/common.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0b3593b3c0..b6956a8098 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -38,6 +38,7 @@
 #include "sysemu/reset.h"
 #include "trace.h"
 #include "qapi/error.h"
+#include "qemu/log.h"
 
 VFIOGroupList vfio_group_list =
 QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -190,6 +191,15 @@ void vfio_region_write(void *opaque, hwaddr addr,
 uint64_t qword;
 } buf;
 
+trace_vfio_region_write(vbasedev->name, region->nr, addr, data, size);
+if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE)) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "Invalid write to read only vfio region 0x%"
+  HWADDR_PRIx" size %u\n", addr, size);
+
+return;
+}
+
 switch (size) {
 case 1:
 buf.byte = data;
@@ -215,8 +225,6 @@ void vfio_region_write(void *opaque, hwaddr addr,
  addr, data, size);
 }
 
-trace_vfio_region_write(vbasedev->name, region->nr, addr, data, size);
-
 /*
  * A read or write to a BAR always signals an INTx EOI.  This will
  * do nothing if not pending (including not in INTx mode).  We assume
-- 
2.17.1

[PATCH v4 0/3] drop writes to read-only ram device & vfio regions

2020-04-16 Thread Yan Zhao

MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

patch 1 modifies handler of ram device memory regions to drop guest writes
to read-only ram device memory regions

patch 2 modifies handler of non-mmap'd read-only vfio regions to drop guest
writes to those regions 

patch 3 set read-only flag to mmap'd read-only vfio regions, so that guest
writes to those regions would be trapped.
without patch 1, host qemu would then crash on guest write to those
read-only regions.
with patch 1, host qemu would drop the writes.

Changelog:
v4:
-instead of modifying tracing log, added qemu_log_mask(LOG_GUEST_ERROR...)
to log guest writes to read-only regions (Philippe)

for
v3:
-refreshed and Cc Stefan for reviewing of tracing part

v2:
-split one big patches into smaller ones (Philippe)
-modify existing trace to record guest writes to read-only memory (Alex)
-modify vfio_region_write() to drop guest writes to non-mmap'd read-only
 region (Alex)


Yan Zhao (3):
  memory: drop guest writes to read-only ram device regions
  hw/vfio: drop guest writes to ro regions
  hw/vfio: let read-only flag take effect for mmap'd regions

 hw/vfio/common.c | 16 ++--
 memory.c |  7 +++
 2 files changed, 21 insertions(+), 2 deletions(-)

-- 
2.17.1

[PATCH v4 1/3] memory: drop guest writes to read-only ram device regions

2020-04-16 Thread Yan Zhao

for ram device regions, drop guest writes if the regions is read-only.

Cc: Philippe Mathieu-Daudé 
Signed-off-by: Yan Zhao 
Signed-off-by: Xin Zeng 
---
 memory.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/memory.c b/memory.c
index 601b749906..9576dd6807 100644
--- a/memory.c
+++ b/memory.c
@@ -34,6 +34,7 @@
 #include "sysemu/accel.h"
 #include "hw/boards.h"
 #include "migration/vmstate.h"
+#include "qemu/log.h"
 
 //#define DEBUG_UNASSIGNED
 
@@ -1313,6 +1314,12 @@ static void memory_region_ram_device_write(void *opaque, 
hwaddr addr,
 MemoryRegion *mr = opaque;
 
 trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, 
size);
+if (mr->readonly) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "Invalid write to read only ram device region 0x%"
+   HWADDR_PRIx" size %u\n", addr, size);
+return;
+}
 
 switch (size) {
 case 1:
-- 
2.17.1

1 2 3 4 >

1 - 100 of 323 matches

Mail list logo