Re: [PATCH v3 00/11] KVM: x86: track guest page access

2016-02-23 Thread Paolo Bonzini


On 23/02/2016 06:44, Tian, Kevin wrote:
>> From: Song, Jike
>> Sent: Tuesday, February 23, 2016 11:02 AM
>>
>> +Kevin
>>
>> On 02/22/2016 06:05 PM, Xiao Guangrong wrote:
>>>
>>> On 02/19/2016 08:00 PM, Paolo Bonzini wrote:

 I still have a doubt: how are you going to handle invalidation of GPU
 shadow page tables if a device (emulated in QEMU or even vhost) does DMA
 to the PPGTT?
>>>
>>> I think Jike is the better one to answer this question, Jike, could you
>>> please clarify it? :)
>>>
>>
>> Sure :)
>>
>> Actually in guest PPGTT is manipulated by CPU rather than GPU. The
>> PPGTT page table itself are plain memory, composed & modified by the
>> GPU driver, i.e. by CPU in Non-Root mode.
>>
>> Given that, we write-protected guest PPGTT, when VM writes PPGTT, EPT
>> violation rather than DMA fault happens.
> 
> 'DMA to PPGTT' is NOT SUPPORTED on our vGPU device model. Today
> Intel gfx driver doesn't use this method, and we explicitly list it as a
> guest driver requirement to support a vGPU. If a malicious driver does 
> program DMA to modify PPGTT, it can only modify guest PPGTT instead
> of shadow PPGTT (being guest invisible). So there is no security issue 
> either.

Ok, thanks for confirming.

Paolo


Re: [PATCH v3 00/11] KVM: x86: track guest page access

2016-02-23 Thread Jike Song
On 02/23/2016 06:01 PM, Paolo Bonzini wrote:
> - Original Message -
>> From: "Jike Song" 
>> To: "Xiao Guangrong" 
>> Cc: "Paolo Bonzini" , g...@kernel.org, 
>> mtosa...@redhat.com, k...@vger.kernel.org,
>> linux-kernel@vger.kernel.org, "kai huang" , 
>> "Andrea Arcangeli" ,
>> "Kevin Tian" 
>> Sent: Tuesday, February 23, 2016 4:02:25 AM
>> Subject: Re: [PATCH v3 00/11] KVM: x86: track guest page access
>>
>> +Kevin
>>
>> On 02/22/2016 06:05 PM, Xiao Guangrong wrote:
>>>
>>> On 02/19/2016 08:00 PM, Paolo Bonzini wrote:
>>>>
>>>> I still have a doubt: how are you going to handle invalidation of GPU
>>>> shadow page tables if a device (emulated in QEMU or even vhost) does DMA
>>>> to the PPGTT?
>>>
>>> I think Jike is the better one to answer this question, Jike, could you
>>> please clarify it? :)
>>>
>>
>> Sure :)
>>
>> Actually in guest PPGTT is manipulated by CPU rather than GPU. The
>> PPGTT page table itself are plain memory, composed & modified by the
>> GPU driver, i.e. by CPU in Non-Root mode.
>>
>> Given that, we write-protected guest PPGTT, when VM writes PPGTT, EPT
>> violation rather than DMA fault happens.
> 

I may still misunderstand you, so apologize in advance ..

> I am not talking of DMA faults; I am talking of a guest that reads
> from disk into the PPGTT.

into PPGTT the page table itself? as said by Kevin in another mail,
this is NOT SUPPORTED.

> This is emulated DMA, and your approach of
> tracking guest page access from KVM means that you are not handling
> this.  Is this right?

Right, our tacking mechanism cares only CPU write, not Device write.

However, there is *NO* DMA emulation, just similar to passthrough.
The device(IGD) is only cable of r/w memory according the
shadowed PPGTT, which is managed by VGPU device-model, guaranteed
only memory that owned by this vgpu can be mapped. 

All we need is to track CPU writes from guest.

> If so, what happens if the guest does this
> kind of operation (for example because it is not using the PPGTT
> anymore)?  KVMGT should not be confused the next time it works on
> that PPGTT page.

As explained above, the device-model won't allow such things to happen.

> 
> Paolo
>

--
Thanks,
Jike



Re: [PATCH v3 00/11] KVM: x86: track guest page access

2016-02-23 Thread Paolo Bonzini


- Original Message -
> From: "Jike Song" 
> To: "Xiao Guangrong" 
> Cc: "Paolo Bonzini" , g...@kernel.org, 
> mtosa...@redhat.com, k...@vger.kernel.org,
> linux-kernel@vger.kernel.org, "kai huang" , 
> "Andrea Arcangeli" ,
> "Kevin Tian" 
> Sent: Tuesday, February 23, 2016 4:02:25 AM
> Subject: Re: [PATCH v3 00/11] KVM: x86: track guest page access
> 
> +Kevin
> 
> On 02/22/2016 06:05 PM, Xiao Guangrong wrote:
> > 
> > On 02/19/2016 08:00 PM, Paolo Bonzini wrote:
> >>
> >> I still have a doubt: how are you going to handle invalidation of GPU
> >> shadow page tables if a device (emulated in QEMU or even vhost) does DMA
> >> to the PPGTT?
> > 
> > I think Jike is the better one to answer this question, Jike, could you
> > please clarify it? :)
> > 
> 
> Sure :)
> 
> Actually in guest PPGTT is manipulated by CPU rather than GPU. The
> PPGTT page table itself are plain memory, composed & modified by the
> GPU driver, i.e. by CPU in Non-Root mode.
> 
> Given that, we write-protected guest PPGTT, when VM writes PPGTT, EPT
> violation rather than DMA fault happens.

I am not talking of DMA faults; I am talking of a guest that reads
from disk into the PPGTT.  This is emulated DMA, and your approach of
tracking guest page access from KVM means that you are not handling
this.  Is this right?  If so, what happens if the guest does this
kind of operation (for example because it is not using the PPGTT
anymore)?  KVMGT should not be confused the next time it works on
that PPGTT page.

Paolo


RE: [PATCH v3 00/11] KVM: x86: track guest page access

2016-02-22 Thread Tian, Kevin
> From: Song, Jike
> Sent: Tuesday, February 23, 2016 11:02 AM
> 
> +Kevin
> 
> On 02/22/2016 06:05 PM, Xiao Guangrong wrote:
> >
> > On 02/19/2016 08:00 PM, Paolo Bonzini wrote:
> >>
> >> I still have a doubt: how are you going to handle invalidation of GPU
> >> shadow page tables if a device (emulated in QEMU or even vhost) does DMA
> >> to the PPGTT?
> >
> > I think Jike is the better one to answer this question, Jike, could you
> > please clarify it? :)
> >
> 
> Sure :)
> 
> Actually in guest PPGTT is manipulated by CPU rather than GPU. The
> PPGTT page table itself are plain memory, composed & modified by the
> GPU driver, i.e. by CPU in Non-Root mode.
> 
> Given that, we write-protected guest PPGTT, when VM writes PPGTT, EPT
> violation rather than DMA fault happens.

'DMA to PPGTT' is NOT SUPPORTED on our vGPU device model. Today
Intel gfx driver doesn't use this method, and we explicitly list it as a
guest driver requirement to support a vGPU. If a malicious driver does 
program DMA to modify PPGTT, it can only modify guest PPGTT instead
of shadow PPGTT (being guest invisible). So there is no security issue 
either.

> 
> >> Generally, this was the reason to keep stuff out of KVM
> >> and instead hook into the kernel mm subsystem (as with userfaultfd).
> >
> > We considered it carefully but this way can not satisfy KVMGT's 
> > requirements.
> > The reasons i explained in the old thread 
> > (https://lkml.org/lkml/2015/12/1/516)
> > are:
> >
> > "For the performance, shadow GPU is performance critical and requires
> > frequently being switched, it is not good to handle it in userspace. And
> > windows guest has many GPU tables and updates it frequently, that means,
> > we need to write protect huge number of pages which are single page based,
> > I am afraid userfaultfd can not handle this case efficiently.

Yes, performance is the main concern. 

Paolo, we explained the reason for in-kernel emulation to you earlier with
your understanding:

> > It's definitely a fast path, e.g. command submission, shadow GPU page
> > table, etc. which are all in performance critical path. Another reason is
> > the I/O access frequency, which could be up to 100k/s for some gfx workload.
> > It's important to shorten the emulation path which can help performance
> > a lot. That's the major reason why we keep vGPU device model in the
> > kernel (will merged into i915 driver)
> 
> Ok, thanks---writing numbers down always helps.  MMIO to userspace costs
> 5000 clock cycles on the latest QEMU and processor (and does not need
> the "big QEMU lock" anymore), but still 100k/s is a ~50 clock cycle
> difference and approximately 15% host CPU usage.

(I believe ~50 should be ~500M clock cycle above)

> >
> > For the functionality, userfaultfd can not fill the need of shadow page
> > because:
> > - the page is keeping readonly, userfaultfd can not fix the fault and let
> > the vcpu progress (write access causes writeable gup).
> >
> > - the access need to be emulated, however, userfaultfd/kernel does not have
> > the ability to emulate the access as the access is trigged by guest, the
> > instruction info is stored in VMCS so that only KVM can emulate it.
> >
> > - shadow page needs to be notified after the emulation is finished as it
> > should know the new data written to the page to update its page 
> > hierarchy.
> > (some hardwares lack the 'retry' ability so the shadow page table need 
> > to
> >  reflect the table in guest at any time). "
> >
> > Any idea?
> >
> 

Thanks Guangrong for investigating the possibility.

Based on earlier explanation, we hope KVM community can re-think the
necessity of support in-kernel emulation for KVMGT. Same framework
might be extended to other type of I/O devices using similar mediated
pass-through concept in the future, which has device model tightly
integrated with native device driver for efficiency and simplicity purpose.

Actually a related open when discussing KVMGT/VFIO integration.
There are 7 total services required to support in-kernel emulation, which 
can be categorize into two groups:

a) services to connect vGPU with VM, which are essentially what a device
driver is doing (so VFIO can fit here), including:
1) Selectively pass-through a region to a VM
2) Trap-and-emulate a region
3) Inject a virtual interrupt
4) Pin/unpin guest memory
5) GPA->IOVA/HVA translation (as a side-effect)

b) services to support device emulation, which gonna be hypervisor
specific, including:
6) Map/unmap guest memory
7) Write-protect a guest memory page

We're working with VFIO community to add support of category a),
but there is still a gap in category b). This patch series can address
the requirement of 7). For 6) it's straightforward for KVM. We may
introduce a new file in KVM to wrap them together for in-kernel
emulation, but need an agreement from community first on this
direction. :-)

Thanks
Kev

Re: [PATCH v3 00/11] KVM: x86: track guest page access

2016-02-22 Thread Jike Song
+Kevin

On 02/22/2016 06:05 PM, Xiao Guangrong wrote:
> 
> On 02/19/2016 08:00 PM, Paolo Bonzini wrote:
>>
>> I still have a doubt: how are you going to handle invalidation of GPU
>> shadow page tables if a device (emulated in QEMU or even vhost) does DMA
>> to the PPGTT?
> 
> I think Jike is the better one to answer this question, Jike, could you
> please clarify it? :)
> 

Sure :)

Actually in guest PPGTT is manipulated by CPU rather than GPU. The
PPGTT page table itself are plain memory, composed & modified by the
GPU driver, i.e. by CPU in Non-Root mode.

Given that, we write-protected guest PPGTT, when VM writes PPGTT, EPT
violation rather than DMA fault happens.

>> Generally, this was the reason to keep stuff out of KVM
>> and instead hook into the kernel mm subsystem (as with userfaultfd).
> 
> We considered it carefully but this way can not satisfy KVMGT's requirements.
> The reasons i explained in the old thread 
> (https://lkml.org/lkml/2015/12/1/516)
> are:
> 
> "For the performance, shadow GPU is performance critical and requires
> frequently being switched, it is not good to handle it in userspace. And
> windows guest has many GPU tables and updates it frequently, that means,
> we need to write protect huge number of pages which are single page based,
> I am afraid userfaultfd can not handle this case efficiently.
> 
> For the functionality, userfaultfd can not fill the need of shadow page
> because:
> - the page is keeping readonly, userfaultfd can not fix the fault and let
> the vcpu progress (write access causes writeable gup).
> 
> - the access need to be emulated, however, userfaultfd/kernel does not have
> the ability to emulate the access as the access is trigged by guest, the
> instruction info is stored in VMCS so that only KVM can emulate it.
> 
> - shadow page needs to be notified after the emulation is finished as it
> should know the new data written to the page to update its page hierarchy.
> (some hardwares lack the 'retry' ability so the shadow page table need to
>  reflect the table in guest at any time). "
> 
> Any idea?
> 

--
Thanks,
Jike


Re: [PATCH v3 00/11] KVM: x86: track guest page access

2016-02-22 Thread Xiao Guangrong



On 02/19/2016 08:00 PM, Paolo Bonzini wrote:



On 14/02/2016 12:31, Xiao Guangrong wrote:

Changelong in v3:
- refine the code of mmu_need_write_protect() based on Huang Kai's suggestion
- rebase the patchset against current code

Changelog in v2:
- fix a issue that the track memory of memslot is freed if we only move
   the memslot or change the flags of memslot
- do not track the gfn which is not mapped in memslots
- introduce the nolock APIs at the begin of the patchset
- use 'unsigned short' as the track counter to reduce the memory and which
   should be enough for shadow page table and KVMGT

This patchset introduces the feature which allows us to track page
access in guest. Currently, only write access tracking is implemented
in this version.

Four APIs are introduces:
- kvm_page_track_add_page(kvm, gfn, mode), single guest page @gfn is
   added into the track pool of the guest instance represented by @kvm,
   @mode specifies which kind of access on the @gfn is tracked

- kvm_page_track_remove_page(kvm, gfn, mode), is the opposed operation
   of kvm_page_track_add_page() which removes @gfn from the tracking pool.
   gfn is no tracked after its last user is gone

- kvm_page_track_register_notifier(kvm, n), register a notifier so that
   the event triggered by page tracking will be received, at that time,
   the callback of n->track_write() will be called

- kvm_page_track_unregister_notifier(kvm, n), does the opposed operation
   of kvm_page_track_register_notifier(), which unlinks the notifier and
   stops receiving the tracked event

The first user of page track is non-leaf shadow page tables as they are
always write protected. It also gains performance improvement because
page track speeds up page fault handler for the tracked pages. The
performance result of kernel building is as followings:

before   after
real 461.63   real 455.48
user 4529.55  user 4557.88
sys 1995.39   sys 1922.57

Furthermore, it is the infrastructure of other kind of shadow page table,
such as GPU shadow page table introduced in KVMGT (1) and native nested
IOMMU.

This patch can be divided into two parts:
- patch 1 ~ patch 7, implement page tracking
- others patches apply page tracking to non-leaf shadow page table


Xiao,

the patches are very readable and very good.  My comments are only minor.


Thank you, Paolo!



I still have a doubt: how are you going to handle invalidation of GPU
shadow page tables if a device (emulated in QEMU or even vhost) does DMA
to the PPGTT?


I think Jike is the better one to answer this question, Jike, could you
please clarify it? :)


Generally, this was the reason to keep stuff out of KVM
and instead hook into the kernel mm subsystem (as with userfaultfd).


We considered it carefully but this way can not satisfy KVMGT's requirements.
The reasons i explained in the old thread (https://lkml.org/lkml/2015/12/1/516)
are:

"For the performance, shadow GPU is performance critical and requires
frequently being switched, it is not good to handle it in userspace. And
windows guest has many GPU tables and updates it frequently, that means,
we need to write protect huge number of pages which are single page based,
I am afraid userfaultfd can not handle this case efficiently.

For the functionality, userfaultfd can not fill the need of shadow page
because:
- the page is keeping readonly, userfaultfd can not fix the fault and let
   the vcpu progress (write access causes writeable gup).

- the access need to be emulated, however, userfaultfd/kernel does not have
   the ability to emulate the access as the access is trigged by guest, the
   instruction info is stored in VMCS so that only KVM can emulate it.

- shadow page needs to be notified after the emulation is finished as it
   should know the new data written to the page to update its page hierarchy.
   (some hardwares lack the 'retry' ability so the shadow page table need to
reflect the table in guest at any time). "

Any idea?


Re: [PATCH v3 00/11] KVM: x86: track guest page access

2016-02-19 Thread Paolo Bonzini


On 14/02/2016 12:31, Xiao Guangrong wrote:
> Changelong in v3:
> - refine the code of mmu_need_write_protect() based on Huang Kai's suggestion
> - rebase the patchset against current code
> 
> Changelog in v2:
> - fix a issue that the track memory of memslot is freed if we only move
>   the memslot or change the flags of memslot
> - do not track the gfn which is not mapped in memslots
> - introduce the nolock APIs at the begin of the patchset
> - use 'unsigned short' as the track counter to reduce the memory and which
>   should be enough for shadow page table and KVMGT
> 
> This patchset introduces the feature which allows us to track page
> access in guest. Currently, only write access tracking is implemented
> in this version.
> 
> Four APIs are introduces:
> - kvm_page_track_add_page(kvm, gfn, mode), single guest page @gfn is
>   added into the track pool of the guest instance represented by @kvm,
>   @mode specifies which kind of access on the @gfn is tracked
>   
> - kvm_page_track_remove_page(kvm, gfn, mode), is the opposed operation
>   of kvm_page_track_add_page() which removes @gfn from the tracking pool.
>   gfn is no tracked after its last user is gone
> 
> - kvm_page_track_register_notifier(kvm, n), register a notifier so that
>   the event triggered by page tracking will be received, at that time,
>   the callback of n->track_write() will be called
> 
> - kvm_page_track_unregister_notifier(kvm, n), does the opposed operation
>   of kvm_page_track_register_notifier(), which unlinks the notifier and
>   stops receiving the tracked event
> 
> The first user of page track is non-leaf shadow page tables as they are
> always write protected. It also gains performance improvement because
> page track speeds up page fault handler for the tracked pages. The
> performance result of kernel building is as followings:
> 
>before   after
> real 461.63   real 455.48
> user 4529.55  user 4557.88
> sys 1995.39   sys 1922.57
> 
> Furthermore, it is the infrastructure of other kind of shadow page table,
> such as GPU shadow page table introduced in KVMGT (1) and native nested
> IOMMU.
> 
> This patch can be divided into two parts:
> - patch 1 ~ patch 7, implement page tracking
> - others patches apply page tracking to non-leaf shadow page table

Xiao,

the patches are very readable and very good.  My comments are only minor.

I still have a doubt: how are you going to handle invalidation of GPU
shadow page tables if a device (emulated in QEMU or even vhost) does DMA
to the PPGTT?  Generally, this was the reason to keep stuff out of KVM
and instead hook into the kernel mm subsystem (as with userfaultfd).

Paolo