from:"Yang, Sheng"

Re: Mask bit support's API

2010-12-02 Thread Yang, Sheng

 should be acceptable. Also we can hook at disable point 
as 
well.
 
> So handling msix enable/disable is more straight-forward.

Don't agree. Then you got duplicate between kernel and userspace. Also the 
semantic of MSI-X MMIO has no relationship with MSIX enable/disable.
> 
> > >> >>  Michael?  I think that should work for virtio and vfio assigned
> > >> >>  devices?  Not sure about pending bits.
> > >> >
> > >> >Pending bits must be tracked in kernel, but I don't see
> > >> >how we can support polling mode if we don't exit to userspace
> > >> >on pending bit reads.
> > >> >
> > >> >This does mean that some reads will be fast and some will be
> > >> >slow, and it's a bit sad that we seem to be optimizing
> > >> >for specific guests, but I just can't come up with
> > >> >anything better.
> > >> 
> > >> If the pending bits live in userspace memory, the device model can
> > >> update them directly?
> > > 
> > > Note that these are updated on an interrupt, so updating them
> > > in userspace would need get_user_page etc trickery,
> > > and add the overhead of atomics.
> > > 
> > > Further I think it's important to avoid the overhead of updating them
> > > all the time, and only do this when an interrupt is
> > > masked or on pending bits read. Since userspace does not know
> > > when interrupts are masked, this means do update on each read.
> > 
> > In fact qemu's accessing to MMIO should be quite rare after moving all
> > the things to the kernel. Using IOCTL is also fine with me.
> > 
> > And how to "do update on each read"?
> 
> When read of pending bits is detected, we could forward it up to qemu.
> Qemu could check internal device state and clear bits that are no longer
> relevant.

Don't understand why we need turn to qemu when everything is ready in kernel. 
And 
pending bit is still in the scope of PCI spec. Also it's can only be written by 
the kernel who emulate the table. Of course, device model can read from it.

--
regards
Yang, Sheng

> 
> > >> >So maybe just add an ioctl to get and to clear pending bits.
> > >> >Maybe set for symmetry.
> > >> 
> > >> For live migration too.  But if they live in memory, no need for
> > >> get/set, just specify the address.
> > >> 
> > >> --
> > >> error compiling committee.c: too many arguments to function
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance test result between virtio_pci MSI-X disable and enable

2010-12-01 Thread Yang, Sheng

On Wednesday 01 December 2010 22:03:58 Michael S. Tsirkin wrote:
> On Wed, Dec 01, 2010 at 04:41:38PM +0800, lidong chen wrote:
> > I used sr-iov, give each vm 2 vf.
> > after apply the patch, and i found performence is the same.
> > 
> > the reason is in function msix_mmio_write, mostly addr is not in mmio
> > range.
> > 
> > static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int
> > len,
> > 
> >const void *val)
> > 
> > {
> > 
> > struct kvm_assigned_dev_kernel *adev =
> > 
> > container_of(this, struct kvm_assigned_dev_kernel,
> > 
> >  msix_mmio_dev);
> > 
> > int idx, r = 0;
> > unsigned long new_val = *(unsigned long *)val;
> > 
> > mutex_lock(&adev->kvm->lock);
> > if (!msix_mmio_in_range(adev, addr, len)) {
> > 
> > // return here.
> > 
> >  r = -EOPNOTSUPP;
> > 
> > goto out;
> > 
> > }
> > 
> > i printk the value:
> > addr start   end   len
> > F004C00C   F0044000  F0044030 4
> > 
> > 00:06.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev
> > 01)
> > 
> > Subsystem: Intel Corporation Unknown device 000c
> > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > 
> > Stepping- SERR- FastB2B-
> > 
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > 
> > SERR-  > 
> > Latency: 0
> > Region 0: Memory at f004 (32-bit, non-prefetchable) [size=16K]
> > Region 3: Memory at f0044000 (32-bit, non-prefetchable) [size=16K]
> > Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3
> > 
> > Vector table: BAR=3 offset=
> > PBA: BAR=3 offset=2000
> > 
> > 00:07.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev
> > 01)
> > 
> > Subsystem: Intel Corporation Unknown device 000c
> > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > 
> > Stepping- SERR- FastB2B-
> > 
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > 
> > SERR-  > 
> > Latency: 0
> > Region 0: Memory at f0048000 (32-bit, non-prefetchable) [size=16K]
> > Region 3: Memory at f004c000 (32-bit, non-prefetchable) [size=16K]
> > Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3
> > 
> > Vector table: BAR=3 offset=
> > PBA: BAR=3 offset=2000
> > 
> > +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
> > + gpa_t addr, int len)
> > +{
> > +   gpa_t start, end;
> > +
> > +   BUG_ON(adev->msix_mmio_base == 0);
> > +   start = adev->msix_mmio_base;
> > +   end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE *
> > +   adev->msix_max_entries_nr;
> > +   if (addr >= start && addr + len <= end)
> > +   return true;
> > +
> > +   return false;
> > +}
> 
> Hmm, this check looks wrong to me: there's no guarantee
> that guest uses the first N entries in the table.
> E.g. it could use a single entry, but only the last one.

Please check the PCI spec.

--
regards
Yang, Sheng

 
> > 2010/11/30 Yang, Sheng :
> > > On Tuesday 30 November 2010 17:10:11 lidong chen wrote:
> > >> sr-iov also meet this problem, MSIX mask waste a lot of cpu resource.
> > >> 
> > >> I test kvm with sriov, which the vf driver could not disable msix.
> > >> so the host os waste a lot of cpu.  cpu rate of host os is 90%.
> > >> 
> > >> then I test xen with sriov, there ara also a lot of vm exits caused by
> > >> MSIX mask.
> > >> but the cpu rate of xen and domain0 is less than kvm. cpu rate of xen
> > >> and domain0 is 60%.
> > >> 
> > >> without sr-iov, the cpu rate of xen and domain0 is higher than kvm.
> > >> 
> > >> so i think the problem is kvm waste more cpu resource to deal with
> > >> MSIX mask. and we can see how xen deal with MSIX mask.
> > >> 
> > >> if this problem sloved, maybe with MSIX enabled, the performace is
> > >> better.
> > > 
> > > Please refer to my posted patches for this issue.
> > > 
> > > http://www.spinics.net/lists/kvm/msg44992.html
> > > 
> > > --
> > > regards
> > > Yang, Sheng
> > > 
> > >> 2010/11/23 Avi Kivity :
> > >> > On 11/23/2010 09:27 AM, lidong chen wrote:
> > >> >> can you tell me something about this problem.
> > >> >> thanks.
> > >> > 
> > >> > Which problem?
> > >> > 
> > >> > --
> > >> > I have a truly marvellous patch that fixes the bug which this
> > >> > signature is too narrow to contain.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance test result between virtio_pci MSI-X disable and enable

2010-12-01 Thread Yang, Sheng

On Wednesday 01 December 2010 17:29:44 lidong chen wrote:
> maybe because i modify the code in assigned_dev_iomem_map().
> 
> i used RHEL6, and calc_assigned_dev_id is below：
> 
> static uint32_t calc_assigned_dev_id(uint8_t bus, uint8_t devfn)
> {
> return (uint32_t)bus << 8 | (uint32_t)devfn;
> }
> 
> and in patch there are there param.
> +msix_mmio.id = calc_assigned_dev_id(r_dev->h_segnr,
> +r_dev->h_busnr, r_dev->h_devfn);

This one should be fine because h_segnr should be 0 here.

But I strongly recommend you to use latest KVM and latest QEmu, we won't know 
what 
would happen during the rebase... (maybe my patch is a little old for the 
latest 
one, so my kvm base is 365bb670a44b217870c2ee1065f57bb43b57e166, qemu base is 
420fe74769cc67baec6f3d962dc054e2972ca3ae).

Things to be checked:
1. If two devices' MMIO have been registered successfully.
2. If you can see the mask bit accessing in kernel from both devices.

--
regards
Yang, Sheng

> 
> 
> #ifdef KVM_CAP_MSIX_MASK
> if (cap_mask) {
> memset(&msix_mmio, 0, sizeof msix_mmio);
> msix_mmio.id = calc_assigned_dev_id(r_dev->h_busnr,
> r_dev->h_devfn);
> msix_mmio.type = KVM_MSIX_TYPE_ASSIGNED_DEV;
> msix_mmio.base_addr = e_phys + offset;
> msix_mmio.max_entries_nr = r_dev->max_msix_entries_nr;
> msix_mmio.flags = KVM_MSIX_MMIO_FLAG_REGISTER;
> ret = kvm_update_msix_mmio(kvm_context, &msix_mmio);
> if (ret)
> fprintf(stderr, "fail to register in-kernel
> msix_mmio!\n"); }
> #endif
> 
> 2010/12/1 Yang, Sheng :
> > On Wednesday 01 December 2010 16:54:16 lidong chen wrote:
> >> yes, i patch qemu as well.
> >> 
> >> and i found the address of second vf is not in mmio range. the first
> >> one is fine.
> > 
> > So looks like something wrong with MMIO register part. Could you check
> > the registeration in assigned_dev_iomem_map() of the 4th patch for QEmu?
> > I suppose something wrong with it. I would try to reproduce it here.
> > 
> > And if you only use one vf, how about the gain?
> > 
> > --
> > regards
> > Yang, Sheng
> > 
> >> 2010/12/1 Yang, Sheng :
> >> > On Wednesday 01 December 2010 16:41:38 lidong chen wrote:
> >> >> I used sr-iov, give each vm 2 vf.
> >> >> after apply the patch, and i found performence is the same.
> >> >> 
> >> >> the reason is in function msix_mmio_write, mostly addr is not in mmio
> >> >> range.
> >> > 
> >> > Did you patch qemu as well? You can see it's impossible for kernel
> >> > part to work alone...
> >> > 
> >> > http://www.mail-archive.com/kvm@vger.kernel.org/msg44368.html
> >> > 
> >> > --
> >> > regards
> >> > Yang, Sheng
> >> > 
> >> >> static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr,
> >> >> int len, const void *val)
> >> >> {
> >> >> 
> >> >>   struct kvm_assigned_dev_kernel *adev =
> >> >>   
> >> >>   container_of(this, struct
> >> >>   kvm_assigned_dev_kernel,
> >> >>   
> >> >>msix_mmio_dev);
> >> >>   
> >> >>   int idx, r = 0;
> >> >>   unsigned long new_val = *(unsigned long *)val;
> >> >>   
> >> >>   mutex_lock(&adev->kvm->lock);
> >> >>   if (!msix_mmio_in_range(adev, addr, len)) {
> >> >>   
> >> >>   // return here.
> >> >>   
> >> >>  r = -EOPNOTSUPP;
> >> >>   
> >> >>   goto out;
> >> >>   
> >> >>   }
> >> >> 
> >> >> i printk the value:
> >> >> addr start   end   len
> >> >> F004C00C   F0044000  F0044030 4
> >> >> 
> >> >> 00:06.0 Ethernet controller: Intel Corporation Unknown device 10ed
> >> >> (rev 01) Subsystem: Intel Corporation Unknown device 000c
> >> >> 
> >> >>   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> >> >> 
> >> >> ParErr

Re: Performance test result between virtio_pci MSI-X disable and enable

2010-12-01 Thread Yang, Sheng

On Wednesday 01 December 2010 17:02:57 Yang, Sheng wrote:
> On Wednesday 01 December 2010 16:54:16 lidong chen wrote:
> > yes, i patch qemu as well.
> > 
> > and i found the address of second vf is not in mmio range. the first
> > one is fine.
> 
> So looks like something wrong with MMIO register part. Could you check the
> registeration in assigned_dev_iomem_map() of the 4th patch for QEmu? I
> suppose something wrong with it. I would try to reproduce it here.
> 
> And if you only use one vf, how about the gain?

It's weird... I've tried assign two vfs to the guest, and two devices' MSI-X 
mask 
bit accessing both being intercepted by kernel as expected...

So for msix_mmio_write/read, you can't see any mask bit accessing from the 
second 
device? 

Also the benefit of this patch would show only when mask bit operation 
intensity is 
high. So how about your interrupt intensity?

--
regards
Yang, Sheng

> 
> --
> regards
> Yang, Sheng
> 
> > 2010/12/1 Yang, Sheng :
> > > On Wednesday 01 December 2010 16:41:38 lidong chen wrote:
> > >> I used sr-iov, give each vm 2 vf.
> > >> after apply the patch, and i found performence is the same.
> > >> 
> > >> the reason is in function msix_mmio_write, mostly addr is not in mmio
> > >> range.
> > > 
> > > Did you patch qemu as well? You can see it's impossible for kernel part
> > > to work alone...
> > > 
> > > http://www.mail-archive.com/kvm@vger.kernel.org/msg44368.html
> > > 
> > > --
> > > regards
> > > Yang, Sheng
> > > 
> > >> static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int
> > >> len, const void *val)
> > >> {
> > >> 
> > >>   struct kvm_assigned_dev_kernel *adev =
> > >>   
> > >>   container_of(this, struct
> > >>   kvm_assigned_dev_kernel,
> > >>   
> > >>msix_mmio_dev);
> > >>   
> > >>   int idx, r = 0;
> > >>   unsigned long new_val = *(unsigned long *)val;
> > >>   
> > >>   mutex_lock(&adev->kvm->lock);
> > >>   if (!msix_mmio_in_range(adev, addr, len)) {
> > >>   
> > >>   // return here.
> > >>   
> > >>  r = -EOPNOTSUPP;
> > >>   
> > >>   goto out;
> > >>   
> > >>   }
> > >> 
> > >> i printk the value:
> > >> addr start   end   len
> > >> F004C00C   F0044000  F0044030 4
> > >> 
> > >> 00:06.0 Ethernet controller: Intel Corporation Unknown device 10ed
> > >> (rev 01) Subsystem: Intel Corporation Unknown device 000c
> > >> 
> > >>   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> > >> 
> > >> ParErr- Stepping- SERR- FastB2B-
> > >> 
> > >>   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > >> 
> > >> SERR-  > >> 
> > >>   Latency: 0
> > >>   Region 0: Memory at f004 (32-bit, non-prefetchable)
> > >>   [size=16K] Region 3: Memory at f0044000 (32-bit,
> > >>   non-prefetchable) [size=16K] Capabilities: [40] MSI-X: Enable+
> > >>   Mask- TabSize=3
> > >>   
> > >>   Vector table: BAR=3 offset=
> > >>   PBA: BAR=3 offset=2000
> > >> 
> > >> 00:07.0 Ethernet controller: Intel Corporation Unknown device 10ed
> > >> (rev 01) Subsystem: Intel Corporation Unknown device 000c
> > >> 
> > >>   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> > >> 
> > >> ParErr- Stepping- SERR- FastB2B-
> > >> 
> > >>   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> > >> 
> > >> SERR-  > >> 
> > >>   Latency: 0
> > >>   Region 0: Memory at f0048000 (32-bit, non-prefetchable)
> > >>   [size=16K] Region 3: Memory at f004c000 (32-bit,
> > >>   non-prefetchable) [size=16K] Capabilities: [40] MSI-X: Enable+
> > >>   Mask- TabSize=3
> > >>   
> > >>   Vector table: BAR=3 offset=
> > >>

Re: Performance test result between virtio_pci MSI-X disable and enable

2010-12-01 Thread Yang, Sheng

On Wednesday 01 December 2010 16:54:16 lidong chen wrote:
> yes, i patch qemu as well.
> 
> and i found the address of second vf is not in mmio range. the first
> one is fine.

So looks like something wrong with MMIO register part. Could you check the 
registeration in assigned_dev_iomem_map() of the 4th patch for QEmu? I suppose 
something wrong with it. I would try to reproduce it here.

And if you only use one vf, how about the gain?

--
regards
Yang, Sheng

> 
> 2010/12/1 Yang, Sheng :
> > On Wednesday 01 December 2010 16:41:38 lidong chen wrote:
> >> I used sr-iov, give each vm 2 vf.
> >> after apply the patch, and i found performence is the same.
> >> 
> >> the reason is in function msix_mmio_write, mostly addr is not in mmio
> >> range.
> > 
> > Did you patch qemu as well? You can see it's impossible for kernel part
> > to work alone...
> > 
> > http://www.mail-archive.com/kvm@vger.kernel.org/msg44368.html
> > 
> > --
> > regards
> > Yang, Sheng
> > 
> >> static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int
> >> len, const void *val)
> >> {
> >>   struct kvm_assigned_dev_kernel *adev =
> >>   container_of(this, struct kvm_assigned_dev_kernel,
> >>msix_mmio_dev);
> >>   int idx, r = 0;
> >>   unsigned long new_val = *(unsigned long *)val;
> >> 
> >>   mutex_lock(&adev->kvm->lock);
> >>   if (!msix_mmio_in_range(adev, addr, len)) {
> >>   // return here.
> >>  r = -EOPNOTSUPP;
> >>   goto out;
> >>   }
> >> 
> >> i printk the value:
> >> addr start   end   len
> >> F004C00C   F0044000  F0044030 4
> >> 
> >> 00:06.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev
> >> 01) Subsystem: Intel Corporation Unknown device 000c
> >>   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> >> ParErr- Stepping- SERR- FastB2B-
> >>   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> >> SERR-  >>   Latency: 0
> >>   Region 0: Memory at f004 (32-bit, non-prefetchable) [size=16K]
> >>   Region 3: Memory at f0044000 (32-bit, non-prefetchable) [size=16K]
> >>   Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3
> >>   Vector table: BAR=3 offset=
> >>   PBA: BAR=3 offset=2000
> >> 
> >> 00:07.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev
> >> 01) Subsystem: Intel Corporation Unknown device 000c
> >>   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> >> ParErr- Stepping- SERR- FastB2B-
> >>   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> >> SERR-  >>   Latency: 0
> >>   Region 0: Memory at f0048000 (32-bit, non-prefetchable) [size=16K]
> >>   Region 3: Memory at f004c000 (32-bit, non-prefetchable) [size=16K]
> >>   Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3
> >>   Vector table: BAR=3 offset=
> >>   PBA: BAR=3 offset=00002000
> >> 
> >> 
> >> 
> >> +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
> >> +   gpa_t addr, int len)
> >> +{
> >> + gpa_t start, end;
> >> +
> >> + BUG_ON(adev->msix_mmio_base == 0);
> >> + start = adev->msix_mmio_base;
> >> + end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE *
> >> + adev->msix_max_entries_nr;
> >> + if (addr >= start && addr + len <= end)
> >> + return true;
> >> +
> >> + return false;
> >> +}
> >> 
> >> 2010/11/30 Yang, Sheng :
> >> > On Tuesday 30 November 2010 17:10:11 lidong chen wrote:
> >> >> sr-iov also meet this problem, MSIX mask waste a lot of cpu resource.
> >> >> 
> >> >> I test kvm with sriov, which the vf driver could not disable msix.
> >> >> so the host os waste a lot of cpu.  cpu rate of host os is 90%.
> >> >> 
> >> >> then I test xen with sriov, there ara also a lot of vm exits caused
> >> >> by MSIX mask.
> >> >> but the cpu rate of xen and domain0 is less than kvm. cpu rate of xen
> >> >> and domain0 is 60%.
>

Re: Performance test result between virtio_pci MSI-X disable and enable

2010-12-01 Thread Yang, Sheng

On Wednesday 01 December 2010 16:41:38 lidong chen wrote:
> I used sr-iov, give each vm 2 vf.
> after apply the patch, and i found performence is the same.
> 
> the reason is in function msix_mmio_write, mostly addr is not in mmio
> range.

This url maybe more convenient.

http://www.spinics.net/lists/kvm/msg44795.html

--
regards
Yang, Sheng

> 
> static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
>  const void *val)
> {
>   struct kvm_assigned_dev_kernel *adev =
>   container_of(this, struct kvm_assigned_dev_kernel,
>msix_mmio_dev);
>   int idx, r = 0;
>   unsigned long new_val = *(unsigned long *)val;
> 
>   mutex_lock(&adev->kvm->lock);
>   if (!msix_mmio_in_range(adev, addr, len)) {
>   // return here.
>  r = -EOPNOTSUPP;
>   goto out;
>   }
> 
> i printk the value:
> addr start   end   len
> F004C00C   F0044000  F0044030 4
> 
> 00:06.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev 01)
>   Subsystem: Intel Corporation Unknown device 000c
>   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B-
>   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> SERR-Latency: 0
>   Region 0: Memory at f004 (32-bit, non-prefetchable) [size=16K]
>   Region 3: Memory at f0044000 (32-bit, non-prefetchable) [size=16K]
>   Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3
>   Vector table: BAR=3 offset=
>   PBA: BAR=3 offset=2000
> 
> 00:07.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev 01)
>   Subsystem: Intel Corporation Unknown device 000c
>   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B-
>   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> SERR-Latency: 0
>   Region 0: Memory at f0048000 (32-bit, non-prefetchable) [size=16K]
>   Region 3: Memory at f004c000 (32-bit, non-prefetchable) [size=16K]
>   Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3
>   Vector table: BAR=3 offset=
>   PBA: BAR=3 offset=2000
> 
> 
> 
> +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
> +   gpa_t addr, int len)
> +{
> + gpa_t start, end;
> +
> + BUG_ON(adev->msix_mmio_base == 0);
> + start = adev->msix_mmio_base;
> + end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE *
> + adev->msix_max_entries_nr;
> + if (addr >= start && addr + len <= end)
> + return true;
> +
> + return false;
> +}
> 
> 2010/11/30 Yang, Sheng :
> > On Tuesday 30 November 2010 17:10:11 lidong chen wrote:
> >> sr-iov also meet this problem, MSIX mask waste a lot of cpu resource.
> >> 
> >> I test kvm with sriov, which the vf driver could not disable msix.
> >> so the host os waste a lot of cpu.  cpu rate of host os is 90%.
> >> 
> >> then I test xen with sriov, there ara also a lot of vm exits caused by
> >> MSIX mask.
> >> but the cpu rate of xen and domain0 is less than kvm. cpu rate of xen
> >> and domain0 is 60%.
> >> 
> >> without sr-iov, the cpu rate of xen and domain0 is higher than kvm.
> >> 
> >> so i think the problem is kvm waste more cpu resource to deal with MSIX
> >> mask. and we can see how xen deal with MSIX mask.
> >> 
> >> if this problem sloved, maybe with MSIX enabled, the performace is
> >> better.
> > 
> > Please refer to my posted patches for this issue.
> > 
> > http://www.spinics.net/lists/kvm/msg44992.html
> > 
> > --
> > regards
> > Yang, Sheng
> > 
> >> 2010/11/23 Avi Kivity :
> >> > On 11/23/2010 09:27 AM, lidong chen wrote:
> >> >> can you tell me something about this problem.
> >> >> thanks.
> >> > 
> >> > Which problem?
> >> > 
> >> > --
> >> > I have a truly marvellous patch that fixes the bug which this
> >> > signature is too narrow to contain.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance test result between virtio_pci MSI-X disable and enable

2010-12-01 Thread Yang, Sheng

On Wednesday 01 December 2010 16:41:38 lidong chen wrote:
> I used sr-iov, give each vm 2 vf.
> after apply the patch, and i found performence is the same.
> 
> the reason is in function msix_mmio_write, mostly addr is not in mmio
> range.

Did you patch qemu as well? You can see it's impossible for kernel part to work 
alone...

http://www.mail-archive.com/kvm@vger.kernel.org/msg44368.html

--
regards
Yang, Sheng


> 
> static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
>  const void *val)
> {
>   struct kvm_assigned_dev_kernel *adev =
>   container_of(this, struct kvm_assigned_dev_kernel,
>msix_mmio_dev);
>   int idx, r = 0;
>   unsigned long new_val = *(unsigned long *)val;
> 
>   mutex_lock(&adev->kvm->lock);
>   if (!msix_mmio_in_range(adev, addr, len)) {
>   // return here.
>  r = -EOPNOTSUPP;
>   goto out;
>   }
> 
> i printk the value:
> addr start   end   len
> F004C00C   F0044000  F0044030 4
> 
> 00:06.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev 01)
>   Subsystem: Intel Corporation Unknown device 000c
>   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B-
>   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> SERR-Latency: 0
>   Region 0: Memory at f004 (32-bit, non-prefetchable) [size=16K]
>   Region 3: Memory at f0044000 (32-bit, non-prefetchable) [size=16K]
>   Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3
>   Vector table: BAR=3 offset=
>   PBA: BAR=3 offset=2000
> 
> 00:07.0 Ethernet controller: Intel Corporation Unknown device 10ed (rev 01)
>   Subsystem: Intel Corporation Unknown device 000c
>   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B-
>   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> SERR-Latency: 0
>   Region 0: Memory at f0048000 (32-bit, non-prefetchable) [size=16K]
>   Region 3: Memory at f004c000 (32-bit, non-prefetchable) [size=16K]
>   Capabilities: [40] MSI-X: Enable+ Mask- TabSize=3
>   Vector table: BAR=3 offset=
>   PBA: BAR=3 offset=2000
> 
> 
> 
> +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
> +   gpa_t addr, int len)
> +{
> + gpa_t start, end;
> +
> + BUG_ON(adev->msix_mmio_base == 0);
> + start = adev->msix_mmio_base;
> + end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE *
> + adev->msix_max_entries_nr;
> + if (addr >= start && addr + len <= end)
> + return true;
> +
> + return false;
> +}
> 
> 2010/11/30 Yang, Sheng :
> > On Tuesday 30 November 2010 17:10:11 lidong chen wrote:
> >> sr-iov also meet this problem, MSIX mask waste a lot of cpu resource.
> >> 
> >> I test kvm with sriov, which the vf driver could not disable msix.
> >> so the host os waste a lot of cpu.  cpu rate of host os is 90%.
> >> 
> >> then I test xen with sriov, there ara also a lot of vm exits caused by
> >> MSIX mask.
> >> but the cpu rate of xen and domain0 is less than kvm. cpu rate of xen
> >> and domain0 is 60%.
> >> 
> >> without sr-iov, the cpu rate of xen and domain0 is higher than kvm.
> >> 
> >> so i think the problem is kvm waste more cpu resource to deal with MSIX
> >> mask. and we can see how xen deal with MSIX mask.
> >> 
> >> if this problem sloved, maybe with MSIX enabled, the performace is
> >> better.
> > 
> > Please refer to my posted patches for this issue.
> > 
> > http://www.spinics.net/lists/kvm/msg44992.html
> > 
> > --
> > regards
> > Yang, Sheng
> > 
> >> 2010/11/23 Avi Kivity :
> >> > On 11/23/2010 09:27 AM, lidong chen wrote:
> >> >> can you tell me something about this problem.
> >> >> thanks.
> >> > 
> >> > Which problem?
> >> > 
> >> > --
> >> > I have a truly marvellous patch that fixes the bug which this
> >> > signature is too narrow to contain.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-11-30 Thread Yang, Sheng

On Tuesday 30 November 2010 22:15:29 Avi Kivity wrote:
> On 11/26/2010 04:35 AM, Yang, Sheng wrote:
> > >  >  Shouldn't kvm also service reads from the pending bitmask?
> > >  
> > >  Of course KVM should service reading from pending bitmask. For
> > >  assigned device, it's kernel who would set the pending bit; but I am
> > >  not sure for virtio. This interface is GET_ENTRY, so reading is fine
> > >  with it.
> 
> The kernel should manage it in the same way.  Virtio raises irq (via
> KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit.
> 
> Note we need to be able to read and write the pending bitmask for live
> migration.

Then seems we still need to an writing interface for it. And I think we can 
work 
on it later since it got no user now.
> 
> > >  >  We could have the kernel handle addr/data writes by setting up an
> > >  >  internal interrupt routing.  A disadvantage is that more work is
> > >  >  needed if we emulator interrupt remapping in qemu.
> > >  
> > >  In fact modifying irq routing in the kernel is also the thing I want
> > >  to avoid.
> > >  
> > >  So, the flow would be:
> > >  
> > >  kernel get MMIO write, record it in it's own MSI table
> > >  KVM exit to QEmu, by one specific exit reason
> > >  QEmu know it have to sync the MSI table, then reading the entries from
> > >  kernel QEmu found it's an write, so it need to reprogram irq routing
> > >  table using the entries above
> > >  done
> > >  
> > >  But wait, why should qemu read entries from kernel? By default exit we
> > >  already have the information about what's the entry to modify and what
> > >  to write, so we can use them directly. By this way, we also don't
> > >  need an specific exit reason - just exit to qemu in normal way is
> > >  fine.
> 
> Because we have an interface where you get an exit if (addr % 4) < 3 and
> don't get an exit if (addr % 4) == 3.  There is a gpa range which is
> partially maintained by the kernel and partially in userspace.  It's a
> confusing interface.  Things like 64-bit reads or writes need to be
> broken up and serviced in two different places.
> 
> We already need to support this (for unaligned writes which hit two
> regions), but let's at least make a contiguous region behave sanely.

Oh, I didn't mean to handle this kind of unaligned writing in userspace. 
They're 
illegal according to the PCI spec(otherwise the result is undefined according 
to 
the spec). I would cover all contiguous writing(32-bit and 64-bit) in the next 
version, and discard all illegal writing. And 64-bit accessing would be broken 
up 
in qemu as you said, as they do currently. 

In fact I think we can handle all data for 64-bit to qemu, because it should 
still 
sync the mask bit with kernel, which make the maskbit writing in userspace 
useless 
and can be ignored.
>
> > >  Then it would be:
> > >  
> > >  kernel get MMIO write, record it in it's own MSI table
> > >  KVM exit to QEmu, indicate MMIO exit
> > >  QEmu found it's an write, it would update it's own MSI table(may need
> > >  to query mask bit from kernel), and reprogram irq routing table using
> > >  the entries above done
> > >  
> > >  Then why should kernel kept it's own MSI table? I think the only
> > >  reason is we can speed up reading in that way - but the reading we
> > >  want to speed up is mostly on enabled entry(the first entry), which
> > >  is already in the IRQ routing table...
> 
> The reason is to keep a sane interface.  Like we emulate instructions
> and msrs in the kernel and don't do half a job.  I don't think there's a
> real need to accelerate the first three words of an msi-x entry.

Here is the case we've observed with Xen. It can only be reproduced by large 
scale 
testing. When the interrupt intensity is very high, even new kernels would try 
to 
make it lower, then it would access mask bit very frequently. And in the 
kernel, 
msi_set_mask_bit() is like this:

static void msi_set_mask_bit(struct irq_data *data, u32 flag)   
{   
struct msi_desc *desc = irq_data_get_msi(data); 

if (desc->msi_attrib.is_msix) { 
msix_mask_irq(desc, flag);  
readl(desc->mask_base); /* Flush write to device *

Re: Performance test result between virtio_pci MSI-X disable and enable

2010-11-30 Thread Yang, Sheng

On Tuesday 30 November 2010 17:10:11 lidong chen wrote:
> sr-iov also meet this problem, MSIX mask waste a lot of cpu resource.
> 
> I test kvm with sriov, which the vf driver could not disable msix.
> so the host os waste a lot of cpu.  cpu rate of host os is 90%.
> 
> then I test xen with sriov, there ara also a lot of vm exits caused by
> MSIX mask.
> but the cpu rate of xen and domain0 is less than kvm. cpu rate of xen
> and domain0 is 60%.
> 
> without sr-iov, the cpu rate of xen and domain0 is higher than kvm.
> 
> so i think the problem is kvm waste more cpu resource to deal with MSIX
> mask. and we can see how xen deal with MSIX mask.
> 
> if this problem sloved, maybe with MSIX enabled, the performace is better.

Please refer to my posted patches for this issue. 

http://www.spinics.net/lists/kvm/msg44992.html

--
regards
Yang, Sheng

> 
> 2010/11/23 Avi Kivity :
> > On 11/23/2010 09:27 AM, lidong chen wrote:
> >> can you tell me something about this problem.
> >> thanks.
> > 
> > Which problem?
> > 
> > --
> > I have a truly marvellous patch that fixes the bug which this
> > signature is too narrow to contain.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-11-25 Thread Yang, Sheng

On Wednesday 24 November 2010 09:59:23 Yang, Sheng wrote:
> On Tuesday 23 November 2010 22:06:20 Avi Kivity wrote:
> > On 11/23/2010 03:57 PM, Yang, Sheng wrote:
> > > >  >  Yeah, but won't be included in this patchset.
> > > >  
> > > >  What API changes are needed?  I'd like to see the complete API.
> > > 
> > > I am not sure about it. But I suppose the structure should be the same?
> > > In fact it's pretty hard for me to image what's needed for virtio in
> > > the future, especially there is no such code now. I really prefer to
> > > deal with assigned device and virtio separately, which would make the
> > > work much easier. But seems you won't agree on that.
> > 
> > First, I don't really see why the two cases are different (but I don't
> > do a lot in this space).  Surely between you and Michael, you have all
> > the information?
> > 
> > Second, my worry is a huge number of ABI variants that come from
> > incrementally adding features.  I want to implement bigger chunks of
> > functionality.  So I'd like to see all potential users addressed, at
> > least from the ABI point of view if not the implementation.
> > 
> > > >  The API needs to be compatible with the pending bit, even if we
> > > >  don't implement it now.  I want to reduce the rate of API changes.
> > > 
> > > This can be implemented by this API, just adding a flag for it. And I
> > > would still take this into consideration in the next API purposal.
> > 
> > Shouldn't kvm also service reads from the pending bitmask?
> 
> Of course KVM should service reading from pending bitmask. For assigned
> device, it's kernel who would set the pending bit; but I am not sure for
> virtio. This interface is GET_ENTRY, so reading is fine with it.
> 
> > > >  So instead of
> > > >  
> > > >  - guest reads/writes msix
> > > >  - kvm filters mmio, implements some, passes others to userspace
> > > >  
> > > >  we have
> > > >  
> > > >  - guest reads/writes msix
> > > >  - kvm implements all
> > > >  - some writes generate an additional notification to userspace
> > > 
> > > I suppose we don't need to generate notification to userspace? Because
> > > every read/write is handled by kernel, and userspace just need
> > > interface to kernel to get/set the entry - and well, does userspace
> > > need to do it when kernel can handle all of them? Maybe not...
> > 
> > We could have the kernel handle addr/data writes by setting up an
> > internal interrupt routing.  A disadvantage is that more work is needed
> > if we emulator interrupt remapping in qemu.
> 
> In fact modifying irq routing in the kernel is also the thing I want to
> avoid.
> 
> So, the flow would be:
> 
> kernel get MMIO write, record it in it's own MSI table
> KVM exit to QEmu, by one specific exit reason
> QEmu know it have to sync the MSI table, then reading the entries from
> kernel QEmu found it's an write, so it need to reprogram irq routing table
> using the entries above
> done
> 
> But wait, why should qemu read entries from kernel? By default exit we
> already have the information about what's the entry to modify and what to
> write, so we can use them directly. By this way, we also don't need an
> specific exit reason - just exit to qemu in normal way is fine.
> 
> Then it would be:
> 
> kernel get MMIO write, record it in it's own MSI table
> KVM exit to QEmu, indicate MMIO exit
> QEmu found it's an write, it would update it's own MSI table(may need to
> query mask bit from kernel), and reprogram irq routing table using the
> entries above done
> 
> Then why should kernel kept it's own MSI table? I think the only reason is
> we can speed up reading in that way - but the reading we want to speed up
> is mostly on enabled entry(the first entry), which is already in the IRQ
> routing table...
> 
> And for enabled/disabled entry, you can see it like this: for the entries
> inside routing table, we think it's enabled; otherwise it's disabled. Then
> you don't need to bothered by pci_enable_msix().
> 
> So our strategy for reading accelerating can be:
> 
> If the entry contained in irq routing table, then use it; otherwise let
> qemu deal with it. Because it's the QEmu who owned irq routing table, the
> synchronization is guaranteed. We don't need the MSI table in the kernel
> then.
> 
> And for writing, we just want t

Re: Mask bit support's API

2010-11-23 Thread Yang, Sheng

On Tuesday 23 November 2010 22:06:20 Avi Kivity wrote:
> On 11/23/2010 03:57 PM, Yang, Sheng wrote:
> > >  >  Yeah, but won't be included in this patchset.
> > >  
> > >  What API changes are needed?  I'd like to see the complete API.
> > 
> > I am not sure about it. But I suppose the structure should be the same?
> > In fact it's pretty hard for me to image what's needed for virtio in the
> > future, especially there is no such code now. I really prefer to deal
> > with assigned device and virtio separately, which would make the work
> > much easier. But seems you won't agree on that.
> 
> First, I don't really see why the two cases are different (but I don't
> do a lot in this space).  Surely between you and Michael, you have all
> the information?
> 
> Second, my worry is a huge number of ABI variants that come from
> incrementally adding features.  I want to implement bigger chunks of
> functionality.  So I'd like to see all potential users addressed, at
> least from the ABI point of view if not the implementation.
> 
> > >  The API needs to be compatible with the pending bit, even if we don't
> > >  implement it now.  I want to reduce the rate of API changes.
> > 
> > This can be implemented by this API, just adding a flag for it. And I
> > would still take this into consideration in the next API purposal.
> 
> Shouldn't kvm also service reads from the pending bitmask?

Of course KVM should service reading from pending bitmask. For assigned device, 
it's kernel who would set the pending bit; but I am not sure for virtio. This 
interface is GET_ENTRY, so reading is fine with it.

> > >  So instead of
> > >  
> > >  - guest reads/writes msix
> > >  - kvm filters mmio, implements some, passes others to userspace
> > >  
> > >  we have
> > >  
> > >  - guest reads/writes msix
> > >  - kvm implements all
> > >  - some writes generate an additional notification to userspace
> > 
> > I suppose we don't need to generate notification to userspace? Because
> > every read/write is handled by kernel, and userspace just need interface
> > to kernel to get/set the entry - and well, does userspace need to do it
> > when kernel can handle all of them? Maybe not...
> 
> We could have the kernel handle addr/data writes by setting up an
> internal interrupt routing.  A disadvantage is that more work is needed
> if we emulator interrupt remapping in qemu.

In fact modifying irq routing in the kernel is also the thing I want to avoid.

So, the flow would be:

kernel get MMIO write, record it in it's own MSI table
KVM exit to QEmu, by one specific exit reason
QEmu know it have to sync the MSI table, then reading the entries from kernel
QEmu found it's an write, so it need to reprogram irq routing table using the 
entries above
done

But wait, why should qemu read entries from kernel? By default exit we already 
have the information about what's the entry to modify and what to write, so we 
can 
use them directly. By this way, we also don't need an specific exit reason - 
just 
exit to qemu in normal way is fine.

Then it would be:

kernel get MMIO write, record it in it's own MSI table
KVM exit to QEmu, indicate MMIO exit
QEmu found it's an write, it would update it's own MSI table(may need to query 
mask bit from kernel), and reprogram irq routing table using the entries above
done

Then why should kernel kept it's own MSI table? I think the only reason is we 
can 
speed up reading in that way - but the reading we want to speed up is mostly on 
enabled entry(the first entry), which is already in the IRQ routing table... 

And for enabled/disabled entry, you can see it like this: for the entries 
inside 
routing table, we think it's enabled; otherwise it's disabled. Then you don't 
need 
to bothered by pci_enable_msix().

So our strategy for reading accelerating can be:

If the entry contained in irq routing table, then use it; otherwise let qemu 
deal 
with it. Because it's the QEmu who owned irq routing table, the synchronization 
is 
guaranteed. We don't need the MSI table in the kernel then.

And for writing, we just want to cover all of mask bit, but none of others.

I think the concept here is more acceptable?

The issue here is MSI table and irq routing table got duplicate information on 
some entries. My initial purposal is to use irq routing table in kernel, then 
we 
don't need to duplicate information.

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-11-23 Thread Yang, Sheng

On Tuesday 23 November 2010 20:04:16 Michael S. Tsirkin wrote:
> On Tue, Nov 23, 2010 at 02:09:52PM +0800, Yang, Sheng wrote:
> > Hi Avi,
> > 
> > I've purposed the following API for mask bit support.
> > 
> > The main point is, QEmu can know which entries are enabled(by
> > pci_enable_msix()).
> 
> Unfortunately, it can't I think, unless all your guests are linux.
> "enabled entries" is a linux kernel concept.
> The MSIX spec only tells you which entries are masked and which are
> unmasked.

Can't understand what you are talking about, and how it related to the guest 
OS. I 
was talking about pci_enable_msix() in the host Linux.

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-11-23 Thread Yang, Sheng

On Tuesday 23 November 2010 20:47:33 Avi Kivity wrote:
> On 11/23/2010 10:30 AM, Yang, Sheng wrote:
> > On Tuesday 23 November 2010 15:54:40 Avi Kivity wrote:
> > >  On 11/23/2010 08:35 AM, Yang, Sheng wrote:
> > >  >  On Tuesday 23 November 2010 14:17:28 Avi Kivity wrote:
> > >  >  >   On 11/23/2010 08:09 AM, Yang, Sheng wrote:
> > >  >  >   >   Hi Avi,
> > >  >  >   >   
> > >  >  >   >   I've purposed the following API for mask bit support.
> > >  >  >   >   
> > >  >  >   >   The main point is, QEmu can know which entries are
> > >  >  >   >   enabled(by pci_enable_msix()). And for enabled entries,
> > >  >  >   >   kernel own it, including MSI data/address and mask
> > >  >  >   >   bit(routing table and mask bitmap). QEmu should use
> > >  >  >   >   KVM_GET_MSIX_ENTRY ioctl to get them(and it can sync with
> > >  >  >   >   them if it want to do so).
> > >  >  >   >   
> > >  >  >   >   Before entries are enabled, QEmu can still use it's own MSI
> > >  >  >   >   table(because we didn't contain these kind of information
> > >  >  >   >   in kernel, and it's unnecessary for kernel).
> > >  >  >   >   
> > >  >  >   >   The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to
> > >  >  >   >   query one entry didn't exist in kernel - or we can simply
> > >  >  >   >   return -EINVAL for it.
> > >  >  >   >   
> > >  >  >   >   I suppose it would be rare for QEmu to use this interface
> > >  >  >   >   to get the context of entry(the only case I think is when
> > >  >  >   >   MSI-X disable and QEmu need to sync the context), so
> > >  >  >   >   performance should not be an issue.
> > >  >  >   >   
> > >  >  >   >   What's your opinion?
> > >  >  >   >   
> > >  >  >   >   >#define KVM_GET_MSIX_ENTRY_IOWR(KVMIO,  0x7d,
> > >  >  >   >   >struct kvm_msix_entry)
> > >  >  >   
> > >  >  >   Need SET_MSIX_ENTRY for live migration as well.
> > >  >  
> > >  >  Current we don't support LM with VT-d...
> > >  
> > >  Isn't this work useful for virtio as well?
> > 
> > Yeah, but won't be included in this patchset.
> 
> What API changes are needed?  I'd like to see the complete API.

I am not sure about it. But I suppose the structure should be the same? In fact 
it's pretty hard for me to image what's needed for virtio in the future, 
especially there is no such code now. I really prefer to deal with assigned 
device 
and virtio separately, which would make the work much easier. But seems you 
won't 
agree on that.

> 
> > >  >  >   What about the pending bits?
> > >  >  
> > >  >  We didn't cover it here - and it's in another MMIO space(PBA). Of
> > >  >  course we can add more flags for it later.
> > >  
> > >  When an entry is masked, we need to set the pending bit for it
> > >  somewhere.  I guess this is broken in the existing code (without your
> > >  patches)?
> > 
> > Even with my patch, we didn't support the pending bit. It would always
> > return 0 now. What we supposed to do(after my patch checked in) is to
> > check IRQ_PENDING flag of irq_desc->status(if the entry is masked), and
> > return the result to userspace.
> > 
> > That would involve some core change, like to export irq_to_desc(). I
> > don't think it would be accepted soon, so would push mask bit first.
> 
> The API needs to be compatible with the pending bit, even if we don't
> implement it now.  I want to reduce the rate of API changes.

This can be implemented by this API, just adding a flag for it. And I would 
still 
take this into consideration in the next API purposal.
 
> > >  >  >   Also need a new exit reason to tell userspace that an msix
> > >  >  >   entry has changed, so userspace can update mappings.
> > >  >  
> > >  >  I think we don't need it. Whenever userspace want to get one
> > >  >  mapping which is an enabled MSI-X entry, it can check it with the
> > >  >  API above(which is quite rare, because kernel would handle all of
> > >  >  them when guest is accessing them). If it's a disabled entry, the
> > >  >  cont

Re: Mask bit support's API

2010-11-23 Thread Yang, Sheng

On Tuesday 23 November 2010 15:54:40 Avi Kivity wrote:
> On 11/23/2010 08:35 AM, Yang, Sheng wrote:
> > On Tuesday 23 November 2010 14:17:28 Avi Kivity wrote:
> > >  On 11/23/2010 08:09 AM, Yang, Sheng wrote:
> > >  >  Hi Avi,
> > >  >  
> > >  >  I've purposed the following API for mask bit support.
> > >  >  
> > >  >  The main point is, QEmu can know which entries are enabled(by
> > >  >  pci_enable_msix()). And for enabled entries, kernel own it,
> > >  >  including MSI data/address and mask bit(routing table and mask
> > >  >  bitmap). QEmu should use KVM_GET_MSIX_ENTRY ioctl to get them(and
> > >  >  it can sync with them if it want to do so).
> > >  >  
> > >  >  Before entries are enabled, QEmu can still use it's own MSI
> > >  >  table(because we didn't contain these kind of information in
> > >  >  kernel, and it's unnecessary for kernel).
> > >  >  
> > >  >  The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to query
> > >  >  one entry didn't exist in kernel - or we can simply return -EINVAL
> > >  >  for it.
> > >  >  
> > >  >  I suppose it would be rare for QEmu to use this interface to get
> > >  >  the context of entry(the only case I think is when MSI-X disable
> > >  >  and QEmu need to sync the context), so performance should not be
> > >  >  an issue.
> > >  >  
> > >  >  What's your opinion?
> > >  >  
> > >  >  >   #define KVM_GET_MSIX_ENTRY_IOWR(KVMIO,  0x7d, struct
> > >  >  >   kvm_msix_entry)
> > >  
> > >  Need SET_MSIX_ENTRY for live migration as well.
> > 
> > Current we don't support LM with VT-d...
> 
> Isn't this work useful for virtio as well?

Yeah, but won't be included in this patchset.
> 
> > >  >  >   #define KVM_UPDATE_MSIX_MMIO  _IOW(KVMIO,  0x7e, struct
> > >  >  >   kvm_msix_mmio)
> > >  >  >   
> > >  >  >   #define KVM_MSIX_TYPE_ASSIGNED_DEV  1
> > >  >  >   
> > >  >  >   #define KVM_MSIX_FLAG_MASKBIT   (1<<   0)
> > >  >  >   #define KVM_MSIX_FLAG_QUERY_MASKBIT (1<<   0)
> > >  >  >   #define KVM_MSIX_FLAG_ENTRY (1<<   1)
> > >  >  >   #define KVM_MSIX_FLAG_QUERY_ENTRY   (1<<   1)
> > >  
> > >  Why is there a need for the flag?  If we simply get/set entire
> > >  entries, that includes the mask bits?
> > 
> > We still want QEmu to cover a part of entries which hasn't been enabled
> > yet(which won't existed in routing table), but kernel would cover all
> > mask bit regardless of if it's enabled. So QEmu can query any entry to
> > check the maskbit, but not address/data.
> 
> Don't understand.  If we support reading/writing entire entries, that
> works for both enabled and disabled entries?
> 
> > >  What about the pending bits?
> > 
> > We didn't cover it here - and it's in another MMIO space(PBA). Of course
> > we can add more flags for it later.
> 
> When an entry is masked, we need to set the pending bit for it
> somewhere.  I guess this is broken in the existing code (without your
> patches)?

Even with my patch, we didn't support the pending bit. It would always return 0 
now. What we supposed to do(after my patch checked in) is to check IRQ_PENDING 
flag 
of irq_desc->status(if the entry is masked), and return the result to userspace.

That would involve some core change, like to export irq_to_desc(). I don't 
think 
it would be accepted soon, so would push mask bit first.

> 
> > >  Also need a new exit reason to tell userspace that an msix entry has
> > >  changed, so userspace can update mappings.
> > 
> > I think we don't need it. Whenever userspace want to get one mapping
> > which is an enabled MSI-X entry, it can check it with the API
> > above(which is quite rare, because kernel would handle all of them when
> > guest is accessing them). If it's a disabled entry, the context inside
> > userspace MMIO record is the correct one(and only one). The only place I
> > think QEmu need to sync is when MSI-X is about to disabled, QEmu need to
> > update it's own MMIO record.
> 
> So in-kernel handling of mmio would be decided per entry?  I'm trying to
> simplify this, and simplest thing is - all or nothing.

So you would like to handle all MSI-X MMIO in kernel?

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mask bit support's API

2010-11-22 Thread Yang, Sheng

On Tuesday 23 November 2010 14:17:28 Avi Kivity wrote:
> On 11/23/2010 08:09 AM, Yang, Sheng wrote:
> > Hi Avi,
> > 
> > I've purposed the following API for mask bit support.
> > 
> > The main point is, QEmu can know which entries are enabled(by
> > pci_enable_msix()). And for enabled entries, kernel own it, including
> > MSI data/address and mask bit(routing table and mask bitmap). QEmu
> > should use KVM_GET_MSIX_ENTRY ioctl to get them(and it can sync with
> > them if it want to do so).
> > 
> > Before entries are enabled, QEmu can still use it's own MSI table(because
> > we didn't contain these kind of information in kernel, and it's
> > unnecessary for kernel).
> > 
> > The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to query one
> > entry didn't exist in kernel - or we can simply return -EINVAL for it.
> > 
> > I suppose it would be rare for QEmu to use this interface to get the
> > context of entry(the only case I think is when MSI-X disable and QEmu
> > need to sync the context), so performance should not be an issue.
> > 
> > What's your opinion?
> > 
> > >  #define KVM_GET_MSIX_ENTRY_IOWR(KVMIO,  0x7d, struct
> > >  kvm_msix_entry)
> 
> Need SET_MSIX_ENTRY for live migration as well.

Current we don't support LM with VT-d...
> 
> > >  #define KVM_UPDATE_MSIX_MMIO  _IOW(KVMIO,  0x7e, struct
> > >  kvm_msix_mmio)
> > >  
> > >  #define KVM_MSIX_TYPE_ASSIGNED_DEV  1
> > >  
> > >  #define KVM_MSIX_FLAG_MASKBIT   (1<<  0)
> > >  #define KVM_MSIX_FLAG_QUERY_MASKBIT (1<<  0)
> > >  #define KVM_MSIX_FLAG_ENTRY (1<<  1)
> > >  #define KVM_MSIX_FLAG_QUERY_ENTRY   (1<<  1)
> 
> Why is there a need for the flag?  If we simply get/set entire entries,
> that includes the mask bits?

We still want QEmu to cover a part of entries which hasn't been enabled 
yet(which 
won't existed in routing table), but kernel would cover all mask bit regardless 
of 
if it's enabled. So QEmu can query any entry to check the maskbit, but not 
address/data.
 
> What about the pending bits?

We didn't cover it here - and it's in another MMIO space(PBA). Of course we can 
add more flags for it later.
> 
> > >  struct kvm_msix_entry {
> > >  
> > >  __u32 id;
> > >  __u32 type;
> > >  __u32 entry; /* The index of entry in the MSI-X table */
> > >  __u32 flags;
> > >  __u32 query_flags;
> > >  union {
> > >  
> > >  struct {
> > >  
> > >  __u32 addr_lo;
> > >  __u32 addr_hi;
> > >  __u32 data;
> 
> Isn't the mask bit in the last word?  Or maybe I'm confused about the
> format.

I separated the entry and mask bit as I said above.
> 
> > >  } msi_entry;
> > >  __u32 reserved[12];
> > >  
> > >  };
> > >  
> > >  };
> > >  
> > >  #define KVM_MSIX_MMIO_FLAG_REGISTER (1<<  0)
> > >  #define KVM_MSIX_MMIO_FLAG_UNREGISTER   (1<<  1)
> > >  #define KVM_MSIX_MMIO_FLAG_MASK 0x3
> > >  
> > >  struct kvm_msix_mmio {
> > >  
> > >  __u32 id;
> > >  __u32 type;
> > >  __u64 base_addr;
> > >  __u32 max_entries_nr;
> > >  __u32 flags;
> > >  __u32 reserved[6];
> > >  
> > >  };
> 
> Also need a new exit reason to tell userspace that an msix entry has
> changed, so userspace can update mappings.

I think we don't need it. Whenever userspace want to get one mapping which is 
an 
enabled MSI-X entry, it can check it with the API above(which is quite rare, 
because kernel would handle all of them when guest is accessing them). If it's 
a 
disabled entry, the context inside userspace MMIO record is the correct one(and 
only one). The only place I think QEmu need to sync is when MSI-X is about to 
disabled, QEmu need to update it's own MMIO record.

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mask bit support's API

2010-11-22 Thread Yang, Sheng

Hi Avi,

I've purposed the following API for mask bit support.

The main point is, QEmu can know which entries are enabled(by 
pci_enable_msix()). 
And for enabled entries, kernel own it, including MSI data/address and mask 
bit(routing table and mask bitmap). QEmu should use KVM_GET_MSIX_ENTRY ioctl to 
get them(and it can sync with them if it want to do so).

Before entries are enabled, QEmu can still use it's own MSI table(because we 
didn't contain these kind of information in kernel, and it's unnecessary for 
kernel). 

The KVM_MSIX_FLAG_ENTRY flag would be clear if QEmu want to query one entry 
didn't 
exist in kernel - or we can simply return -EINVAL for it.

I suppose it would be rare for QEmu to use this interface to get the context of 
entry(the only case I think is when MSI-X disable and QEmu need to sync the 
context), so performance should not be an issue.

What's your opinion?

> #define KVM_GET_MSIX_ENTRY_IOWR(KVMIO,  0x7d, struct kvm_msix_entry) 
> #define KVM_UPDATE_MSIX_MMIO  _IOW(KVMIO,  0x7e, struct kvm_msix_mmio)   
>
> #define KVM_MSIX_TYPE_ASSIGNED_DEV  1   
> 
> #define KVM_MSIX_FLAG_MASKBIT   (1 << 0)
> #define KVM_MSIX_FLAG_QUERY_MASKBIT (1 << 0)
> #define KVM_MSIX_FLAG_ENTRY (1 << 1)
> #define KVM_MSIX_FLAG_QUERY_ENTRY   (1 << 1)
> 
> struct kvm_msix_entry { 
> __u32 id;   
> __u32 type; 
> __u32 entry; /* The index of entry in the MSI-X table */
> __u32 flags;
> __u32 query_flags;  
> union { 
> struct {
> __u32 addr_lo;  
> __u32 addr_hi;  
> __u32 data; 
> } msi_entry;
> __u32 reserved[12]; 
> };  
> };  
> 
> #define KVM_MSIX_MMIO_FLAG_REGISTER (1 << 0)
> #define KVM_MSIX_MMIO_FLAG_UNREGISTER   (1 << 1)
> #define KVM_MSIX_MMIO_FLAG_MASK 0x3 
> 
> struct kvm_msix_mmio {  
> __u32 id;   
> __u32 type; 
> __u64 base_addr;
> __u32 max_entries_nr;   
> __u32 flags;    
> __u32 reserved[6];  
> };  

--
regards
Yang, Sheng 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] Introduce a workqueue to deliver PIT timer interrupts.

2010-06-11 Thread Yang, Sheng

On Saturday 12 June 2010 01:35:23 Marcelo Tosatti wrote:
> On Thu, Jun 10, 2010 at 04:44:05PM -0400, Chris Lalancette wrote:
> > We really want to "kvm_set_irq" during the hrtimer callback,
> > but that is risky because that is during interrupt context.
> > Instead, offload the work to a workqueue, which is a bit safer
> > and should provide most of the same functionality.
> > 
> > Signed-off-by: Chris Lalancette 
> > ---
> > 
> >  arch/x86/kvm/i8254.c |  117
> >  -- arch/x86/kvm/i8254.h
> >  |4 +-
> >  arch/x86/kvm/irq.c   |1 -
> >  3 files changed, 69 insertions(+), 53 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
> > index 188d827..99c7472 100644
> > --- a/arch/x86/kvm/i8254.c
> > +++ b/arch/x86/kvm/i8254.c
> > @@ -34,6 +34,7 @@
> > 
> >  #include 
> >  #include 
> > 
> > +#include 
> > 
> >  #include "irq.h"
> >  #include "i8254.h"
> > 
> > @@ -244,11 +245,11 @@ static void kvm_pit_ack_irq(struct
> > kvm_irq_ack_notifier *kian)
> > 
> >  {
> >  
> > struct kvm_kpit_state *ps = container_of(kian, struct kvm_kpit_state,
> > 
> >  irq_ack_notifier);
> > 
> > -   raw_spin_lock(&ps->inject_lock);
> > +   spin_lock(&ps->inject_lock);
> > 
> > if (atomic_dec_return(&ps->pit_timer.pending) < 0)
> > 
> > atomic_inc(&ps->pit_timer.pending);
> > 
> > ps->irq_ack = 1;
> > 
> > -   raw_spin_unlock(&ps->inject_lock);
> > +   spin_unlock(&ps->inject_lock);
> > 
> >  }
> >  
> >  void __kvm_migrate_pit_timer(struct kvm_vcpu *vcpu)
> > 
> > @@ -281,6 +282,58 @@ static struct kvm_timer_ops kpit_ops = {
> > 
> > .is_periodic = kpit_is_periodic,
> >  
> >  };
> > 
> > +static void pit_do_work(struct work_struct *work)
> > +{
> > +   struct kvm_pit *pit = container_of(work, struct kvm_pit,
> > expired); +   struct kvm *kvm = pit->kvm;
> > +   struct kvm_vcpu *vcpu;
> > +   int i;
> > +   struct kvm_kpit_state *ps = &pit->pit_state;
> > +   int inject = 0;
> > +
> > +   /* Try to inject pending interrupts when
> > +* last one has been acked.
> > +*/
> > +   spin_lock(&ps->inject_lock);
> > +   if (ps->irq_ack) {
> > +   ps->irq_ack = 0;
> > +   inject = 1;
> > +   }
> > +   spin_unlock(&ps->inject_lock);
> > +   if (inject) {
> > +   kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 1);
> > +   kvm_set_irq(kvm, kvm->arch.vpit->irq_source_id, 0, 0);
> > +
> > +   /*
> > +* Provides NMI watchdog support via Virtual Wire mode.
> > +* The route is: PIT -> PIC -> LVT0 in NMI mode.
> > +*
> > +* Note: Our Virtual Wire implementation is simplified,
> > only +* propagating PIT interrupts to all VCPUs when
> > they have set +* LVT0 to NMI delivery. Other PIC
> > interrupts are just sent to +* VCPU0, and only if its
> > LVT0 is in EXTINT mode. +*/
> > +   if (kvm->arch.vapics_in_nmi_mode > 0)
> > +   kvm_for_each_vcpu(i, vcpu, kvm)
> > +   kvm_apic_nmi_wd_deliver(vcpu);
> > +   }
> > +}
> > +
> > +static enum hrtimer_restart pit_timer_fn(struct hrtimer *data)
> > +{
> > +   struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer);
> > +   struct kvm_pit *pt = ktimer->kvm->arch.vpit;
> > +
> > +   queue_work(pt->wq, &pt->expired);
> 
> So this disables interrupt reinjection. Older RHEL3 guests do not
> compensate for lost ticks, and as such are likely to drift without
> it (but RHEL3 is EOL, should one care?).
> 
> Are there other guests which rely on PIT reinjection, or is it OK
> to remove it completly?

IIRC, the old kernel *does* compensate ticks, so we need disable reinjection. 
And 
the latest kernel doesn't do this, so we have to do reinjection.

So we can't disable reinjection anyway.

BTW: The patch has some coding style issues, suggest using 
scripts/checkpatch.pl 
to check it.

--
regards
Yang, Sheng

> 
> > +
> > +   if (ktimer->t_ops->is_periodic(ktimer)) {
> > +   hrtimer_add_expires_ns(&ktimer->timer, ktimer->period);
> > +   return HRTIMER_RESTART;
> > +   }
> > +   else
> > +   return HRTIMER_NORESTART;
> > +}
> 
> Also need to cancel the pending work whenever the current code cancels
> the hrtimer (destroy_pit_timer, etc).
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Propagate fpu_alloc errors

2010-05-25 Thread Yang, Sheng

On Tuesday 25 May 2010 22:01:50 Jan Kiszka wrote:
> Memory allocation may fail. Propagate such errors.
> 
> Signed-off-by: Jan Kiszka 
> ---

Reviewed-by: Sheng Yang 

--
regards
Yang, Sheng

>  arch/x86/include/asm/kvm_host.h |2 +-
>  arch/x86/kvm/svm.c  |7 ++-
>  arch/x86/kvm/vmx.c  |4 +++-
>  arch/x86/kvm/x86.c  |   11 +--
>  4 files changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h
> b/arch/x86/include/asm/kvm_host.h index d08bb4a..0cd0f29 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -624,7 +624,7 @@ int kvm_pic_set_irq(void *opaque, int irq, int level);
> 
>  void kvm_inject_nmi(struct kvm_vcpu *vcpu);
> 
> -void fx_init(struct kvm_vcpu *vcpu);
> +int fx_init(struct kvm_vcpu *vcpu);
> 
>  void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu);
>  void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 4af2c12..5f25e59 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -903,13 +903,18 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm
> *kvm, unsigned int id) svm->asid_generation = 0;
>   init_vmcb(svm);
> 
> - fx_init(&svm->vcpu);
> + err = fx_init(&svm->vcpu);
> + if (err)
> + goto free_page4;
> +
>   svm->vcpu.arch.apic_base = 0xfee0 | MSR_IA32_APICBASE_ENABLE;
>   if (kvm_vcpu_is_bsp(&svm->vcpu))
>   svm->vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP;
> 
>   return &svm->vcpu;
> 
> +free_page4:
> + __free_page(hsave_page);
>  free_page3:
>   __free_pages(nested_msrpm_pages, MSRPM_ALLOC_ORDER);
>  free_page2:
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 99ae513..61bdae3 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2661,7 +2661,9 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu)
>   msr |= MSR_IA32_APICBASE_BSP;
>   kvm_set_apic_base(&vmx->vcpu, msr);
> 
> - fx_init(&vmx->vcpu);
> + ret = fx_init(&vmx->vcpu);
> + if (ret != 0)
> + goto out;
> 
>   seg_setup(VCPU_SREG_CS);
>   /*
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7be1d36..e773d93 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5113,12 +5113,19 @@ int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu
> *vcpu, struct kvm_fpu *fpu) return 0;
>  }
> 
> -void fx_init(struct kvm_vcpu *vcpu)
> +int fx_init(struct kvm_vcpu *vcpu)
>  {
> - fpu_alloc(&vcpu->arch.guest_fpu);
> + int err;
> +
> + err = fpu_alloc(&vcpu->arch.guest_fpu);
> + if (err)
> + return err;
> +
>   fpu_finit(&vcpu->arch.guest_fpu);
> 
>   vcpu->arch.cr0 |= X86_CR0_ET;
> +
> + return 0;
>  }
>  EXPORT_SYMBOL_GPL(fx_init);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [UNTESTED] KVM: do not call kvm_set_irq from irq disabled section

2010-04-21 Thread Yang, Sheng

On Wednesday 21 April 2010 05:49:11 Bonenkamp, Ralf wrote:
> Hi Marcelo,
> 
> Thanks for the patch.
> I put it into my kernel source tree and tested the freshly build kernel in
>  my testing environment. The problem is now - almost - gone. The only
>  suspicious message (ONE occurrence immediate after starting the Server
>  2008 R2 VM) in syslog is now:
> 
> BUG: scheduling while atomic: qemu/3674/0x0002
> Modules linked in: tun bridge stp llc ext3 jbd uhci_hcd
>  snd_hda_codec_realtek snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
>  snd_seq_device snd_pcm_oss snd_mixer_oss snd_hda_intel snd_hda_codec
>  snd_hwdep snd_pcm snd_timer snd soundcore snd_page_alloc i915
>  drm_kms_helper drm i2c_algo_bit video output i2c_i801 i2c_core
>  ide_pci_generic ide_core button intel_agp ehci_hcd thermal e1000e iTCO_wdt
>  iTCO_vendor_support processor usbcore kvm_intel evdev sg psmouse pcspkr
>  serio_raw kvm rtc_cmos rtc_core rtc_lib ext4 mbcache jbd2 crc16 dm_mod
>  sr_mod cdrom sd_mod ata_generic pata_acpi ahci libata scsi_mod Pid: 3674,
>  comm: qemu Not tainted 2.6.33.2-KVM_patch #4
> Call Trace:
>  [] ? schedule+0x544/0xa60
>  [] ? mmu_zap_unsync_children+0x17d/0x200 [kvm]
>  [] ? __mutex_lock_slowpath+0x162/0x350
>  [] ? mutex_lock+0x9/0x20
>  [] ? kvm_ioapic_set_irq+0x47/0x160 [kvm]
>  [] ? kvm_set_irq+0xf4/0x190 [kvm]
>  [] ? kvm_set_ioapic_irq+0x0/0x50 [kvm]
>  [] ? kvm_set_pic_irq+0x0/0x50 [kvm]
>  [] ? paging64_walk_addr+0x25f/0x750 [kvm]
>  [] ? __mutex_lock_slowpath+0x25d/0x350
>  [] ? kvm_assigned_dev_ack_irq+0x35/0x90 [kvm]
>  [] ? kvm_notify_acked_irq+0x71/0x120 [kvm]

Seems this time is kvm_notify_acked_irq() with RCU.

-- 
regards
Yang, Sheng

>  [] ? kvm_ioapic_update_eoi+0x73/0xd0 [kvm]
>  [] ? apic_reg_write+0x569/0x700 [kvm]
>  [] ? apic_mmio_write+0x69/0x70 [kvm]
>  [] ? emulator_write_emulated_onepage+0xac/0x1b0 [kvm]
>  [] ? x86_emulate_insn+0x1d25/0x4d20 [kvm]
>  [] ? x86_decode_insn+0x96e/0xba0 [kvm]
>  [] ? kvm_mmu_unprotect_page_virt+0xec/0x100 [kvm]
>  [] ? emulate_instruction+0xd3/0x380 [kvm]
>  [] ? __down_read+0xce/0xd0
>  [] ? apic_update_ppr+0x29/0x70 [kvm]
>  [] ? handle_apic_access+0x18/0x40 [kvm_intel]
>  [] ? kvm_arch_vcpu_ioctl_run+0x3ad/0xcf0 [kvm]
>  [] ? wake_futex+0x37/0x70
>  [] ? kvm_vcpu_ioctl+0x53c/0x910 [kvm]
>  [] ? __switch_to_xtra+0x163/0x1a0
>  [] ? __switch_to+0x271/0x340
>  [] ? vfs_ioctl+0x35/0xd0
>  [] ? do_vfs_ioctl+0x88/0x570
>  [] ? schedule+0x2f9/0xa60
>  [] ? sys_ioctl+0x80/0xa0
>  [] ? fire_user_return_notifiers+0x3a/0x70
>  [] ? system_call_fastpath+0x16/0x1b
> kvm: emulating exchange as write
> 
> Honestly my knowledge of the kvm internals is not sufficient to decide if
>  this bug report still belongs to my problem or is something different. So
>  if you need more information or additional testing please let me know..
> 
> Best regards
> Ralf Bonenkamp
> 
> -Ursprüngliche Nachricht-
> Von: Marcelo Tosatti [mailto:mtosa...@redhat.com]
> Gesendet: Dienstag, 20. April 2010 17:54
> An: kvm
> Cc: Ralf Bonenkamp; Chris Wright; Yang, Sheng
> Betreff: *** GMX Spamverdacht *** [UNTESTED] KVM: do not call kvm_set_irq
>  from irq disabled section
> 
> 
> The assigned device interrupt work handler calls kvm_set_irq, which
> can sleep, for example, waiting for the ioapic mutex, from irq disabled
> section.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=15725
> 
> Fix by dropping assigned_dev_lock (and re-enabling interrupts)
> before invoking kvm_set_irq for the KVM_DEV_IRQ_HOST_MSIX case. Other
> cases do not require the lock or interrupts disabled (a new work
> instance will be queued in case of concurrent interrupt).
> 
> KVM-Stable-Tag.
> Signed-off-by: Marcelo Tosatti 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [UNTESTED] KVM: do not call kvm_set_irq from irq disabled section

2010-04-21 Thread Yang, Sheng

On Tuesday 20 April 2010 23:54:01 Marcelo Tosatti wrote:
> The assigned device interrupt work handler calls kvm_set_irq, which
> can sleep, for example, waiting for the ioapic mutex, from irq disabled
> section.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=15725
> 
> Fix by dropping assigned_dev_lock (and re-enabling interrupts)
> before invoking kvm_set_irq for the KVM_DEV_IRQ_HOST_MSIX case. Other
> cases do not require the lock or interrupts disabled (a new work
> instance will be queued in case of concurrent interrupt).

Looks fine, but depends on the new work would be queued sounds a little 
tricky...

How about a local_irq_disable() at the beginning? It can ensure no concurrent 
interrupts would happen as well I think.

> 
> KVM-Stable-Tag.
> Signed-off-by: Marcelo Tosatti 
> 
> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index 47ca447..7ac7bbe 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c
> @@ -64,24 +64,33 @@ static void
>  kvm_assigned_dev_interrupt_work_handler(struct work_struct *work)
>  interrupt_work);
>   kvm = assigned_dev->kvm;
> 
> - spin_lock_irq(&assigned_dev->assigned_dev_lock);
>   if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
>   struct kvm_guest_msix_entry *guest_entries =
>   assigned_dev->guest_msix_entries;

irq_requested_type and guest_msix_entries should also protected by the lock. 
So how about another spin_lock()/unlock() pair wraps the second kvm_set_irq()?

> +
> + spin_lock_irq(&assigned_dev->assigned_dev_lock);
>   for (i = 0; i < assigned_dev->entries_nr; i++) {
>   if (!(guest_entries[i].flags &
>   KVM_ASSIGNED_MSIX_PENDING))
>   continue;
>   guest_entries[i].flags &= ~KVM_ASSIGNED_MSIX_PENDING;
> + /*
> +  * If kvm_assigned_dev_intr sets pending for an
> +  * entry smaller than this work instance is
> +  * currently processing, a new work instance
> +  * will be queued.
> +  */
> + spin_unlock_irq(&assigned_dev->assigned_dev_lock);
>   kvm_set_irq(assigned_dev->kvm,
>   assigned_dev->irq_source_id,
>   guest_entries[i].vector, 1);
> + spin_lock_irq(&assigned_dev->assigned_dev_lock);
>   }
> + spin_unlock_irq(&assigned_dev->assigned_dev_lock);
>   } else
>   kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
>   assigned_dev->guest_irq, 1);

Or could we make kvm_set_irq() atomic? Though the code path is a little long 
for spinlock.

> 
> - spin_unlock_irq(&assigned_dev->assigned_dev_lock);
>  }
> 
>  static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: VMX: move CR3/PDPTR update to vmx_set_cr3

2009-10-26 Thread Yang, Sheng

On Tuesday 27 October 2009 02:48:33 Marcelo Tosatti wrote:
> GUEST_CR3 is updated via kvm_set_cr3 whenever CR3 is modified from
> outside guest context. Similarly pdptrs are updated via load_pdptrs.
>
> Let kvm_set_cr3 perform the update, removing it from the vcpu_run
> fast path.

Looks fine to me.

Acked-by: Sheng Yang 

-- 
regards
Yang, Sheng
>
> Signed-off-by: Marcelo Tosatti 
>
> Index: b/arch/x86/kvm/vmx.c
> ===
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -1748,6 +1748,7 @@ static void vmx_set_cr3(struct kvm_vcpu
>   vmcs_write64(EPT_POINTER, eptp);
>   guest_cr3 = is_paging(vcpu) ? vcpu->arch.cr3 :
>   vcpu->kvm->arch.ept_identity_map_addr;
> + ept_load_pdptrs(vcpu);
>   }
>
>   vmx_flush_tlb(vcpu);
> @@ -3638,10 +3639,6 @@ static void vmx_vcpu_run(struct kvm_vcpu
>  {
>   struct vcpu_vmx *vmx = to_vmx(vcpu);
>
> - if (enable_ept && is_paging(vcpu)) {
> - vmcs_writel(GUEST_CR3, vcpu->arch.cr3);
> - ept_load_pdptrs(vcpu);
> - }
>   /* Record the guest's net vcpu time for enforced NMI injections. */
>   if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
>   vmx->entry_time = ktime_get();
> Index: b/arch/x86/kvm/x86.c
> ===
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4517,8 +4517,10 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct
>
>   mmu_reset_needed |= vcpu->arch.cr4 != sregs->cr4;
>   kvm_x86_ops->set_cr4(vcpu, sregs->cr4);
> - if (!is_long_mode(vcpu) && is_pae(vcpu))
> + if (!is_long_mode(vcpu) && is_pae(vcpu)) {
>   load_pdptrs(vcpu, vcpu->arch.cr3);
> + mmu_reset_needed = 1;
> + }
>
>   if (mmu_reset_needed)
>   kvm_mmu_reset_context(vcpu);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: cpuinfo and HVM features (was: Host latency peaks due to kvm-intel)

2009-07-27 Thread Yang, Sheng

On Monday 27 July 2009 17:08:42 Jan Kiszka wrote:
> [ carrying this to LKML ]
>
> Yang, Sheng wrote:
> > On Monday 27 July 2009 03:16:27 H. Peter Anvin wrote:
> >> Jan Kiszka wrote:
> >>> Avi Kivity wrote:
> >>>> On 07/24/2009 12:41 PM, Jan Kiszka wrote:
> >>>>> I vaguely recall that someone promised to add a feature reporting
> >>>>> facility for all those nice things, modern VM-extensions may or may
> >>>>> not support (something like or even an extension of /proc/cpuinfo).
> >>>>> What is the state of this plan? Would be specifically interesting for
> >>>>> Intel CPUs as there seem to be many of them out there with
> >>>>> restrictions for special use cases - like real-time.
> >>>>
> >>>> Newer kernels do report some vmx features (like flexpriority) in
> >>>> /proc/cpuinfo but not all.
> >>>
> >>> Ah, nice. Then we just need this?
> >>
> >> Fine with me.
> >>
> >> Acked-by: H. Peter Anvin 
> >>
> >> However, I guess the real question if we shouldn't export ALL VMX
> >> features in a consistent way instead?
> >
> > When I add feature reporting to cpuinfo, I just put highlight features
> > there, otherwise the VMX feature list would at least as long as CPU one.
>
> That could become true. But the question is always what the highlights
> are. Often this depends on the hypervisor as it may implement
> workarounds for missing features differently (or not at all). So I'm
> also for exposing feature information consistently.

(CC Andi and Ingo)

The highlight means the feature we would gain a lot, like FlexPriority, EPT, 
VPID. They can be vendor specific. And I am talking about hardware capability 
here, so what's hypervisor did for workaround is not in scope.
>
> > I have also suggested another field for virtualization feature for it,
> > but some concern again userspace tools raised.
> >
> > For we got indeed quite a lot features, and would get more, would it
> > better to export the part of struct vmcs_config entries(that's
> > pin_based_exec_ctrl, cpu_based_exec_ctrl, and cpu_based_2nd_exec_ctrl)
> > through
> > sys/module/kvm_intel/? Put every feature to cpuinfo seems not that
> > necessary for such a big list.
>
> I don't think this information should only come from KVM. Consider you
> didn't build it into some kernel but still want to find out what your
> system is able to provide.

Yes, agree.
>
> What about adding some dedicated /proc entry for CPU virtualization
> features, say /proc/hvminfo?

Well, compared to this, I may still prefer a new item in /proc/cpuinfo, for 
it's still CPU feature, like Andi did for power management(IIRC).

Any more preferred location?

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Host latency peaks due to kvm-intel

2009-07-26 Thread Yang, Sheng

On Monday 27 July 2009 03:16:27 H. Peter Anvin wrote:
> Jan Kiszka wrote:
> > Avi Kivity wrote:
> >> On 07/24/2009 12:41 PM, Jan Kiszka wrote:
> >>> I vaguely recall that someone promised to add a feature reporting
> >>> facility for all those nice things, modern VM-extensions may or may not
> >>> support (something like or even an extension of /proc/cpuinfo). What is
> >>> the state of this plan? Would be specifically interesting for Intel
> >>> CPUs as there seem to be many of them out there with restrictions for
> >>> special use cases - like real-time.
> >>
> >> Newer kernels do report some vmx features (like flexpriority) in
> >> /proc/cpuinfo but not all.
> >
> > Ah, nice. Then we just need this?
>
> Fine with me.
>
> Acked-by: H. Peter Anvin 
>
> However, I guess the real question if we shouldn't export ALL VMX
> features in a consistent way instead?
>
When I add feature reporting to cpuinfo, I just put highlight features there, 
otherwise the VMX feature list would at least as long as CPU one.

I have also suggested another field for virtualization feature for it, but 
some concern again userspace tools raised.

For we got indeed quite a lot features, and would get more, would it better to 
export the part of struct vmcs_config entries(that's pin_based_exec_ctrl, 
cpu_based_exec_ctrl, and cpu_based_2nd_exec_ctrl) through 
sys/module/kvm_intel/? Put every feature to cpuinfo seems not that necessary 
for such a big list.

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

ioctl number overlapped?

2009-07-20 Thread Yang, Sheng

Happen to see this:

include/linux.kvm.h

503 #define KVM_IRQ_LINE_STATUS   _IOWR(KVMIO, 0x67, struct kvm_irq_level)
504 #define KVM_REGISTER_COALESCED_MMIO \
505 _IOW(KVMIO,  0x67, struct kvm_coalesced_mmio_zone)

Both ioctl use 0x67, and the code has released to v2.6.30...

-- 
regards
Yang, Sheng

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] Update VMX_EPT_IDENTITY_PAGETABLE_ADDR to synchronize with kernel code.

2009-07-16 Thread Yang, Sheng

On Friday 17 July 2009 03:14:54 Marcelo Tosatti wrote:
> On Thu, Jul 16, 2009 at 11:48:46AM -0700, Jordan Justen wrote:
> > On Thu, 2009-07-16 at 11:18 -0700, Marcelo Tosatti wrote:
> > > On Thu, Jul 16, 2009 at 11:02:22AM -0700, Jordan Justen wrote:
> > > > Although VMX_EPT_IDENTITY_PAGETABLE_ADDR does not appear to be used
> > > > within qemu-kvm, this change mirrors a similar change in the kernel
> > > > kvm code.
> > > >
> > > > The purpose is to move the KVM 'EPT Identity Pages' from:
> > > >   0xfffbc000-0xfffbcfff
> > > > to:
> > > >   0xfeffc000-0xfeffcfff
> > > >
> > > > The step is required to free up the 0xff00-0x (16MB)
> > > > range for use with bios.bin.
> > > >
> > > > The KVM kernel change depends upon a change to kvm/bios/rombios.c so
> > > > the bios INT15-E820 function will properly reserve the new location.
> > > >
> > > > Signed-off-by: Jordan Justen 
> > > > ---
> > > >  kvm/include/x86/asm/vmx.h |2 +-
> > > >  1 files changed, 1 insertions(+), 1 deletions(-)
> > > >
> > > > diff --git a/kvm/include/x86/asm/vmx.h b/kvm/include/x86/asm/vmx.h
> > > > index df8d4f9..99e2bb9 100644
> > > > --- a/kvm/include/x86/asm/vmx.h
> > > > +++ b/kvm/include/x86/asm/vmx.h
> > > > @@ -403,7 +403,7 @@ enum vmcs_field {
> > > >  #define VMX_EPT_EXECUTABLE_MASK0x4ull
> > > >  #define VMX_EPT_IGMT_BIT   (1ull << 6)
> > > >
> > > > -#define VMX_EPT_IDENTITY_PAGETABLE_ADDR0xfffbc000ul
> > > > +#define VMX_EPT_IDENTITY_PAGETABLE_ADDR0xfeffc000ul
> > >
> > > Won't this conflict with an older BIOS? (the e820 reserved entry on
> > > older qemu-kvm+bios will not cover the EPT identity table on kernels
> > > with this patch).
> > >
> > > Perhaps add a new ioctl (similar to the tss one) to so userspace can
> > > set the address?
> >
> > I am not very familiar with the reasons why the EPT identity
> > page-table setup is happening within the kernel.
> >
> > As it stands, there is the shared knowledge that the qemu-kvm bios
> > just happens to know that the kvm kernel code has reserved a
> > particular page of the address space.
> >
> > It would be much easier to coordinate all the pieces if it were
> > all setup on the qemu-kvm side.
> >
> > Is this possible?
>
> It is possible but all of the EPT implementation is in the kernel, so it
> does not make much sense to have the details of the identity table in
> qemu-kvm.
>
> The address of it though can be controlled by qemu-kvm.
>
> Sheng?

Put the identity map into kernel space because we want older version QEmu can 
also work with EPT.

Yes, we need a new CAP for address. Would do it soon.

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: device-assignment: Add PCI option ROM support

2009-06-22 Thread Yang, Sheng

On Tuesday 23 June 2009 00:09:28 Alex Williamson wrote:
> On Mon, 2009-06-22 at 13:32 +0800, Yang, Sheng wrote:
> > On Friday 19 June 2009 21:44:40 Alex Williamson wrote:
> > > On Fri, 2009-06-19 at 15:27 +0800, Yang, Sheng wrote:
> > > > On Friday 19 June 2009 00:28:41 Alex Williamson wrote:
> > > > > The one oddity I noticed is that even when the enable bit is clear,
> > > > > the guest can read the ROM.  I don't know that this is actually
> > > > > illegal, vs returning zeros or ones though.  It seems like maybe
> > > > > the generic PCI code isn't tracking the enable bit.  I think that's
> > > > > an independent problem from this patch though.  Thanks,
> > > >
> > > > That should be fine. I've taken a look at code, seems Linux kernel
> > > > set enable_bit when someone begin to read rom, and copy rom to
> > > > buffer, then unmap the rom. So the rom can be read when enable bit
> > > > clear.
> > >
> > > For this testing, I used an mmap of the ROM address though, so the
> > > kernel caching shouldn't have been involved.  It looks to me like the
> > > problem is that the map function provided via pci_register_io_region()
> > > only knows how to create mappings, not tear them down.  I think maybe
> > > pci_update_mappings() should still call the map_func when new_addr is
> > > -1 to let the io space drive shutdown the mapping.  As it is, once we
> > > setup the mapping, it lives until something else happens to overlap it,
> > > regardless of the state of the PCI BAR.  Thanks,
> >
> > I think it may not necessary to tear them down, for the bar mapping won't
> > change IIUR.
>
> We can't guarantee that, the OS can move them if it understands the
> resources available to the PCI bus.  It typically doesn't move them
> though.
>
> > And you are accessing the sysfs file, right? In the Linux kernel, IIRC,
> > pci_create_sysfs_dev_files() create sysfs file, and hook the read to
> > pci_read_rom(), which called pci_map_rom(), which would call
> > pci_enable_rom(), and write the enable_rom bit to the rom_base_reg. So
> > that the rom can be read regardless of enable_rom bit state - and .
> >
> > But I also found something interested. The write hook of file,
> > pci_write_rom() seems won't cause NMI(and seems you need write a char
> > rather than 0 to enable the accessing?). So why NMI happen in host?...
>
> As I mentioned, I'm not using the /sys files to write to the ROM
> precisely because the rom write function is only to enable/disable the
> ROM BAR.  I'm using setpci to manually enable the ROM, then I use the
> test program below to mmap the ROM address from /dev/mem, read part of
> it, try to write the first few bytes, then read it back.  You'll need to
> change the hard coded address if you want to test yourself.  Obviously
> don't do it on a system in use by others since it will likely take it
> down.  Thanks,

Oh, yes. Sorry for completely miss the method... Yeah, by this method, the ROM 
shouldn't present to guest. And you are right, the PCI mapping is in only one 
direction. I think we can fix it in QEmu upstream.

-- 
regards
Yang, Sheng

>
> Alex
>
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> #define DEV_MEM   "/dev/mem"
> #define ROM_ADDR  0xe630
>
> int main(void)
> {
>   unsigned char *map;
>   int i, fd = open(DEV_MEM, O_RDWR);
>
>   if (fd == -1) {
>   printf("Failed to open /dev/mem: %s\n", strerror(errno));
>   return -1;
>   }
>
>   map = mmap(NULL, getpagesize(), PROT_READ|PROT_WRITE,
>  MAP_SHARED, fd, ROM_ADDR);
>
>   if (map == MAP_FAILED) {
>   printf("Failed to mmap /dev/mem: %s\n", strerror(errno));
>   close(fd);
>   return -1;
>   }
>
>   for (i = 0; i < 64;) {
>   printf("%02x", map[i++]);
>   if (i % 16 == 0)
>   printf("\n");
>   else if (i % 4 == 0)
>   printf("  ");
>   else
>   printf(" ");
>   }
>
>   printf("Writing...");
>   map[0] = 0xba;
>   map[1] = 0xdb;
>   map[2] = 0xad;
>   map[3] = 0xc0;
>   map[4] = 0xff;
>   map[5] = 0xee;
>   printf("done\n");
>
>   for (i = 0; i < 64;) {
>   printf("%02x", map[i++]);
>   if (i % 16 == 0)
>   printf("\n");
>   else if (i % 4 == 0)
>   printf("  ");
>   else
>   printf(" ");
>   }
>
>   munmap(map, getpagesize());
>   close(fd);
>   return 0;
> }


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: device-assignment: Add PCI option ROM support

2009-06-21 Thread Yang, Sheng

On Friday 19 June 2009 21:44:40 Alex Williamson wrote:
> On Fri, 2009-06-19 at 15:27 +0800, Yang, Sheng wrote:
> > On Friday 19 June 2009 00:28:41 Alex Williamson wrote:
> > > The one oddity I noticed is that even when the enable bit is clear, the
> > > guest can read the ROM.  I don't know that this is actually illegal, vs
> > > returning zeros or ones though.  It seems like maybe the generic PCI
> > > code isn't tracking the enable bit.  I think that's an independent
> > > problem from this patch though.  Thanks,
> >
> > That should be fine. I've taken a look at code, seems Linux kernel set
> > enable_bit when someone begin to read rom, and copy rom to buffer, then
> > unmap the rom. So the rom can be read when enable bit clear.
>
> For this testing, I used an mmap of the ROM address though, so the
> kernel caching shouldn't have been involved.  It looks to me like the
> problem is that the map function provided via pci_register_io_region()
> only knows how to create mappings, not tear them down.  I think maybe
> pci_update_mappings() should still call the map_func when new_addr is -1
> to let the io space drive shutdown the mapping.  As it is, once we setup
> the mapping, it lives until something else happens to overlap it,
> regardless of the state of the PCI BAR.  Thanks,

I think it may not necessary to tear them down, for the bar mapping won't 
change IIUR.

And you are accessing the sysfs file, right? In the Linux kernel, IIRC, 
pci_create_sysfs_dev_files() create sysfs file, and hook the read to 
pci_read_rom(), which called pci_map_rom(), which would call pci_enable_rom(), 
and write the enable_rom bit to the rom_base_reg. So that the rom can be read 
regardless of enable_rom bit state - and .

But I also found something interested. The write hook of file, pci_write_rom() 
seems won't cause NMI(and seems you need write a char rather than 0 to enable 
the accessing?). So why NMI happen in host?...

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: device-assignment: Add PCI option ROM support

2009-06-19 Thread Yang, Sheng

On Friday 19 June 2009 00:28:41 Alex Williamson wrote:
> On Thu, 2009-06-18 at 13:30 +0800, Yang, Sheng wrote:
> > On Tuesday 16 June 2009 00:29:17 Alex Williamson wrote:
> > > The PCI code already knows about option ROMs, so we just need to
> > > mmap some space for it, load it with a copy of the contents, and
> > > complete the plubming to the generic code.
> > >
> > > With this a Linux guest can get access to the ROM contents via
> > > /sys/bus/pci/devices//rom.  This might also enable the BIOS
> > > to execute ROMs by loading them dynamically from the device
> > > rather than hoping they all fit into RAM.
> >
> > The patch looks fine. One question: if guest write to the ROM, I think
> > the guest would be killed for QEmu would receive a SIGSEGV? I am not sure
> > if it's too severe...
>
> Hi Sheng,
>
> Good thought.  I tested this with a simple program that mmaps the ROM
> address from /dev/mem and tries to write to it (using setpci to enable
> the ROM).  The results are a little surprising.  On the host, writing to
> the ROM causes an NMI and the system dies.  On the KVM guest, the write
> is happily discarded, neither segfaulting from the mprotect nor
> affecting the contents of the ROM.  So it seems that something above my
> mprotect is discarding the write, and if we did hit it, a qemu segfault
> isn't that far from what happens on bare metal.
>
Oh, that's good. :)

> The one oddity I noticed is that even when the enable bit is clear, the
> guest can read the ROM.  I don't know that this is actually illegal, vs
> returning zeros or ones though.  It seems like maybe the generic PCI
> code isn't tracking the enable bit.  I think that's an independent
> problem from this patch though.  Thanks,

That should be fine. I've taken a look at code, seems Linux kernel set 
enable_bit when someone begin to read rom, and copy rom to buffer, then unmap 
the rom. So the rom can be read when enable bit clear.

-- 
regards
Yang, Sheng

>
> Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: device-assignment: Add PCI option ROM support

2009-06-17 Thread Yang, Sheng

On Tuesday 16 June 2009 00:29:17 Alex Williamson wrote:
> The PCI code already knows about option ROMs, so we just need to
> mmap some space for it, load it with a copy of the contents, and
> complete the plubming to the generic code.
>
> With this a Linux guest can get access to the ROM contents via
> /sys/bus/pci/devices//rom.  This might also enable the BIOS
> to execute ROMs by loading them dynamically from the device
> rather than hoping they all fit into RAM.
>
> Signed-off-by: Alex Williamson 

Hi Alex

The patch looks fine. One question: if guest write to the ROM, I think the 
guest would be killed for QEmu would receive a SIGSEGV? I am not sure if it's 
too severe...

-- 
regards
Yang, Sheng

> ---
>
>  hw/device-assignment.c |   60
>  hw/device-assignment.h |  
>  5 +---
>  2 files changed, 46 insertions(+), 19 deletions(-)
>
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index 65920d0..dfcd670 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -286,8 +286,8 @@ static void assigned_dev_pci_write_config(PCIDevice *d,
> uint32_t address, /* Continue to program the card */
>  }
>
> -if ((address >= 0x10 && address <= 0x24) || address == 0x34 ||
> -address == 0x3c || address == 0x3d ||
> +if ((address >= 0x10 && address <= 0x24) || address == 0x30 ||
> +address == 0x34 || address == 0x3c || address == 0x3d ||
>  pci_access_cap_config(d, address, len)) {
>  /* used for update-mappings (BAR emulation) */
>  pci_default_write_config(d, address, val, len);
> @@ -322,8 +322,8 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice
> *d, uint32_t address, AssignedDevice *pci_dev = container_of(d,
> AssignedDevice, dev);
>
>  if (address < 0x4 || (pci_dev->need_emulate_cmd && address == 0x4) ||
> - (address >= 0x10 && address <= 0x24) || address == 0x34 ||
> -address == 0x3c || address == 0x3d ||
> + (address >= 0x10 && address <= 0x24) || address == 0x30 ||
> +address == 0x34 || address == 0x3c || address == 0x3d ||
>  pci_access_cap_config(d, address, len)) {
>  val = pci_default_read_config(d, address, len);
>  DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n",
> @@ -384,11 +384,20 @@ static int assigned_dev_register_regions(PCIRegion
> *io_regions,
>
>  /* map physical memory */
>  pci_dev->v_addrs[i].e_physbase = cur_region->base_addr;
> -pci_dev->v_addrs[i].u.r_virtbase =
> -mmap(NULL,
> - (cur_region->size + 0xFFF) & 0xF000,
> - PROT_WRITE | PROT_READ, MAP_SHARED,
> - cur_region->resource_fd, (off_t) 0);
> +if (i == PCI_ROM_SLOT) {
> +pci_dev->v_addrs[i].u.r_virtbase =
> +mmap(NULL,
> + (cur_region->size + 0xFFF) & 0xF000,
> + PROT_WRITE | PROT_READ, MAP_ANONYMOUS |
> MAP_PRIVATE, + 0, (off_t) 0);
> +
> +} else {
> +pci_dev->v_addrs[i].u.r_virtbase =
> +mmap(NULL,
> + (cur_region->size + 0xFFF) & 0xF000,
> + PROT_WRITE | PROT_READ, MAP_SHARED,
> + cur_region->resource_fd, (off_t) 0);
> +}
>
>  if (pci_dev->v_addrs[i].u.r_virtbase == MAP_FAILED) {
>  pci_dev->v_addrs[i].u.r_virtbase = NULL;
> @@ -397,6 +406,14 @@ static int assigned_dev_register_regions(PCIRegion
> *io_regions, (uint32_t) (cur_region->base_addr));
>  return -1;
>  }
> +
> +if (i == PCI_ROM_SLOT) {
> +memset(pci_dev->v_addrs[i].u.r_virtbase, 0,
> +   (cur_region->size + 0xFFF) & 0xF000);
> +mprotect(pci_dev->v_addrs[PCI_ROM_SLOT].u.r_virtbase,
> + (cur_region->size + 0xFFF) & 0xF000,
> PROT_READ); +}
> +
>  pci_dev->v_addrs[i].r_size = cur_region->size;
>  pci_dev->v_addrs[i].e_size = 0;
>
> @@ -468,7 +485,7 @@ again:
>  return 1;
>  }
>
> -for (r = 0; r < MAX_IO_REGIONS; r++) {
> +for (r = 0; r < PCI_NUM_REGIONS; r++) {
>   if (fscanf(f, "%lli %lli %lli\n", &start, &end, &flags) != 3)
>   break;
>
> @@ -480,11 +497,13 @@ again:
>  continue;
>

Re: KVM: x86: verify MTRR/PAT validity

2009-06-17 Thread Yang, Sheng

On Tuesday 16 June 2009 20:05:29 Marcelo Tosatti wrote:
> Do not allow invalid MTRR/PAT values in set_msr_mtrr.
>
> Please review carefully.
>
> Signed-off-by: Marcelo Tosatti 
>
Looks fine to me.

Is it necessary to check reserved bit of MSR_MTRRdefType and variable MTRRs as 
well? Maybe like this:

if (msr == MSR_MTRRdefType) {
return valid_mtrr_type(data & ~0xc00ull);
}

And variable ones can be:

#define MTRR_VALID_MASK(v, msr) (~(rsvd_bits(cpuid_max_physaddr(v)) | ((msr % 
2) << 11)))

return valid_mtrr_type(data & MTRR_VALID_MASK(vcpu, msr)))


(rsvd_bits() is in mmu.c, both untested)

Maybe we can put cpuid_max_physaddr as a field in vcpu struct?

-- 
regards
Yang, Sheng

>
> Index: kvm/arch/x86/kvm/x86.c
> ===
> --- kvm.orig/arch/x86/kvm/x86.c
> +++ kvm/arch/x86/kvm/x86.c
> @@ -722,11 +722,53 @@ static bool msr_mtrr_valid(unsigned msr)
>   return false;
>  }
>
> +static unsigned mtrr_types[] = {0, 1, 4, 5, 6};
> +static unsigned pat_types[] = {0, 1, 4, 5, 6, 7};
> +
> +static bool valid_mt(unsigned type, int len, unsigned array[len])
> +{
> + int i;
> +
> + for (i = 0; i < len; i++)
> + if (type == array[i])
> + return true;
> +
> + return false;
> +}
> +
> +#define valid_pat_type(a) valid_mt(a, ARRAY_SIZE(pat_types), pat_types)
> +#define valid_mtrr_type(a) valid_mt(a, ARRAY_SIZE(mtrr_types), mtrr_types)
> +
> +static bool mtrr_valid(u32 msr, u64 data)
> +{
> + int i;
> +
> + if (!msr_mtrr_valid(msr))
> + return false;
> +
> + if (msr == MSR_IA32_CR_PAT) {
> + for (i = 0; i < 8; i++)
> + if (!valid_pat_type((data >> (i * 8)) & 0xff))
> + return false;
> + return true;
> + } else if (msr == MSR_MTRRdefType) {
> + return valid_mtrr_type(data & 0xff);
> + } else if (msr >= MSR_MTRRfix64K_0 && msr <= MSR_MTRRfix4K_F8000) {
> + for (i = 0; i < 8 ; i++)
> + if (!valid_mtrr_type((data >> (i * 8)) & 0xff))
> + return false;
> + return true;
> + }
> +
> + /* variable MTRRs, physmaskn have bits 0-10 reserved */
> + return valid_mtrr_type(data & 0xff);
> +}
> +
>  static int set_msr_mtrr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>  {
>   u64 *p = (u64 *)&vcpu->arch.mtrr_state.fixed_ranges;
>
> - if (!msr_mtrr_valid(msr))
> + if (!mtrr_valid(msr, data))
>   return 1;
>
>   if (msr == MSR_MTRRdefType) {


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/5] VMX EPT misconfigurtion handler

2009-06-10 Thread Yang, Sheng

On Wednesday 10 June 2009 05:30:09 Marcelo Tosatti wrote:
> >From the Intel docs:
>
> An EPT misconfiguration occurs when, in the course of translation
> a guest-physical address, the logical processor encounters an EPT
> paging-structure entry that contains an unsupported value.
>
> Handle this event and print useful information for diagnostics.

Looks fine to me, thanks!

-- 
regards
Yang, Sheng


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] kvm: Use a bitmap for tracking used GSIs

2009-05-12 Thread Yang, Sheng

On Wednesday 13 May 2009 12:41:29 Alex Williamson wrote:
> We're currently using a counter to track the most recent GSI we've
> handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
> assignment with a driver that regularly toggles the MSI enable bit.
> This can mean only a few minutes of usable run time.  Instead, track
> used GSIs in a bitmap.
>
> Signed-off-by: Alex Williamson 
> ---

Acked.

-- 
regards
Yang, Sheng

>
>  v2: Added mutex to protect gsi bitmap
>  v3: Updated for comments from Michael Tsirkin
>  No longer depends on "[PATCH] kvm: device-assignment: Catch GSI
> overflow" v4: Fix gsi_bytes calculation noted by Sheng Yang
>
>  hw/device-assignment.c  |4 ++
>  kvm/libkvm/kvm-common.h |4 ++
>  kvm/libkvm/libkvm.c |   81
> +-- kvm/libkvm/libkvm.h |  
> 10 ++
>  4 files changed, 86 insertions(+), 13 deletions(-)
>
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index a7365c8..a6cc9b9 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -561,8 +561,10 @@ static void free_dev_irq_entries(AssignedDevice *dev)
>  {
>  int i;
>
> -for (i = 0; i < dev->irq_entries_nr; i++)
> +for (i = 0; i < dev->irq_entries_nr; i++) {
>  kvm_del_routing_entry(kvm_context, &dev->entry[i]);
> +kvm_free_irq_route_gsi(kvm_context, dev->entry[i].gsi);
> +}
>  free(dev->entry);
>  dev->entry = NULL;
>  dev->irq_entries_nr = 0;
> diff --git a/kvm/libkvm/kvm-common.h b/kvm/libkvm/kvm-common.h
> index 591fb53..4b3cb51 100644
> --- a/kvm/libkvm/kvm-common.h
> +++ b/kvm/libkvm/kvm-common.h
> @@ -66,8 +66,10 @@ struct kvm_context {
>  #ifdef KVM_CAP_IRQ_ROUTING
>   struct kvm_irq_routing *irq_routes;
>   int nr_allocated_irq_routes;
> + void *used_gsi_bitmap;
> + int max_gsi;
> + pthread_mutex_t gsi_mutex;
>  #endif
> - int max_used_gsi;
>  };
>
>  int kvm_alloc_kernel_memory(kvm_context_t kvm, unsigned long memory,
> diff --git a/kvm/libkvm/libkvm.c b/kvm/libkvm/libkvm.c
> index ba0a5d1..3eaa120 100644
> --- a/kvm/libkvm/libkvm.c
> +++ b/kvm/libkvm/libkvm.c
> @@ -35,6 +35,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "libkvm.h"
>
>  #if defined(__x86_64__) || defined(__i386__)
> @@ -65,6 +66,8 @@
>  int kvm_abi = EXPECTED_KVM_API_VERSION;
>  int kvm_page_size;
>
> +static inline void set_bit(uint32_t *buf, unsigned int bit);
> +
>  struct slot_info {
>   unsigned long phys_addr;
>   unsigned long len;
> @@ -286,6 +289,9 @@ kvm_context_t kvm_init(struct kvm_callbacks *callbacks,
>   int fd;
>   kvm_context_t kvm;
>   int r;
> +#ifdef KVM_CAP_IRQ_ROUTING
> + int gsi_count, gsi_bytes, i;
> +#endif
>
>   fd = open("/dev/kvm", O_RDWR);
>   if (fd == -1) {
> @@ -322,6 +328,25 @@ kvm_context_t kvm_init(struct kvm_callbacks
> *callbacks, kvm->dirty_pages_log_all = 0;
>   kvm->no_irqchip_creation = 0;
>   kvm->no_pit_creation = 0;
> +#ifdef KVM_CAP_IRQ_ROUTING
> + pthread_mutex_init(&kvm->gsi_mutex, NULL);
> +
> + gsi_count = kvm_get_gsi_count(kvm);
> + /* Round up so we can search ints using ffs */
> + gsi_bytes = ((gsi_count + 31) / 32) * 4;
> + kvm->used_gsi_bitmap = malloc(gsi_bytes);
> + if (!kvm->used_gsi_bitmap)
> + goto out_close;
> + memset(kvm->used_gsi_bitmap, 0, gsi_bytes);
> + kvm->max_gsi = gsi_bytes * 8;
> +
> + /* Mark all the IOAPIC pin GSIs and any over-allocated
> +  * GSIs as already in use. */
> + for (i = 0; i < KVM_IOAPIC_NUM_PINS; i++)
> + set_bit(kvm->used_gsi_bitmap, i);
> + for (i = gsi_count; i < kvm->max_gsi; i++)
> + set_bit(kvm->used_gsi_bitmap, i);
> +#endif
>
>   return kvm;
>   out_close:
> @@ -1298,8 +1323,6 @@ int kvm_add_routing_entry(kvm_context_t kvm,
>   new->flags = entry->flags;
>   new->u = entry->u;
>
> - if (entry->gsi > kvm->max_used_gsi)
> - kvm->max_used_gsi = entry->gsi;
>   return 0;
>  #else
>   return -ENOSYS;
> @@ -1404,18 +1427,54 @@ int kvm_commit_irq_routes(kvm_context_t kvm)
>  #endif
>  }
>
> +#ifdef KVM_CAP_IRQ_ROUTING
> +static inline void set_bit(uint32_t *buf, unsigned int bit)
> +{
> + buf[bit / 32] |= 1U << (bit % 32);
> +}
> +
> +static inline void clear_bit(uint32_t *buf, unsigned int bit)
> +{
> + buf[bit / 32] &= ~(1U << (bit % 32));
> +}
> +
> +

Re: [PATCH v2] kvm: device-assignment: Fix kvm_get_irq_route_gsi() return check

2009-05-12 Thread Yang, Sheng

On Wednesday 13 May 2009 12:45:11 Alex Williamson wrote:
> Use 'r' for the return value since gsi is unsigned.
>
> Signed-off-by: Alex Williamson 
> ---

Acked.

-- 
regards
Yang, Sheng

>
>  v2: Use 'r' instead of a cast, per Sheng Yang
>
>  hw/device-assignment.c |5 +++--
>  1 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index a6cc9b9..4107915 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -797,11 +797,12 @@ static void assigned_dev_update_msi(PCIDevice
> *pci_dev, unsigned int ctrl_pos) assigned_dev->entry->u.msi.data =
> *(uint16_t *)(pci_dev->config + pci_dev->cap.start + PCI_MSI_DATA_32);
>  assigned_dev->entry->type = KVM_IRQ_ROUTING_MSI;
> -assigned_dev->entry->gsi = kvm_get_irq_route_gsi(kvm_context);
> -if (assigned_dev->entry->gsi < 0) {
> +r = kvm_get_irq_route_gsi(kvm_context);
> +if (r < 0) {
>  perror("assigned_dev_update_msi: kvm_get_irq_route_gsi");
>  return;
>  }
> +assigned_dev->entry->gsi = r;
>
>  kvm_add_routing_entry(kvm_context, assigned_dev->entry);
>  if (kvm_commit_irq_routes(kvm_context) < 0) {


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] kvm: Use a bitmap for tracking used GSIs

2009-05-12 Thread Yang, Sheng

On Wednesday 13 May 2009 12:10:34 Alex Williamson wrote:
> On Tue, 2009-05-12 at 21:42 -0600, Alex Williamson wrote:
> > On Wed, 2009-05-13 at 11:30 +0800, Yang, Sheng wrote:
> > > > +   kvm->used_gsi_bitmap = malloc(gsi_bytes);
> > > > +   if (!kvm->used_gsi_bitmap) {
> > > > +   pthread_mutex_unlock(&kvm->gsi_mutex);
> > > > +   goto out_close;
> > > > +   }
> > > > +   memset(kvm->used_gsi_bitmap, 0, gsi_bytes);
> > > > +   kvm->max_gsi = gsi_bytes * 8;
> > >
> > > So max_gsi = gsi_count / 4?
>
> kvm->max_gsi actually becomes the number of GSIs available in the
> bitmap, which may be more than gsi_count if we rounded up.  We
> preallocate GSIs between gsi_count and max_gsi to avoid using them.
> This just lets us not need to special case testing whether a bit in the
> last index is < gsi_count.  Am I overlooking anything here?  Thanks,
>
Oh, I understand that, and I just follow the logic of last comment here(which 
"gsi_bytes = (gsi_count + 31) / 32" )... Sorry for confusion...

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: device-assignment: Fix kvm_get_irq_route_gsi() return check

2009-05-12 Thread Yang, Sheng

On Wednesday 13 May 2009 11:55:50 Alex Williamson wrote:
> On Wed, 2009-05-13 at 11:36 +0800, Yang, Sheng wrote:
> > On Wednesday 13 May 2009 06:14:01 Alex Williamson wrote:
> > > --- a/hw/device-assignment.c
> > > +++ b/hw/device-assignment.c
> > > @@ -798,7 +798,7 @@ static void assigned_dev_update_msi(PCIDevice
> > > *pci_dev, unsigned int ctrl_pos) pci_dev->cap.start + PCI_MSI_DATA_32);
> > >  assigned_dev->entry->type = KVM_IRQ_ROUTING_MSI;
> > >  assigned_dev->entry->gsi = kvm_get_irq_route_gsi(kvm_context);
> > > -if (assigned_dev->entry->gsi < 0) {
> > > +if ((int)(assigned_dev->entry->gsi) < 0) {
> > >  perror("assigned_dev_update_msi: kvm_get_irq_route_gsi");
> > >  return;
> > >  }
> >
> > Use a return value(r) seems better.
>
> Hi Sheng,
>
> Do you mean:
>
> r = kvm_get_irq_route_gsi(kvm_context);
> if (r < 0) {
> ...
> }
> assigned_dev->entry->gsi = r;

Yes.

> > And I realized there is memory leak here. Entry seems haven't been freed
> > for error... So does MSI-X...
>
> I hadn't noticed that one, but now that you mention it, yep.  Thanks,

Thanks. :)

-- 
regards
Yang, Sheng

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: device-assignment: Fix kvm_get_irq_route_gsi() return check

2009-05-12 Thread Yang, Sheng

On Wednesday 13 May 2009 06:14:01 Alex Williamson wrote:
> Cast to a signed int to test for < 0.
>
> Signed-off-by: Alex Williamson 
> ---
>
>  This supersedes "[PATCH] kvm: device-assignment: Catch GSI overflow"
>
>  hw/device-assignment.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index a6cc9b9..5bdae24 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -798,7 +798,7 @@ static void assigned_dev_update_msi(PCIDevice *pci_dev,
> unsigned int ctrl_pos) pci_dev->cap.start + PCI_MSI_DATA_32);
>  assigned_dev->entry->type = KVM_IRQ_ROUTING_MSI;
>  assigned_dev->entry->gsi = kvm_get_irq_route_gsi(kvm_context);
> -if (assigned_dev->entry->gsi < 0) {
> +if ((int)(assigned_dev->entry->gsi) < 0) {
>  perror("assigned_dev_update_msi: kvm_get_irq_route_gsi");
>  return;
>  }

Use a return value(r) seems better.

And I realized there is memory leak here. Entry seems haven't been freed for 
error... So does MSI-X...

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] kvm: Use a bitmap for tracking used GSIs

2009-05-12 Thread Yang, Sheng

On Wednesday 13 May 2009 06:07:15 Alex Williamson wrote:
> We're currently using a counter to track the most recent GSI we've
> handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
> assignment with a driver that regularly toggles the MSI enable bit.
> This can mean only a few minutes of usable run time.  Instead, track
> used GSIs in a bitmap.
>
> Signed-off-by: Alex Williamson 
> ---
>
>  v2: Added mutex to protect gsi bitmap
>  v3: Updated for comments from Michael Tsirkin
>  No longer depends on "[PATCH] kvm: device-assignment: Catch GSI
> overflow"
>
>  hw/device-assignment.c  |4 ++
>  kvm/libkvm/kvm-common.h |4 ++
>  kvm/libkvm/libkvm.c |   83
> +-- kvm/libkvm/libkvm.h |  
> 10 ++
>  4 files changed, 88 insertions(+), 13 deletions(-)
>
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index a7365c8..a6cc9b9 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -561,8 +561,10 @@ static void free_dev_irq_entries(AssignedDevice *dev)
>  {
>  int i;
>
> -for (i = 0; i < dev->irq_entries_nr; i++)
> +for (i = 0; i < dev->irq_entries_nr; i++) {
>  kvm_del_routing_entry(kvm_context, &dev->entry[i]);
> +kvm_free_irq_route_gsi(kvm_context, dev->entry[i].gsi);
> +}
>  free(dev->entry);
>  dev->entry = NULL;
>  dev->irq_entries_nr = 0;
> diff --git a/kvm/libkvm/kvm-common.h b/kvm/libkvm/kvm-common.h
> index 591fb53..4b3cb51 100644
> --- a/kvm/libkvm/kvm-common.h
> +++ b/kvm/libkvm/kvm-common.h
> @@ -66,8 +66,10 @@ struct kvm_context {
>  #ifdef KVM_CAP_IRQ_ROUTING
>   struct kvm_irq_routing *irq_routes;
>   int nr_allocated_irq_routes;
> + void *used_gsi_bitmap;
> + int max_gsi;
> + pthread_mutex_t gsi_mutex;
>  #endif
> - int max_used_gsi;
>  };
>
>  int kvm_alloc_kernel_memory(kvm_context_t kvm, unsigned long memory,
> diff --git a/kvm/libkvm/libkvm.c b/kvm/libkvm/libkvm.c
> index ba0a5d1..3d7ab75 100644
> --- a/kvm/libkvm/libkvm.c
> +++ b/kvm/libkvm/libkvm.c
> @@ -35,6 +35,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "libkvm.h"
>
>  #if defined(__x86_64__) || defined(__i386__)
> @@ -65,6 +66,8 @@
>  int kvm_abi = EXPECTED_KVM_API_VERSION;
>  int kvm_page_size;
>
> +static inline void set_bit(uint32_t *buf, unsigned int bit);
> +
>  struct slot_info {
>   unsigned long phys_addr;
>   unsigned long len;
> @@ -286,6 +289,9 @@ kvm_context_t kvm_init(struct kvm_callbacks *callbacks,
>   int fd;
>   kvm_context_t kvm;
>   int r;
> +#ifdef KVM_CAP_IRQ_ROUTING
> + int gsi_count, gsi_bytes, i;
> +#endif
>
>   fd = open("/dev/kvm", O_RDWR);
>   if (fd == -1) {
> @@ -322,6 +328,27 @@ kvm_context_t kvm_init(struct kvm_callbacks
> *callbacks, kvm->dirty_pages_log_all = 0;
>   kvm->no_irqchip_creation = 0;
>   kvm->no_pit_creation = 0;
> +#ifdef KVM_CAP_IRQ_ROUTING
> + pthread_mutex_init(&kvm->gsi_mutex, NULL);
> +
> + gsi_count = kvm_get_gsi_count(kvm);
> + /* Round up so we can search ints using ffs */
> + gsi_bytes = (gsi_count + 31) / 32;

CMIW, should it be gsi_bytes = (gsi_count + 7) / 8? This looks like bits-to-
int. 

> + kvm->used_gsi_bitmap = malloc(gsi_bytes);
> + if (!kvm->used_gsi_bitmap) {
> + pthread_mutex_unlock(&kvm->gsi_mutex);
> + goto out_close;
> + }
> + memset(kvm->used_gsi_bitmap, 0, gsi_bytes);
> + kvm->max_gsi = gsi_bytes * 8;

So max_gsi = gsi_count / 4?

-- 
regards
Yang, Sheng

> +
> + /* Mark all the IOAPIC pin GSIs and any over-allocated
> +  * GSIs as already in use. */
> + for (i = 0; i < KVM_IOAPIC_NUM_PINS; i++)
> + set_bit(kvm->used_gsi_bitmap, i);
> + for (i = gsi_count; i < kvm->max_gsi; i++)
> + set_bit(kvm->used_gsi_bitmap, i);
> +#endif
>
>   return kvm;
>   out_close:
> @@ -1298,8 +1325,6 @@ int kvm_add_routing_entry(kvm_context_t kvm,
>   new->flags = entry->flags;
>   new->u = entry->u;
>
> - if (entry->gsi > kvm->max_used_gsi)
> - kvm->max_used_gsi = entry->gsi;
>   return 0;
>  #else
>   return -ENOSYS;
> @@ -1404,18 +1429,54 @@ int kvm_commit_irq_routes(kvm_context_t kvm)
>  #endif
>  }
>
> +#ifdef KVM_CAP_IRQ_ROUTING
> +static inline void set_bit(uint32_t *buf, unsigned int bit)
> +{
> + buf[bit / 32] |= 1U << (bit % 32);
> +}
> +
> +static inline void

Re: [PATCH 1/1] KVM: Fix potentially recursively get kvm lock

2009-05-12 Thread Yang, Sheng

On Wednesday 13 May 2009 06:09:08 Marcelo Tosatti wrote:
> On Tue, May 12, 2009 at 03:36:27PM -0600, Alex Williamson wrote:
> > On Tue, 2009-05-12 at 16:44 -0300, Marcelo Tosatti wrote:
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index 4d00942..ba067db 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -250,7 +250,15 @@ static void deassign_host_irq(struct kvm *kvm,
> > >   disable_irq_nosync(assigned_dev->
> > >  host_msix_entries[i].vector);
> > >
> > > + /*
> > > +  * FIXME: kvm_assigned_dev_interrupt_work_handler can deadlock
> > > +  * with cancel_work_sync, since it requires kvm->lock for irq
> > > +  * injection. This is a hack, the irq code must use
> > > +  * a separate lock.
> > > +  */
> > > + mutex_unlock(&kvm->lock);
> > >   cancel_work_sync(&assigned_dev->interrupt_work);
> > > + mutex_lock(&kvm->lock);
> >
> > Seems to work, I assume you've got a similar unlock/lock for the
> > MSI/INTx block.  Thanks,
>
> KVM: workaround workqueue / deassign_host_irq deadlock
>
> I think I'm running into the following deadlock in the kvm kernel module
> when trying to use device assignment:
>
> CPU A   CPU B
> kvm_vm_ioctl_deassign_dev_irq()
>   mutex_lock(&kvm->lock);   worker_thread()
>   -> kvm_deassign_irq()   ->
> kvm_assigned_dev_interrupt_work_handler()
> -> deassign_host_irq()  mutex_lock(&kvm->lock);
>   -> cancel_work_sync() [blocked]
>
> Workaround the issue by dropping kvm->lock for cancel_work_sync().
>
> Reported-by: Alex Williamson 
> From: Sheng Yang 
> Signed-off-by: Marcelo Tosatti 

Another calling path(kvm_free_all_assigned_devices()) don't hold kvm->lock... 
Seems it need the lock for travel assigned dev list?

-- 
regards
Yang, Sheng

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4d00942..d4af719 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -250,7 +250,15 @@ static void deassign_host_irq(struct kvm *kvm,
>   disable_irq_nosync(assigned_dev->
>  host_msix_entries[i].vector);
>
> + /*
> +  * FIXME: kvm_assigned_dev_interrupt_work_handler can deadlock
> +  * with cancel_work_sync, since it requires kvm->lock for irq
> +  * injection. This is a hack, the irq code must use
> +  * a separate lock. Same below for MSI.
> +  */
> + mutex_unlock(&kvm->lock);
>   cancel_work_sync(&assigned_dev->interrupt_work);
> + mutex_lock(&kvm->lock);
>
>   for (i = 0; i < assigned_dev->entries_nr; i++)
>   free_irq(assigned_dev->host_msix_entries[i].vector,
> @@ -263,7 +271,9 @@ static void deassign_host_irq(struct kvm *kvm,
>   } else {
>   /* Deal with MSI and INTx */
>   disable_irq_nosync(assigned_dev->host_irq);
> + mutex_unlock(&kvm->lock);
>   cancel_work_sync(&assigned_dev->interrupt_work);
> + mutex_lock(&kvm->lock);
>
>   free_irq(assigned_dev->host_irq, (void *)assigned_dev);


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] KVM: Fix potentially recursively get kvm lock

2009-05-12 Thread Yang, Sheng

On Tuesday 12 May 2009 19:55:24 Marcelo Tosatti wrote:
> On Tue, May 12, 2009 at 05:32:09PM +0800, Sheng Yang wrote:
> > kvm_vm_ioctl_deassign_dev_irq() would potentially recursively get
> > kvm->lock, because it called kvm_deassigned_irq() which implicit hold
> > kvm->lock by calling deassign_host_irq().
> >
> > Fix it by move kvm_deassign_irq() out of critial region. And add the
> > missing lock for deassign_guest_irq().
> >
> > Reported-by: Alex Williamson 
> > Signed-off-by: Sheng Yang 
> > ---
> >  virt/kvm/kvm_main.c |   14 +++---
> >  1 files changed, 7 insertions(+), 7 deletions(-)
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 4d00942..3c69655 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -215,6 +215,8 @@ static void kvm_assigned_dev_ack_irq(struct
> > kvm_irq_ack_notifier *kian) static void deassign_guest_irq(struct kvm
> > *kvm,
> >struct kvm_assigned_dev_kernel *assigned_dev)
> >  {
> > +   mutex_lock(&kvm->lock);
> > +
> > kvm_unregister_irq_ack_notifier(&assigned_dev->ack_notifier);
> > assigned_dev->ack_notifier.gsi = -1;
> >
> > @@ -222,6 +224,8 @@ static void deassign_guest_irq(struct kvm *kvm,
> > kvm_free_irq_source_id(kvm, assigned_dev->irq_source_id);
> > assigned_dev->irq_source_id = -1;
> > assigned_dev->irq_requested_type &= ~(KVM_DEV_IRQ_GUEST_MASK);
> > +
> > +   mutex_unlock(&kvm->lock);
> >  }
> >
> >  /* The function implicit hold kvm->lock mutex due to cancel_work_sync()
> > */ @@ -558,20 +562,16 @@ static int kvm_vm_ioctl_deassign_dev_irq(struct
> > kvm *kvm, struct kvm_assigned_irq
> >  *assigned_irq)
> >  {
> > -   int r = -ENODEV;
> > struct kvm_assigned_dev_kernel *match;
> >
> > mutex_lock(&kvm->lock);
> > -
> > match = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> >   assigned_irq->assigned_dev_id);
> > +   mutex_unlock(&kvm->lock);
>
> assigned_dev list is protected by kvm->lock. So you could have another
> ioctl adding to it at the same time you're searching.

Oh, yes... My fault... 

> Could either have a separate kvm->assigned_devs_lock, to protect
> kvm->arch.assigned_dev_head (users are ioctls that manipulate it), or
> change the IRQ injection to use a separate spinlock, kill the workqueue
> and call kvm_set_irq from the assigned device interrupt handler.

Peferred the latter, though needs more work. But the only reason for put a 
workqueue here is because kvm->lock is a mutex? I can't believe... If so, I 
think we had made a big mistake - we have to fix all kinds of racy problem 
caused by this, but finally find it's unnecessary... 

Maybe another reason is kvm_kick_vcpu(), but have already fixed by you.

Continue to check the code...

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: device-assignment deadlock

2009-05-12 Thread Yang, Sheng

[I was kicked off from the mailing list by periodic unknown reason last 
Friday... Sorry]

> Hi Sheng,
>
> I think I'm running into the following deadlock in the kvm kernel module
> when trying to use device assignment:
>
> CPU A   CPU B
> kvm_vm_ioctl_deassign_dev_irq()
>  mutex_lock(&kvm->lock);   worker_thread()
>  -> kvm_deassign_irq()   -> 
>kvm_assigned_dev_interrupt_work_handler()
>-> deassign_host_irq()  mutex_lock(&kvm->lock);
>  -> cancel_work_sync() [blocked]

> I wonder if we need finer granularity locking to avoid this.
> Suggestions?  Thanks,

This part again...

I think simply move kvm_deassign_irq() out of critical region is OK, and I 
also add the lock which seems missing in deassign_guest_irq(). Would post a 
patch soon.

-- 
regards
Yang, Sheng

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: Use a bitmap for tracking used GSIs

2009-05-11 Thread Yang, Sheng

On Friday 08 May 2009 06:22:20 Alex Williamson wrote:
> We're currently using a counter to track the most recent GSI we've
> handed out.  This quickly hits KVM_MAX_IRQ_ROUTES when using device
> assignment with a driver that regularly toggles the MSI enable bit.
> This can mean only a few minutes of usable run time.  Instead, track
> used GSIs in a bitmap.
>
> Signed-off-by: Alex Williamson 
> ---
>
>  Applies on top of "kvm: device-assignment: Catch GSI overflow"
>
>  hw/device-assignment.c  |4 ++-
>  kvm/libkvm/kvm-common.h |3 +-
>  kvm/libkvm/libkvm.c |   68
> +-- kvm/libkvm/libkvm.h |  
> 10 +++
>  4 files changed, 74 insertions(+), 11 deletions(-)
>
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index e06dd08..5bdae24 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -561,8 +561,10 @@ static void free_dev_irq_entries(AssignedDevice *dev)
>  {
>  int i;
>
> -for (i = 0; i < dev->irq_entries_nr; i++)
> +for (i = 0; i < dev->irq_entries_nr; i++) {
>  kvm_del_routing_entry(kvm_context, &dev->entry[i]);
> +kvm_free_irq_route_gsi(kvm_context, dev->entry[i].gsi);
> +}
>  free(dev->entry);
>  dev->entry = NULL;
>  dev->irq_entries_nr = 0;
> diff --git a/kvm/libkvm/kvm-common.h b/kvm/libkvm/kvm-common.h
> index 591fb53..94f86e5 100644
> --- a/kvm/libkvm/kvm-common.h
> +++ b/kvm/libkvm/kvm-common.h
> @@ -66,8 +66,9 @@ struct kvm_context {
>  #ifdef KVM_CAP_IRQ_ROUTING
>   struct kvm_irq_routing *irq_routes;
>   int nr_allocated_irq_routes;
> + void *used_gsi_bitmap;
> + int max_gsi;
>  #endif
> - int max_used_gsi;
>  };
>
>  int kvm_alloc_kernel_memory(kvm_context_t kvm, unsigned long memory,
> diff --git a/kvm/libkvm/libkvm.c b/kvm/libkvm/libkvm.c
> index 2a4165a..43abc7d 100644
> --- a/kvm/libkvm/libkvm.c
> +++ b/kvm/libkvm/libkvm.c
> @@ -1298,8 +1298,6 @@ int kvm_add_routing_entry(kvm_context_t kvm,
>   new->flags = entry->flags;
>   new->u = entry->u;
>
> - if (entry->gsi > kvm->max_used_gsi)
> - kvm->max_used_gsi = entry->gsi;
>   return 0;
>  #else
>   return -ENOSYS;
> @@ -1404,20 +1402,72 @@ int kvm_commit_irq_routes(kvm_context_t kvm)
>  #endif
>  }
>
> +#ifdef KVM_CAP_IRQ_ROUTING
> +static inline void set_bit(unsigned int *buf, int bit)
> +{
> + buf[bit >> 5] |= (1U << (bit & 0x1f));
> +}
> +
> +static inline void clear_bit(unsigned int *buf, int bit)
> +{
> + buf[bit >> 5] &= ~(1U << (bit & 0x1f));
> +}
> +
> +static int kvm_find_free_gsi(kvm_context_t kvm)
> +{
> + int i, bit, gsi;
> + unsigned int *buf = kvm->used_gsi_bitmap;
> +
> + for (i = 0; i < (kvm->max_gsi >> 5); i++) {
> + if (buf[i] != ~0U)
> + break;
> + }
> +
> + if (i == kvm->max_gsi >> 5)
> + return -ENOSPC;
> +
> + bit = ffs(~buf[i]);
> + if (!bit)
> + return -EAGAIN;
> +
> + gsi = (bit - 1) | (i << 5);
> + set_bit(buf, gsi);
> + return gsi;
> +}
> +#endif
> +
>  int kvm_get_irq_route_gsi(kvm_context_t kvm)
>  {
>  #ifdef KVM_CAP_IRQ_ROUTING
> - if (kvm->max_used_gsi >= KVM_IOAPIC_NUM_PINS)  {
> - if (kvm->max_used_gsi + 1 < kvm_get_gsi_count(kvm))
> -return kvm->max_used_gsi + 1;
> -else
> -return -ENOSPC;
> -} else
> -return KVM_IOAPIC_NUM_PINS;
> + if (!kvm->max_gsi) {
> + int i;
> +
> + /* Round the number of GSIs supported to a 4 byte
> +  * value so we can search it using ints and ffs */
> + i = kvm_get_gsi_count(kvm) & ~0x1f;
> + kvm->used_gsi_bitmap = malloc(i >> 3);

3 or 5?

I am a little confused by these magic numbers, including 0x1f...

I think there are something can indicate the length of unsigned long in 
QEmu(sorry, can't find it now...), so how about using ffsl() and get other 
constants based on it?

-- 
regards
Yang, Sheng

> + if (!kvm->used_gsi_bitmap)
> + return -ENOMEM;
> + memset(kvm->used_gsi_bitmap, 0, i >> 3);
> + kvm->max_gsi = i;
> +
> + /* Mark all the IOAPIC pin GSIs as already used */
> + for (i = 0; i <= KVM_IOAPIC_NUM_PINS; i++)
> + set_bit(kvm->used_gsi_bitmap, i);
> + }
> +
> + r

Re: [PATCH v10 0/7] PCI: Linux kernel SR-IOV support

2009-03-08 Thread Yang, Sheng

On Monday 09 March 2009 11:42:05 Yang, Sheng wrote:
> On Sunday 08 March 2009 22:30:16 Avi Kivity wrote:
> > Matthew Wilcox wrote:
> > > On Tue, Feb 24, 2009 at 12:47:38PM +0200, Avi Kivity wrote:
> > >> Do those patches allow using a VF on the host (in other words, does
> > >> the kernel emulate config space accesses)?
> > >
> > > SR-IOV hardware handles config space accesses to virtual functions.  No
> > > kernel changes needed for that aspect of it.
> >
> > Patches 2 and 3 of the patchset that enables SR/IOV in kvm [1] suggest
> > that at the config space is only partially implemented.
> >
> > [1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/29034
>
> Hi Avi
>
> For kernel side, patch 2 is not necessary. Because kernel would read
> VID/DID directly from pci_dev rather than configuration space, which have
> been set properly already.
>
> And very sorry, for the patch 3. We haven't known exactly what's happened.
> I think the problem is caused by guest driver, but didn't confirm(and I
> have some misunderstandings with ZhaoYu for I thought we are agree on the
> reason, but after confirm with him, he didn't agree). I am doing more
> investigations to find the real cause.

Found the reason of patch 3.

After insert guest driver module(vf driver), the driver would do a RMW to the 
command register to enable Bus Master bit(bit 2). And before that, MMIO bit 
have been set in the register. But without the patch 3, guest driver won't see 
the MMIO bit(bit 1), then just set 0x4 to the command register, with the side 
effect to unmap MMIO in QEmu. So patch 3 is needed(and what I thought before 
is right).

Unset the bit only affect the QEmu, which would unmap the mapping for MMIO. 
Kernel side don't need this, so it's OK.

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v10 0/7] PCI: Linux kernel SR-IOV support

2009-03-08 Thread Yang, Sheng

On Sunday 08 March 2009 22:30:16 Avi Kivity wrote:
> Matthew Wilcox wrote:
> > On Tue, Feb 24, 2009 at 12:47:38PM +0200, Avi Kivity wrote:
> >> Do those patches allow using a VF on the host (in other words, does the
> >> kernel emulate config space accesses)?
> >
> > SR-IOV hardware handles config space accesses to virtual functions.  No
> > kernel changes needed for that aspect of it.
>
> Patches 2 and 3 of the patchset that enables SR/IOV in kvm [1] suggest
> that at the config space is only partially implemented.
>
> [1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/29034

Hi Avi

For kernel side, patch 2 is not necessary. Because kernel would read VID/DID 
directly from pci_dev rather than configuration space, which have been set 
properly already.

And very sorry, for the patch 3. We haven't known exactly what's happened. I 
think the problem is caused by guest driver, but didn't confirm(and I have 
some misunderstandings with ZhaoYu for I thought we are agree on the reason, 
but after confirm with him, he didn't agree). I am doing more investigations 
to find the real cause.

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: protect assigned dev workqueue, int handler and irq acker

2009-03-01 Thread Yang, Sheng

On Saturday 28 February 2009 01:54:31 Marcelo Tosatti wrote:
> On Fri, Feb 27, 2009 at 12:17:01PM +0800, Yang, Sheng wrote:
> > On Friday 27 February 2009 07:50:54 Marcelo Tosatti wrote:
> > > Can someone with HW test this please?
> >
> > Good catch! The patch works fine on my side.
> >
> > Can it be a per-device lock? One big lock for all assigned device seems
> > restrict scalability.
>
> Since all state is per-device, yes.
>
> Can you please review, test and ack the patch below?

I just checked dmesg and got this...

[ 1105.343824] [ cut here ] 
   
[ 1105.347814] WARNING: at kernel/smp.c:226 
smp_call_function_single+0x41/0x10b()   
   
[ 1105.347814] Hardware name: To Be Filled By O.E.M.
   
[ 1105.347814] Modules linked in: kvm_intel kvm bridge stp llc i2c_dev 
i2c_core e1000 e1000e ehci_hcd ohci_hcd uhci_hcd

[ 1105.347814] Pid: 9, comm: events/0 Tainted: GW  2.6.29-rc4-1-
gd5b5623 #20   
[ 1105.347814] Call Trace:  
   
[ 1105.347814]  [] warn_slowpath+0xd3/0xf2
   
[ 1105.347814]  [] ? __enqueue_entity+0x74/0x76   
   
[ 1105.347814]  [] ? enqueue_entity+0xad/0xb6 
   
[ 1105.347814]  [] ? try_to_wake_up+0x1ff/0x211   
   
[ 1105.347814]  [] smp_call_function_single+0x41/0x10b
   
[ 1105.347814]  [] kvm_vcpu_kick+0x74/0x7c [kvm]  
   
[ 1105.347814]  [] kvm_apic_set_irq+0x70/0x77 [kvm]   
   
[ 1105.347814]  [] kvm_set_msi+0xe8/0x10d [kvm]   
   
[ 1105.347814]  [] ? 
kvm_assigned_dev_interrupt_work_handler+0x30/0xfd [kvm] 

[ 1105.347814]  [] kvm_set_irq+0x6f/0xb3 [kvm]
   
[ 1105.347814]  [] 
kvm_assigned_dev_interrupt_work_handler+0x7c/0xfd [kvm] 
  
[ 1105.347814]  [] ? 
kvm_assigned_dev_interrupt_work_handler+0x0/0xfd [kvm]  

[ 1105.347814]  [] run_workqueue+0xf5/0x1fd   
   
[ 1105.347814]  [] ? run_workqueue+0x9f/0x1fd 
   
[ 1105.347814]  [] worker_thread+0xdb/0xe8
   
[ 1105.347814]  [] ? autoremove_wake_function+0x0/0x38
   
[ 1105.347814]  [] ? worker_thread+0x0/0xe8   
   
[ 1105.347814]  [] kthread+0x49/0x78  
   
[ 1105.347814]  [] child_rip+0xa/0x20 
   
[ 1105.347814]  [] ? restore_args+0x0/0x30
   
[ 1105.347814]  [] ? finish_task_switch+0x0/0xf3  
   
[ 1105.347814]  [] ? kthread+0x0/0x78 
   
[ 1105.347814]  [] ? child_rip+0x0/0x20   
   
[ 1105.347814] ---[ end trace 3b3fe301343db608 ]--- 

-- 
regards
Yang, Sheng


> Thanks.
>
> > > -
> > >
> > > kvm_assigned_dev_ack_irq is vulnerable to a race condition with

Re: assigned dev msi int handling

2009-02-26 Thread Yang, Sheng

On Friday 27 February 2009 07:59:27 Marcelo Tosatti wrote:
> Hi Sheng,
>
> So for guest INTX interrupts the host interrupt is reenabled on ack from
> the guest, which is nice. Now for guest MSI interrupts it keeps reenabling
> the interrupt as fast as the work handler can run.
>
> Can you explain why it works this way? Why not disable interrupts
> on the host in all cases and only reenable on ack?
>

Sorry for I didn't think it over... The direct reason is ack_irq binding with 
kvm_set_irq() which is unnecessary for MSI/MSI-X. But enable(msi) after EOI 
seems more proper here, though more changes are needed for MSI-X(seems we need 
one ack notifier for one vector in MSI-X).

-- 
regards
Yang, Sheng

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: protect assigned dev workqueue, int handler and irq acker

2009-02-26 Thread Yang, Sheng

On Friday 27 February 2009 07:50:54 Marcelo Tosatti wrote:
> Can someone with HW test this please?

Good catch! The patch works fine on my side.

Can it be a per-device lock? One big lock for all assigned device seems 
restrict scalability. 

> -
>
> kvm_assigned_dev_ack_irq is vulnerable to a race condition with the
> interrupt handler function. It does:
>
> if (dev->host_irq_disabled) {
> enable_irq(dev->host_irq);
> dev->host_irq_disabled = false;
>   }
>
> If an interrupt triggers before the host->dev_irq_disabled assignment,
> it will disable the interrupt and set dev->host_irq_disabled to true.
>
> On return to kvm_assigned_dev_ack_irq, dev->host_irq_disabled is set to
> false, and the next kvm_assigned_dev_ack_irq call will fail to reenable
> it.
>
> Other than that, having the interrupt handler and work handlers run in
> parallel sounds like asking for trouble (could not spot any obvious
> problem, but better not have to, its fragile).

Well, my original purpose is a FIFO between interrupt handler and work(for 
MSI-X), but seems too complex... And I also don't see any problem for now...

-- 
regards
Yang, Sheng

> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 3832243..faaf386 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -152,6 +152,7 @@ struct kvm {
>   unsigned long mmu_notifier_seq;
>   long mmu_notifier_count;
>  #endif
> + spinlock_t assigned_dev_lock;
>  };
>
>  /* The guest did something we don't support. */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 4d2be16..2bbf074 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -41,6 +41,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -132,6 +133,7 @@ static void
> kvm_assigned_dev_interrupt_work_handler(struct work_struct *work) *
> finer-grained lock, update this
>*/
>   mutex_lock(&kvm->lock);
> + spin_lock_irq(&kvm->assigned_dev_lock);
>   if (assigned_dev->irq_requested_type & KVM_ASSIGNED_DEV_MSIX) {
>   struct kvm_guest_msix_entry *guest_entries =
>   assigned_dev->guest_msix_entries;
> @@ -158,18 +160,21 @@ static void
> kvm_assigned_dev_interrupt_work_handler(struct work_struct *work) }
>   }
>
> + spin_unlock_irq(&kvm->assigned_dev_lock);
>   mutex_unlock(&assigned_dev->kvm->lock);
>  }
>
>  static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
>  {
> + unsigned long flags;
>   struct kvm_assigned_dev_kernel *assigned_dev =
>   (struct kvm_assigned_dev_kernel *) dev_id;
>
> + spin_lock_irqsave(&assigned_dev->kvm->assigned_dev_lock, flags);
>   if (assigned_dev->irq_requested_type == KVM_ASSIGNED_DEV_MSIX) {
>   int index = find_index_from_host_irq(assigned_dev, irq);
>   if (index < 0)
> - return IRQ_HANDLED;
> + goto out;
>   assigned_dev->guest_msix_entries[index].flags |=
>   KVM_ASSIGNED_MSIX_PENDING;
>   }
> @@ -179,6 +184,8 @@ static irqreturn_t kvm_assigned_dev_intr(int irq, void
> *dev_id) disable_irq_nosync(irq);
>   assigned_dev->host_irq_disabled = true;
>
> +out:
> + spin_unlock_irqrestore(&assigned_dev->kvm->assigned_dev_lock, flags);
>   return IRQ_HANDLED;
>  }
>
> @@ -186,6 +193,7 @@ static irqreturn_t kvm_assigned_dev_intr(int irq, void
> *dev_id) static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier
> *kian) {
>   struct kvm_assigned_dev_kernel *dev;
> + unsigned long flags;
>
>   if (kian->gsi == -1)
>   return;
> @@ -198,10 +206,12 @@ static void kvm_assigned_dev_ack_irq(struct
> kvm_irq_ack_notifier *kian) /* The guest irq may be shared so this ack may
> be
>* from another device.
>*/
> + spin_lock_irqsave(&dev->kvm->assigned_dev_lock, flags);
>   if (dev->host_irq_disabled) {
>   enable_irq(dev->host_irq);
>   dev->host_irq_disabled = false;
>   }
> + spin_unlock_irqrestore(&dev->kvm->assigned_dev_lock, flags);
>  }
>
>  /* The function implicit hold kvm->lock mutex due to cancel_work_sync() */
> @@ -955,6 +965,7 @@ static struct kvm *kvm_create_vm(void)
>   kvm->mm = current->mm;
>   atomic_inc(&kvm->mm->mm_count);
>   spin_lock_init(&kvm->mmu_lock);
> + spin_lock_init(&kvm->assigned_dev_lock);
>   kvm_io_bus_init(&kvm->pio_bus);
>   mutex_init(&kvm->lock);
>   kvm_io_bus_init(&kvm->mmio_bus);
>
>
> - End forwarded message -


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: gettimeofday "slow" in RHEL4 guests

2008-12-29 Thread Yang, Sheng

On Monday 29 December 2008 02:38:07 Marcelo Tosatti wrote:
> On Tue, Nov 25, 2008 at 01:52:59PM +0100, Andi Kleen wrote:
> > > But yeah - the remapping of HPET timers to virtual HPET timers sounds
> > > pretty tough. I wonder if one could overcome that with a little
> > > hardware support though ...
> >
> > For gettimeofday better make TSC work. Even in the best case (no
> > virtualization) it is much faster than HPET because it sits in the CPU,
> > while HPET is far away on the external south bridge.
>
> The tsc clock on older Linux 2.6 kernels compensates for lost ticks.
> The algorithm uses the PIT count (latched) to measure the delay between
> interrupt generation and handling, and sums that value, on the next
> interrupt, to the TSC delta.
>
> Sheng investigated this problem in the discussions before in-kernel PIT
> was merged:
>
> http://www.mail-archive.com/kvm-de...@lists.sourceforge.net/msg13873.html
>
> The algorithm overcompensates for lost ticks and the guest time runs
> faster than the hosts.
>
> There are two issues:
>
> 1) A bug in the in-kernel PIT which miscalculates the count value.
>
> 2) For the case where more than one interrupt is lost, and later
> reinjected, the value read from PIT count is meaningless for the purpose
> of the tsc algorithm. The count is interpreted as the delay until the
> next interrupt, which is not the case with reinjection.
>
> As Sheng mentioned in the thread above, Xen pulls back the TSC value
> when reinjecting interrupts. VMWare ESX has a notion of "virtual TSC",
> which I believe is similar in this context.
>
> For KVM I believe the best immediate solution (for now) is to provide an
> option to disable reinjection, behaving similarly to real hardware. The
> advantage is simplicity compared to virtualizing the time sources.
>
> The QEMU PIT emulation has a limit on the rate of interrupt reinjection,
> perhaps something similar should be investigated in the future.
>
> The following patch (which contains the bugfix for 1) and disabled
> reinjection) fixes the severe time drift on RHEL4 with "clock=tsc".
> What I'm proposing is to condition reinjection with an option
> (-kvm-pit-no-reinject or something).

I agree that it should go with a user space option to disable rejection, as 
it's hard to overcome the problem that we delayed interrupt injection... 

-- 
regards
Yang, Sheng

> Comments or better ideas?
>
>
> diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
> index e665d1c..608af7b 100644
> --- a/arch/x86/kvm/i8254.c
> +++ b/arch/x86/kvm/i8254.c
> @@ -201,13 +201,16 @@ static int __pit_timer_fn(struct kvm_kpit_state *ps)
>   if (!atomic_inc_and_test(&pt->pending))
>   set_bit(KVM_REQ_PENDING_TIMER, &vcpu0->requests);
>
> + if (atomic_read(&pt->pending) > 1)
> + atomic_set(&pt->pending, 1);
> +
>   if (vcpu0 && waitqueue_active(&vcpu0->wq))
>   wake_up_interruptible(&vcpu0->wq);
>
>   hrtimer_add_expires_ns(&pt->timer, pt->period);
>   pt->scheduled = hrtimer_get_expires_ns(&pt->timer);
>   if (pt->period)
> - ps->channels[0].count_load_time = 
> hrtimer_get_expires(&pt->timer);
> + ps->channels[0].count_load_time = ktime_get();
>
>   return (pt->period == 0 ? 0 : 1);
>  }

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Don't deliver PIC interrupts to disabled APICs - v2

2008-10-22 Thread Yang, Sheng

On Thursday 23 October 2008 04:44:48 Jan Kiszka wrote:
> Jan Kiszka wrote:
> > Jan Kiszka wrote:
> >> Jan Kiszka wrote:
> >>> Avi Kivity wrote:
> >>>> Jan Kiszka wrote:
> >>>>> [ taking Sheng's comments into account ]
> >>>>>
> >>>>> The logic of kvm_apic_accept_pic_intr has a minor, practically hardly
> >>>>> relevant incorrectness: PIC interrupts are still delivered even if
> >>>>> the APIC of VPU0 (BSP) is disabled. This does not comply with the
> >>>>> Virtual Wire mode according to the Intel MP spec.
> >>>>
> >>>> This breaks Windows XP with the Standard PC HAL, so I am unapplying
> >>>> this patch.
> >>>
> >>> Hmm, this points to either an APIC setup or BIOS bug. To my
> >>> understanding, the Standard PC HAL should not fiddle with the APIC, so
> >>> what the BIOS leaves behind should counts. But I think I found no
> >>> traces of APIC manipulation in rombios32.c.
> >>
> >> Manipulation on UP systems. There is fiddling for SMP. But I will check
> >> again.
> >
> > I take everything back: For yet unknown reasons Windows' standard HAL
> > actually decides to disable the APIC actively. Either there is a
> > short-path around a disabled APIC for Virtual Wire mode in Real Live
> > (though I fail to read this out of the spec), or Windows simply has a
> > bug here (MS insists on NOT supporting the Standard HAL on APIC systems
> > [1] - precisely the setup KVM is providing). Sheng, any comments on
> > this? Guess we have to live with the previous version, maybe with some
> > refactoring + commenting.
>
> I was curious and played with my corresponding qemu patch [1]: It works
> with the same Windows image that hangs under KVM. Then I looked at a
> prominent guest-visible difference: the reported CPU type. QEMU claims
> to provide a CPU called "qemu64" by default. Changing this to Pentium2
> or newer makes Windows issue the lethal APIC disable command. On the
> other hand, calling kvm with "-cpu pentium" makes Windows boot again.
>
> However, I still can't tell from this if we see a Windows bug or if the
> change is incorrect (but me feeling tends to the former).

Confirmed that "-cpu pentium" solve the problem. But...

Sorry for that I've found that I neglected some info on the spec. SDM 3B 
5.3.1 "External Interrupts".

"When the local APIC is global/hardware disabled, these pins are configured as 
INTR and NMI pins, respectively."

So, it's right to inject PIC interrupt when LAPIC is hardware disabled. I 
think we can drop this patch...

(I will be more careful on such kind of issues next time...)
--
regards
Yang, Sheng

>
> Jan
>
> [1] http://permalink.gmane.org/gmane.comp.emulators.qemu/31429
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: LAPIC soft-disable vs. LVT masking

2008-10-20 Thread Yang, Sheng

On Monday 20 October 2008 16:49:11 Jan Kiszka wrote:
> Hi Sheng,
>
> obviously, I meditated too long over the APIC specs and VAPIC code of
> KVM: When the guest resets the soft-enable bit in SVR, the in-kernel
> APIC implementation also set the LVT masked bits - so far, so fine
> (according to specs). But I failed to read out of that doc if those mask
> bits are permanently set (until the guest clears them again) or only
> until the soft-disabling ends (ie. they are restored to their previous
> state - QEMU goes this way). Can you clarify?
>
> Thanks,
> Jan
>
Hi Jan

I also can't find related info in the spec. But I think, when software enable 
bit is cleaned, the spec said the mask bits are set, which means the content 
of register is changed. And no words for what happen if set software enable 
bit, so I think it maybe retain the mask state after software enable (a 
little more possibility).

I will give a update if I got more infos.
--
regards
Yang, Sheng


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Enable MSI support for KVM VT-d

2008-10-19 Thread Yang, Sheng

On Monday 20 October 2008 12:09:20 Zhang, Xiantao wrote:
> Avi Kivity wrote:
> > Sheng Yang wrote:
> >> Hi, Avi
> >>
> >> This patchset enable MSI support for KVM VT-d.
> >>
> >> And here are only kernel space ones. The third patch would go to
> >> also goto x86 upstream.
> >>
> >> The userspace code would looks like this:
> >>
> >> assigned_irq_data.guest_msi_addr = *(uint32_t *)(d->msi_cap + 4);
> >> assigned_irq_data.guest_msi_data = *(uint16_t *)(d->msi_cap + 8);
> >> assigned_irq_data.flags |= KVM_DEV_IRQ_ASSIGN_ENABLE_MSI;
> >> r = kvm_assign_irq(kvm_context, &assigned_irq_data);
> >>
> >> I've test the patchset with some userspace hack, it works well.
> >
> > Can you resend this patch with all the updates, as well as the
> > userspace changes?
>
> Maybe Sheng need to make it work on kvm/ia64, and at least the changes
> can't break ia64 side. Xiantao
>
Yes, I would ensure that won't break ia64. I would repost the patchset soon. 

But Avi, for the userspace, the Amit's patch still not checked in, so I 
haven't written a complete version because of lacking code base. I only got a 
experiment patch by hand, expose MSI cap to guest and enable MSI when guest 
wrote MSI enable bit. Well, I prefer to give you a complete version after 
Amit's patch is there. (I will conclude the change of userspace in the first 
mail).

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: VMX: Move private memory slot position

2008-10-13 Thread Yang, Sheng

On Saturday 13 September 2008 16:55:27 Avi Kivity wrote:
> Avi Kivity wrote:
> > Yang, Sheng wrote:
> >> On Thursday 04 September 2008 11:30:20 Yang, Sheng wrote:
> >>> From ebe4ea311305d2910dcdcff2510662da0dc2c742 Mon Sep 17 00:00:00 2001
> >>> From: Sheng Yang <[EMAIL PROTECTED]>
> >>> Date: Thu, 4 Sep 2008 03:11:48 +0800
> >>> Subject: [PATCH] KVM: VMX: Move private memory slot position
> >>>
> >>> PCI device assignment would map guest MMIO spaces as separate slot, so
> >>> it is possible that the device has more than 2 MMIO spaces and
> >>> overwrite current private memslot.
> >>>
> >>> The patch move private memory slot to the top of userspace visible
> >>> memory slots.
> >>
> >> Avi, these two?
> >
> > Thanks, applied both.
> >
> > Note that kvm now exports the number of slots using KVM_CAP_NR_MEMSLOTS,
> > so userspace could be made dynamic.
>
> Well, the kernel change causes the host to oops while booting Windows on
> an i386 pae host.  No idea why.

I've found the reason... It's because that kvm_mmu_page->slot_bitmap is 
unsigned long, and if use KVM_MEMORY_SLOTS + xxx, it would beyond 32 in pae, 
then memory corrupted.

But reduce supported memory slot number to 28 or extend slot_bitmap, or other 
methods? Slot_bitmap have bitops, so keep unsigned long would be better... 
Now reduce supported memory slot number seems reasonable to me.

(I also want to have this fix into 2.6.28, for some device would easily 
overlapped with current private memory slot)

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] KVM: x86: Enable MSI for assigned device

2008-10-05 Thread Yang, Sheng

On Sunday 05 October 2008 18:27:20 Avi Kivity wrote:
> Sheng Yang wrote:
> > As well as export ioapic_get_delivery_bitmask().
> >
> > @@ -132,8 +177,12 @@ static void
> > kvm_assigned_dev_interrupt_work_handler(struct work_struct *work) *
> > finer-grained lock, update this
> >*/
> >   mutex_lock(&assigned_dev->kvm->lock);
> > - kvm_set_irq(assigned_dev->kvm,
> > - assigned_dev->guest_irq, 1);
> > + if (assigned_dev->guest_intr_type == KVM_ASSIGNED_DEV_INTR)
> > + kvm_set_irq(assigned_dev->kvm, assigned_dev->guest_irq, 1);
> > + else if (assigned_dev->guest_intr_type == KVM_ASSIGNED_DEV_MSI) {
> > + assigned_device_msi_dispatch(assigned_dev);
> > + enable_irq(assigned_dev->host_irq);
> > + }
>
> What happens if the host interrupt is level triggered pci and the guest
> interrupt is msi?  Or do we not support this combination?
>
> If not, how do we prevent it?

I think we don't need to support this combination. Currently, fail to enable 
MSI would fallback to enable IRQ. And MSI disabled guest should not expose 
MSI capability to guest. And also if guest fail to enable MSI, MSI enable bit 
in PCI configuration space should be set 0.

So I would like to sent another return value to tell userspace MSI enable 
failed. And before try to enable, we may also provide a interface to 
userspace to know if MSI can be enabled.

> > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > index 4269be1..a9b408b 100644
> > --- a/include/linux/kvm.h
> > +++ b/include/linux/kvm.h
> > @@ -493,9 +493,13 @@ struct kvm_assigned_irq {
> >   __u32 assigned_dev_id;
> >   __u32 host_irq;
> >   __u32 guest_irq;
> > + __u16 guest_msi_data;
>
> Need padding here, just to be safe.
>
> > + __u32 guest_msi_addr;
>
> Is u32 enough for the msi address?  Including ia64?
>
> >   __u32 flags;
> >  };
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index e24280b..dc6a046 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -300,8 +300,11 @@ struct kvm_assigned_dev_kernel {
> >   int host_busnr;
> >   int host_devfn;
> >   int host_irq;
> > +     u16 guest_msi_addr;
>
> u32?  or even u64?

Oops...

Well, here is enough for MSI (I mean u16 for msi_data), for PCI spec define 
the size. But I'd better extend msi_data to u32, later I will extend msi_addr 
to u64 or msi_add_lo and msi_addr_hi, for the support of MSI-X.

--
regards
Yang, Sheng
>
> --
> error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] KVM: Implement OR logic on guest shared IRQ line

2008-10-02 Thread Yang, Sheng

[Oops, outlook even didn't notice that last mail don't have subject then send 
it out...]

>From eab008da232cd9cc09dd8071bd15796c8e46f6bd Mon Sep 17 00:00:00 2001
From: Sheng Yang <[EMAIL PROTECTED]>
Date: Thu, 2 Oct 2008 14:21:06 +0800
Subject: [PATCH 2/2] KVM: Implement OR logic on guest shared IRQ line

Now IOAPIC and PIC treat every kvm_set_irq() as from one separate interrupt
source, so implement OR logic base on this.

Notice that the every caller should ensure that it would call kvm_set_irq()
only when the interrupt state of source is changing (also means call
kvm_set_irq() in pair).

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 arch/x86/kvm/i8259.c |   12 +++-
 arch/x86/kvm/irq.c   |6 +-
 arch/x86/kvm/irq.h   |1 +
 arch/x86/kvm/x86.c   |3 ---
 virt/kvm/ioapic.c|9 +
 virt/kvm/ioapic.h|1 +
 6 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index 17e41e1..d2b05be 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -142,9 +142,19 @@ void kvm_pic_update_irq(struct kvm_pic *s)
 void kvm_pic_set_irq(void *opaque, int irq, int level)
 {
struct kvm_pic *s = opaque;
+   struct kvm_kpic_state *entry;

+   entry = &s->pics[irq >> 3];
if (irq >= 0 && irq < PIC_NUM_PINS) {
-   pic_set_irq1(&s->pics[irq >> 3], irq & 7, level);
+   /* OR logic on level trig for sharing interrupt */
+   if (entry->elcr) {
+   s->irq_counts[irq] += (level == 1 ? 1 : -1);
+   ASSERT(s->irq_counts[irq] >= 0);
+   if (s->irq_counts[irq] != 0)
+   level = 1;
+   }
+
+   pic_set_irq1(entry, irq & 7, level);
pic_update_irq(s);
}
 }
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 8c1b9c5..8999d9d 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -100,7 +100,11 @@ void __kvm_migrate_timers(struct kvm_vcpu *vcpu)
__kvm_migrate_pit_timer(vcpu);
 }

-/* This should be called with the kvm->lock mutex held */
+/*
+ * The caller of kvm_set_irq() should hold kvm->lock mutex, and ensure
+ * that kvm_set_irq() was called in pair when asserting and deasserting with
+ * level trig interrupt source for the same irq.
+ */
 void kvm_set_irq(struct kvm *kvm, int irq, int level)
 {
/* Not possible to detect if the guest uses the PIC or the
diff --git a/arch/x86/kvm/irq.h b/arch/x86/kvm/irq.h
index 9f157c9..ef9e828 100644
--- a/arch/x86/kvm/irq.h
+++ b/arch/x86/kvm/irq.h
@@ -60,6 +60,7 @@ struct kvm_kpic_state {

 struct kvm_pic {
struct kvm_kpic_state pics[2]; /* 0 is master pic, 1 is slave pic */
+   int irq_counts[PIC_NUM_PINS];
irq_request_func *irq_request;
void *irq_request_opaque;
int output; /* intr from master PIC */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e685d48..71a0f81 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -144,9 +144,6 @@ static void kvm_assigned_dev_interrupt_work_handler(struct 
work_struct *work)
kvm_put_kvm(assigned_dev->kvm);
 }

-/* FIXME: Implement the OR logic needed to make shared interrupts on
- * this line behave properly
- */
 static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
 {
struct kvm_assigned_dev_kernel *assigned_dev =
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index c8f939c..d9526e0 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -275,6 +275,15 @@ void kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int 
irq, int level)

if (irq >= 0 && irq < IOAPIC_NUM_PINS) {
entry = ioapic->redirtbl[irq];
+
+   /* OR logic on level trig for sharing interrupt */
+   if (entry.fields.trig_mode == 1) {
+   ioapic->irq_counts[irq] += (level == 1 ? 1 : -1);
+   ASSERT(ioapic->irq_counts[irq] >= 0);
+   if (ioapic->irq_counts[irq] != 0)
+   level = 1;
+   }
+
level ^= entry.fields.polarity;
if (!level)
ioapic->irr &= ~mask;
diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
index b52732f..6ed7dc7 100644
--- a/virt/kvm/ioapic.h
+++ b/virt/kvm/ioapic.h
@@ -56,6 +56,7 @@ struct kvm_ioapic {
u8 dest_id;
} fields;
} redirtbl[IOAPIC_NUM_PINS];
+   int irq_counts[IOAPIC_NUM_PINS];
struct kvm_io_device dev;
struct kvm *kvm;
void (*ack_notifier)(void *opaque, int irq);
--
1.5.3


0002-KVM-Implement-OR-logic-on-guest-shared-IRQ-line.patch
Description: 0002-KVM-Implement-OR-logic-on-guest-shared-IRQ-line.patch

[no subject]

2008-10-02 Thread Yang, Sheng

>From eab008da232cd9cc09dd8071bd15796c8e46f6bd Mon Sep 17 00:00:00 2001
From: Sheng Yang <[EMAIL PROTECTED]>
Date: Thu, 2 Oct 2008 14:21:06 +0800
Subject: [PATCH 2/2] KVM: Implement OR logic on guest shared IRQ line

Now IOAPIC and PIC treat every kvm_set_irq() as from one separate interrupt
source, so implement OR logic base on this.

Notice that the every caller should ensure that it would call kvm_set_irq()
only when the interrupt state of source is changing (also means call
kvm_set_irq() in pair).

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 arch/x86/kvm/i8259.c |   12 +++-
 arch/x86/kvm/irq.c   |6 +-
 arch/x86/kvm/irq.h   |1 +
 arch/x86/kvm/x86.c   |3 ---
 virt/kvm/ioapic.c|9 +
 virt/kvm/ioapic.h|1 +
 6 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index 17e41e1..d2b05be 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -142,9 +142,19 @@ void kvm_pic_update_irq(struct kvm_pic *s)
 void kvm_pic_set_irq(void *opaque, int irq, int level)
 {
struct kvm_pic *s = opaque;
+   struct kvm_kpic_state *entry;

+   entry = &s->pics[irq >> 3];
if (irq >= 0 && irq < PIC_NUM_PINS) {
-   pic_set_irq1(&s->pics[irq >> 3], irq & 7, level);
+   /* OR logic on level trig for sharing interrupt */
+   if (entry->elcr) {
+   s->irq_counts[irq] += (level == 1 ? 1 : -1);
+   ASSERT(s->irq_counts[irq] >= 0);
+   if (s->irq_counts[irq] != 0)
+   level = 1;
+   }
+
+   pic_set_irq1(entry, irq & 7, level);
pic_update_irq(s);
}
 }
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 8c1b9c5..8999d9d 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -100,7 +100,11 @@ void __kvm_migrate_timers(struct kvm_vcpu *vcpu)
__kvm_migrate_pit_timer(vcpu);
 }

-/* This should be called with the kvm->lock mutex held */
+/*
+ * The caller of kvm_set_irq() should hold kvm->lock mutex, and ensure
+ * that kvm_set_irq() was called in pair when asserting and deasserting with
+ * level trig interrupt source for the same irq.
+ */
 void kvm_set_irq(struct kvm *kvm, int irq, int level)
 {
/* Not possible to detect if the guest uses the PIC or the
diff --git a/arch/x86/kvm/irq.h b/arch/x86/kvm/irq.h
index 9f157c9..ef9e828 100644
--- a/arch/x86/kvm/irq.h
+++ b/arch/x86/kvm/irq.h
@@ -60,6 +60,7 @@ struct kvm_kpic_state {

 struct kvm_pic {
struct kvm_kpic_state pics[2]; /* 0 is master pic, 1 is slave pic */
+   int irq_counts[PIC_NUM_PINS];
irq_request_func *irq_request;
void *irq_request_opaque;
int output; /* intr from master PIC */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e685d48..71a0f81 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -144,9 +144,6 @@ static void kvm_assigned_dev_interrupt_work_handler(struct 
work_struct *work)
kvm_put_kvm(assigned_dev->kvm);
 }

-/* FIXME: Implement the OR logic needed to make shared interrupts on
- * this line behave properly
- */
 static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
 {
struct kvm_assigned_dev_kernel *assigned_dev =
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index c8f939c..d9526e0 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -275,6 +275,15 @@ void kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int 
irq, int level)

if (irq >= 0 && irq < IOAPIC_NUM_PINS) {
entry = ioapic->redirtbl[irq];
+
+   /* OR logic on level trig for sharing interrupt */
+   if (entry.fields.trig_mode == 1) {
+   ioapic->irq_counts[irq] += (level == 1 ? 1 : -1);
+   ASSERT(ioapic->irq_counts[irq] >= 0);
+   if (ioapic->irq_counts[irq] != 0)
+   level = 1;
+   }
+
level ^= entry.fields.polarity;
if (!level)
ioapic->irr &= ~mask;
diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
index b52732f..6ed7dc7 100644
--- a/virt/kvm/ioapic.h
+++ b/virt/kvm/ioapic.h
@@ -56,6 +56,7 @@ struct kvm_ioapic {
u8 dest_id;
} fields;
} redirtbl[IOAPIC_NUM_PINS];
+   int irq_counts[IOAPIC_NUM_PINS];
struct kvm_io_device dev;
struct kvm *kvm;
void (*ack_notifier)(void *opaque, int irq);
--
1.5.3


0002-KVM-Implement-OR-logic-on-guest-shared-IRQ-line.patch
Description: 0002-KVM-Implement-OR-logic-on-guest-shared-IRQ-line.patch

[PATCH 1/2] KVM: Separate interrupt sources

2008-10-02 Thread Yang, Sheng

>From e6b784985c14afe9805bfc8706858884b0259ab5 Mon Sep 17 00:00:00 2001
From: Sheng Yang <[EMAIL PROTECTED]>
Date: Thu, 2 Oct 2008 14:20:22 +0800
Subject: [PATCH 1/2] KVM: Separate interrupt sources

Keep a record of current interrupt state before update, and don't
assert/deassert repeatly. So that every caller of kvm_set_irq() can be identify
as a separate interrupt sources for IOAPIC/PIC to implement logical OR of level
trig interrupts on one IRQ line.

Notice that userspace devices are treated as one device for each IRQ line. The
correctness of sharing interrupt for each IRQ line should be ensured by
userspace program (QEmu).

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 arch/x86/kvm/x86.c   |   25 +
 include/linux/kvm_host.h |3 +++
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ad7a227..e685d48 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -135,8 +135,11 @@ static void kvm_assigned_dev_interrupt_work_handler(struct 
work_struct *work)
 * finer-grained lock, update this
 */
mutex_lock(&assigned_dev->kvm->lock);
-   kvm_set_irq(assigned_dev->kvm,
-   assigned_dev->guest_irq, 1);
+   if (assigned_dev->irq_state == 0) {
+   kvm_set_irq(assigned_dev->kvm,
+   assigned_dev->guest_irq, 1);
+   assigned_dev->irq_state = 1;
+   }
mutex_unlock(&assigned_dev->kvm->lock);
kvm_put_kvm(assigned_dev->kvm);
 }
@@ -165,7 +168,10 @@ static void kvm_assigned_dev_ack_irq(struct 
kvm_irq_ack_notifier *kian)

dev = container_of(kian, struct kvm_assigned_dev_kernel,
   ack_notifier);
-   kvm_set_irq(dev->kvm, dev->guest_irq, 0);
+   if (dev->irq_state == 1) {
+   kvm_set_irq(dev->kvm, dev->guest_irq, 0);
+   dev->irq_state = 0;
+   }
enable_irq(dev->host_irq);
 }

@@ -1993,7 +1999,18 @@ long kvm_arch_vm_ioctl(struct file *filp,
goto out;
if (irqchip_in_kernel(kvm)) {
mutex_lock(&kvm->lock);
-   kvm_set_irq(kvm, irq_event.irq, irq_event.level);
+   /*
+* Take one IRQ line as from one device, shared IRQ
+* line should also be handled in the userspace before
+* use KVM_IRQ_LINE ioctl to change IRQ line state.
+*/
+   if (kvm->userspace_intrsource_states[irq_event.irq]
+   != irq_event.level) {
+   kvm_set_irq(kvm, irq_event.irq,
+   irq_event.level);
+   kvm->userspace_intrsource_states[irq_event.irq]
+   = irq_event.level;
+   }
mutex_unlock(&kvm->lock);
r = 0;
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 73b7c52..8c2a504 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -129,6 +129,8 @@ struct kvm {
unsigned long mmu_notifier_seq;
long mmu_notifier_count;
 #endif
+
+   int userspace_intrsource_states[KVM_IOAPIC_NUM_PINS];
 };

 /* The guest did something we don't support. */
@@ -303,6 +305,7 @@ struct kvm_assigned_dev_kernel {
int host_irq;
int guest_irq;
int irq_requested;
+   int irq_state;
struct pci_dev *dev;
struct kvm *kvm;
 };
--
1.5.3


0001-KVM-Separate-interrupt-sources.patch
Description: 0001-KVM-Separate-interrupt-sources.patch

[RFC][PATCH 0/2] Fix guest shared interrupt with in-kernel irqchip

2008-10-02 Thread Yang, Sheng

To deal with guest shared interrupt bug in in-kernel irqchip, we should:

1. Identify each level trig interrupt source.
2. Implement logical OR on the same IRQ line for each interrupt source.

Here I chose a simple method: the caller of kvm_set_irq() has responsiblity
to identify interrupt sources, and IOAPIC/PIC ensure logical OR on IRQ line.

The alternative method of identify can be: a process to
request/allocate/free device identity, then kvm_set_irq() has responsibility
to identify interrupt sources. But I think it's too complicate and
unnecessary, for the caller of set_irq() should aware of the IRQ state.

The patch treats all userspace devices as one source. This have been ensured
by QEmu, which would ensure logical OR on IRQ line if IRQ line is sharing in
userspace.

Comments are welcome! And patches are untested, due to our boxes are down
during the holiday(and my Linux Desktop in the company also down, have
to post patch with outlook)...

Thanks!
--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Remaining passthrough/VT-d tasks list

2008-09-27 Thread Yang, Sheng

On Wednesday 24 September 2008 17:51:24 Avi Kivity wrote:
> Yang, Sheng wrote:
> >> 2. shared guest pci interrupts
> >>
> >> That's a correctness issue.  No matter how many interrupts we have, we
> >> may have sharing issues.  Of course with only three the issue is very
> >> pressing since we will get sharing with just a few devices.  Currently
> >> if two assigned devices share a guest interrupts, or if an emulated
> >> device shares an interrupt with an assigned device, things will break.
> >>
> >> They need to be fixed independently.
> >
> > About the second issue, I don't understand how it would break... Would
> > you please give more details on this? It's a QEmu bug or IOAPIC bug?
>
> It's a kernel bug.
>
> Both the device assignment code and KVM_SET_IRQ ioctl() call
> kvm_set_irq(), so the last one wins.  We need logical-OR mixing between
> the various sources.  Just like pci_set_irq() in qemu, only for the kernel.
>
> Userspace is one source, each assigned device irq is a separate source.
>
I am working on this now.

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Remaining passthrough/VT-d tasks list

2008-09-27 Thread Yang, Sheng

On Sunday 28 September 2008 13:04:06 Avi Kivity wrote:
> Tian, Kevin wrote:
> >> No. Maybe the Neocleus polarity trick (which also reduces performance).
> >
> > To my knowledge, Neocleus polarity trick can't solve this isolation
> > issue, which just provides one effecient way to track
> > assertion/deassertion transition on the irq line. For example, reverse
> > polarity when receiving an instance, and then a new irq instance would
> > occur when all devices de- assert on shared irq line, and then recover
> > the polarity. In your concerned case where guest driver misbehaves, this
> > polarity trick can't work neither as one device always asserts the line.
>
> You're right, I didn't think it through.
>
> If there was a standard way to mask pci irqs, it might have worked, but
> there isn't, unfortunately.
>
One purpose:

If we suffered from IRQ storm of one level triggered irq line, two possible: 
host issue or guest issue.

If it's a host issue, host should try to stop it. If it can't, the IRQ line 
would be disabled, and guest device also isn't functional. 

If it's a guest issue, guest should try to stop it, and prevent it from 
causing trouble in host. KVM should try best including disable guest device 
to do this. So guest device also won't functional.

Base on above theory, we can assume that IRQ storm caused by assigned guest 
device, and try to stop device from doing this. (Yeah, anyway, guest device 
won't survive).

I think we can brought a little QoS concept here(stolen from Eddie :) ). The 
assumption is, the normal rate of device deliver interrupts is much slower 
than a continuous level trigger if the EOI is wrote immediately. So we can do 
something with the gap.

Measure the calling rate of our irq handler, if it's exceed some reasonable 
threshold, KVM would try to stop guest device for a while (even it don't know 
if the guest device cause this).

First to try set interrupt disable bit in Device Control Register, wait for a 
period of time, then check again.

If the irq strom can't be stopped, KVM try a more aggressive way: Do the 
Function Level Reset. It's should be the end of device's life...

Oh, of course, if even FLR didn't solve the IRQ storm, that's host's issue. 
Let's wait host to disable the IRQ line - of course, the guest device can't 
be recovered too.

It's just a initial purpose, I think it may work. The problem is if the gap is 
easy to catch... But at least, I think a physical continuous one should be 
much different from any working ones...

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Remaining passthrough/VT-d tasks list

2008-09-27 Thread Yang, Sheng

On Sunday 28 September 2008 13:04:06 Avi Kivity wrote:
> Tian, Kevin wrote:
> >> No. Maybe the Neocleus polarity trick (which also reduces performance).
> >
> > To my knowledge, Neocleus polarity trick can't solve this isolation
> > issue, which just provides one effecient way to track
> > assertion/deassertion transition on the irq line. For example, reverse
> > polarity when receiving an instance, and then a new irq instance would
> > occur when all devices de- assert on shared irq line, and then recover
> > the polarity. In your concerned case where guest driver misbehaves, this
> > polarity trick can't work neither as one device always asserts the line.
>
> You're right, I didn't think it through.
>
> If there was a standard way to mask pci irqs, it might have worked, but
> there isn't, unfortunately.
>
What if we got a way to mask pci irqs? We also have to unmask pci irq when 
guest wrote EOI to vlapic(or at any other time). I think this still cause 
problem. The problem is, we don't know if guest would deassert the line. 
Maybe add some time-based detection here might work?

And about the mask of pci irq, how about disable PCI device interrupt using 
Device Control Register bit 10? Not sure if it would affect the pending 
transaction, also not sure all device support this (though they should 
support).

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Remaining passthrough/VT-d tasks list

2008-09-27 Thread Yang, Sheng

On Wednesday 24 September 2008 16:38:35 Avi Kivity wrote:
> Yang, Sheng wrote:
> >> - Shared Interrupt support
> >
> > I still don't know who would do this. It's very important for VT-d real
> > usable. If nobody interested in it, I would pick it up, but after Oct. 6
> > (after National Holiday in China).
>
> Shared host interrupts?  What's your plan here?  The polarity trick?

Hi, Avi

After check host shared interrupts situation, I got a question here:

If I understand correctly, current solution don't block host shared irq, just 
come with the performance pentry. The penalty come with host disabled irq 
line for a period. We have to wait guest to write EOI. But I fail to see the 
correctness problem here (except a lot of spurious interrupt in the guest).

I've checked mail, but can't find clue about that. Can you explain the 
situation?

Thanks!
--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] kvm: bios: switch MTRRs to cover only the PCI range and default to WB

2008-09-26 Thread Yang, Sheng

On Friday 26 September 2008 01:52:29 Alex Williamson wrote:
> kvm: bios: switch MTRRs to cover only the PCI range and default to WB
>
> This matches how some bare metal machines report MTRRs and avoids
> the problem of running out of MTRRs to cover all of RAM.
>
> Signed-off-by: Alex Williamson <[EMAIL PROTECTED]>
> ---
>
>  bios/rombios32.c |   24 
>  1 files changed, 4 insertions(+), 20 deletions(-)
>
> diff --git a/bios/rombios32.c b/bios/rombios32.c
> index f8edf18..592abf9 100755
> --- a/bios/rombios32.c
> +++ b/bios/rombios32.c
> @@ -494,7 +494,6 @@ void setup_mtrr(void)
>  uint8_t valb[8];
>  uint64_t val;
>  } u;
> -uint64_t vbase, vmask;
>
>  mtrr_cap = rdmsr(MSR_MTRRcap);
>  vcnt = mtrr_cap & 0xff;
> @@ -521,25 +520,10 @@ void setup_mtrr(void)
>  wrmsr_smp(MSR_MTRRfix4K_E8000, 0);
>  wrmsr_smp(MSR_MTRRfix4K_F, 0);
>  wrmsr_smp(MSR_MTRRfix4K_F8000, 0);
> -vbase = 0;
> ---vcnt; /* leave one mtrr for VRAM */
> -for (i = 0; i < vcnt && vbase < ram_size; ++i) {
> -vmask = (1ull << 40) - 1;
> -while (vbase + vmask + 1 > ram_size)
> -vmask >>= 1;
> -wrmsr_smp(MTRRphysBase_MSR(i), vbase | 6);
> -wrmsr_smp(MTRRphysMask_MSR(i), (~vmask & 0xfff000ull) |
> 0x800); -vbase += vmask + 1;
> -}
> -for (vbase = 1ull << 32; i < vcnt && vbase < ram_end; ++i) {
> -vmask = (1ull << 40) - 1;
> -while (vbase + vmask + 1 > ram_end)
> -vmask >>= 1;
> -wrmsr_smp(MTRRphysBase_MSR(i), vbase | 6);
> -wrmsr_smp(MTRRphysMask_MSR(i), (~vmask & 0xfff000ull) |
> 0x800); -vbase += vmask + 1;
> -}
> -wrmsr_smp(MSR_MTRRdefType, 0xc00);
> +/* Mark 3.5-4GB as UC, anything not specified defaults to WB */
> +wrmsr_smp(MTRRphysBase_MSR(0), 0xe000ull | 0);
> +wrmsr_smp(MTRRphysMask_MSR(0), ~(0x2000ull - 1) | 0x800);
> +wrmsr_smp(MSR_MTRRdefType, 0xc06);
>  }
>

I think we should do a little more than just write msr to update mtrr.

Intel SDM 10.11.8 "MTRR consideration in MP Systems" define the procedure to 
modify MTRR msr in MP. Especially, step 4 enter no-fill cache mode(set CR0.CD 
bit and clean NW bit), step 12 re-enabled the caching(clear this two bits).

We based on these behaviors to detect MTRR update.

(Forgot to raise the bug to Avi, recalled it now...)
--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/7] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests

2008-09-25 Thread Yang, Sheng

On Tuesday 23 September 2008 22:54:53 Amit Shah wrote:
> +static uint32_t assigned_dev_pci_read_config(PCIDevice *d, uint32_t
> address, +int len)
> +{
> +   uint32_t val = 0;
> +   int fd, r;
> +
> +   if ((address >= 0x10 && address <= 0x24) || address == 0x34 ||
> +   address == 0x3c || address == 0x3d) {
> +   val = pci_default_read_config(d, address, len);
> +   DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n",
> + (d->devfn >> 3) & 0x1F, (d->devfn & 0x7), address,
> val, + len);
> +   return val;
> +   }
> +
> +   /* vga specific, remove later */
> +   if (address == 0xFC)
> +   goto do_log;
> +
> +   fd = ((AssignedDevice *)d)->real_device.config_fd;
> +   r = lseek(fd, address, SEEK_SET);
> +   if (r < 0) {
> +   fprintf(stderr, "%s: bad seek, errno = %d\n",
> +   __func__, errno);
> +   return val;
> +   }

This read from configuration space method got a little trouble: vender id and 
device id read from configuration space directly rather than "vender" 
and "device" file in the sysfs. That's cause trouble with some device that 
configuration space inconsistent with "vender" and "device" file, e.g. some 
fix up by host PCI subsystem in kernel. 

Maybe it can be delay a little for a following patch, but we should address 
this issue... Maybe we can use libpci? There are more fields than vender and 
device got this problem, like "irq".

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM Migration fails

2008-09-25 Thread Yang, Sheng

On Thursday 25 September 2008 15:23:22 jd wrote:
> The error code and the messages seem bit different here.
>

Um... At least I believe it's a regression, and our migration test never 
success after that. So I think it's worth to look into it. If you are lucky 
enough, you would got two bugs. :)

--
regards
Yang, Sheng

> /Jd
>
> --- On Wed, 9/24/08, Yang, Sheng <[EMAIL PROTECTED]> wrote:
> > From: Yang, Sheng <[EMAIL PROTECTED]>
> > Subject: Re: KVM Migration fails
> > To: kvm@vger.kernel.org, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> > Date: Wednesday, September 24, 2008, 11:23 PM
> >
> > On Thursday 25 September 2008 12:22:42 jd wrote:
> > > Hi
> > >   I have a setup using shared nfs disks. When
> >
> > migration is attempted, it
> >
> > > fails... any ideas on how to debug this..?
> >
> > It's a regression bug recently.
> >
> > Please refer to
> >
> > https://sourceforge.net/tracker/index.php?func=detail&aid=2106661&group_i
> >d=180599&atid=893831
> >
> > I think a git bisect can also help.
> > --
> > regards
> > Yang, Sheng
> >
> > > /Jd
> > >
> > > Details
> > > ===
> > >
> > > migration: write failed (Connection reset by peer)^M
> > > Migration failed! ret=0 error=9
> > >
> > > Source : KVM-73, Cent OS 5.2, 64 bit.
> > >
> > > qemu-system-x86_64 -net
> >
> > nic,vlan=0,macaddr=00:16:3e:16:f4:f0 -net
> >
> > > user,vlan=0 -hda /mnt/nfs/vmdisks/XPSP2-KVM.disk.xm
> >
> > -boot c -m 1024
> >
> > > -no-acpi  -vnc :22 -name XPSP2-KVM -smp 2 -monitor
> > > unix:/var/run/kvm/monitors/XPSP2-KVM,server,nowait
> >
> > -pidfile
> >
> > > /var/run/kvm/pids/XPSP2-KVM -daemonize
> > >
> > >
> > >
> > > Dest   : KVM-70, Fedora 8, 64bit
> > >
> > > qemu-system-x86_64 -net
> >
> > nic,vlan=0,macaddr=00:16:3e:16:f4:f0 -net
> >
> > > user,vlan=0 -hda /mnt/nfs/vmdisks/XPSP2-KVM.disk.xm
> >
> > -boot c -m 1024
> >
> > > -no-acpi  -vnc :23 -incoming tcp://0:8002 -name
> >
> > XPSP2-KVM -smp 2 -monitor
> >
> > > unix:/var/run/kvm/monitors/XPSP2-KVM,server,nowait
> >
> > -pidfile
> >
> > > /var/run/kvm/pids/XPSP2-KVM -daemonize
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > To unsubscribe from this list: send the line
> >
> > "unsubscribe kvm" in
> >
> > > the body of a message to [EMAIL PROTECTED]
> > > More majordomo info at
> >
> > http://vger.kernel.org/majordomo-info.html
> >
> >
> > --
> > To unsubscribe from this list: send the line
> > "unsubscribe kvm" in
> > the body of a message to [EMAIL PROTECTED]
> > More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM Migration fails

2008-09-24 Thread Yang, Sheng

On Thursday 25 September 2008 12:22:42 jd wrote:
> Hi
>   I have a setup using shared nfs disks. When migration is attempted, it
> fails... any ideas on how to debug this..?

It's a regression bug recently. 

Please refer to 

https://sourceforge.net/tracker/index.php?func=detail&aid=2106661&group_id=180599&atid=893831

I think a git bisect can also help. 
--
regards
Yang, Sheng
>
> /Jd
>
> Details
> ===
>
> migration: write failed (Connection reset by peer)^M
> Migration failed! ret=0 error=9
>
> Source : KVM-73, Cent OS 5.2, 64 bit.
>
> qemu-system-x86_64 -net nic,vlan=0,macaddr=00:16:3e:16:f4:f0 -net
> user,vlan=0 -hda /mnt/nfs/vmdisks/XPSP2-KVM.disk.xm -boot c -m 1024
> -no-acpi  -vnc :22 -name XPSP2-KVM -smp 2 -monitor
> unix:/var/run/kvm/monitors/XPSP2-KVM,server,nowait -pidfile 
> /var/run/kvm/pids/XPSP2-KVM -daemonize
>
>
>
> Dest   : KVM-70, Fedora 8, 64bit
>
> qemu-system-x86_64 -net nic,vlan=0,macaddr=00:16:3e:16:f4:f0 -net
> user,vlan=0 -hda /mnt/nfs/vmdisks/XPSP2-KVM.disk.xm -boot c -m 1024
> -no-acpi  -vnc :23 -incoming tcp://0:8002 -name XPSP2-KVM -smp 2 -monitor
> unix:/var/run/kvm/monitors/XPSP2-KVM,server,nowait -pidfile 
> /var/run/kvm/pids/XPSP2-KVM -daemonize
>
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/7] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests

2008-09-24 Thread Yang, Sheng

On Thursday 25 September 2008 12:54:46 Yang, Sheng wrote:
> On Tuesday 23 September 2008 22:54:53 Amit Shah wrote:
> > From: Or Sagi <[EMAIL PROTECTED]>
> > From: Nir Peleg <[EMAIL PROTECTED]>
> > From: Amit Shah <[EMAIL PROTECTED]>
> > From: Ben-Ami Yassour <[EMAIL PROTECTED]>
> > From: Weidong Han <[EMAIL PROTECTED]>
> > From: Glauber de Oliveira Costa <[EMAIL PROTECTED]>
> >
> > With this patch, we can assign a device on the host machine to a
> > guest.
> >
> > A new command-line option, -pcidevice is added.
> > For example, to invoke it for a device sitting at PCI bus:dev.fn
> > 04:08.0, use this:
> >
> > -pcidevice host=04:08.0
> >
> > * The host driver for the device, if any, is to be removed before
> > assigning the device (else device assignment will fail).
> >
> > * A device that shares IRQ with another host device cannot currently
> > be assigned.
> >
> > This works only with the in-kernel irqchip method; to use the
> > userspace irqchip, a kernel module (irqhook) and some extra changes
> > are needed.
>
> Hi Amit
>
> I am afraid I got this when try to enable VT-d.
>
> create_userspace_phys_mem: Invalid argument
> assigned_dev_iomem_map: Error: create new mapping failed
>
> Can you have a look at it? (and the patch you sent to Weidong don't got
> this problem.)

Oh, Weidong's patch "[PATCH] VT-d: Fix iommu map page for mmio pages" fix it. 
--
regards
Yang, Sheng
>
> Thanks.
> --
> regards
> Yang, Sheng
>
> > Signed-off-by: Amit Shah <[EMAIL PROTECTED]>
> > ---
> >  qemu/Makefile.target|1 +
> >  qemu/hw/device-assignment.c |  665
> > +++ qemu/hw/device-assignment.h |
> > 93 ++
> >  qemu/hw/pc.c|9 +
> >  qemu/hw/pci.c   |7 +
> >  qemu/vl.c   |   18 ++
> >  6 files changed, 793 insertions(+), 0 deletions(-)
> >  create mode 100644 qemu/hw/device-assignment.c
> >  create mode 100644 qemu/hw/device-assignment.h
> >
> > diff --git a/qemu/Makefile.target b/qemu/Makefile.target
> > index 72f3db8..40eb273 100644
> > --- a/qemu/Makefile.target
> > +++ b/qemu/Makefile.target
> > @@ -616,6 +616,7 @@ OBJS+= ide.o pckbd.o ps2.o vga.o $(SOUND_HW) dma.o
> >  OBJS+= fdc.o mc146818rtc.o serial.o i8259.o i8254.o pcspk.o pc.o
> >  OBJS+= cirrus_vga.o apic.o parallel.o acpi.o piix_pci.o
> >  OBJS+= usb-uhci.o vmmouse.o vmport.o vmware_vga.o extboot.o
> > +OBJS+= device-assignment.o
> >  ifeq ($(USE_KVM_PIT), 1)
> >  OBJS+= i8254-kvm.o
> >  endif
> > diff --git a/qemu/hw/device-assignment.c b/qemu/hw/device-assignment.c
> > new file mode 100644
> > index 000..e70daf2
> > --- /dev/null
> > +++ b/qemu/hw/device-assignment.c
> > @@ -0,0 +1,665 @@
> > +/*
> > + * Copyright (c) 2007, Neocleus Corporation.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > it + * under the terms and conditions of the GNU General Public License,
> > + * version 2, as published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope it will be useful, but
> > WITHOUT + * ANY WARRANTY; without even the implied warranty of
> > MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > General Public License for + * more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > along with + * this program; if not, write to the Free Software
> > Foundation, Inc., 59 Temple + * Place - Suite 330, Boston, MA 02111-1307
> > USA.
> > + *
> > + *
> > + *  Assign a PCI device from the host to a guest VM.
> > + *
> > + *  Adapted for KVM by Qumranet.
> > + *
> > + *  Copyright (c) 2007, Neocleus, Alex Novik ([EMAIL PROTECTED])
> > + *  Copyright (c) 2007, Neocleus, Guy Zana ([EMAIL PROTECTED])
> > + *  Copyright (C) 2008, Qumranet, Amit Shah ([EMAIL PROTECTED])
> > + *  Copyright (C) 2008, Red Hat, Amit Shah ([EMAIL PROTECTED])
> > + */
> > +#include 
> > +#include 
> > +#include "qemu-kvm.h"
> > +#include 
> > +#include "device-assignment.h"
> > +
> > +/* From linux/ioport.h */
> > +#define IORESOURCE_IO  0x0100  /* Resource type */
> > +#define IORESOURCE_MEM 0x0200
> > +#define IORESOURCE_IRQ 0x0400
> > +#define IORESOURCE_DMA 0x0800
> > +#define IORESOURCE_PREFETC

Re: [PATCH 5/7] KVM/userspace: Device Assignment: Support for assigning PCI devices to guests

2008-09-24 Thread Yang, Sheng

On Tuesday 23 September 2008 22:54:53 Amit Shah wrote:
> From: Or Sagi <[EMAIL PROTECTED]>
> From: Nir Peleg <[EMAIL PROTECTED]>
> From: Amit Shah <[EMAIL PROTECTED]>
> From: Ben-Ami Yassour <[EMAIL PROTECTED]>
> From: Weidong Han <[EMAIL PROTECTED]>
> From: Glauber de Oliveira Costa <[EMAIL PROTECTED]>
>
> With this patch, we can assign a device on the host machine to a
> guest.
>
> A new command-line option, -pcidevice is added.
> For example, to invoke it for a device sitting at PCI bus:dev.fn
> 04:08.0, use this:
>
> -pcidevice host=04:08.0
>
> * The host driver for the device, if any, is to be removed before
> assigning the device (else device assignment will fail).
>
> * A device that shares IRQ with another host device cannot currently
> be assigned.
>
> This works only with the in-kernel irqchip method; to use the
> userspace irqchip, a kernel module (irqhook) and some extra changes
> are needed.
>

Hi Amit

I am afraid I got this when try to enable VT-d.

create_userspace_phys_mem: Invalid argument
assigned_dev_iomem_map: Error: create new mapping failed

Can you have a look at it? (and the patch you sent to Weidong don't got this 
problem.)

Thanks.
--
regards
Yang, Sheng

> Signed-off-by: Amit Shah <[EMAIL PROTECTED]>
> ---
>  qemu/Makefile.target|1 +
>  qemu/hw/device-assignment.c |  665
> +++ qemu/hw/device-assignment.h |  
> 93 ++
>  qemu/hw/pc.c|9 +
>  qemu/hw/pci.c   |7 +
>  qemu/vl.c   |   18 ++
>  6 files changed, 793 insertions(+), 0 deletions(-)
>  create mode 100644 qemu/hw/device-assignment.c
>  create mode 100644 qemu/hw/device-assignment.h
>
> diff --git a/qemu/Makefile.target b/qemu/Makefile.target
> index 72f3db8..40eb273 100644
> --- a/qemu/Makefile.target
> +++ b/qemu/Makefile.target
> @@ -616,6 +616,7 @@ OBJS+= ide.o pckbd.o ps2.o vga.o $(SOUND_HW) dma.o
>  OBJS+= fdc.o mc146818rtc.o serial.o i8259.o i8254.o pcspk.o pc.o
>  OBJS+= cirrus_vga.o apic.o parallel.o acpi.o piix_pci.o
>  OBJS+= usb-uhci.o vmmouse.o vmport.o vmware_vga.o extboot.o
> +OBJS+= device-assignment.o
>  ifeq ($(USE_KVM_PIT), 1)
>  OBJS+= i8254-kvm.o
>  endif
> diff --git a/qemu/hw/device-assignment.c b/qemu/hw/device-assignment.c
> new file mode 100644
> index 000..e70daf2
> --- /dev/null
> +++ b/qemu/hw/device-assignment.c
> @@ -0,0 +1,665 @@
> +/*
> + * Copyright (c) 2007, Neocleus Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> for + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> with + * this program; if not, write to the Free Software Foundation, Inc.,
> 59 Temple + * Place - Suite 330, Boston, MA 02111-1307 USA.
> + *
> + *
> + *  Assign a PCI device from the host to a guest VM.
> + *
> + *  Adapted for KVM by Qumranet.
> + *
> + *  Copyright (c) 2007, Neocleus, Alex Novik ([EMAIL PROTECTED])
> + *  Copyright (c) 2007, Neocleus, Guy Zana ([EMAIL PROTECTED])
> + *  Copyright (C) 2008, Qumranet, Amit Shah ([EMAIL PROTECTED])
> + *  Copyright (C) 2008, Red Hat, Amit Shah ([EMAIL PROTECTED])
> + */
> +#include 
> +#include 
> +#include "qemu-kvm.h"
> +#include 
> +#include "device-assignment.h"
> +
> +/* From linux/ioport.h */
> +#define IORESOURCE_IO  0x0100  /* Resource type */
> +#define IORESOURCE_MEM 0x0200
> +#define IORESOURCE_IRQ 0x0400
> +#define IORESOURCE_DMA 0x0800
> +#define IORESOURCE_PREFETCH0x1000  /* No side effects */
> +
> +/* #define DEVICE_ASSIGNMENT_DEBUG */
> +
> +#ifdef DEVICE_ASSIGNMENT_DEBUG
> +#define DEBUG(fmt, args...) fprintf(stderr, "%s: " fmt, __func__ , ##
> args) +#else
> +#define DEBUG(fmt, args...)
> +#endif
> +
> +static void assigned_dev_ioport_writeb(void *opaque, uint32_t addr,
> +  uint32_t value)
> +{
> +   AssignedDevRegion *r_access = (AssignedDevRegion *)opaque;
> +   uint32_t r_pio = (unsigned long)r_access->r_virtbase
> +   + (addr - r_access->e_physbase);
> +
> +   if (r_access->debug & DEVICE_ASSIG

Re: [PATCH 5/9] kvm-x86: Enable NMI Watchdog via in-kernel PIT source

2008-09-24 Thread Yang, Sheng

On Tuesday 23 September 2008 23:04:48 Jan Kiszka wrote:
> Yang, Sheng wrote:
> > On Friday 19 September 2008 20:03:02 Jan Kiszka wrote:
> >> LINT0 of the LAPIC can be used to route PIT events as NMI watchdog
> >> ticks into the guest. This patch aligns the in-kernel irqchip emulation
> >> with the user space irqchip with already supports this feature. The
> >> trick is to route PIT interrupts to all LAPIC's LVT0 lines.
> >>
> >> Rebased patch and slightly polished patch originally posted by Sheng
> >> Yang.
> >
> > Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
> >
> > Thanks for pick up this patch again!
> >
> > Have you test some Windows guest with this watchdog? Last time I dropped
> > it because it cause BSOD on some version of
> > Windows(IRQ_NOT_EQUAL_OR_LESS). I don't remember the exactly situation
> > there, but you may have a try.
>
> Not yet. I always tell my colleagues that I don't need Windows on my
> desktop, I just need a few VM images - for testing... :)
>
> I will try to dig out / generate some image and reproduce the issue you
> and Gleb see. Hope it will trigger here as well. Anything special
> required to make Windows use the NMI as watchdog?
>
I don't know if Windows use NMI watchdog. In fact, my original patch just 
cause Windows BSOD, and I think Windows don't use it(Linux NMI watchdog 
mechanism is a little tricky one)...

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Remaining passthrough/VT-d tasks list

2008-09-24 Thread Yang, Sheng

On Wednesday 24 September 2008 17:22:53 Avi Kivity wrote:
> Yang, Sheng wrote:
> >> We only have three pci interrupts at this point (though this could be
> >> easily extended); if you start the guest with a non-trivial number of
> >> devices, you will have shared guest interrupts.
> >>
> >> (of course, when I pointed this out during review, people said it could
> >> be done later, then forgot all about it)
> >
> > .
> >
> > I think it's a performance issue, not break it? How about do it like Xen
> > side? Try best to avoid the share, extended the pci interrupts, improve
> > hash algorithm. Is there anything else we can do?
>
> Two separate issues:
>
> 1. only three guest pci interrupts
>
> That's a performance issue, not correctness.  can be fixed by using gsi
> 16-23 in APIC mode, and by adding another IOAPIC (so we can use gsi
> 16-47).  Anthony Xu posted some patches for this, not sure where this
> stands, but it was the right approach.
>
> 2. shared guest pci interrupts
>
> That's a correctness issue.  No matter how many interrupts we have, we
> may have sharing issues.  Of course with only three the issue is very
> pressing since we will get sharing with just a few devices.  Currently
> if two assigned devices share a guest interrupts, or if an emulated
> device shares an interrupt with an assigned device, things will break.
>
> They need to be fixed independently.

About the second issue, I don't understand how it would break... Would you 
please give more details on this? It's a QEmu bug or IOAPIC bug?

-- 
regards
Yang, Sheng

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Remaining passthrough/VT-d tasks list

2008-09-24 Thread Yang, Sheng

On Wednesday 24 September 2008 16:53:15 Avi Kivity wrote:
> Yang, Sheng wrote:
> >> Shared guest interrupts is a prerequisite for merging into mainline.
> >> Without this, device assignment is useless in anything but a benchmark
> >> scenario.  I won't push device assignment for 2.6.28 without it.
> >>
> >> Shared host interrupts are a different matter; which one did you mean?
> >
> > Got confused...
> >
> > I think we are talking about share host interrupts, that is pre-assigned
> > device shared IRQ with other devices.
> >
> > Why share guest interrupts is a prerequisite...
>
> We only have three pci interrupts at this point (though this could be
> easily extended); if you start the guest with a non-trivial number of
> devices, you will have shared guest interrupts.
>
> (of course, when I pointed this out during review, people said it could
> be done later, then forgot all about it)
>
.. 

I think it's a performance issue, not break it? How about do it like Xen side? 
Try best to avoid the share, extended the pci interrupts, improve hash 
algorithm. Is there anything else we can do?

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Remaining passthrough/VT-d tasks list

2008-09-24 Thread Yang, Sheng

On Wednesday 24 September 2008 16:38:35 Avi Kivity wrote:
> Yang, Sheng wrote:
> >> - MSI support (WIP)
> >> - MTRR/PAT support of EPT (WIP)
> >> - MTRR/PAT support of shadow (WIP)
> >> - Basic FLR support (WIP)
> >
> > Above four are my works. All of them work now. But more job should be
> > done to polish the patches. And the main part of Function Level Reset
> > would be picked by linux-pci.
> >
> > Another thing is we would send out/update above patches before Sept. 28,
> > and hope they can picked by 2.6.28 merge window.
> >
> > Avi, what's your opinion? Of course we would work hard. :) But what's the
> > deadline of merge window?
>
> No one knows, but it's very unlikely these features will make it for
> 2.6.28.  To be merged, it is not sufficient for the patches to be
> ready.  They have to undergo some testing in the field.

..

> >> - Shared Interrupt support
> >
> > I still don't know who would do this. It's very important for VT-d real
> > usable. If nobody interested in it, I would pick it up, but after Oct. 6
> > (after National Holiday in China).
>
> Shared host interrupts?  What's your plan here?  The polarity trick?
>
Yeah, share host interrupts. But haven't got the very clear idea yet. 

-- 
regards
Yang, Sheng


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Remaining passthrough/VT-d tasks list

2008-09-24 Thread Yang, Sheng

On Wednesday 24 September 2008 16:34:22 Avi Kivity wrote:
> Han, Weidong wrote:
> > Hi all,
> >
> > The initial passthrough/VT-d patches have been in kvm, it's time to
> > enhance it, and push them into 2.6.28.
> >
> >   - Shared Interrupt support
>
> Shared guest interrupts is a prerequisite for merging into mainline.
> Without this, device assignment is useless in anything but a benchmark
> scenario.  I won't push device assignment for 2.6.28 without it.
>
> Shared host interrupts are a different matter; which one did you mean?
>

Got confused...

I think we are talking about share host interrupts, that is pre-assigned 
device shared IRQ with other devices. 

Why share guest interrupts is a prerequisite... 

--
regards
Yang, Sheng

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Remaining passthrough/VT-d tasks list

2008-09-23 Thread Yang, Sheng

On Wednesday 24 September 2008 14:15:15 Han, Weidong wrote:
> Hi all,
>
> The initial passthrough/VT-d patches have been in kvm, it's time to enhance
> it, and push them into 2.6.28.
>

Some supplements:

> Following is the remaining passthrough/VT-d tasks list:
>
> - Multiple devices assignment (WIP)

Weidong is working on this.

> - MSI support (WIP)
> - MTRR/PAT support of EPT (WIP)
> - MTRR/PAT support of shadow (WIP)
> - Basic FLR support (WIP)

Above four are my works. All of them work now. But more job should be done to 
polish the patches. And the main part of Function Level Reset would be picked 
by linux-pci. 

Another thing is we would send out/update above patches before Sept. 28, and 
hope they can picked by 2.6.28 merge window.

Avi, what's your opinion? Of course we would work hard. :) But what's the 
deadline of merge window? 

> (Above tasks are working in process, some patches have been sent out,
> others will be sent out in near future) - architecture independent (such as
> x86, IPF)
> - Shared Interrupt support

I still don't know who would do this. It's very important for VT-d real 
usable. If nobody interested in it, I would pick it up, but after Oct. 6
(after National Holiday in China).

--
regards
Yang, Sheng

> - Add dummy driver to hide/unbind passthrough device from host
> kernel
>
> If I omit some good features or you have some good proposals, please feel
> free to add them to this list. If you are interest in any tasks, please
> reply the mail directly and let other guys to know your progress.
> Appreciate any effort from you!
>
>
> Randy (Weidong)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/11] VMX: work around lacking VNMI support

2008-09-23 Thread Yang, Sheng

On Tuesday 23 September 2008 17:45:44 Gleb Natapov wrote:
> On Tue, Sep 23, 2008 at 05:42:02PM +0800, Yang, Sheng wrote:
> > > > That is exactly what I am using. Run it with SMP hal and do
> > > > hibernate.
> > >
> > > Oh... Finally found how to enable that hibernate option
> > >
> > > And this hibernate works on my virtual_nmi supported box, with smp hal
> > > and 2 cpus.
> >
> > However, for this hibernate won't success if there is no NMI support,
> > maybe we can say it's not a "regression"...
>
> I am not saying it's a regression, but it would be nice to have it
> working :)
>
Yeah, of course. :)

-- 
regards
Yang, Sheng



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/11] VMX: work around lacking VNMI support

2008-09-23 Thread Yang, Sheng

On Tuesday 23 September 2008 17:37:00 Yang, Sheng wrote:
> On Tuesday 23 September 2008 17:26:55 Gleb Natapov wrote:
> > On Tue, Sep 23, 2008 at 05:24:50PM +0800, Yang, Sheng wrote:
> > > On Tuesday 23 September 2008 17:15:09 Gleb Natapov wrote:
> > > > On Tue, Sep 23, 2008 at 05:08:09PM +0800, Yang, Sheng wrote:
> > > > > > > >>> We still get here with vmx->soft_vnmi_blocked = 1. Trying
> > > > > > > >>> to find out how.
> > > > > > > >>
> > > > > > > >> We should only come along here with vnmi blocked on
> > > > > > > >> reinjection (after a fault on calling the handler).
> > > > > > > >
> > > > > > > > I see that nmi_injected is never cleared and it is check
> > > > > > > > before calling vmx_inject_nmi();
> > > > > > >
> > > > > > > That should happen in vmx_complete_interrupts, but only if the
> > > > > > > exit takes place after the NMI has been successfully delivered
> > > > > > > to the guest (which is not the case if invoking the handler
> > > > > > > raises an exception). So far for the theory...
> > > > > >
> > > > > > Okey, I have this one in dmesg:
> > > > > > kvm_handle_exit: unexpected, valid vectoring info and exit reason
> > > > > > is 0x9
> > > > >
> > > > > Oh... Another task switch issue...
> > > > >
> > > > > I think it's may not be a issue import by this patchset? Seems need
> > > > > more debug...
> > > > >
> > > > > The patchset is OK for me, except I don't know when we would need
> > > > > that timeout one (buggy guest?...), and we may also root cause this
> > > > > issue or ensure that it's not a regression.
> > > >
> > > > Without the patch series kvm doesn't inject NMIs on this machine, so
> > > > guest hangs. It's hard to tell if this message is caused by these
> > > > patches or not.
> > >
> > > Maybe try to reproduce it on virtual_nmi support machine is OK. But I
> > > only got Windows 2003 server edition by the hand. Does other Windows
> > > behaviour the same?
> >
> > That is exactly what I am using. Run it with SMP hal and do hibernate.
>
> Oh... Finally found how to enable that hibernate option
>
> And this hibernate works on my virtual_nmi supported box, with smp hal and
> 2 cpus.

However, for this hibernate won't success if there is no NMI support, maybe we 
can say it's not a "regression"...

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/11] VMX: work around lacking VNMI support

2008-09-23 Thread Yang, Sheng

On Tuesday 23 September 2008 17:26:55 Gleb Natapov wrote:
> On Tue, Sep 23, 2008 at 05:24:50PM +0800, Yang, Sheng wrote:
> > On Tuesday 23 September 2008 17:15:09 Gleb Natapov wrote:
> > > On Tue, Sep 23, 2008 at 05:08:09PM +0800, Yang, Sheng wrote:
> > > > > > >>> We still get here with vmx->soft_vnmi_blocked = 1. Trying to
> > > > > > >>> find out how.
> > > > > > >>
> > > > > > >> We should only come along here with vnmi blocked on
> > > > > > >> reinjection (after a fault on calling the handler).
> > > > > > >
> > > > > > > I see that nmi_injected is never cleared and it is check before
> > > > > > > calling vmx_inject_nmi();
> > > > > >
> > > > > > That should happen in vmx_complete_interrupts, but only if the
> > > > > > exit takes place after the NMI has been successfully delivered to
> > > > > > the guest (which is not the case if invoking the handler raises
> > > > > > an exception). So far for the theory...
> > > > >
> > > > > Okey, I have this one in dmesg:
> > > > > kvm_handle_exit: unexpected, valid vectoring info and exit reason
> > > > > is 0x9
> > > >
> > > > Oh... Another task switch issue...
> > > >
> > > > I think it's may not be a issue import by this patchset? Seems need
> > > > more debug...
> > > >
> > > > The patchset is OK for me, except I don't know when we would need
> > > > that timeout one (buggy guest?...), and we may also root cause this
> > > > issue or ensure that it's not a regression.
> > >
> > > Without the patch series kvm doesn't inject NMIs on this machine, so
> > > guest hangs. It's hard to tell if this message is caused by these
> > > patches or not.
> >
> > Maybe try to reproduce it on virtual_nmi support machine is OK. But I
> > only got Windows 2003 server edition by the hand. Does other Windows
> > behaviour the same?
>
> That is exactly what I am using. Run it with SMP hal and do hibernate.
>
Oh... Finally found how to enable that hibernate option

And this hibernate works on my virtual_nmi supported box, with smp hal and 2 
cpus.

-- 
regards
Yang, Sheng




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/11] VMX: work around lacking VNMI support

2008-09-23 Thread Yang, Sheng

On Tuesday 23 September 2008 17:15:09 Gleb Natapov wrote:
> On Tue, Sep 23, 2008 at 05:08:09PM +0800, Yang, Sheng wrote:
> > > > >>> We still get here with vmx->soft_vnmi_blocked = 1. Trying to find
> > > > >>> out how.
> > > > >>
> > > > >> We should only come along here with vnmi blocked on reinjection
> > > > >> (after a fault on calling the handler).
> > > > >
> > > > > I see that nmi_injected is never cleared and it is check before
> > > > > calling vmx_inject_nmi();
> > > >
> > > > That should happen in vmx_complete_interrupts, but only if the exit
> > > > takes place after the NMI has been successfully delivered to the
> > > > guest (which is not the case if invoking the handler raises an
> > > > exception). So far for the theory...
> > >
> > > Okey, I have this one in dmesg:
> > > kvm_handle_exit: unexpected, valid vectoring info and exit reason is
> > > 0x9
> >
> > Oh... Another task switch issue...
> >
> > I think it's may not be a issue import by this patchset? Seems need more
> > debug...
> >
> > The patchset is OK for me, except I don't know when we would need that
> > timeout one (buggy guest?...), and we may also root cause this issue or
> > ensure that it's not a regression.
>
> Without the patch series kvm doesn't inject NMIs on this machine, so guest
> hangs. It's hard to tell if this message is caused by these patches or not.
>
> --
Just tried, Windows XP sp2 ia32pae with 2 cpus is OK to hibernate with 
virtual_nmi...

-- 
regards
Yang, Sheng



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/11] VMX: work around lacking VNMI support

2008-09-23 Thread Yang, Sheng

On Tuesday 23 September 2008 17:15:09 Gleb Natapov wrote:
> On Tue, Sep 23, 2008 at 05:08:09PM +0800, Yang, Sheng wrote:
> > > > >>> We still get here with vmx->soft_vnmi_blocked = 1. Trying to find
> > > > >>> out how.
> > > > >>
> > > > >> We should only come along here with vnmi blocked on reinjection
> > > > >> (after a fault on calling the handler).
> > > > >
> > > > > I see that nmi_injected is never cleared and it is check before
> > > > > calling vmx_inject_nmi();
> > > >
> > > > That should happen in vmx_complete_interrupts, but only if the exit
> > > > takes place after the NMI has been successfully delivered to the
> > > > guest (which is not the case if invoking the handler raises an
> > > > exception). So far for the theory...
> > >
> > > Okey, I have this one in dmesg:
> > > kvm_handle_exit: unexpected, valid vectoring info and exit reason is
> > > 0x9
> >
> > Oh... Another task switch issue...
> >
> > I think it's may not be a issue import by this patchset? Seems need more
> > debug...
> >
> > The patchset is OK for me, except I don't know when we would need that
> > timeout one (buggy guest?...), and we may also root cause this issue or
> > ensure that it's not a regression.
>
> Without the patch series kvm doesn't inject NMIs on this machine, so guest
> hangs. It's hard to tell if this message is caused by these patches or not.
>
Maybe try to reproduce it on virtual_nmi support machine is OK. But I only got 
Windows 2003 server edition by the hand. Does other Windows behaviour the 
same?

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/11] VMX: work around lacking VNMI support

2008-09-23 Thread Yang, Sheng

On Tuesday 23 September 2008 17:00:21 Gleb Natapov wrote:
> On Tue, Sep 23, 2008 at 10:57:40AM +0200, Jan Kiszka wrote:
> > Gleb Natapov wrote:
> > > On Tue, Sep 23, 2008 at 10:46:38AM +0200, Jan Kiszka wrote:
> > >> Gleb Natapov wrote:
> > >>> On Mon, Sep 22, 2008 at 09:59:07AM +0200, Jan Kiszka wrote:
> > >>>> @@ -2356,6 +2384,19 @@ static void vmx_inject_nmi(struct kvm_vc
> > >>>>  {
> > >>>>  struct vcpu_vmx *vmx = to_vmx(vcpu);
> > >>>>
> > >>>> +if (!cpu_has_virtual_nmis()) {
> > >>>> +/*
> > >>>> + * Tracking the NMI-blocked state in software is
> > >>>> built upon + * finding the next open IRQ window.
> > >>>> This, in turn, depends on + * well-behaving guests:
> > >>>> They have to keep IRQs disabled at + * least as long
> > >>>> as the NMI handler runs. Otherwise we may + * cause
> > >>>> NMI nesting, maybe breaking the guest. But as this is + 
> > >>>>* highly unlikely, we can live with the residual risk. + 
> > >>>>*/
> > >>>> +vmx->soft_vnmi_blocked = 1;
> > >>>> +vmx->vnmi_blocked_time = 0;
> > >>>> +}
> > >>>> +
> > >>>
> > >>> We still get here with vmx->soft_vnmi_blocked = 1. Trying to find out
> > >>> how.
> > >>
> > >> We should only come along here with vnmi blocked on reinjection (after
> > >> a fault on calling the handler).
> > >
> > > I see that nmi_injected is never cleared and it is check before calling
> > > vmx_inject_nmi();
> >
> > That should happen in vmx_complete_interrupts, but only if the exit
> > takes place after the NMI has been successfully delivered to the guest
> > (which is not the case if invoking the handler raises an exception). So
> > far for the theory...
>
> Okey, I have this one in dmesg:
> kvm_handle_exit: unexpected, valid vectoring info and exit reason is 0x9
>
Oh... Another task switch issue...

I think it's may not be a issue import by this patchset? Seems need more 
debug... 

The patchset is OK for me, except I don't know when we would need that timeout 
one (buggy guest?...), and we may also root cause this issue or ensure that 
it's not a regression.

-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: VMX: Host NMI triggering on NMI vmexit

2008-09-23 Thread Yang, Sheng

On Tuesday 23 September 2008 16:47:54 Jan Kiszka wrote:
> Yang, Sheng wrote:
> > On Monday 22 September 2008 19:00:38 Avi Kivity wrote:
> >> Jan Kiszka wrote:
> >>>> Maybe the answer is to generate the local nmi via an IPI-to-self
> >>>> command to the local apic.
> >>>
> >>> Going this way leaves me with a few questions: Will it be OK for the
> >>> related mainainers to export the required service?
> >>
> >> If we can make a case for it (I think we can), then I don't see why not.
> >>
> >> Sheng, can you confirm that 'int 2' is problematic, and that
> >> nmi-via-lapic is the best workaround?
> >
> > Just back from vacation... :)
> >
> > Jan said is true, "int 2" itself won't block subsequent NMIs. But I think
> > it's too obviously as a hardware issue when using with NMI exiting=1 in
> > vmx nonroot mode, so I have checked it with my colleague, finally found
> > these in SDM 3B 23-2:
> >
> > The following bullets detail when architectural state is and is not
> > updated in response to VM exits:
> > •   If an event causes a VM exit *directly*, it does not update
> > architectural state as it would have if it had it not caused the VM exit:
> > [...]
> > — *An NMI causes subsequent NMIs to be blocked*, but only after the VM
> > exit completes.
> >
> > So we needn't worry about that, and this shouldn't cause any trouble
> > AFAIK...
>
> Fine, problems--. :)
>
> > Jan, seems we need to do more investigating on the issues you met...
>
> Sorry, which one do you mean now?

You said 
"Only true until you have multiple unsynchronized NMI sources, e.g.
inter-CPU NMIs of kgdb + a watchdog. I just stumbled over several bugs
in kvm's and my own NMI code that were triggered by such a scenario
(sigh...)." ?

If it's not related to this one. That's fine. :)

-- 
regards
Yang, Sheng


>
> Jan
>
> --
> Siemens AG, Corporate Technology, CT SE 2
> Corporate Competence Center Embedded Linux


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 9/11] VMX: Provide support for user space injected NMIs

2008-09-22 Thread Yang, Sheng

On Monday 22 September 2008 15:59:02 Jan Kiszka wrote:
> This patch adds the required bits to the VMX side for user space
> injected NMIs. As with the preexisting in-kernel irqchip support, the
> CPU must provide the "virtual NMI" feature for proper tracking of the
> NMI blocking state.
>
> Based on the original patch by Sheng Yang.
>
> Signed-off-by: Jan Kiszka <[EMAIL PROTECTED]>
> ---
>  arch/x86/kvm/vmx.c |   33 +
>  1 file changed, 33 insertions(+)
>
> Index: b/arch/x86/kvm/vmx.c
> ===
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2356,6 +2356,7 @@ static void vmx_inject_nmi(struct kvm_vc
>  {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
>
> +   ++vcpu->stat.nmi_injections;
> if (vcpu->arch.rmode.active) {
> vmx->rmode.irq.pending = true;
> vmx->rmode.irq.vector = NMI_VECTOR;
> @@ -2424,6 +2425,30 @@ static void do_interrupt_requests(struct
>  {
> vmx_update_window_states(vcpu);
>
> +   if (cpu_has_virtual_nmis()) {
> +   if (vcpu->arch.nmi_pending && !vcpu->arch.nmi_injected) {
> +   if (vcpu->arch.nmi_window_open) {
> +   vcpu->arch.nmi_pending = false;
> +   vcpu->arch.nmi_injected = true;
> +   } else {
> +   enable_nmi_window(vcpu);
> +   return;
> +   }
> +   }
> +   if (vcpu->arch.nmi_injected) {
> +   vmx_inject_nmi(vcpu);
> +   if (vcpu->arch.nmi_pending
> +   || kvm_run->request_nmi_window)
> +   enable_nmi_window(vcpu);
> +   else if (vcpu->arch.irq_summary
> +|| kvm_run->request_interrupt_window)
> +   enable_irq_window(vcpu);
> +   return;
> +   }
> +   if (!vcpu->arch.nmi_window_open ||
> kvm_run->request_nmi_window) +  
> enable_nmi_window(vcpu);
> +   }
> +
> if (vcpu->arch.interrupt_window_open) {
> if (vcpu->arch.irq_summary &&
> !vcpu->arch.interrupt.pending) kvm_do_inject_irq(vcpu);
> @@ -2936,6 +2961,14 @@ static int handle_nmi_window(struct kvm_
> vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control);
> ++vcpu->stat.nmi_window_exits;
>
> +   /*
> +* If the user space waits to inject a NNI, exit as soon as
> possible +*/

o... found a typo :)

And also please add my signed-off for patch 8 and 9.

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
--
regards
Yang, Sheng

> +   if (kvm_run->request_nmi_window && !vcpu->arch.nmi_pending) {
> +   kvm_run->exit_reason = KVM_EXIT_NMI_WINDOW_OPEN;
> +   return 0;
> +   }
> +
> return 1;
>  }


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/9] kvm-x86: Enable NMI Watchdog via in-kernel PIT source

2008-09-22 Thread Yang, Sheng

On Friday 19 September 2008 20:03:02 Jan Kiszka wrote:
> LINT0 of the LAPIC can be used to route PIT events as NMI watchdog
> ticks into the guest. This patch aligns the in-kernel irqchip emulation
> with the user space irqchip with already supports this feature. The
> trick is to route PIT interrupts to all LAPIC's LVT0 lines.
>
> Rebased patch and slightly polished patch originally posted by Sheng
> Yang.

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>

Thanks for pick up this patch again! 

Have you test some Windows guest with this watchdog? Last time I dropped it 
because it cause BSOD on some version of Windows(IRQ_NOT_EQUAL_OR_LESS). I 
don't remember the exactly situation there, but you may have a try. 

-- 
regards
Yang, Sheng
>
> Signed-off-by: Jan Kiszka <[EMAIL PROTECTED]>
> ---
>  arch/x86/kvm/i8254.c |   15 +++
>  arch/x86/kvm/irq.h   |1 +
>  arch/x86/kvm/lapic.c |   32 
>  3 files changed, 44 insertions(+), 4 deletions(-)
>
> Index: b/arch/x86/kvm/i8254.c
> ===
> --- a/arch/x86/kvm/i8254.c
> +++ b/arch/x86/kvm/i8254.c
> @@ -594,10 +594,25 @@ void kvm_free_pit(struct kvm *kvm)
>
>  static void __inject_pit_timer_intr(struct kvm *kvm)
>  {
> +   struct kvm_vcpu *vcpu;
> +   int i;
> +
> mutex_lock(&kvm->lock);
> kvm_set_irq(kvm, 0, 1);
> kvm_set_irq(kvm, 0, 0);
> mutex_unlock(&kvm->lock);
> +
> +   /*
> +* Provideds NMI watchdog support in IOAPIC mode.
> +* The route is: PIT -> PIC -> LVT0 in NMI mode,
> +* timer IRQs will continue to flow through the IOAPIC.
> +*/
> +   for (i = 0; i < KVM_MAX_VCPUS; ++i) {
> +   vcpu = kvm->vcpus[i];
> +   if (!vcpu)
> +   continue;
> +   kvm_apic_local_deliver(vcpu, APIC_LVT0);
> +   }
>  }
>
>  void kvm_inject_pit_timer_irqs(struct kvm_vcpu *vcpu)
> Index: b/arch/x86/kvm/irq.h
> ===
> --- a/arch/x86/kvm/irq.h
> +++ b/arch/x86/kvm/irq.h
> @@ -93,6 +93,7 @@ void kvm_unregister_irq_ack_notifier(str
>  void kvm_timer_intr_post(struct kvm_vcpu *vcpu, int vec);
>  void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu);
>  void kvm_inject_apic_timer_irqs(struct kvm_vcpu *vcpu);
> +int kvm_apic_local_deliver(struct kvm_vcpu *vcpu, int lvt_type);
>  void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu);
>  void __kvm_migrate_pit_timer(struct kvm_vcpu *vcpu);
>  void __kvm_migrate_timers(struct kvm_vcpu *vcpu);
> Index: b/arch/x86/kvm/lapic.c
> ===
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -382,6 +382,14 @@ static int __apic_accept_irq(struct kvm_
> }
> break;
>
> +   case APIC_DM_EXTINT:
> +   /*
> +* Should only be called by kvm_apic_local_deliver() with
> LVT0, +* before NMI watchdog was enabled. Already handled
> by +* kvm_apic_accept_pic_intr().
> +*/
> +   break;
> +
> default:
> printk(KERN_ERR "TODO: unsupported delivery mode %x\n",
>delivery_mode);
> @@ -749,6 +757,9 @@ static void apic_mmio_write(struct kvm_i
> case APIC_LVTTHMR:
> case APIC_LVTPC:
> case APIC_LVT0:
> +   if (val == APIC_DM_NMI)
> +   apic_debug("Receive NMI setting on APIC_LVT0 "
> +   "for cpu %d\n", apic->vcpu->vcpu_id);
> case APIC_LVT1:
> case APIC_LVTERR:
> /* TODO: Check vector */
> @@ -965,12 +976,25 @@ int apic_has_pending_timer(struct kvm_vc
> return 0;
>  }
>
> -static int __inject_apic_timer_irq(struct kvm_lapic *apic)
> +int kvm_apic_local_deliver(struct kvm_vcpu *vcpu, int lvt_type)
>  {
> -   int vector;
> +   struct kvm_lapic *apic = vcpu->arch.apic;
> +   int vector, mode, trig_mode;
> +   u32 reg;
> +
> +   if (apic && apic_enabled(apic)) {
> +   reg = apic_get_reg(apic, lvt_type);
> +   vector = reg & APIC_VECTOR_MASK;
> +   mode = reg & APIC_MODE_MASK;
> +   trig_mode = reg & APIC_LVT_LEVEL_TRIGGER;
> +   return __apic_accept_irq(apic, mode, vector, 1, trig_mode);
> +   }
> +   return 0;
> +}
>
> -   vector = apic_lvt_vector(apic, APIC_LVTT);
> -   return

Re: VMX: Host NMI triggering on NMI vmexit

2008-09-22 Thread Yang, Sheng

On Monday 22 September 2008 19:00:38 Avi Kivity wrote:
> Jan Kiszka wrote:
> >> Maybe the answer is to generate the local nmi via an IPI-to-self command
> >> to the local apic.
> >
> > Going this way leaves me with a few questions: Will it be OK for the
> > related mainainers to export the required service?
>
> If we can make a case for it (I think we can), then I don't see why not.
>
> Sheng, can you confirm that 'int 2' is problematic, and that
> nmi-via-lapic is the best workaround?

Just back from vacation... :)

Jan said is true, "int 2" itself won't block subsequent NMIs. But I think it's 
too obviously as a hardware issue when using with NMI exiting=1 in vmx 
nonroot mode, so I have checked it with my colleague, finally found these in 
SDM 3B 23-2:

The following bullets detail when architectural state is and is not updated in 
response to VM exits:
•   If an event causes a VM exit *directly*, it does not update architectural 
state as it would have if it had it not caused the VM exit:
[...]
— *An NMI causes subsequent NMIs to be blocked*, but only after the VM exit
  completes.

So we needn't worry about that, and this shouldn't cause any trouble AFAIK...

Jan, seems we need to do more investigating on the issues you met...

-- 
regards
Yang, Sheng

> > And is it safe to
> > assume VMX == LAPIC available and usable?
>
> Yes.
>
> > However, this is how it would look like.
>
> I'd define a send_nmi_self() instead, to allow the implementation to
> change (x2apic/etc).
>
> > Yet untested, /me has to
> > replace his host kernel first...
>
> You could test it in a VM, if someone implements nested vmx :)
>
> btw, looks like svm is not affected by this.
>
> --
> error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: VMX: NMI injection without virtual NMI support

2008-09-11 Thread Yang, Sheng

On Thursday 11 September 2008 23:11:14 Jan Kiszka wrote:
> Hi Sheng,
>
> we had parts of this discussion privately a while back, but now I need
> to dig deeper. I finally have to complete and push out user space NMI
> injection that bitrots over and over again in my queue.
>
> Intel CPUs with virtual NMI support can easily be programmed to inject
> NMIs only when the guest CPU is ready or tell the guest to exit as soon
> as it changes its NMI readiness. But we are also facing a lot of CPUs
> that were produced without this feature, and we need to find some way to
> achieve virtual NMI support for them, even if the interruptibility of
> the guest is not as good as with true virtual NMIs.
>
> Now my questions to you or anyone else with VMX expert knowledge:
>
> The System Programming Guide, Volume 3B, 27.2 says that "Without
> NMI-window exiting support, the VMM will need to poll and check the
> interruptibility state of the guest to deliver virtual NMIs."
> Interruptibility involves, besides "blocked by MOV SS" and "blocked by
> STI", the NMI nesting. So, how can the VMM track if the guest is still
> inside its NMI handler after event injection?

I think, we have no direct idea about that...

> According to 22.6.1, NMI blocking is active as long as the guest runs if
> the "NMI-exiting" control bit is 1 (and setting it appears to be
> required to avoid that the guest can block NMIs for the host, right?).

Yes.

> But what does this mean for bit 3 of the interruptibility state? Can I
> still use it after some guest exit for whatever reason (like a hard IRQ)
> to poll if the guest can now accept pending virtual NMIs? 

No. Though the public spec is a little ambiguous, I just checked some document 
and found that, if "virtual NMI" is not enabled, "Block by NMI" just 
indicated block *host physical* NMI rather than virtual NMI.

> In that case, virtual NMI injections would only be delayed a bit on older 
>CPUs, but could still work reliably, right? Or is there some other way to 
>track the NMI interruptibility on such CPUs?

I hope so, but sadly seems we can't know if we can inject NMI to guest without 
enable "virtual NMI".

After some brainstorming with my colleague Haitao, we found one way *may* can 
work on non-virtual nmi supported hardware, but it's very tricky and 
untested(just a theoretical idea), also sacrifice host physical NMI, and 
depends on guest NMI implement. 

The basic idea is:
1. Disable "NMI Exiting" feature, so that guest would handle any host physical 
NMI.
2. Set "Blocking by NMI" to true. So if guest execute "iret", this bit should 
be cleaned. This also block physical NMIs. 
3. Enable "IRQ Window" and ensure IDT vector 2 is a interrupt gate(so that IF 
bit is cleared when executing the handler). So we use IRQ window to replace 
NMI window here, otherwise for the period between IRET and VMEXIT, host 
physical NMIs would be handled by guest handler.
4. Check "IRQ Window" exit, if "Blocking by NMI" is cleared(also STI and SS), 
we can inject NMI.

It's just untested idea, and I think it's quite tricky. If you still have 
interest, you may try it, but I can't guarantee the result...

Thanks.

-- 
regards
Yang, Sheng

>
> TiA,
> Jan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: VMX: Move private memory slot position

2008-09-11 Thread Yang, Sheng

On Thursday 04 September 2008 11:30:20 Yang, Sheng wrote:
> From ebe4ea311305d2910dcdcff2510662da0dc2c742 Mon Sep 17 00:00:00 2001
> From: Sheng Yang <[EMAIL PROTECTED]>
> Date: Thu, 4 Sep 2008 03:11:48 +0800
> Subject: [PATCH] KVM: VMX: Move private memory slot position
>
> PCI device assignment would map guest MMIO spaces as separate slot, so it
> is possible that the device has more than 2 MMIO spaces and overwrite
> current private memslot.
>
> The patch move private memory slot to the top of userspace visible memory
> slots.
>

Avi, these two?

(Oh, it's a little old, next time I will use git-send-email :) )
-- 
regards
Yang, Sheng

> Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
> ---
>  arch/x86/kvm/vmx.c |2 +-
>  arch/x86/kvm/vmx.h |5 +++--
>  2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 004d24a..27c3bb7 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2448,7 +2448,7 @@ static int vmx_set_tss_addr(struct kvm *kvm, unsigned
> int addr)
>  {
>   int ret;
>   struct kvm_userspace_memory_region tss_mem = {
> - .slot = 8,
> + .slot = TSS_PRIVATE_MEMSLOT,
>   .guest_phys_addr = addr,
>   .memory_size = PAGE_SIZE * 3,
>   .flags = 0,
> diff --git a/arch/x86/kvm/vmx.h b/arch/x86/kvm/vmx.h
> index 16b3cfb..dd0eea9 100644
> --- a/arch/x86/kvm/vmx.h
> +++ b/arch/x86/kvm/vmx.h
> @@ -356,8 +356,9 @@ enum vmcs_field {
>  #define IA32_FEATURE_CONTROL_LOCKED_BIT  0x1
>  #define IA32_FEATURE_CONTROL_VMXON_ENABLED_BIT   0x4
>
> -#define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT 9
> -#define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT   10
> +#define TSS_PRIVATE_MEMSLOT  (KVM_MEMORY_SLOTS + 0)
> +#define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT (KVM_MEMORY_SLOTS + 1)
> +#define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT   (KVM_MEMORY_SLOTS + 2)
>
>  #define VMX_NR_VPIDS (1 << 16)
>  #define VMX_VPID_EXTENT_SINGLE_CONTEXT   1
> --
> 1.5.4.5


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: Fix QEmu interrupted HLT emulation

2008-09-11 Thread Yang, Sheng

On Thursday 11 September 2008 16:50:37 Avi Kivity wrote:
> Yang, Sheng wrote:
> > From: Sheng Yang <[EMAIL PROTECTED]>
> > Date: Thu, 31 Jul 2008 13:43:58 +0800
> > Subject: [PATCH] KVM: Fix QEmu interrupted HLT emulation
> >
> > QEmu can interrupt VCPU from HLT emulation without setting mp_state to
> > MP_STATE_RUNNABLE, when it kick vcpus which are doing HLT emulation to
> > do something like "stop" or "info cpus". Here are two issues of this
> > behaviour:
> >
> > First, if vcpu exit to QEmu with MP_STATE_HALTED, it would keep in
> > this state later for vcpu_run(), which is eerie...
> >
> > Second, a practical problem: bios load AP boot up code to 0x1
> > (now), and AP is running HLT there. But later grub load it's stage2
> > code to the same address. Then if the halting vcpu was forced exit to
> > QEmu in grub, and come back for vcpu_run later, it can't execute HLT
> > instruction anymore, just because the bios code is not there,
> > and it would follow a piece of code of grub, which would cause
> > completely chaos...
> >
> > The second issue directly lead to guest crash or SMP linux can't boot
> > up AP later if we "stop" or "info cpus" in grub. Though I also sent a
> > patch for BIOS, it's necessary to get correct behavior here.
>
> Going over my backlog it looks like I missed this.  But I think
> Marcelo's rework obsoletes this patch?

Yeah, long ago... So I also drop this patch.

-- 
regards
Yang, Sheng

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: VMX: Add PAT support for EPT

2008-09-10 Thread Yang, Sheng

On Tuesday 09 September 2008 22:36:22 Avi Kivity wrote:
> Avi Kivity wrote:
> > This appears to be a new feature?  My documentation (a bit old)
> > doesn't show it.  If so, we need a check to see that it is available.
>
> The check is actually there.
>
> If the feature is present, we need to expose it via
> KVM_GET_SUPPORTED_CPUID, and add save/restore support for the msr via
> msrs_to_save.

Yeah, it's a feature come with EPT. Thanks for reminder! Would update the 
patch soon. 

PS: The latest spec available at 
http://www.intel.com/products/processor/manuals/

It contains EPT and VPID and other new things on Nehalem. I would work on 
clearing up the code according to the latest spec soon (yeah, we also only 
got it for days...)

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][REPOST] KVM: VMX: Always return 0 for clear_flush_young() when using EPT

2008-09-08 Thread Yang, Sheng

On Sunday 07 September 2008 21:31:54 Avi Kivity wrote:
> Why not to a
>
>   if (!shadow_access_mask)
>return 0;
>
> in the beginning?

Oops...
>
> I guess returning 'old' is safer than returning 'young'.

Yeah, me too, though possibly cause thrashing.

How about this one?

--
From: Sheng Yang <[EMAIL PROTECTED]>
Date: Mon, 8 Sep 2008 15:12:30 +0800
Subject: [PATCH] KVM: VMX: Always return old for clear_flush_young() when 
using EPT

As well as discard fake accessed bit and dirty bit of EPT.

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 arch/x86/kvm/mmu.c |4 
 arch/x86/kvm/vmx.c |3 +--
 arch/x86/kvm/vmx.h |2 --
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a87a11e..bce3e25 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -711,6 +711,10 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long 
*rmapp)
u64 *spte;
int young = 0;

+   /* always return old for EPT */
+   if (!shadow_accessed_mask)
+   return 0;
+
spte = rmap_next(kvm, rmapp, NULL);
while (spte) {
int _young;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 14671f4..2d6c770 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3558,8 +3558,7 @@ static int __init vmx_init(void)
kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK |
VMX_EPT_WRITABLE_MASK |
VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT);
-   kvm_mmu_set_mask_ptes(0ull, VMX_EPT_FAKE_ACCESSED_MASK,
-   VMX_EPT_FAKE_DIRTY_MASK, 0ull,
+   kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
VMX_EPT_EXECUTABLE_MASK);
kvm_enable_tdp();
} else
diff --git a/arch/x86/kvm/vmx.h b/arch/x86/kvm/vmx.h
index 0c22e5f..41e8c10 100644
--- a/arch/x86/kvm/vmx.h
+++ b/arch/x86/kvm/vmx.h
@@ -370,8 +370,6 @@ enum vmcs_field {
 #define VMX_EPT_READABLE_MASK  0x1ull
 #define VMX_EPT_WRITABLE_MASK  0x2ull
 #define VMX_EPT_EXECUTABLE_MASK0x4ull
-#define VMX_EPT_FAKE_ACCESSED_MASK (1ull << 62)
-#define VMX_EPT_FAKE_DIRTY_MASK(1ull << 63)

 #define VMX_EPT_IDENTITY_PAGETABLE_ADDR0xfffbc000ul

--
1.5.6.5

From 250f978cf178fce89b9e5c68007307ccddbb2868 Mon Sep 17 00:00:00 2001
From: Sheng Yang <[EMAIL PROTECTED]>
Date: Mon, 8 Sep 2008 15:12:30 +0800
Subject: [PATCH] KVM: VMX: Always return old for clear_flush_young() when using EPT

As well as discard fake accessed bit and dirty bit of EPT.

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 arch/x86/kvm/mmu.c |4 
 arch/x86/kvm/vmx.c |3 +--
 arch/x86/kvm/vmx.h |2 --
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a87a11e..bce3e25 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -711,6 +711,10 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
 	u64 *spte;
 	int young = 0;
 
+	/* always return old for EPT */
+	if (!shadow_accessed_mask)
+		return 0;
+
 	spte = rmap_next(kvm, rmapp, NULL);
 	while (spte) {
 		int _young;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 14671f4..2d6c770 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3558,8 +3558,7 @@ static int __init vmx_init(void)
 		kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK |
 			VMX_EPT_WRITABLE_MASK |
 			VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT);
-		kvm_mmu_set_mask_ptes(0ull, VMX_EPT_FAKE_ACCESSED_MASK,
-VMX_EPT_FAKE_DIRTY_MASK, 0ull,
+		kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
 VMX_EPT_EXECUTABLE_MASK);
 		kvm_enable_tdp();
 	} else
diff --git a/arch/x86/kvm/vmx.h b/arch/x86/kvm/vmx.h
index 0c22e5f..41e8c10 100644
--- a/arch/x86/kvm/vmx.h
+++ b/arch/x86/kvm/vmx.h
@@ -370,8 +370,6 @@ enum vmcs_field {
 #define VMX_EPT_READABLE_MASK			0x1ull
 #define VMX_EPT_WRITABLE_MASK			0x2ull
 #define VMX_EPT_EXECUTABLE_MASK			0x4ull
-#define VMX_EPT_FAKE_ACCESSED_MASK		(1ull << 62)
-#define VMX_EPT_FAKE_DIRTY_MASK			(1ull << 63)
 
 #define VMX_EPT_IDENTITY_PAGETABLE_ADDR		0xfffbc000ul
 
-- 
1.5.6.5

Re: Test with VT-d patches

2008-09-04 Thread Yang, Sheng

On Thursday 04 September 2008 17:12:28 Han, Weidong wrote:
> [EMAIL PROTECTED] wrote:
> > Hello,
> >
> > I try to explain the current state on my pc.
> >
> > 1) The patch KVM 1/2 was applied with the following change.
> >The files "drivers/pci/iova.h" and "drivers/pci/intel-iommu.h"
> >already exists. And I get both files rejected.
> >So I take the headers from the KVM 1/2 Patchfile and use these a
> >"iova.h" and "intel-iommu.h". Kernel compiles without errors.
> >
> > 2) What known bug is in the latest userspace patchfile? (Told by
> > Yang, Sheng)
>

I think it's mentioned by Ben that userspace patch refer a kvm fd before it 
was initialized correctly. (patch at 2008-08-26, Hope I don't miss any update 
on this... ) 

> in add_assigned_device(), you can simply comment out some lines between
> following #ifdef and #endif as follows.
>
> +#ifdef KVM_CAP_IOMMU
> //+   r = kvm_check_extension(kvm_context, KVM_CAP_IOMMU);
> //+   if (r)
> + assigned_devices[nr_assigned_devices].dma |=
> + KVM_DEV_ASSIGN_ENABLE_IOMMU;
> +
> //+   r = get_param_value(dma, sizeof dma, "dma", arg);
> //+   if (r && !strncmp(dma, "none", 4)) {
> //+   assigned_devices[nr_assigned_devices].dma &=
> //+   ~KVM_DEV_ASSIGN_ENABLE_IOMMU;
> + }
> +#endif
>

Yeah, as this workaround.

> Randy (Weidong)
>
> >> Hello,
> >>
> >> * On Wednesday 03 September 2008 14:07:36 [EMAIL PROTECTED] wrote:
> >>> Hi,
> >>>
> >>> i make some more tests with
> >>> 1) an old APCI1500/PCI card no linux driver support -> so i don't
> >>> need to unload the module. 2) my second network-card Realtek 10/100
> >>> MBit.
> >>>
> >>> Both don't work at all. Only my first gigabit-onboard-network-card
> >>> starts.
> >>>
> >>> Here the output from userspace/dmesg:
> >>>
> >>>
> >>> 1) // Applied Micro Circuits Corp. APCI1500 Signal processing
> >>> controller
> >>>
> >>> Warning: No DNS servers found
> >>> Registered host PCI device 03:00.0 ("03:00.0") as guest device
> >>> 00:03.0 assigned_dev_update_irq: Input/output error
> >>> assigned_dev_update_irq: Input/output error
> >>
> >> This means the devices shares the irq with some other device in the
> >> system. See the "lspci -v" output for details.
> >>
> >>> 2) // 03:02.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
> >>> RTL-8139/8139C/8139C+ (rev 10) Registered host PCI device 03:02.0
> >>> ("03:02.0") as guest device 00:03.0 assigned_dev_update_irq:
> >>> Input/output error
> >>>
> >>> [  369.195971] pci :03:02.0: PCI INT A -> GSI 18 (level, low)
> >>> -> IRQ 18 [  369.326026] kvm_vm_ioctl_assign_irq: couldn't allocate
> >>> irq for pv device
> >>
> >> Same error.
> >>
> >>> Sould i update the BIOS if possible/available?
> >>
> >> That's not necessary. We don't support assigning devices that share
> >> the irq on
> >>
> >> the host with some other device. You can try inserting one device at
> >> a time and try different PCI slots.
> >>
> >>> Could it be a problem with the already assigned patches? (See the
> >>> first E-Mail :The VTD [PATCH1/2] seems already be applied. )
> >>
> >> No patches have already been applied to any tree. You'll definitely
> >> have to apply the 1/2 patch as well.
> >
> > Ok. On both PCI-Cards (1) APCI1500 Signal processing controller and
> > (2) Realtek NIC
> > the IRQ is shared with my USB-UHCI-Controller.
> > On this PC there is no PS2-Connector (USB-Mouse/USB-Keyboard).
> > When I unload the usb module on the host, I can not handle the system
> > anymore.

Well, another choice is use VNC from anyother machine. 

> >
> > I can start qemu/kvm remote over a serial-console. But then I cannot
> > use the guest system without keyboard/mouse. (login and so)
> >
> > "Han, Weidong" told to use IRQF_SHARED in reqeust_irq(). In which
> > code?

kvm_vm_ioctl_assign_irq() in x86.c. But for you are using keyboard and mouse, 
I think it would cause chaos? Maybe you can tell USB 2.0 controller apart 
from USB 1.1, and just unload 2.0 ones. The USB 1.1 is UHCI while 2.0 is 
EHCI, the drivers can be made as modules.

> >
> > Both tests are only made to get an idea how it works with other
> > devices.
> >
> > My primary efforts are to use my onboard Gigabit with VT-d direct on
> > the Windows guest-system. This device don't share the IRQ with an
> > other device. But I have the problems with: 1) slow down the complete
> > system and generate this Unbalanced IRQ 21 messages. 2) Packet
> > statistic under windows shows on this device no package traffic in
> > live-network. (no ping etc. possible)

No clue, if you ensure that you've got latest version of patch... I think it 
was posted by Amit at 2008-08-26.

BTW: Can you try a new Linux guest if possible? 

-- 
regards
Yang, Sheng

> >
> > Thanks for your hints.
> >
> > Gregor
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: VMX: Move private memory slot position

2008-09-03 Thread Yang, Sheng

From ebe4ea311305d2910dcdcff2510662da0dc2c742 Mon Sep 17 00:00:00 2001
From: Sheng Yang <[EMAIL PROTECTED]>
Date: Thu, 4 Sep 2008 03:11:48 +0800
Subject: [PATCH] KVM: VMX: Move private memory slot position

PCI device assignment would map guest MMIO spaces as separate slot, so it is
possible that the device has more than 2 MMIO spaces and overwrite current
private memslot.

The patch move private memory slot to the top of userspace visible memory 
slots.

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 arch/x86/kvm/vmx.c |2 +-
 arch/x86/kvm/vmx.h |5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 004d24a..27c3bb7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2448,7 +2448,7 @@ static int vmx_set_tss_addr(struct kvm *kvm, unsigned 
int addr)
 {
int ret;
struct kvm_userspace_memory_region tss_mem = {
-   .slot = 8,
+   .slot = TSS_PRIVATE_MEMSLOT,
.guest_phys_addr = addr,
.memory_size = PAGE_SIZE * 3,
.flags = 0,
diff --git a/arch/x86/kvm/vmx.h b/arch/x86/kvm/vmx.h
index 16b3cfb..dd0eea9 100644
--- a/arch/x86/kvm/vmx.h
+++ b/arch/x86/kvm/vmx.h
@@ -356,8 +356,9 @@ enum vmcs_field {
 #define IA32_FEATURE_CONTROL_LOCKED_BIT0x1
 #define IA32_FEATURE_CONTROL_VMXON_ENABLED_BIT 0x4

-#define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT   9
-#define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT 10
+#define TSS_PRIVATE_MEMSLOT(KVM_MEMORY_SLOTS + 0)
+#define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT   (KVM_MEMORY_SLOTS + 1)
+#define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT (KVM_MEMORY_SLOTS + 2)

 #define VMX_NR_VPIDS   (1 << 16)
 #define VMX_VPID_EXTENT_SINGLE_CONTEXT 1
--
1.5.4.5

From ebe4ea311305d2910dcdcff2510662da0dc2c742 Mon Sep 17 00:00:00 2001
From: Sheng Yang <[EMAIL PROTECTED]>
Date: Thu, 4 Sep 2008 03:11:48 +0800
Subject: [PATCH] KVM: VMX: Move private memory slot position

PCI device assignment would map guest MMIO spaces as separate slot, so it is
possible that the device has more than 2 MMIO spaces and overwrite current
private memslot.

The patch move private memory slot to the top of userspace visible memory slots.

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 arch/x86/kvm/vmx.c |2 +-
 arch/x86/kvm/vmx.h |5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 004d24a..27c3bb7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2448,7 +2448,7 @@ static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
 {
 	int ret;
 	struct kvm_userspace_memory_region tss_mem = {
-		.slot = 8,
+		.slot = TSS_PRIVATE_MEMSLOT,
 		.guest_phys_addr = addr,
 		.memory_size = PAGE_SIZE * 3,
 		.flags = 0,
diff --git a/arch/x86/kvm/vmx.h b/arch/x86/kvm/vmx.h
index 16b3cfb..dd0eea9 100644
--- a/arch/x86/kvm/vmx.h
+++ b/arch/x86/kvm/vmx.h
@@ -356,8 +356,9 @@ enum vmcs_field {
 #define IA32_FEATURE_CONTROL_LOCKED_BIT		0x1
 #define IA32_FEATURE_CONTROL_VMXON_ENABLED_BIT	0x4
 
-#define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT	9
-#define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT	10
+#define TSS_PRIVATE_MEMSLOT			(KVM_MEMORY_SLOTS + 0)
+#define APIC_ACCESS_PAGE_PRIVATE_MEMSLOT	(KVM_MEMORY_SLOTS + 1)
+#define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT	(KVM_MEMORY_SLOTS + 2)
 
 #define VMX_NR_VPIDS(1 << 16)
 #define VMX_VPID_EXTENT_SINGLE_CONTEXT		1
-- 
1.5.4.5

[PATCH] kvm: libkvm: Modify userspace memory slot limit to 32

2008-09-03 Thread Yang, Sheng

From f95d06a16a820d4fff59ccc88b422f7d051e7330 Mon Sep 17 00:00:00 2001
From: Sheng Yang <[EMAIL PROTECTED]>
Date: Thu, 4 Sep 2008 03:21:38 +0800
Subject: [PATCH] kvm: libkvm: Modify userspace memory slot limit to 32

To keep consistent with kernel space.

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 libkvm/kvm-common.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/libkvm/kvm-common.h b/libkvm/kvm-common.h
index 7092085..233abfe 100644
--- a/libkvm/kvm-common.h
+++ b/libkvm/kvm-common.h
@@ -19,7 +19,7 @@
 /* FIXME: share this number with kvm */
 /* FIXME: or dynamically alloc/realloc regions */
 #ifndef __s390__
-#define KVM_MAX_NUM_MEM_REGIONS 8u
+#define KVM_MAX_NUM_MEM_REGIONS 32u
 #define MAX_VCPUS 16
 #else
 #define KVM_MAX_NUM_MEM_REGIONS 1u
--
1.5.4.5

From f95d06a16a820d4fff59ccc88b422f7d051e7330 Mon Sep 17 00:00:00 2001
From: Sheng Yang <[EMAIL PROTECTED]>
Date: Thu, 4 Sep 2008 03:21:38 +0800
Subject: [PATCH] kvm: libkvm: Modify userspace memory slot limit to 32

To keep consistent with kernel space.

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 libkvm/kvm-common.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/libkvm/kvm-common.h b/libkvm/kvm-common.h
index 7092085..233abfe 100644
--- a/libkvm/kvm-common.h
+++ b/libkvm/kvm-common.h
@@ -19,7 +19,7 @@
 /* FIXME: share this number with kvm */
 /* FIXME: or dynamically alloc/realloc regions */
 #ifndef __s390__
-#define KVM_MAX_NUM_MEM_REGIONS 8u
+#define KVM_MAX_NUM_MEM_REGIONS 32u
 #define MAX_VCPUS 16
 #else
 #define KVM_MAX_NUM_MEM_REGIONS 1u
-- 
1.5.4.5

[PATCH][REPOST] KVM: VMX: Always return 0 for clear_flush_young() when using EPT

2008-09-03 Thread Yang, Sheng

Hi Avi

It seems something wrong with my git-send-email, and I can't got my post from 
kvm@vger.kernel.org, so resend it. Sorry for inconvenient. 

Thanks!
-- 
From: Sheng Yang <[EMAIL PROTECTED]>
Date: Mon, 1 Sep 2008 13:22:09 +0800
Subject: [PATCH] KVM: VMX: Always return 0 for clear_flush_young() when using 
EPT

As well as discard fake accessed bit and dirty bit of EPT.

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 arch/x86/kvm/mmu.c |   15 +++
 arch/x86/kvm/vmx.c |3 +--
 arch/x86/kvm/vmx.h |2 --
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f33c594..e437985 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -716,10 +716,17 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long 
*rmapp)
int _young;
u64 _spte = *spte;
BUG_ON(!(_spte & PT_PRESENT_MASK));
-   _young = _spte & PT_ACCESSED_MASK;
-   if (_young) {
-   young = 1;
-   clear_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
+
+   /* always return old for EPT */
+   if (!shadow_accessed_mask)
+   _young = 0;
+   else {
+   _young = _spte & PT_ACCESSED_MASK;
+   if (_young) {
+   young = 1;
+   clear_bit(PT_ACCESSED_SHIFT,
+ (unsigned long *)spte);
+   }
}
spte = rmap_next(kvm, rmapp, spte);
}
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 81c121c..d637897 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3557,8 +3557,7 @@ static int __init vmx_init(void)
kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK |
VMX_EPT_WRITABLE_MASK |
VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT);
-   kvm_mmu_set_mask_ptes(0ull, VMX_EPT_FAKE_ACCESSED_MASK,
-   VMX_EPT_FAKE_DIRTY_MASK, 0ull,
+   kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
VMX_EPT_EXECUTABLE_MASK);
kvm_enable_tdp();
} else
diff --git a/arch/x86/kvm/vmx.h b/arch/x86/kvm/vmx.h
index 0c22e5f..41e8c10 100644
--- a/arch/x86/kvm/vmx.h
+++ b/arch/x86/kvm/vmx.h
@@ -370,8 +370,6 @@ enum vmcs_field {
 #define VMX_EPT_READABLE_MASK  0x1ull
 #define VMX_EPT_WRITABLE_MASK  0x2ull
 #define VMX_EPT_EXECUTABLE_MASK0x4ull
-#define VMX_EPT_FAKE_ACCESSED_MASK (1ull << 62)
-#define VMX_EPT_FAKE_DIRTY_MASK(1ull << 63)

 #define VMX_EPT_IDENTITY_PAGETABLE_ADDR0xfffbc000ul

--
1.5.4.5

From 23229946e717294091bf54cee704fb3b1cd4167d Mon Sep 17 00:00:00 2001
From: Sheng Yang <[EMAIL PROTECTED]>
Date: Mon, 1 Sep 2008 13:22:09 +0800
Subject: [PATCH] KVM: VMX: Always return 0 for clear_flush_young() when using EPT

As well as discard fake accessed bit and dirty bit of EPT.

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 arch/x86/kvm/mmu.c |   15 +++
 arch/x86/kvm/vmx.c |3 +--
 arch/x86/kvm/vmx.h |2 --
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f33c594..e437985 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -716,10 +716,17 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
 		int _young;
 		u64 _spte = *spte;
 		BUG_ON(!(_spte & PT_PRESENT_MASK));
-		_young = _spte & PT_ACCESSED_MASK;
-		if (_young) {
-			young = 1;
-			clear_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
+
+		/* always return old for EPT */
+		if (!shadow_accessed_mask)
+			_young = 0;
+		else {
+			_young = _spte & PT_ACCESSED_MASK;
+			if (_young) {
+young = 1;
+clear_bit(PT_ACCESSED_SHIFT,
+	  (unsigned long *)spte);
+			}
 		}
 		spte = rmap_next(kvm, rmapp, spte);
 	}
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 81c121c..d637897 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3557,8 +3557,7 @@ static int __init vmx_init(void)
 		kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK |
 			VMX_EPT_WRITABLE_MASK |
 			VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT);
-		kvm_mmu_set_mask_ptes(0ull, VMX_EPT_FAKE_ACCESSED_MASK,
-VMX_EPT_FAKE_DIRTY_MASK, 0ull,
+		kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
 VMX_EPT_EXECUTABLE_MASK);
 		kvm_enable_tdp();
 	} else
diff --git a/arch/x86/kvm/vmx.h b/arch/x86/kvm/vmx.h
index 0c22e5f..41e8c10 100644
--- a/arch/x86/kvm/vmx.h
+++ b/arch/x86/kvm/vmx.h
@@ -370,8 +370,6 @@ enum vmcs_field {
 #define VMX_EPT_READABLE_MASK			0x1ull
 #define VMX_EPT_WRITABLE_MASK			0x2ull
 #define VMX_EPT_EXECUTABLE_MASK			0x4ull
-#define VMX_EPT_FAKE_ACCESSED_MASK		(1ull << 62)
-#define VMX_EPT_FAKE_DIRTY_MASK			(1ull << 63)
 
 #define VMX_EPT_I

Re: Test with VT-d patches

2008-09-03 Thread Yang, Sheng

On Wednesday 03 September 2008 16:37:36 [EMAIL PROTECTED] wrote:
> Hi,
>
> i make some more tests with
> 1) an old APCI1500/PCI card no linux driver support -> so i don't need to
> unload the module. 2) my second network-card Realtek 10/100 MBit.
>
> Both don't work at all. Only my first gigabit-onboard-network-card starts.
>
> Here the output from userspace/dmesg:

Hi Gregor

I have a question here. Which version of patch you used? For in fact, Amit's 
latest userspace patch can't work due to a bug. So seems you didn't use 
Amit's latest userspace patch? What's about kernel part?

-- 
regards
Yang, Sheng
> 1) // Applied Micro Circuits Corp. APCI1500 Signal processing controller
>
> Warning: No DNS servers found
> Registered host PCI device 03:00.0 ("03:00.0") as guest device 00:03.0
> assigned_dev_update_irq: Input/output error
> assigned_dev_update_irq: Input/output error
> *** glibc detected *** qemu-system-x86_64: double free or corruption (out):
> 0x013edfb0 ***
>
> === Backtrace: =
> /lib/libc.so.6[0x7f18c6c609e8]
> /lib/libc.so.6(cfree+0x76)[0x7f18c6c63036]
> qemu-system-x86_64[0x470d0c]
> qemu-system-x86_64[0x418104]
> qemu-system-x86_64[0x469b62]
> qemu-system-x86_64[0x417fed]
> qemu-system-x86_64[0x4f937d]
> qemu-system-x86_64[0x5221ff]
> qemu-system-x86_64[0x522af3]
> qemu-system-x86_64[0x4f9906]
> qemu-system-x86_64[0x4f9c20]
> /lib/libpthread.so.0[0x7f18c76353ea]
> /lib/libc.so.6(clone+0x6d)[0x7f18c6ccdb9d]
> === Memory map: 
> 0040-00595000 r-xp  08:03 719950
> /usr/local/bin/qemu-system-x86_64 00794000-00795000 r--p 00194000 08:03
> 719950 /usr/local/bin/qemu-system-x86_64
> 00795000-007a5000 rw-p 00195000 08:03 719950
> /usr/local/bin/qemu-system-x86_64 007a5000-00b9b000 rw-p 007a5000 00:00 0
> 00b9b000-00b9c000 rwxp 00b9b000 00:00 0
> 00b9c000-00bb7000 rw-p 00b9c000 00:00 0
> 012e7000-01613000 rw-p 012e7000 00:00 0 
> [heap] 4073-48c31000 rwxp 4073 00:00 0
> 48c31000-48c32000 ---p 48c31000 00:00 0
> 48c32000-49432000 rw-p 48c32000 00:00 0
> 49432000-49433000 ---p 49432000 00:00 0
> 49433000-49436000 rw-p 49433000 00:00 0
> 7f189400-7f1894021000 rw-p 7f189400 00:00 0
> 7f1894021000-7f189800 ---p 7f1894021000 00:00 0
> 7f1899ae-7f1899af6000 r-xp  08:03 908548
> /lib/libgcc_s.so.1 7f1899af6000-7f1899cf5000 ---p 00016000 08:03 908548
> /lib/libgcc_s.so.1 7f1899cf5000-7f1899cf6000 r--p 00015000
> 08:03 908548 /lib/libgcc_s.so.1
> 7f1899cf6000-7f1899cf7000 rw-p 00016000 08:03 908548
> /lib/libgcc_s.so.1 7f1899d04000-7f1899e1e000 rw-s  00:08 589836
> /SYSV (deleted) 7f1899e1e000-7f189ae1f000 rw-p
> 7f1899e1e000 00:00 0
> 7f189af19000-7f189af1e000 r-xp  08:03 706832
> /usr/lib/libXfixes.so.3.1.0 7f189af1e000-7f189b11d000 ---p 5000 08:03
> 706832 /usr/lib/libXfixes.so.3.1.0
> 7f189b11d000-7f189b11e000 rw-p 4000 08:03 706832
> /usr/lib/libXfixes.so.3.1.0 7f189b11e000-7f189b127000 r-xp  08:03
> 707183 /usr/lib/libXcursor.so.1.0.2
> 7f189b127000-7f189b327000 ---p 9000 08:03 707183
> /usr/lib/libXcursor.so.1.0.2 7f189b327000-7f189b328000 rw-p 9000 08:03
> 707183 /usr/lib/libXcursor.so.1.0.2
> 7f189b328000-7f189b375000 rw-p 7f18c7f13000 00:00 0
> 7f189b3a9000-7f189b3b r-xp  08:03 707187
> /usr/lib/libXrandr.so.2.1.0 7f189b3b-7f189b5af000 ---p 7000 08:03
> 707187 /usr/lib/libXrandr.so.2.1.0
> 7f189b5af000-7f189b5b rw-p 6000 08:03 707187
> /usr/lib/libXrandr.so.2.1.0 7f189b5b-7f189b5b9000 r-xp  08:03
> 707141 /usr/lib/libXrender.so.1.3.0
> 7f189b5b9000-7f189b7b8000 ---p 9000 08:03 707141
> /usr/lib/libXrender.so.1.3.0 7f189b7b8000-7f189b7b9000 r--p 8000 08:03
> 707141 /usr/lib/libXrender.so.1.3.0
> 7f189b7b9000-7f189b7ba000 rw-p 9000 08:03 707141
> /usr/lib/libXrender.so.1.3.0 7f189b7ba000-7f189b7ca000 r-xp  08:03
> 706836 /usr/lib/libXext.so.6.4.0
> 7f189b7ca000-7f189b9ca000 ---p 0001 08:03 706836
> /usr/lib/libXext.so.6.4.0 7f189b9ca000-7f189b9cc000 rw-p 0001 08:03
> 706836 /usr/lib/libXext.so.6.4.0
> 7f189b9cc000-7f189b9d1000 r-xp  08:03 706262
> /usr/lib/libXdmcp.s

Re: Test with VT-d patches

2008-09-02 Thread Yang, Sheng

On Wednesday 03 September 2008 02:40:06 [EMAIL PROTECTED] wrote:
> Hi,
>
> >On Tue, Sep 2, 2008 at 4:17 PM,  <[EMAIL PROTECTED]> wrote:
> >> Hi,
> >>
> >> here comes a small part of the dmesg output. Qemu/KVM produces now a CPU
> >
> >usage
> >
> >> of about 90%.
> >>
> >>
> >> Sep  2 11:27:35 ubuntu klogd: [  335.057707] [ cut here
> >
> >]
> >
> >> Sep  2 11:27:35 ubuntu klogd: [  335.057711] WARNING: at
> >
> >kernel/irq/manage.c:180 __enable_irq+0x34/0x80()
> >
> >> Sep  2 11:27:35 ubuntu klogd: [  335.057713] Unbalanced enable for IRQ
> >> 21
> >
> >[...]
> >
> >> This messages comes endless.
> >> Something with IRQs?
> >
> >Hum, seems that interrupt has already been enabled... did you load the
> >driver for the NIC in the host? With pass-through the device is
> >"owned" by the guest.
>
> No.
> The driver was not loaded.
> It is not possible to start qemu with a pci-device used by the host.
> (I try this with the second network device.)
>
> Can this interrupt be shared with an other IRQ?

Oh, no, not currently...

The device's IRQ should not be shared with other, we haven't implement shared 
IRQ logic, but soon. 
-- 
regards
Yang, Sheng
>
> Gregor


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Test with VT-d patches

2008-09-02 Thread Yang, Sheng

On Tuesday 02 September 2008 16:44:18 [EMAIL PROTECTED] wrote:
> Hi,
>
> i am interested in the use of the new VT-d hardware feature.
> My Dell-PC "OPTIPLEX" is capable for this.
> The Linux-System was an Ubuntu-8.10 (AMD64) with the current
> Linux-Kernel from the KVM-Kernel GIT repository. (2.6.27-rc4)
> To use VT-d i download kvm-74 and take the patches from "Amit Shah".
> 1) The KVM/userspace [PATCH1/1] was applied without errors.
> 2) The VTD [PATCH1/2] seems already be applied.
> 3) The VTD [PATCH2/2] was applied without errors.
>
> Now I use the command line option -pcidevice dev=00:03.19 to pass the
> Intel Pro Gigabit Network Device to my WindowsXP Guest-System.
>
> Qemu told me something like: "passing 00:03.19 as device 00:03.00 to the
> guest system". Windows starts normally fast. But then its slow rapidly
> down. My mouse stops for about 10sec and then goes again for 1sec.
>
> Windows remember the new hardware correctly. And after i install the new
> driver, the system seems to go a little bit faster.
>
> But the problem was, that no ping or other network packages was
> send/received.
>
> What can i do to find the problem? (Debugging?)

Hi Gregor

Can you have a look at your dmesg and post it here? I think we can got some 
clue. 

Thanks!
--
regards
Yang, Sheng
>
> Regards,
>
> Gregor Glomm
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: MMU: Fix overflow of SHADOW_PT_INDEX with EPT in 32pae

2008-09-01 Thread Yang, Sheng

From: Sheng Yang <[EMAIL PROTECTED]>
Date: Mon, 1 Sep 2008 17:28:59 +0800
Subject: [PATCH] KVM: MMU: Fix overflow of SHADOW_PT_INDEX with EPT in 32pae

EPT is 4 level by default in 32pae (48bits), but virtual address only
got 32 bits. This result in SHADOW_PT_INDEX() overflow when try to
fetch level 4 index.

Fix it by extend virtual address to 64bits in any condition.

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 arch/x86/kvm/mmu.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f33c594..8ca9aad 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -943,6 +943,7 @@ static int walk_shadow(struct kvm_shadow_walk *walker,
int level;
int r;
u64 *sptep;
+   u64 ext_addr = addr;
unsigned index;

shadow_addr = vcpu->arch.mmu.root_hpa;
@@ -954,7 +955,12 @@ static int walk_shadow(struct kvm_shadow_walk *walker,
}

while (level >= PT_PAGE_TABLE_LEVEL) {
-   index = SHADOW_PT_INDEX(addr, level);
+   /*
+* SHADOW_PT_INDEX is overflow with EPT in 32pae mode. Because
+* EPT is 4 level (48bits) by default, but the addr got only 32
+* bits. Extend addr to 64 bit.
+*/
+   index = SHADOW_PT_INDEX(ext_addr, level);
sptep = ((u64 *)__va(shadow_addr)) + index;
r = walker->entry(walker, vcpu, addr, sptep, level);
if (r)
--
1.5.4.5

From d04ca5ce11171da3ba0f3523767cdb4f35731476 Mon Sep 17 00:00:00 2001
From: Sheng Yang <[EMAIL PROTECTED]>
Date: Mon, 1 Sep 2008 17:28:59 +0800
Subject: [PATCH] KVM: MMU: Fix overflow of SHADOW_PT_INDEX with EPT in 32pae

EPT is 4 level by default in 32pae (48bits), but virtual address only
got 32 bits. This result in SHADOW_PT_INDEX() overflow when try to
fetch level 4 index.

Fix it by extend virtual address to 64bits in any condition.

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 arch/x86/kvm/mmu.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f33c594..8ca9aad 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -943,6 +943,7 @@ static int walk_shadow(struct kvm_shadow_walk *walker,
 	int level;
 	int r;
 	u64 *sptep;
+	u64 ext_addr = addr;
 	unsigned index;
 
 	shadow_addr = vcpu->arch.mmu.root_hpa;
@@ -954,7 +955,12 @@ static int walk_shadow(struct kvm_shadow_walk *walker,
 	}
 
 	while (level >= PT_PAGE_TABLE_LEVEL) {
-		index = SHADOW_PT_INDEX(addr, level);
+		/*
+		 * SHADOW_PT_INDEX is overflow with EPT in 32pae mode. Because
+		 * EPT is 4 level (48bits) by default, but the addr got only 32
+		 * bits. Extend addr to 64 bit.
+		 */
+		index = SHADOW_PT_INDEX(ext_addr, level);
 		sptep = ((u64 *)__va(shadow_addr)) + index;
 		r = walker->entry(walker, vcpu, addr, sptep, level);
 		if (r)
-- 
1.5.4.5

Re: [PATCH] KVM: MMU: Add shadow_accessed_shift

2008-08-31 Thread Yang, Sheng

On Sunday 31 August 2008 23:13:54 Avi Kivity wrote:
> [EMAIL PROTECTED] wrote:
> > From: Sheng Yang <[EMAIL PROTECTED]>
> >
> > We use a "fake" A/D bit for EPT, to keep epte behaviour consistent with
> > shadow spte. But it's not that good for MMU notifier. Now we can only
> > expect return young=0 for clean_flush_young() in most condition.
>
> Perhaps we are better off setting shadow_accessed_mask to 0 for ept, and
> adding a test for clear_flush_young()?  This is the only place that
> needs adjusting as far as I can tell.
>
> I don't see what having a fake accessed bit buys us, and I'd like the
> patch to be as small as possible, since it needs to go into
> 2.6.26-stable and 2.6.27-rc.

Though I still think fake accessed bit here makes logic consistent, here is 
the patch follow your comment. But I think it may not necessary for the 
2.6.26-stable?


From: Sheng Yang <[EMAIL PROTECTED]>
Date: Mon, 1 Sep 2008 13:22:09 +0800
Subject: [PATCH] KVM: VMX: Always return 0 for clear_flush_young() when using 
EPT

As well as discard fake accessed bit and dirty bit of EPT.

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 arch/x86/kvm/mmu.c |   15 +++
 arch/x86/kvm/vmx.c |3 +--
 arch/x86/kvm/vmx.h |2 --
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f33c594..e437985 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -716,10 +716,17 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long 
*rmapp)
int _young;
u64 _spte = *spte;
BUG_ON(!(_spte & PT_PRESENT_MASK));
-   _young = _spte & PT_ACCESSED_MASK;
-   if (_young) {
-   young = 1;
-   clear_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
+
+   /* always return old for EPT */
+   if (!shadow_accessed_mask)
+   _young = 0;
+   else {
+   _young = _spte & PT_ACCESSED_MASK;
+   if (_young) {
+   young = 1;
+   clear_bit(PT_ACCESSED_SHIFT,
+ (unsigned long *)spte);
+   }
}
spte = rmap_next(kvm, rmapp, spte);
}
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 81c121c..d637897 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3557,8 +3557,7 @@ static int __init vmx_init(void)
kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK |
VMX_EPT_WRITABLE_MASK |
VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT);
-   kvm_mmu_set_mask_ptes(0ull, VMX_EPT_FAKE_ACCESSED_MASK,
-   VMX_EPT_FAKE_DIRTY_MASK, 0ull,
+   kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
VMX_EPT_EXECUTABLE_MASK);
kvm_enable_tdp();
} else
diff --git a/arch/x86/kvm/vmx.h b/arch/x86/kvm/vmx.h
index 0c22e5f..41e8c10 100644
--- a/arch/x86/kvm/vmx.h
+++ b/arch/x86/kvm/vmx.h
@@ -370,8 +370,6 @@ enum vmcs_field {
 #define VMX_EPT_READABLE_MASK  0x1ull
 #define VMX_EPT_WRITABLE_MASK  0x2ull
 #define VMX_EPT_EXECUTABLE_MASK0x4ull
-#define VMX_EPT_FAKE_ACCESSED_MASK (1ull << 62)
-#define VMX_EPT_FAKE_DIRTY_MASK(1ull << 63)

 #define VMX_EPT_IDENTITY_PAGETABLE_ADDR0xfffbc000ul

--
1.5.4.5



From 23229946e717294091bf54cee704fb3b1cd4167d Mon Sep 17 00:00:00 2001
From: Sheng Yang <[EMAIL PROTECTED]>
Date: Mon, 1 Sep 2008 13:22:09 +0800
Subject: [PATCH] KVM: VMX: Always return 0 for clear_flush_young() when using EPT

As well as discard fake accessed bit and dirty bit of EPT.

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 arch/x86/kvm/mmu.c |   15 +++
 arch/x86/kvm/vmx.c |3 +--
 arch/x86/kvm/vmx.h |2 --
 3 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f33c594..e437985 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -716,10 +716,17 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
 		int _young;
 		u64 _spte = *spte;
 		BUG_ON(!(_spte & PT_PRESENT_MASK));
-		_young = _spte & PT_ACCESSED_MASK;
-		if (_young) {
-			young = 1;
-			clear_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
+
+		/* always return old for EPT */
+		if (!shadow_accessed_mask)
+			_young = 0;
+		else {
+			_young = _spte & PT_ACCESSED_MASK;
+			if (_young) {
+young = 1;
+clear_bit(PT_ACCESSED_SHIFT,
+	  (unsigned long *)spte);
+			}
 		}
 		spte = rmap_next(kvm, rmapp, spte);
 	}
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 81c121c..d637897 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3557,8 +3557,7 @@ static int __init vmx_init(void)
 		kvm_mmu_set_base_ptes(VMX_EPT

[PATCH] KVM: MMU: Add shadow_accessed_shift

2008-08-29 Thread Yang, Sheng

From: Sheng Yang <[EMAIL PROTECTED]>
Date: Fri, 29 Aug 2008 14:02:29 +0800
Subject: [PATCH] KVM: MMU: Add shadow_accessed_shift


Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 arch/x86/kvm/mmu.c |9 ++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3008279..0997d82 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -151,6 +151,7 @@ static u64 __read_mostly shadow_nx_mask;
 static u64 __read_mostly shadow_x_mask;/* mutual exclusive with 
nx_mask */
 static u64 __read_mostly shadow_user_mask;
 static u64 __read_mostly shadow_accessed_mask;
+static u16 __read_mostly shadow_accessed_shift;
 static u64 __read_mostly shadow_dirty_mask;

 void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte)
@@ -171,6 +172,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 
accessed_mask,
 {
shadow_user_mask = user_mask;
shadow_accessed_mask = accessed_mask;
+   shadow_accessed_shift = find_first_bit((unsigned long *)&accessed_mask,
+  sizeof(accessed_mask));
shadow_dirty_mask = dirty_mask;
shadow_nx_mask = nx_mask;
shadow_x_mask = x_mask;
@@ -709,10 +712,10 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long 
*rmapp)
int _young;
u64 _spte = *spte;
BUG_ON(!(_spte & PT_PRESENT_MASK));
-   _young = _spte & PT_ACCESSED_MASK;
+   _young = _spte & shadow_accessed_mask;
if (_young) {
young = 1;
-   clear_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
+   clear_bit(shadow_accessed_shift, (unsigned long *)spte);
}
spte = rmap_next(kvm, rmapp, spte);
}
@@ -1785,7 +1788,7 @@ static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, 
gfn_t gfn)
&& shadow_accessed_mask
&& !(*spte & shadow_accessed_mask)
&& is_shadow_present_pte(*spte))
-   set_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
+   set_bit(shadow_accessed_shift, (unsigned long *)spte);
 }

 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
--
1.5.4.5

From 3a2cc947a656a6eb4e815e64a44cb3e77a162a89 Mon Sep 17 00:00:00 2001
From: Sheng Yang <[EMAIL PROTECTED]>
Date: Fri, 29 Aug 2008 14:02:29 +0800
Subject: [PATCH] KVM: MMU: Add shadow_accessed_shift


Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 arch/x86/kvm/mmu.c |9 ++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3008279..0997d82 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -151,6 +151,7 @@ static u64 __read_mostly shadow_nx_mask;
 static u64 __read_mostly shadow_x_mask;	/* mutual exclusive with nx_mask */
 static u64 __read_mostly shadow_user_mask;
 static u64 __read_mostly shadow_accessed_mask;
+static u16 __read_mostly shadow_accessed_shift;
 static u64 __read_mostly shadow_dirty_mask;
 
 void kvm_mmu_set_nonpresent_ptes(u64 trap_pte, u64 notrap_pte)
@@ -171,6 +172,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 {
 	shadow_user_mask = user_mask;
 	shadow_accessed_mask = accessed_mask;
+	shadow_accessed_shift = find_first_bit((unsigned long *)&accessed_mask,
+	   sizeof(accessed_mask));
 	shadow_dirty_mask = dirty_mask;
 	shadow_nx_mask = nx_mask;
 	shadow_x_mask = x_mask;
@@ -709,10 +712,10 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
 		int _young;
 		u64 _spte = *spte;
 		BUG_ON(!(_spte & PT_PRESENT_MASK));
-		_young = _spte & PT_ACCESSED_MASK;
+		_young = _spte & shadow_accessed_mask;
 		if (_young) {
 			young = 1;
-			clear_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
+			clear_bit(shadow_accessed_shift, (unsigned long *)spte);
 		}
 		spte = rmap_next(kvm, rmapp, spte);
 	}
@@ -1785,7 +1788,7 @@ static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn)
 	&& shadow_accessed_mask
 	&& !(*spte & shadow_accessed_mask)
 	&& is_shadow_present_pte(*spte))
-		set_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte);
+		set_bit(shadow_accessed_shift, (unsigned long *)spte);
 }
 
 void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
-- 
1.5.4.5

Re: [ANNOUNCE] kvm-73 release

2008-08-21 Thread Yang, Sheng

On Thursday 21 August 2008 09:45:29 Yang, Sheng wrote:
> On Thursday 21 August 2008 00:43:22 Muli Ben-Yehuda wrote:
> > On Wed, Aug 20, 2008 at 06:04:26PM +0300, Avi Kivity wrote:
> > > Other noteworthy changes: speedups of both virtio-net and qcow2
> > > with cache=off.  Two important works-in-progress: device
> > > assignment (not usable yet, as dma support is still missing)
> > > and the real-mode emulation framework.
> >
> > Hi Avi,
> >
> > The latest version of the VT-d patches for device assignment was
> > posted by Ben on Aug 7th[1][2]. There were no substantial review
> > comments, so... what are we waiting for?
> >
> > [1] http://www.mail-archive.com/kvm@vger.kernel.org/msg02554.html
> > [2] http://www.mail-archive.com/kvm@vger.kernel.org/msg02553.html
>
> Hi Muli
>
> The next step should be send the first patch to linux-pci(and CC
> Jesse Barnes and other guys) to have review, for it related to DMAR
> modification... I think we mentioned that in recent comments, and I
> meant to remind Ben about that, but forgot it...

Hi Ben

I think you can CC Jesse Barnes, David 
Woodhouse([EMAIL PROTECTED], current VT-d maintainer) and 
Mark Gross(former VT-d maintainer).

-- 
regards
Yang, Sheng


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] kvm-73 release

2008-08-20 Thread Yang, Sheng

On Thursday 21 August 2008 00:43:22 Muli Ben-Yehuda wrote:
> On Wed, Aug 20, 2008 at 06:04:26PM +0300, Avi Kivity wrote:
> > Other noteworthy changes: speedups of both virtio-net and qcow2
> > with cache=off.  Two important works-in-progress: device
> > assignment (not usable yet, as dma support is still missing) and
> > the real-mode emulation framework.
>
> Hi Avi,
>
> The latest version of the VT-d patches for device assignment was
> posted by Ben on Aug 7th[1][2]. There were no substantial review
> comments, so... what are we waiting for?
>
> [1] http://www.mail-archive.com/kvm@vger.kernel.org/msg02554.html
> [2] http://www.mail-archive.com/kvm@vger.kernel.org/msg02553.html
>
Hi Muli

The next step should be send the first patch to linux-pci(and CC  
Jesse Barnes and other guys) to have review, for it related to DMAR 
modification... I think we mentioned that in recent comments, and I 
meant to remind Ben about that, but forgot it...

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: Fix wrong KVM_GET_LAPIC

2008-08-17 Thread Yang, Sheng

From a8ca7dd8f5fe0125e7b7d0a21f5caddacd754911 Mon Sep 17 00:00:00 2001
From: Sheng Yang <[EMAIL PROTECTED]>
Date: Mon, 18 Aug 2008 11:04:22 +0800
Subject: [PATCH] KVM: Fix wrong KVM_GET_LAPIC

Which caused migration fail in recent commits.

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 arch/x86/kvm/x86.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ee005a6..4a03375 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1555,7 +1555,7 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
if (r)
goto out;
r = -EFAULT;
-   if (copy_to_user(argp, &lapic, sizeof lapic))
+   if (copy_to_user(argp, lapic, sizeof(struct kvm_lapic_state)))
goto out;
r = 0;
break;
--
1.5.4.5

From a8ca7dd8f5fe0125e7b7d0a21f5caddacd754911 Mon Sep 17 00:00:00 2001
From: Sheng Yang <[EMAIL PROTECTED]>
Date: Mon, 18 Aug 2008 11:04:22 +0800
Subject: [PATCH] KVM: Fix wrong KVM_GET_LAPIC

Which caused migration fail in recent commits.

Signed-off-by: Sheng Yang <[EMAIL PROTECTED]>
---
 arch/x86/kvm/x86.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ee005a6..4a03375 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1555,7 +1555,7 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		if (r)
 			goto out;
 		r = -EFAULT;
-		if (copy_to_user(argp, &lapic, sizeof lapic))
+		if (copy_to_user(argp, lapic, sizeof(struct kvm_lapic_state)))
 			goto out;
 		r = 0;
 		break;
-- 
1.5.4.5

1 2 >

1 - 100 of 165 matches

Mail list logo