qemu-devel@nongnu.org

2020-04-21 Thread Sasha Levin
Hi

[This is an automated email]

This commit has been processed because it contains a -stable tag.
The stable tag indicates that it's relevant for the following trees: all

The bot has tested the following trees: v5.6.5, v5.5.18, v5.4.33, v4.19.116, 
v4.14.176, v4.9.219, v4.4.219.

v5.6.5: Build OK!
v5.5.18: Build OK!
v5.4.33: Build OK!
v4.19.116: Build OK!
v4.14.176: Build OK!
v4.9.219: Build OK!
v4.4.219: Failed to apply! Possible dependencies:
029499b47738 ("KVM: x86: MMU: Make mmu_set_spte() return emulate value")
19d194c62b25 ("MIPS: KVM: Simplify TLB_* macros")
403015b323a2 ("MIPS: KVM: Move non-TLB handling code out of tlb.c")
7ee0e5b29d27 ("KVM: x86: MMU: Remove unused parameter of __direct_map()")
9fbfb06a4065 ("MIPS: KVM: Arrayify struct kvm_mips_tlb::tlb_lo*")
ba049e93aef7 ("kvm: rename pfn_t to kvm_pfn_t")
bdb7ed8608f8 ("MIPS: KVM: Convert headers to kernel sized types")
ca64c2beecd4 ("MIPS: KVM: Abstract guest ASID mask")
caa1faa7aba6 ("MIPS: KVM: Trivial whitespace and style fixes")
e6207bbea16c ("MIPS: KVM: Use MIPS_ENTRYLO_* defs from mipsregs.h")


NOTE: The patch will not be queued to stable trees until it is upstream.

How should we proceed with this patch?

-- 
Thanks
Sasha



Re: [PATCH 03/15] KVM: MIPS: Fix VPN2_MASK definition for variable cpu_vmbits

2020-04-21 Thread Sasha Levin
Hi

[This is an automated email]

This commit has been processed because it contains a -stable tag.
The stable tag indicates that it's relevant for the following trees: all

The bot has tested the following trees: v5.6.5, v5.5.18, v5.4.33, v4.19.116, 
v4.14.176, v4.9.219, v4.4.219.

v5.6.5: Build OK!
v5.5.18: Build OK!
v5.4.33: Build OK!
v4.19.116: Build OK!
v4.14.176: Build OK!
v4.9.219: Build OK!
v4.4.219: Failed to apply! Possible dependencies:
029499b47738 ("KVM: x86: MMU: Make mmu_set_spte() return emulate value")
19d194c62b25 ("MIPS: KVM: Simplify TLB_* macros")
403015b323a2 ("MIPS: KVM: Move non-TLB handling code out of tlb.c")
7ee0e5b29d27 ("KVM: x86: MMU: Remove unused parameter of __direct_map()")
9a99c4fd6586 ("KVM: MIPS: Define KVM_ENTRYHI_ASID to 
cpu_asid_mask(&boot_cpu_data)")
9fbfb06a4065 ("MIPS: KVM: Arrayify struct kvm_mips_tlb::tlb_lo*")
ba049e93aef7 ("kvm: rename pfn_t to kvm_pfn_t")
bdb7ed8608f8 ("MIPS: KVM: Convert headers to kernel sized types")
ca64c2beecd4 ("MIPS: KVM: Abstract guest ASID mask")
caa1faa7aba6 ("MIPS: KVM: Trivial whitespace and style fixes")
e6207bbea16c ("MIPS: KVM: Use MIPS_ENTRYLO_* defs from mipsregs.h")


NOTE: The patch will not be queued to stable trees until it is upstream.

How should we proceed with this patch?

-- 
Thanks
Sasha



qemu-devel@nongnu.org

2020-05-06 Thread Sasha Levin
Hi

[This is an automated email]

This commit has been processed because it contains a -stable tag.
The stable tag indicates that it's relevant for the following trees: all

The bot has tested the following trees: v5.6.10, v5.4.38, v4.19.120, v4.14.178, 
v4.9.221, v4.4.221.

v5.6.10: Build OK!
v5.4.38: Build OK!
v4.19.120: Build OK!
v4.14.178: Build OK!
v4.9.221: Build OK!
v4.4.221: Failed to apply! Possible dependencies:
029499b47738 ("KVM: x86: MMU: Make mmu_set_spte() return emulate value")
19d194c62b25 ("MIPS: KVM: Simplify TLB_* macros")
403015b323a2 ("MIPS: KVM: Move non-TLB handling code out of tlb.c")
7ee0e5b29d27 ("KVM: x86: MMU: Remove unused parameter of __direct_map()")
9fbfb06a4065 ("MIPS: KVM: Arrayify struct kvm_mips_tlb::tlb_lo*")
ba049e93aef7 ("kvm: rename pfn_t to kvm_pfn_t")
bdb7ed8608f8 ("MIPS: KVM: Convert headers to kernel sized types")
ca64c2beecd4 ("MIPS: KVM: Abstract guest ASID mask")
caa1faa7aba6 ("MIPS: KVM: Trivial whitespace and style fixes")
e6207bbea16c ("MIPS: KVM: Use MIPS_ENTRYLO_* defs from mipsregs.h")


NOTE: The patch will not be queued to stable trees until it is upstream.

How should we proceed with this patch?

-- 
Thanks
Sasha



Re: [Qemu-devel] [PATCH v7] kvm: notify host when the guest is panicked

2012-07-21 Thread Sasha Levin
On 07/21/2012 09:12 AM, Wen Congyang wrote:
> +#define KVM_PV_PORT  (0x505UL)
> +
>  #ifdef __KERNEL__
>  #include 
>  
> @@ -221,6 +223,11 @@ static inline void kvm_disable_steal_time(void)
>  }
>  #endif
>  
> +static inline unsigned int kvm_arch_pv_features(void)
> +{
> + return inl(KVM_PV_PORT);
> +}
> +

Why is this safe?

I'm not sure you can just pick any ioport you'd like and use it.



Re: [Qemu-devel] [PATCH v7.5] kvm: notify host when the guest is panicked

2012-07-22 Thread Sasha Levin
On 07/21/2012 10:44 AM, Wen Congyang wrote:
> We can know the guest is panicked when the guest runs on xen.
> But we do not have such feature on kvm.
> 
> Another purpose of this feature is: management app(for example:
> libvirt) can do auto dump when the guest is panicked. If management
> app does not do auto dump, the guest's user can do dump by hand if
> he sees the guest is panicked.
> 
> We have three solutions to implement this feature:
> 1. use vmcall
> 2. use I/O port
> 3. use virtio-serial.
> 
> We have decided to avoid touching hypervisor. The reason why I choose
> choose the I/O port is:
> 1. it is easier to implememt
> 2. it does not depend any virtual device
> 3. it can work when starting the kernel

Was the option of implementing a virtio-watchdog driver considered?

You're basically re-implementing a watchdog, a guest-host interface and a set 
of protocols for guest-host communications.

Why can't we re-use everything we have now, push a virtio watchdog driver into 
drivers/watchdog/, and gain a more complete solution to detecting hangs inside 
the guest.



Re: [Qemu-devel] [PATCH v7.5] kvm: notify host when the guest is panicked

2012-07-22 Thread Sasha Levin
On 07/22/2012 09:14 PM, Anthony Liguori wrote:
> Sasha Levin  writes:
> 
>> On 07/21/2012 10:44 AM, Wen Congyang wrote:
>>> We can know the guest is panicked when the guest runs on xen.
>>> But we do not have such feature on kvm.
>>>
>>> Another purpose of this feature is: management app(for example:
>>> libvirt) can do auto dump when the guest is panicked. If management
>>> app does not do auto dump, the guest's user can do dump by hand if
>>> he sees the guest is panicked.
>>>
>>> We have three solutions to implement this feature:
>>> 1. use vmcall
>>> 2. use I/O port
>>> 3. use virtio-serial.
>>>
>>> We have decided to avoid touching hypervisor. The reason why I choose
>>> choose the I/O port is:
>>> 1. it is easier to implememt
>>> 2. it does not depend any virtual device
>>> 3. it can work when starting the kernel
>>
>> Was the option of implementing a virtio-watchdog driver considered?
>>
>> You're basically re-implementing a watchdog, a guest-host interface and a 
>> set of protocols for guest-host communications.
>>
>> Why can't we re-use everything we have now, push a virtio watchdog
>> driver into drivers/watchdog/, and gain a more complete solution to
>> detecting hangs inside the guest.
> 
> The purpose of virtio is not to reinvent every possible type of device.
> There are plenty of hardware watchdogs that are very suitable to be used
> for this purpose.  QEMU implements quite a few already.
> 
> Watchdogs are not performance sensitive so there's no point in using
> virtio.

The issue here is not performance, but the adding of a brand new guest-host 
interface.

virtio-rng isn't performance sensitive either, yet it was implemented using 
virtio so there wouldn't be yet another interface to communicate between guest 
and host.

This patch goes ahead to add a "arch pv features" interface using ioports, 
without any idea what it might be used for beyond this watchdog.



Re: [Qemu-devel] [PATCH v7] kvm: notify host when the guest is panicked

2012-07-22 Thread Sasha Levin
On 07/22/2012 09:22 PM, Anthony Liguori wrote:
> Sasha Levin  writes:
> 
>> On 07/21/2012 09:12 AM, Wen Congyang wrote:
>>> +#define KVM_PV_PORT(0x505UL)
>>> +
>>>  #ifdef __KERNEL__
>>>  #include 
>>>  
>>> @@ -221,6 +223,11 @@ static inline void kvm_disable_steal_time(void)
>>>  }
>>>  #endif
>>>  
>>> +static inline unsigned int kvm_arch_pv_features(void)
>>> +{
>>> +   return inl(KVM_PV_PORT);
>>> +}
>>> +
>>
>> Why is this safe?
>>
>> I'm not sure you can just pick any ioport you'd like and use it.
> 
> There are three ways I/O ports get used on a PC:
> 
> 1) Platform devices
>  - This is well defined since the vast majority of platform devices are
>implemented within a single chip.  If you're emulating an i440fx
>chipset, the PIIX4 spec has an exhaustive list.
> 
> 2) PCI devices
>  - Typically, PCI only allocates ports starting at 0x0d00 to avoid
>conflicts with ISA devices.
> 
> 3) ISA devices
>  - ISA uses subtractive decoding so any ISA device can access.  In
>theory, an ISA device could attempt to use port 0x0505 but it's
>unlikely.  In a modern guest, there aren't really any ISA devices being
>added either.
> 
> So yes, picking port 0x0505 is safe for something like this (as long as
> you check to make sure that you really are under KVM).

Is there anything that actually prevents me from using PCI ports lower than 
0x0d00? As you said in (3), ISA isn't really used anymore (nor is implemented 
by lkvm for example), so placing PCI below 0x0d00 might even make sense in that 
case.

Furthermore, I can place one of these brand new virtio-mmio devices which got 
introduced recently wherever I want right now - Having a device that uses 0x505 
would cause a pretty non-obvious failure mode.

Either way, If we are going to grab an ioport, then:

 - It should be documented well somewhere in Documentation/virt/kvm
 - It should go through request_region() to actually claim those ioports.
 - It should fail gracefully if that port is taken for some reason, instead of 
not even checking it.



Re: [Qemu-devel] [PATCH v7] kvm: notify host when the guest is panicked

2012-07-22 Thread Sasha Levin
On 07/22/2012 10:19 PM, Sasha Levin wrote:
> On 07/22/2012 09:22 PM, Anthony Liguori wrote:
>> Sasha Levin  writes:
>>
>>> On 07/21/2012 09:12 AM, Wen Congyang wrote:
>>>> +#define KVM_PV_PORT   (0x505UL)
>>>> +
>>>>  #ifdef __KERNEL__
>>>>  #include 
>>>>  
>>>> @@ -221,6 +223,11 @@ static inline void kvm_disable_steal_time(void)
>>>>  }
>>>>  #endif
>>>>  
>>>> +static inline unsigned int kvm_arch_pv_features(void)
>>>> +{
>>>> +  return inl(KVM_PV_PORT);
>>>> +}
>>>> +
>>>
>>> Why is this safe?
>>>
>>> I'm not sure you can just pick any ioport you'd like and use it.
>>
>> There are three ways I/O ports get used on a PC:
>>
>> 1) Platform devices
>>  - This is well defined since the vast majority of platform devices are
>>implemented within a single chip.  If you're emulating an i440fx
>>chipset, the PIIX4 spec has an exhaustive list.
>>
>> 2) PCI devices
>>  - Typically, PCI only allocates ports starting at 0x0d00 to avoid
>>conflicts with ISA devices.
>>
>> 3) ISA devices
>>  - ISA uses subtractive decoding so any ISA device can access.  In
>>theory, an ISA device could attempt to use port 0x0505 but it's
>>unlikely.  In a modern guest, there aren't really any ISA devices being
>>added either.
>>
>> So yes, picking port 0x0505 is safe for something like this (as long as
>> you check to make sure that you really are under KVM).
> 
> Is there anything that actually prevents me from using PCI ports lower than 
> 0x0d00? As you said in (3), ISA isn't really used anymore (nor is implemented 
> by lkvm for example), so placing PCI below 0x0d00 might even make sense in 
> that case.
> 
> Furthermore, I can place one of these brand new virtio-mmio devices which got 
> introduced recently wherever I want right now - Having a device that uses 
> 0x505 would cause a pretty non-obvious failure mode.
> 
> Either way, If we are going to grab an ioport, then:
> 
>  - It should be documented well somewhere in Documentation/virt/kvm
>  - It should go through request_region() to actually claim those ioports.
>  - It should fail gracefully if that port is taken for some reason, instead 
> of not even checking it.
> 

Out of curiosity I tested that, and apparently lkvm has no problem allocating 
virtio-pci devices in that range:

sh-4.2# pwd
/sys/devices/pci:00/:00:01.0
sh-4.2# cat resource | head -n1
0x0500 0x05ff 0x00040101

This was with the commit in question applied.



Re: [Qemu-devel] [PATCH v7] kvm: notify host when the guest is panicked

2012-07-22 Thread Sasha Levin
On 07/23/2012 12:29 AM, Anthony Liguori wrote:
> Sasha Levin  writes:
> 
>> On 07/22/2012 10:19 PM, Sasha Levin wrote:
>>> On 07/22/2012 09:22 PM, Anthony Liguori wrote:
>>>> Sasha Levin  writes:
>>>>
>>>>> On 07/21/2012 09:12 AM, Wen Congyang wrote:
>>>>>> +#define KVM_PV_PORT (0x505UL)
>>>>>> +
>>>>>>  #ifdef __KERNEL__
>>>>>>  #include 
>>>>>>  
>>>>>> @@ -221,6 +223,11 @@ static inline void kvm_disable_steal_time(void)
>>>>>>  }
>>>>>>  #endif
>>>>>>  
>>>>>> +static inline unsigned int kvm_arch_pv_features(void)
>>>>>> +{
>>>>>> +return inl(KVM_PV_PORT);
>>>>>> +}
>>>>>> +
>>>>>
>>>>> Why is this safe?
>>>>>
>>>>> I'm not sure you can just pick any ioport you'd like and use it.
>>>>
>>>> There are three ways I/O ports get used on a PC:
>>>>
>>>> 1) Platform devices
>>>>  - This is well defined since the vast majority of platform devices are
>>>>implemented within a single chip.  If you're emulating an i440fx
>>>>chipset, the PIIX4 spec has an exhaustive list.
>>>>
>>>> 2) PCI devices
>>>>  - Typically, PCI only allocates ports starting at 0x0d00 to avoid
>>>>conflicts with ISA devices.
>>>>
>>>> 3) ISA devices
>>>>  - ISA uses subtractive decoding so any ISA device can access.  In
>>>>theory, an ISA device could attempt to use port 0x0505 but it's
>>>>unlikely.  In a modern guest, there aren't really any ISA devices being
>>>>added either.
>>>>
>>>> So yes, picking port 0x0505 is safe for something like this (as long as
>>>> you check to make sure that you really are under KVM).
>>>
>>> Is there anything that actually prevents me from using PCI ports lower than 
>>> 0x0d00? As you said in (3), ISA isn't really used anymore (nor is 
>>> implemented by lkvm for example), so placing PCI below 0x0d00 might even 
>>> make sense in that case.
>>>
>>> Furthermore, I can place one of these brand new virtio-mmio devices which 
>>> got introduced recently wherever I want right now - Having a device that 
>>> uses 0x505 would cause a pretty non-obvious failure mode.
>>>
>>> Either way, If we are going to grab an ioport, then:
>>>
>>>  - It should be documented well somewhere in Documentation/virt/kvm
>>>  - It should go through request_region() to actually claim those ioports.
>>>  - It should fail gracefully if that port is taken for some reason, instead 
>>> of not even checking it.
>>>
>>
>> Out of curiosity I tested that, and apparently lkvm has no problem 
>> allocating virtio-pci devices in that range:
>>
>> sh-4.2# pwd
>> /sys/devices/pci:00/:00:01.0
>> sh-4.2# cat resource | head -n1
>> 0x0500 0x05ff 0x00040101
>>
>> This was with the commit in question applied.
> 
> With all due respect, lkvm has a half-baked implementation of PCI.  This
> is why you have to pass kernel parameters to disable ACPI and disable
> PCI BIOS probing.
> 
> So yeah, you can do funky things in lkvm but that doesn't mean a system
> that emulated actual hardware would ever do that.

We disable ACPI simply because we don't support it. MPtable is a perfectly 
valid mechanism to do everything we need so far, so implementing ACPI didn't 
interest either of us too much. What's more - why implement a "complete design 
disaster in every way" ;)

Regarding PCI probing, while we do force the use of direct memory probing this 
is because we lack anything which reassembles a BIOS. Like the above, this 
wasn't too interesting in a virtualized environment, and the kernel is pretty 
happy running without it. PCI probing does happen in a standard way.

I think that the interesting part in that test was not that you could actually 
put a PCI device in the 0x500 range, but that nothing failed and no one yelled 
at me (with the panic commit applied).

I'm not worried about port 0x505 being taken, I'm worried that it'll silently 
break a (although not very common/reasonable/typical) perfectly valid use case.



Re: [Qemu-devel] [PATCH v7.5] kvm: notify host when the guest is panicked

2012-07-22 Thread Sasha Levin
On 07/23/2012 12:36 AM, Anthony Liguori wrote:
> Sasha Levin  writes:
> 
>> On 07/22/2012 09:14 PM, Anthony Liguori wrote:
>>> Sasha Levin  writes:
>>>
>>>> On 07/21/2012 10:44 AM, Wen Congyang wrote:
>>>>> We can know the guest is panicked when the guest runs on xen.
>>>>> But we do not have such feature on kvm.
>>>>>
>>>>> Another purpose of this feature is: management app(for example:
>>>>> libvirt) can do auto dump when the guest is panicked. If management
>>>>> app does not do auto dump, the guest's user can do dump by hand if
>>>>> he sees the guest is panicked.
>>>>>
>>>>> We have three solutions to implement this feature:
>>>>> 1. use vmcall
>>>>> 2. use I/O port
>>>>> 3. use virtio-serial.
>>>>>
>>>>> We have decided to avoid touching hypervisor. The reason why I choose
>>>>> choose the I/O port is:
>>>>> 1. it is easier to implememt
>>>>> 2. it does not depend any virtual device
>>>>> 3. it can work when starting the kernel
>>>>
>>>> Was the option of implementing a virtio-watchdog driver considered?
>>>>
>>>> You're basically re-implementing a watchdog, a guest-host interface and a 
>>>> set of protocols for guest-host communications.
>>>>
>>>> Why can't we re-use everything we have now, push a virtio watchdog
>>>> driver into drivers/watchdog/, and gain a more complete solution to
>>>> detecting hangs inside the guest.
>>>
>>> The purpose of virtio is not to reinvent every possible type of device.
>>> There are plenty of hardware watchdogs that are very suitable to be used
>>> for this purpose.  QEMU implements quite a few already.
>>>
>>> Watchdogs are not performance sensitive so there's no point in using
>>> virtio.
>>
>> The issue here is not performance, but the adding of a brand new
>> guest-host interface.
> 
> We have:
> 
> 1) Virtio--this is our preferred PV interface.  It needs PCI to be fully
> initialized and probably will live as a module.
> 
> 2) Hypercalls--this a secondary PV interface but is available very
> early.  It's terminated in kvm.ko which means it can only operate on
> things that are logically part of the CPU and/or APIC complex.
> 
> This patch introduces a third interface which is available early like
> hypercalls but not necessarily terminated in kvm.ko.  That means it can
> have a broader scope in functionality than (2).
> 
> We could just as well use a hypercall and have multiple commands issued
> to that hypercall as a convention and add a new exit type to KVM that
> sent that specific hypercall to userspace for processing.
> 
> But a PIO operation already has this behavior and requires no changes to 
> kvm.ko.

I don't dispute that there may be a need for another guest-host interface, but 
this patch can basically be called "kvm: notify host when the guest is 
panicked, oh, btw, and add a brand new undocumented interface"

The new interface should at least come in it's own patch, with documentation.



Re: [Qemu-devel] [PATCH v10] kvm: notify host when the guest is panicked

2012-08-29 Thread Sasha Levin
On 08/29/2012 07:18 AM, Wen Congyang wrote:
> We can know the guest is panicked when the guest runs on xen.
> But we do not have such feature on kvm.
> 
> Another purpose of this feature is: management app(for example:
> libvirt) can do auto dump when the guest is panicked. If management
> app does not do auto dump, the guest's user can do dump by hand if
> he sees the guest is panicked.
> 
> We have three solutions to implement this feature:
> 1. use vmcall
> 2. use I/O port
> 3. use virtio-serial.
> 
> We have decided to avoid touching hypervisor. The reason why I choose
> choose the I/O port is:
> 1. it is easier to implememt
> 2. it does not depend any virtual device
> 3. it can work when starting the kernel
> 
> Signed-off-by: Wen Congyang 
> ---
>  Documentation/virtual/kvm/pv_event.txt |   32 
> 
>  arch/ia64/include/asm/kvm_para.h   |   14 ++
>  arch/powerpc/include/asm/kvm_para.h|   14 ++
>  arch/s390/include/asm/kvm_para.h   |   14 ++
>  arch/x86/include/asm/kvm_para.h|   27 +++
>  arch/x86/kernel/kvm.c  |   25 +
>  include/linux/kvm_para.h   |   23 +++
>  7 files changed, 149 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/virtual/kvm/pv_event.txt
> 
> diff --git a/Documentation/virtual/kvm/pv_event.txt 
> b/Documentation/virtual/kvm/pv_event.txt
> new file mode 100644
> index 000..bb04de0
> --- /dev/null
> +++ b/Documentation/virtual/kvm/pv_event.txt
> @@ -0,0 +1,32 @@
> +The KVM paravirtual event interface
> +=
> +
> +Initializing the paravirtual event interface
> +==
> +kvm_pv_event_init()
> +Argiments:
> + None
> +
> +Return Value:
> + 0: The guest kernel can use paravirtual event interface.
> + 1: The guest kernel can't use paravirtual event interface.
> +
> +Querying whether the event can be ejected
> +==
> +kvm_pv_has_feature()
> +Arguments:
> + feature: The bit value of this paravirtual event to query
> +
> +Return Value:
> + 0 : The guest kernel can't eject this paravirtual event.
> + -1: The guest kernel can eject this paravirtual event.
> +
> +
> +Ejecting paravirtual event
> +==
> +kvm_pv_eject_event()
> +Arguments:
> + event: The event to be ejected.
> +
> +Return Value:
> + None

What's the protocol for communicating with the hypervisor? What is it supposed
to do on reads/writes to that ioport?

> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 2f7712e..7d297f0 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -96,8 +96,11 @@ struct kvm_vcpu_pv_apf_data {
>  #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK
>  #define KVM_PV_EOI_DISABLED 0x0
>  
> +#define KVM_PV_EVENT_PORT(0x505UL)
> +
>  #ifdef __KERNEL__
>  #include 
> +#include 
>  
>  extern void kvmclock_init(void);
>  extern int kvm_register_clock(char *txt);
> @@ -228,6 +231,30 @@ static inline void kvm_disable_steal_time(void)
>  }
>  #endif
>  
> +static inline int kvm_arch_pv_event_init(void)
> +{
> + if (!request_region(KVM_PV_EVENT_PORT, 1, "KVM_PV_EVENT"))

Only one byte is requested here, but the rest of the code is reading/writing 
longs?

The struct resource * returned from request_region is simply being leaked here?

What happens if we go ahead with adding another event (let's say OOM event)?
request_region() is going to fail for anything but the first call.

> + return -1;

This return value doesn't correspond with the documentation.

> +
> + return 0;
> +}
> +
> +static inline unsigned int kvm_arch_pv_features(void)
> +{
> + unsigned int features = inl(KVM_PV_EVENT_PORT);
> +
> + /* Reading from an invalid I/O port will return -1 */

Just wondering, where is that documented? For lkvm for example the return value
from an ioport without a device on the other side is undefined, so it's possible
we're doing something wrong there.

> + if (features == ~0)
> + features = 0;
> +
> + return features;
> +}
> +
> +static inline void kvm_arch_pv_eject_event(unsigned int event)
> +{
> + outl(event, KVM_PV_EVENT_PORT);
> +}
> +
>  #endif /* __KERNEL__ */
>  
>  #endif /* _ASM_X86_KVM_PARA_H */
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index c1d61ee..6129459 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -368,6 +368,17 @@ static struct notifier_block kvm_pv_reboot_nb = {
>   .notifier_call = kvm_pv_reboot_notify,
>  };
>  
> +static int
> +kvm_pv_panic_notify(struct notifier_block *nb, unsigned long code, void 
> *unused)
> +{
> + kvm_pv_eject_event(KVM_PV_EVENT_PANICKED);
> + return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block kvm_pv_panic_nb = {
> + .notifier_call = kvm_pv_panic_notify,
> +};
> +
>  sta

Re: [Qemu-devel] [PATCH v10] kvm: notify host when the guest is panicked

2012-08-30 Thread Sasha Levin
On 08/30/2012 04:03 AM, Wen Congyang wrote:
> At 08/29/2012 07:56 PM, Sasha Levin Wrote:
>> On 08/29/2012 07:18 AM, Wen Congyang wrote:
>>> diff --git a/Documentation/virtual/kvm/pv_event.txt 
>>> b/Documentation/virtual/kvm/pv_event.txt
>>> new file mode 100644
>>> index 000..bb04de0
>>> --- /dev/null
>>> +++ b/Documentation/virtual/kvm/pv_event.txt
>>> @@ -0,0 +1,32 @@
>>> +The KVM paravirtual event interface
>>> +=
>>> +
>>> +Initializing the paravirtual event interface
>>> +==
>>> +kvm_pv_event_init()
>>> +Argiments:
>>> +   None
>>> +
>>> +Return Value:
>>> +   0: The guest kernel can use paravirtual event interface.
>>> +   1: The guest kernel can't use paravirtual event interface.
>>> +
>>> +Querying whether the event can be ejected
>>> +==
>>> +kvm_pv_has_feature()
>>> +Arguments:
>>> +   feature: The bit value of this paravirtual event to query
>>> +
>>> +Return Value:
>>> +   0 : The guest kernel can't eject this paravirtual event.
>>> +   -1: The guest kernel can eject this paravirtual event.
>>> +
>>> +
>>> +Ejecting paravirtual event
>>> +==
>>> +kvm_pv_eject_event()
>>> +Arguments:
>>> +   event: The event to be ejected.
>>> +
>>> +Return Value:
>>> +   None
>>
>> What's the protocol for communicating with the hypervisor? What is it 
>> supposed
>> to do on reads/writes to that ioport?
> 
> Not only ioport, the other arch can use some other ways. We can use
> these APIs to eject event to hypervisor. The caller does not care how
> to communicate with the hypervisor.

Right, it can work in several ways, but the protocol (whatever it may be)
between the hypervisor and the guest kernel should be documented here as well.

>>> diff --git a/arch/x86/include/asm/kvm_para.h 
>>> b/arch/x86/include/asm/kvm_para.h
>>> index 2f7712e..7d297f0 100644
>>> --- a/arch/x86/include/asm/kvm_para.h
>>> +++ b/arch/x86/include/asm/kvm_para.h
>>> @@ -96,8 +96,11 @@ struct kvm_vcpu_pv_apf_data {
>>>  #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK
>>>  #define KVM_PV_EOI_DISABLED 0x0
>>>  
>>> +#define KVM_PV_EVENT_PORT  (0x505UL)
>>> +
>>>  #ifdef __KERNEL__
>>>  #include 
>>> +#include 
>>>  
>>>  extern void kvmclock_init(void);
>>>  extern int kvm_register_clock(char *txt);
>>> @@ -228,6 +231,30 @@ static inline void kvm_disable_steal_time(void)
>>>  }
>>>  #endif
>>>  
>>> +static inline int kvm_arch_pv_event_init(void)
>>> +{
>>> +   if (!request_region(KVM_PV_EVENT_PORT, 1, "KVM_PV_EVENT"))
>>
>> Only one byte is requested here, but the rest of the code is reading/writing 
>> longs?
>>
>> The struct resource * returned from request_region is simply being leaked 
>> here?
>>
>> What happens if we go ahead with adding another event (let's say OOM event)?
>> request_region() is going to fail for anything but the first call.
> 
> For x86, we use ioport to communicate with hypervisor. We can read a 32bit 
> value
> from the hypervisor. If the bit0 is setted, it means the hypervisor supports
> panicked event. If you want add another event, you can use another unused bit.
> I think 32 events are enough now.
> 
> You can write a value to the ioport to eject the event. Only one event can be
> ejected at a time.

I was trying to point out that kvm_pv_event_init() would fail on anything but
the first call, while the API suggests it should be called to verify we can
write events.

>>> +static inline unsigned int kvm_arch_pv_features(void)
>>> +{
>>> +   unsigned int features = inl(KVM_PV_EVENT_PORT);
>>> +
>>> +   /* Reading from an invalid I/O port will return -1 */
>>
>> Just wondering, where is that documented? For lkvm for example the return 
>> value
>> from an ioport without a device on the other side is undefined, so it's 
>> possible
>> we're doing something wrong there.
> 
> Hmm, how to use lkvm? Can you give me a example. So I can test this patch on 
> lkvm.

You can grab lkvm from https://github.com/penberg/linux-kvm , it lives under
tools/kvm/ in the kernel tree.

> For qemu, it returns -1. I don't know which is right now. I will investigate 
> it.

Thing is, unless x86 arch suggests it should return something specific in this
case, we can't assume a return value either from lkvm or qemu.


Thanks,
Sasha




Re: [Qemu-devel] [PATCH v11] kvm: notify host when the guest is panicked

2012-10-30 Thread Sasha Levin
On Tue, Oct 30, 2012 at 9:48 PM, Wen Congyang  wrote:
> At 10/31/2012 09:12 AM, Marcelo Tosatti Wrote:
>> It has been asked earlier why a simple virtio device is not usable
>> for this (with no response IIRC).
>
> 1. We can't use virtio device when the kernel is booting.

So the issue here is the small window between the point the guest
becomes "self aware" and to the point virtio drivers are loaded,
right?

I agree that if something happens during that interval, a
"virtio-notifier" driver won't catch that, but anything beyond that is
better done with a virtio driver, so how is the generic infrastructure
added in this patch useful to anything beyond detecting panics in that
initial interval?

> 2. The virtio's driver can be built as a module, and if it is not loaded
>and the kernel is panicked, there is no way to notify the host.

Even if the suggested virtio-notifier driver is built as a module, it
would get auto-loaded when the guest is booting, so I'm not sure about
this point?

> 3. I/O port is more reliable than virtio device.
>If virtio's driver has some bug, and it cause kernel panicked, we can't
>use it. The I/O port is more reliable because it only depends on notifier
>chain(If we use virtio device, it also depends on notifier chain).

This is like suggesting that we let KVM emulate virtio-blk on it's
own, parallel to the virtio implementation, so that even if there's a
problem with virtio-blk, KVM can emulate a virtio-blk on it's own.

Furthermore, why stop at virtio? What if the KVM code has a bug and it
doesn't pass IO properly? Or the x86 code? we still want panic
notifications if that happens...


Thanks,
Sasha



Re: [Qemu-devel] [PATCH v11] kvm: notify host when the guest is panicked

2012-11-09 Thread Sasha Levin
On Mon, Nov 5, 2012 at 8:58 PM, Hu Tao  wrote:
> But in the case of panic notification, more dependency means more
> chances of failure of panic notification. Say, if we use a virtio device
> to do panic notification, then we will fail if: virtio itself has
> problems, virtio for some reason can't be deployed(neither built-in or
> as a module), or guest doesn't support virtio, etc.

Add polling to your virtio device. If it didn't notify of a panic but
taking more than 20 sec to answer your poll request you can assume
it's dead.

Actually, just use virtio-serial and something in userspace on the guest.

> We choose IO because compared to virtio device, it is not that heavy and
> less problematic.

Less problematic? Heavy? Are there any known issues with virtio that
should be fixed? You make virtio sound like an old IDE drive or
something.


Thanks,
Sasha



[Qemu-devel] [PATCH] tap: Add optional parameters to up/down script

2011-09-29 Thread Sasha Levin
This allows the user to add custom parameters to the up or down
scripts.

Cc: Anthony Liguori 
Signed-off-by: Sasha Levin 
---
 net.c |8 
 net/tap.c |   37 ++---
 2 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/net.c b/net.c
index d05930c..bb27598 100644
--- a/net.c
+++ b/net.c
@@ -952,10 +952,18 @@ static const struct {
 .type = QEMU_OPT_STRING,
 .help = "script to initialize the interface",
 }, {
+.name = "scriptparams",
+.type = QEMU_OPT_STRING,
+.help = "parameters for the initialization script",
+}, {
 .name = "downscript",
 .type = QEMU_OPT_STRING,
 .help = "script to shut down the interface",
 }, {
+.name = "downscriptparams",
+.type = QEMU_OPT_STRING,
+.help = "parameters for the deinitialization script",
+}, {
 .name = "sndbuf",
 .type = QEMU_OPT_SIZE,
 .help = "send buffer limit"
diff --git a/net/tap.c b/net/tap.c
index 1f26dc9..5a9141e 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -52,7 +52,7 @@ typedef struct TAPState {
 VLANClientState nc;
 int fd;
 char down_script[1024];
-char down_script_arg[128];
+char down_script_arg[1024];
 uint8_t buf[TAP_BUFSIZE];
 unsigned int read_poll : 1;
 unsigned int write_poll : 1;
@@ -392,7 +392,8 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr)
 {
 int fd, vnet_hdr_required;
 char ifname[128] = {0,};
-const char *setup_script;
+const char *setup_script, *setup_script_params;
+char setup_script_formatted[1024];
 
 if (qemu_opt_get(opts, "ifname")) {
 pstrcpy(ifname, sizeof(ifname), qemu_opt_get(opts, "ifname"));
@@ -411,10 +412,16 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr)
 }
 
 setup_script = qemu_opt_get(opts, "script");
+setup_script_params = qemu_opt_get(opts, "scriptparams");
+if (setup_script_params == NULL)
+setup_script_params = "";
+
+snprintf(setup_script_formatted, sizeof(setup_script_formatted),
+ "%s %s", ifname, setup_script_params);
 if (setup_script &&
 setup_script[0] != '\0' &&
 strcmp(setup_script, "no") != 0 &&
-launch_script(setup_script, ifname, fd)) {
+launch_script(setup_script, setup_script_formatted, fd)) {
 close(fd);
 return -1;
 }
@@ -432,9 +439,12 @@ int net_init_tap(QemuOpts *opts, Monitor *mon, const char 
*name, VLANState *vlan
 if (qemu_opt_get(opts, "fd")) {
 if (qemu_opt_get(opts, "ifname") ||
 qemu_opt_get(opts, "script") ||
+qemu_opt_get(opts, "scriptparams") ||
 qemu_opt_get(opts, "downscript") ||
+qemu_opt_get(opts, "downscriptparams") ||
 qemu_opt_get(opts, "vnet_hdr")) {
-error_report("ifname=, script=, downscript= and vnet_hdr= is 
invalid with fd=");
+error_report("ifname=, script=, downscript=, scriptparams=, "
+ "downscriptparams= and vnet_hdr= is invalid with 
fd=");
 return -1;
 }
 
@@ -455,6 +465,14 @@ int net_init_tap(QemuOpts *opts, Monitor *mon, const char 
*name, VLANState *vlan
 qemu_opt_set(opts, "downscript", DEFAULT_NETWORK_DOWN_SCRIPT);
 }
 
+if (!qemu_opt_get(opts, "scriptparams")) {
+qemu_opt_set(opts, "scriptparams", "");
+}
+
+if (!qemu_opt_get(opts, "downscriptparams")) {
+qemu_opt_set(opts, "downscriptparams", "");
+}
+
 fd = net_tap_init(opts, &vnet_hdr);
 if (fd == -1) {
 return -1;
@@ -475,18 +493,23 @@ int net_init_tap(QemuOpts *opts, Monitor *mon, const char 
*name, VLANState *vlan
 snprintf(s->nc.info_str, sizeof(s->nc.info_str), "fd=%d", fd);
 } else {
 const char *ifname, *script, *downscript;
+   const char *scriptparams, *downscriptparams;
 
 ifname = qemu_opt_get(opts, "ifname");
 script = qemu_opt_get(opts, "script");
 downscript = qemu_opt_get(opts, "downscript");
+scriptparams = qemu_opt_get(opts, "scriptparams");
+downscriptparams = qemu_opt_get(opts, "downscriptparams");
 
 snprintf(s->nc.info_str, sizeof(s->nc.info_str),
- "ifname=%s,script=%s,downscript=%s",
- ifname, 

Re: [Qemu-devel] [PATCH] tap: Add optional parameters to up/down script

2011-09-29 Thread Sasha Levin
On Thu, 2011-09-29 at 17:20 +0200, Jan Kiszka wrote:
> On 2011-09-29 16:40, Thomas Jung wrote:
> > On 2011-09-29 16:11 jan kiszka wrote
> >> What kind of parameters would you want to pass? Something that tells VMs
> >> apart (which can be solved without these extensions) or more?
> >>
> >> Jan
> > 
> > In our Case:
> > 
> > We want to simulate an larger environment with multiple switches realized in
> > openvswitch.
> > Openvswitch requires a up- and downscript for each switch. So the idea was,
> > have a single and variable up- and downscript and feed it with parameters
> > (like switch to use, vlan id and so on) through the scriptparams. 
> 
> Sounds reasonable. A short note on the "why" in the description would
> have been nice.

Sure, I'll add the use case to the patch note.

> The patch has some style issues that should be addressed first
> (checkpatch will tell).

I wasn't aware that there was a checkpatch for qemu, now I do :)

I'll fix it and resend.

-- 

Sasha.




[Qemu-devel] [PATCH v2] tap: Add optional parameters to up/down script

2011-09-30 Thread Sasha Levin
This allows the user to add custom parameters to the up or down
scripts.

Extra parameters are useful in more complex networking scenarios
where we would like to configure network devices when starting
or stopping the guest.

Cc: Anthony Liguori 
Signed-off-by: Sasha Levin 
---
 net.c |8 
 net/tap.c |   45 -
 2 files changed, 44 insertions(+), 9 deletions(-)

diff --git a/net.c b/net.c
index d05930c..bb27598 100644
--- a/net.c
+++ b/net.c
@@ -952,10 +952,18 @@ static const struct {
 .type = QEMU_OPT_STRING,
 .help = "script to initialize the interface",
 }, {
+.name = "scriptparams",
+.type = QEMU_OPT_STRING,
+.help = "parameters for the initialization script",
+}, {
 .name = "downscript",
 .type = QEMU_OPT_STRING,
 .help = "script to shut down the interface",
 }, {
+.name = "downscriptparams",
+.type = QEMU_OPT_STRING,
+.help = "parameters for the deinitialization script",
+}, {
 .name = "sndbuf",
 .type = QEMU_OPT_SIZE,
 .help = "send buffer limit"
diff --git a/net/tap.c b/net/tap.c
index 1f26dc9..f2f1fe5 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -53,6 +53,7 @@ typedef struct TAPState {
 int fd;
 char down_script[1024];
 char down_script_arg[128];
+char down_script_params[1024];
 uint8_t buf[TAP_BUFSIZE];
 unsigned int read_poll : 1;
 unsigned int write_poll : 1;
@@ -62,7 +63,8 @@ typedef struct TAPState {
 unsigned host_vnet_hdr_len;
 } TAPState;
 
-static int launch_script(const char *setup_script, const char *ifname, int fd);
+static int launch_script(const char *setup_script, const char *ifname,
+ const char *params, int fd);
 
 static int tap_can_send(void *opaque);
 static void tap_send(void *opaque);
@@ -287,7 +289,8 @@ static void tap_cleanup(VLANClientState *nc)
 qemu_purge_queued_packets(nc);
 
 if (s->down_script[0])
-launch_script(s->down_script, s->down_script_arg, s->fd);
+launch_script(s->down_script, s->down_script_arg,
+  s->down_script_params, s->fd);
 
 tap_read_poll(s, 0);
 tap_write_poll(s, 0);
@@ -344,11 +347,12 @@ static TAPState *net_tap_fd_init(VLANState *vlan,
 return s;
 }
 
-static int launch_script(const char *setup_script, const char *ifname, int fd)
+static int launch_script(const char *setup_script, const char *ifname,
+ const char *params, int fd)
 {
 sigset_t oldmask, mask;
 int pid, status;
-char *args[3];
+char *args[4];
 char **parg;
 
 sigemptyset(&mask);
@@ -371,6 +375,7 @@ static int launch_script(const char *setup_script, const 
char *ifname, int fd)
 parg = args;
 *parg++ = (char *)setup_script;
 *parg++ = (char *)ifname;
+*parg++ = (char *)params;
 *parg = NULL;
 execv(setup_script, args);
 _exit(1);
@@ -392,7 +397,7 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr)
 {
 int fd, vnet_hdr_required;
 char ifname[128] = {0,};
-const char *setup_script;
+const char *setup_script, *setup_script_params;
 
 if (qemu_opt_get(opts, "ifname")) {
 pstrcpy(ifname, sizeof(ifname), qemu_opt_get(opts, "ifname"));
@@ -411,10 +416,15 @@ static int net_tap_init(QemuOpts *opts, int *vnet_hdr)
 }
 
 setup_script = qemu_opt_get(opts, "script");
+setup_script_params = qemu_opt_get(opts, "scriptparams");
+if (setup_script_params == NULL) {
+setup_script_params = "";
+}
+
 if (setup_script &&
 setup_script[0] != '\0' &&
 strcmp(setup_script, "no") != 0 &&
-launch_script(setup_script, ifname, fd)) {
+launch_script(setup_script, ifname, setup_script_params, fd)) {
 close(fd);
 return -1;
 }
@@ -432,9 +442,12 @@ int net_init_tap(QemuOpts *opts, Monitor *mon, const char 
*name, VLANState *vlan
 if (qemu_opt_get(opts, "fd")) {
 if (qemu_opt_get(opts, "ifname") ||
 qemu_opt_get(opts, "script") ||
+qemu_opt_get(opts, "scriptparams") ||
 qemu_opt_get(opts, "downscript") ||
+qemu_opt_get(opts, "downscriptparams") ||
 qemu_opt_get(opts, "vnet_hdr")) {
-error_report("ifname=, script=, downscript= and vnet_hdr= is 
invalid with fd=");
+error_report("ifname=, script=, downscript=, scriptparams=, "
+ "downscriptparams= and vn

Re: [Qemu-devel] [PATCH 1/1 V5 tuning] kernel/kvm: introduce KVM_SET_LINT1 and fix improper nmi emulation

2011-10-14 Thread Sasha Levin
On Fri, 2011-10-14 at 17:51 +0800, Lai Jiangshan wrote:
> Currently, NMI interrupt is blindly sent to all the vCPUs when NMI
> button event happens. This doesn't properly emulate real hardware on
> which NMI button event triggers LINT1. Because of this, NMI is sent to
> the processor even when LINT1 is masked in LVT. For example, this
> causes the problem that kdump initiated by NMI sometimes doesn't work
> on KVM, because kdump assumes NMI is masked on CPUs other than CPU0.
> 
> With this patch, we introduce introduce KVM_SET_LINT1,
> and we can use KVM_SET_LINT1 to correctly emulate NMI button
> without change the old KVM_NMI behavior.
> 
> Signed-off-by: Lai Jiangshan 
> Reported-by: Kenji Kaneshige 
> ---

It could use a documentation update as well.

>  arch/x86/kvm/irq.h   |1 +
>  arch/x86/kvm/lapic.c |7 +++
>  arch/x86/kvm/x86.c   |8 
>  include/linux/kvm.h  |3 +++
>  4 files changed, 19 insertions(+), 0 deletions(-)
> diff --git a/arch/x86/kvm/irq.h b/arch/x86/kvm/irq.h
> index 53e2d08..0c96315 100644
> --- a/arch/x86/kvm/irq.h
> +++ b/arch/x86/kvm/irq.h
> @@ -95,6 +95,7 @@ void kvm_pic_reset(struct kvm_kpic_state *s);
>  void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu);
>  void kvm_inject_apic_timer_irqs(struct kvm_vcpu *vcpu);
>  void kvm_apic_nmi_wd_deliver(struct kvm_vcpu *vcpu);
> +void kvm_apic_lint1_deliver(struct kvm_vcpu *vcpu);
>  void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu);
>  void __kvm_migrate_pit_timer(struct kvm_vcpu *vcpu);
>  void __kvm_migrate_timers(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 57dcbd4..87fe36a 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1039,6 +1039,13 @@ void kvm_apic_nmi_wd_deliver(struct kvm_vcpu *vcpu)
>   kvm_apic_local_deliver(apic, APIC_LVT0);
>  }
>  
> +void kvm_apic_lint1_deliver(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_lapic *apic = vcpu->arch.apic;
> +
> + kvm_apic_local_deliver(apic, APIC_LVT1);
> +}
> +
>  static struct kvm_timer_ops lapic_timer_ops = {
>   .is_periodic = lapic_is_periodic,
>  };
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 84a28ea..fccd094 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2077,6 +2077,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>   case KVM_CAP_XSAVE:
>   case KVM_CAP_ASYNC_PF:
>   case KVM_CAP_GET_TSC_KHZ:
> + case KVM_CAP_SET_LINT1:
>   r = 1;
>   break;
>   case KVM_CAP_COALESCED_MMIO:
> @@ -3264,6 +3265,13 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>  
>   goto out;
>   }
> + case KVM_SET_LINT1: {
> + r = -EINVAL;
> + if (!irqchip_in_kernel(vcpu->kvm))
> + goto out;
> + r = 0;
> + kvm_apic_lint1_deliver(vcpu);

We simply ignore the return value of kvm_apic_local_deliver() and assume
it always works. why?


-- 

Sasha.




Re: [Qemu-devel] [PATCH 2/3] use gettimeofday() instead of time()

2012-01-05 Thread Sasha Levin
On Fri, 2012-01-06 at 07:37 +, Zhang, Yang Z wrote:
> Please refer the patch zero for the description. 

Each patch should have description about what it does in the changelog,
otherwise you're going to lose important information about the change.

Furthermore, cover letters don't get merged at all, which means that if
someone looks at this patch, he won't have anywhere to refer to to get
the description (besides googling for the original mails I guess).

-- 

Sasha.




Re: [Qemu-devel] [PATCH] KVM: Add wrapper script around QEMU to test kernels

2011-11-07 Thread Sasha Levin
On Mon, Nov 7, 2011 at 12:23 PM, Gerd Hoffmann  wrote:
>  Hi,
>
>> It's not just about code, it's as much about culture and development process.
>
> Indeed.  The BSDs have both kernel and the base system in a single
> repository.  There are probably good reasons for (and against) it.
>
> In Linux we don't have that culture.  No tool (except perf) lives in the
> kernel repo.  I fail to see why kvm-tool is that much different from
> udev, util-linux, iproute, filesystem tools, that it should be included.

tools/power was merged in just 2 versions ago, do you think that
merging that was a mistake?



Re: [Qemu-devel] Secure KVM

2011-11-07 Thread Sasha Levin
Hi Anthony,

Thank you for your comments!

On Mon, 2011-11-07 at 11:37 -0600, Anthony Liguori wrote:
> On 11/06/2011 02:40 PM, Sasha Levin wrote:
> > Hi all,
> >
> > I'm planning on doing a small fork of the KVM tool to turn it into a
> > 'Secure KVM' enabled hypervisor. Now you probably ask yourself, Huh?
> >
> > The idea was discussed briefly couple of months ago, but never got off
> > the ground - which is a shame IMO.
> >
> > It's easy to explain the problem: If an attacker finds a security hole
> > in any of the devices which are exposed to the guest, the attacker would
> > be able to either crash the guest, or possibly run code on the host
> > itself.
> >
> > The solution is also simple to explain: Split the devices into different
> > processes and use seccomp to sandbox each device into the exact set of
> > resources it needs to operate, nothing more and nothing less.
> >
> > Since I'll be basing it on the KVM tool, which doesn't really emulate
> > that many legacy devices, I'll focus first on the virtio family for the
> > sake of simplicity (and covering 90% of the options).
> >
> > This is my basic overview of how I'm planning on implementing the
> > initial POC:
> >
> > 1. First I'll focus on the simple virtio-rng device, it's simple enough
> > to allow us to focus on the aspects which are important for the POC
> > while still covering most bases (i.e. sandbox to single file
> > - /dev/urandom and such).
> >
> > 2. Do it on a one process per device concept, where for each device
> > (notice - not device *type*) requested, a new process which handles it
> > will be spawned.
> >
> > 3. That process will be limited exactly to the resources it needs to
> > operate, for example - if we run a virtio-blk device, it would be able
> > to access only the image file which it should be using.
> >
> > 4. Connection between hypervisor and devices will be based on unix
> > sockets, this should allow for better separation compared to other
> > approaches such as shared memory.
> >
> > 5. While performance is an aspect, complete isolation is more important.
> > Security is primary, performance is secondary.
> >
> > 6. Share as much code as possible with current implementation of virtio
> > devices, make it possible to run virtio devices either like it's being
> > done now, or by spawning them as separate processes - the amount of
> > specific code for the separate process case should be minimal.
> >
> >
> > Thats all I have for now, comments are *very* welcome.
> 
> I thought about this a bit and have some ideas that may or may not help.
> 
> 1) If you add device save/load support, then it's something you can 
> potentially 
> use to give yourself quite a bit of flexibility in changing the sandbox.  At 
> any 
> point in run time, you can save the device model's state in the sandbox, 
> destroy 
> the sandbox, and then build a new sandbox and restore the device to its 
> former 
> state.
> 
> This might turn out to be very useful in supporting things like device 
> hotplug 
> and/or memory hot plug.
> 
> 2) I think it's largely possible to implement all device emulation without 
> doing 
> any dynamic memory allocation.  Since memory allocation DoS is something you 
> have to deal with anyway, I suspect most device emulation already uses a 
> fixed 
> amount of memory per device.   This can potentially dramatically simplify 
> things.
> 
> 3) I think virtio can/should be used as a generic "backend to frontend" 
> transport between the device model and the tool.

virtio requires server and client to have shared memory, so if we
already go with shared memory we can just let the device manage the
actual virtio driver directly, no?

Also, things like interrupts would also require some sort of a different
IPC, which would complicate things a bit.


> 4) Lack of select() is really challenging.  I understand why it's not there 
> since it can technically be emulated but it seems like a no-risk syscall to 
> whitelist and it would make programming in a sandbox so much easier.  Maybe 
> Andrea has some comments here?  I might be missing something here.

There are several of these which would be nice to have, and if we can
get seccomp filters we have good flexibility with which APIs we allow
for each device.

> Regards,
> 
> Anthony Liguori
> 
> >
> 

-- 

Sasha.




Re: [Qemu-devel] [PATCH] KVM: Add wrapper script around QEMU to test kernels

2011-11-08 Thread Sasha Levin
On Tue, Nov 8, 2011 at 4:52 PM, Christoph Hellwig  wrote:
> On Tue, Nov 08, 2011 at 04:41:40PM +0200, Avi Kivity wrote:
>> On 11/06/2011 03:35 AM, Alexander Graf wrote:
>> > To quickly get going, just execute the following as user:
>> >
>> >     $ ./Documentation/run-qemu.sh -r / -a init=/bin/bash
>> >
>> > This will drop you into a shell on your rootfs.
>> >
>>
>> Doesn't work on Fedora 15.  F15's qemu-kvm doesn't have -machine or
>> -virtfs.  Even qemu.git on F15 won't build virtfs since xattr.h
>> detection is broken (patch posted).
>
> Nevermind that running virtfs as a rootfs is a really dumb idea.  You
> do now want to run a VM that has a rootfs that gets changed all the
> time behind your back.
>
> Running qemu -snapshot on the actual root block device is the only
> safe way to reuse the host installation, although it gets a bit
> complicated if people have multiple devices mounted into the namespace.

Using block devices also requires root.



Re: [Qemu-devel] [PATCH] i386: derive '-cpu host' from KVM_GET_SUPPORTED_CPUID

2011-11-09 Thread Sasha Levin
On Wed, 2011-11-09 at 20:00 +0200, Avi Kivity wrote:
> On 11/09/2011 07:56 PM, Anthony Liguori wrote:
> > On 11/09/2011 07:44 AM, Avi Kivity wrote:
> >> The fact that a host cpu supports a feature doesn't mean that QEMU
> >> and KVM
> >> will also support it, yet -cpuid host brings host features wholesale.
> >>
> >> We need to whitelist each feature separately to make sure we support it.
> >> This patch adds KVM whitelisting (by simply using
> >> KVM_GET_SUPPORTED_CPUID
> >> instead of the CPUID instruction).
> >>
> >> Signed-off-by: Avi Kivity
> >
> > This seems like a 1.0 candidate, yes?
> 
> There is a distinct possibility this will uncover bugs in kvm's
> KVM_GET_SUPPORTED_CPUID.  Those won't be qemu bugs, so I think it's good
> for 1.0.
> 

Avi, we have a problem in the KVM tool of KVM_GET_SUPPORTED_CPUID
sometimes returning -E2BIG. I've sent a mail about it some time ago, but
we couldn't really find the reason.

It's somewhat non-deterministic, and theres no sure way to reproduce it,
but it doesn't happen that rarely.

The block of code that uses it from usermode it pretty simple:

struct kvm_cpuid2 *kvm_cpuid;

kvm_cpuid = calloc(1, sizeof(*kvm_cpuid) + 
MAX_KVM_CPUID_ENTRIES * sizeof(*kvm_cpuid->entries));

kvm_cpuid->nent = MAX_KVM_CPUID_ENTRIES;
if (ioctl(vcpu->kvm->sys_fd, KVM_GET_SUPPORTED_CPUID, kvm_cpuid) < 0)
die_perror("KVM_GET_SUPPORTED_CPUID failed");

MAX_KVM_CPUID_ENTRIES is set to 100, which is more than the 80 defined
in the kernel, so it shouldn't be an issue. It wouldn't explain the non
deterministic behavior either.

QEMU's code around it allows it to hide the bug if it does happen:

uint32_t kvm_arch_get_supported_cpuid(KVMState *s, uint32_t function,
   uint32_t index, int reg)
{
 struct kvm_cpuid2 *cpuid;
 int i, max;
 uint32_t ret = 0;
 uint32_t cpuid_1_edx;
 int has_kvm_features = 0;

 max = 1;
 while ((cpuid = try_get_cpuid(s, max)) == NULL) {
 max *= 2;
 }
[snip]

Which means that if it fails it will silently retry until it makes it.

Any guess on why it might happen?

-- 

Sasha.




Re: [Qemu-devel] [PATCH] ivshmem: use PIO for BAR0(Doorbell) instead of MMIO to reduce notification time

2011-11-17 Thread Sasha Levin
On Thu, 2011-11-17 at 16:36 +0200, Avi Kivity wrote:
> On 11/14/2011 05:56 AM, zanghongy...@huawei.com wrote:
> > From: Hongyong Zang 
> >
> > Ivshmem(nahanni) is a mechanism for sharing host memory with VMs running on 
> > the same host. Currently, guest notifies qemu by reading or writing ivshmem 
> > device's PCI MMIO BAR0(Doorbell).
> >
> > This patch, changes this PCI MMIO BAR0(Doorbell) to PIO. And we find guest 
> > accesses PIO BAR 30% faster than MMIO BAR.
> >  
> >  CharDriverState **eventfd_chr;
> >  CharDriverState *server_chr;
> > -MemoryRegion ivshmem_mmio;
> > +MemoryRegion ivshmem_pio;
> >  
> > -pcibus_t mmio_addr;
> > +pcibus_t pio_addr;
> 
> 
> This is a backwards incompatible change.  The way to accomplish this is
> to add a new BAR which aliases the old one.  The new BAR should not be
> visible on guests created with -M pc-1.0 and below.  Please also update
> the spec so that driver authors can make use of the new feature.

Can we add an optional BAR 3 which does exactly what BAR 0 does, but is
in PIO space?

This will allow us to extend the spec instead of changing it, and in
turn drivers could remain compatible with QEMU and other device
implementations.

-- 

Sasha.




[Qemu-devel] [PATCH resend] vl.c: Don't limit node count by smp count

2011-06-29 Thread Sasha Levin
[I've sent this patch couple of months ago and noticed it
 didn't make it's way in - so I'm sending it again]

It is possible to create CPU-less NUMA nodes, node amount shouldn't be
limited by amount of CPUs.

Tested-by: Michael Roth 
Signed-off-by: Sasha Levin 
---
 vl.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/vl.c b/vl.c
index e0191e1..b95ae8d 100644
--- a/vl.c
+++ b/vl.c
@@ -3147,8 +3147,8 @@ int main(int argc, char **argv, char **envp)
 if (nb_numa_nodes > 0) {
 int i;
 
-if (nb_numa_nodes > smp_cpus) {
-nb_numa_nodes = smp_cpus;
+if (nb_numa_nodes > MAX_NODES) {
+nb_numa_nodes = MAX_NODES;
 }
 
 /* If no memory size if given for any node, assume the default case
-- 
1.7.6




Re: [Qemu-devel] [RFC v4 00/58] Memory API

2011-07-19 Thread Sasha Levin
On Tue, Jul 19, 2011 at 11:51 PM, Anthony Liguori  wrote:
> On 07/19/2011 11:10 AM, Avi Kivity wrote:
>>
>> On 07/19/2011 07:05 PM, Avi Kivity wrote:
>>>
>>> On 07/19/2011 05:50 PM, Anthony Liguori wrote:

>>
>> There's bits I don't like about the interface
>
> Which bits are these?

 Nothing I haven't already commented on. I think there's too much in
 the generic level. I don't think coalesced I/O belongs here. It's a
 concept that doesn't fit. I think a side-band API would be nicer.
>>>
>>> Well, it's impossible to do it in a side band. When a range that has
>>> coalesced mmio is exposed is completely orthogonal to programming the
>>> BAR register - it can happen, for example, due to another BAR being
>>> removed or the bridge window being programmed. You can also have a
>>> coalesced mmio region being partially clipped.
>>
>> Of course, it's not really impossible, just clumsy.
>
> There are exactly two devices that use coalesced I/O: VGA and e1000.
>
> VGA does coalesced I/O over the legacy VGA region (0xa ... 0xc).
>  This region is very special in the PC and is directly routed by the I440FX
> to the appropriate first PCI graphics card.
>
> The VGA device knows exactly where this region is mapped.
>
> The e1000 does coalesced I/O for it's memory registers.  But it's dubious
> how much this actually matters anymore.  The original claim was a 10% boost
> with iperf.
>
> The e1000 is not performance competitive with virtio-net though so it
> certainly is reasonable to assume that noone would notice if we removed
> coalesced I/O from the e1000.
>
> The point is, it's so incredibly special cased that having it as part of
> such a general purpose API seems wrong.  Of the hundreds of devices, we only
> have one device that we know for sure really needs it and it could easily be
> done independent of the memory API for that device.

Sorry for sidetracking it a bit, but if coalesced mmio doesn't fit
nicely in a good memory model, maybe the problem is with coalesced
mmio and not the memory model itself.
It's something that's not really being used at the moment - neither in
qemu nor in kvm tools (we just use dirty log in a guest memory block
after finding out coalesced mmio is still slow).
What about just deprecating it and doing either the sockets approach
(whenever it's ready), or doing a brand new virtio-memory device to
avoid the possibility of sockets turning into something like coalesced
mmio.



Re: [Qemu-devel] [RFC v4 00/58] Memory API

2011-07-19 Thread Sasha Levin
On Tue, 2011-07-19 at 21:53 -0500, Anthony Liguori wrote:
> QEMU does use it and it's quite important.  Coalesced MMIO is really 
> about write caching MMIO exits.  It only works with devices that have 
> registers where writing has no side effects.  Moreover, it only really 
> works well when there are lots and lots of writes to these registers 
> simultaneously.
> 
> Couple that with the fact that the buffer is a fixed size and it's 
> really not flexible enough to be useful for a wide variety of devices.
> 
> But for VGA planar mode writes, it works wonders.  It would be terrible 
> to totally lose it.  That said, I'm not at all convinced it's useful for 
> much other than VGA planar mode.

Why was the coalesced approach taken in the first place? When I tried
using it for VGA in /tools/kvm it just seemed to me like a builtin
virtio-memory transport.

Thats why I think planar VGA would be fine if we deprecate coalesced
mmio in favor of either socket ioeventfds or a new virtio-memory device.

-- 

Sasha.




[Qemu-devel] Guest kernel device compatability auto-detection

2011-08-24 Thread Sasha Levin
Hi,

Currently when we run the guest we treat it as a black box, we're not
quite sure what it's going to start and whether it supports the same
features we expect it to support when running it from the host.

This forces us to start the guest with the safest defaults possible, for
example: '-drive file=my_image.qcow2' will be started with slow IDE
emulation even though the guest is capable of virtio.

I'm currently working on a method to try and detect whether the guest
kernel has specific configurations enabled and either warn the user if
we know the kernel is not going to properly work or use better defaults
if we know some advanced features are going to work.

How am I planning to do it? First, we'll try finding which kernel the
guest is going to boot (easy when user does '-kernel', less easy when
the user boots an image). For simplicity sake I'll stick with the
'-kernel' option for now.

Once we have the kernel we can do two things:
 1. See if the kernel was built with CONFIG_IKCONFIG.

 2. Try finding the System.map which belongs to the kernel, it's
provided with all distro kernels so we can expect it to be around. If we
did find it we repeat the same process as in #1.

If we found one of the above, we start matching config sets ("we need
a,b,c,d for virtio, let's see if it's all there"). Once we find a good
config set, we use it for defaults. If we didn't find a good config set
we warn the user and don't even bother starting the guest.

If we couldn't find either, we can just default to whatever we have as
defaults now.


To sum it up, I was wondering if this approach has been considered
before and whether it sounds interesting enough to try.

-- 

Sasha.




Re: [Qemu-devel] Guest kernel device compatability auto-detection

2011-08-25 Thread Sasha Levin
On Thu, 2011-08-25 at 08:32 +0100, Richard W.M. Jones wrote:
> On Thu, Aug 25, 2011 at 08:33:04AM +0300, Avi Kivity wrote:
> > On 08/25/2011 08:21 AM, Sasha Levin wrote:
> > >Hi,
> > >
> > >Currently when we run the guest we treat it as a black box, we're not
> > >quite sure what it's going to start and whether it supports the same
> > >features we expect it to support when running it from the host.
> > >
> > >This forces us to start the guest with the safest defaults possible, for
> > >example: '-drive file=my_image.qcow2' will be started with slow IDE
> > >emulation even though the guest is capable of virtio.
> > >
> > >I'm currently working on a method to try and detect whether the guest
> > >kernel has specific configurations enabled and either warn the user if
> > >we know the kernel is not going to properly work or use better defaults
> > >if we know some advanced features are going to work.
> > >
> > >How am I planning to do it? First, we'll try finding which kernel the
> > >guest is going to boot (easy when user does '-kernel', less easy when
> > >the user boots an image). For simplicity sake I'll stick with the
> > >'-kernel' option for now.
> > >
> > >Once we have the kernel we can do two things:
> > >  1. See if the kernel was built with CONFIG_IKCONFIG.
> > >
> > >  2. Try finding the System.map which belongs to the kernel, it's
> > >provided with all distro kernels so we can expect it to be around. If we
> > >did find it we repeat the same process as in #1.
> > >
> > >If we found one of the above, we start matching config sets ("we need
> > >a,b,c,d for virtio, let's see if it's all there"). Once we find a good
> > >config set, we use it for defaults. If we didn't find a good config set
> > >we warn the user and don't even bother starting the guest.
> > >
> > >If we couldn't find either, we can just default to whatever we have as
> > >defaults now.
> > >
> > >
> > >To sum it up, I was wondering if this approach has been considered
> > >before and whether it sounds interesting enough to try.
> > >
> > 
> > This is a similar problem to p2v or v2v - taking a guest that used
> > to run on physical or virtual hardware, and modifying it to run on
> > (different) virtual hardware.  The first step is what you're looking
> > for - detecting what the guest currently supports.
> > 
> > You can look at http://libguestfs.org/virt-v2v/ for an example.  I'm
> > also copying Richard Jones, who maintains libguestfs, which does the
> > actual poking around in the guest.
> 
> Yes, as Avi says, we do all of the above already.  Including
> for Windows guests.

>From what I gathered libguestfs only provides access to the guests'
image.

Which part is doing the IKCONFIG or System.map probing? Or is it done in
a different way?

-- 

Sasha.




[Qemu-devel] Questions regarding ivshmem spec

2011-08-25 Thread Sasha Levin
Hello,

I am looking to implement an ivshmem device for KVM tools, the purpose
is to provide same functionality as QEMU and interoperability with QEMU.

Going through the spec (I found here:
https://gitorious.org/nahanni/guest-code/blobs/master/device_spec.txt )
and the code in QEMU I have gathered several questions, I'll be happy
for some help with it.

1. File handles and guest IDs are passed between the server and the
peers using sockets, is the protocol itself documented anywhere? I would
like to be able to work alongside QEMU servers/peers.

2. The spec describes DOORBELL as an array of DWORDs, when one guest
wants to poke a different guest it would write something into the offset
of the other guest in the DOORBELL array.
Looking at the implementation in QEMU, DOORBELL is one DWORD, when
writing to it the upper WORD is the guest id and the lower WORD is the
value.
What am I missing here?

3. There are 3 ways for guests to communicate between each other, and
I'm assuming all guests using the same SHM block must use the same
method. Is it safe to assume we'll always use ioeventfds as in the
implementation now?

Thanks!

-- 

Sasha.




Re: [Qemu-devel] Questions regarding ivshmem spec

2011-08-25 Thread Sasha Levin
On Thu, 2011-08-25 at 17:00 +0300, Avi Kivity wrote:
> On 08/25/2011 04:29 PM, Sasha Levin wrote:
> > 2. The spec describes DOORBELL as an array of DWORDs, when one guest
> > wants to poke a different guest it would write something into the offset
> > of the other guest in the DOORBELL array.
> > Looking at the implementation in QEMU, DOORBELL is one DWORD, when
> > writing to it the upper WORD is the guest id and the lower WORD is the
> > value.
> > What am I missing here?
> >
> 
> The spec in qemu.git is accurate.  The intent is to use an ioeventfd 
> bound into an irqfd so a write into the doorbell injects an interrupt 
> directly into the other guest, without going through qemu^Wkvm tool.
> 

But the doorbell is a single DWORD, so if a guest writes to it we'd
still need to figure out which guest/vector he wants to poke from
userspace, no?

If it was an array of doorbells then yes, we could assign an ioeventfd
to each offset - but now I don't quite see how we can avoid passing
through the userspace.

-- 

Sasha.




Re: [Qemu-devel] Questions regarding ivshmem spec

2011-08-25 Thread Sasha Levin
On Thu, 2011-08-25 at 17:40 +0300, Avi Kivity wrote:
> On 08/25/2011 05:39 PM, Sasha Levin wrote:
> > On Thu, 2011-08-25 at 17:00 +0300, Avi Kivity wrote:
> > >  On 08/25/2011 04:29 PM, Sasha Levin wrote:
> > >  >  2. The spec describes DOORBELL as an array of DWORDs, when one guest
> > >  >  wants to poke a different guest it would write something into the 
> > > offset
> > >  >  of the other guest in the DOORBELL array.
> > >  >  Looking at the implementation in QEMU, DOORBELL is one DWORD, when
> > >  >  writing to it the upper WORD is the guest id and the lower WORD is the
> > >  >  value.
> > >  >  What am I missing here?
> > >  >
> > >
> > >  The spec in qemu.git is accurate.  The intent is to use an ioeventfd
> > >  bound into an irqfd so a write into the doorbell injects an interrupt
> > >  directly into the other guest, without going through qemu^Wkvm tool.
> > >
> >
> > But the doorbell is a single DWORD, so if a guest writes to it we'd
> > still need to figure out which guest/vector he wants to poke from
> > userspace, no?
> >
> > If it was an array of doorbells then yes, we could assign an ioeventfd
> > to each offset - but now I don't quite see how we can avoid passing
> > through the userspace.
> >
> 
> Use the datamatch facility.
> 
> We didn't want an array of registers to avoid scaling issues (PIO space 
> is quite small).
> 
> 

Ah, right.

Thanks!

-- 

Sasha.




Re: [Qemu-devel] Guest kernel device compatability auto-detection

2011-08-25 Thread Sasha Levin
On Thu, 2011-08-25 at 16:48 -0500, Anthony Liguori wrote:
> On 08/25/2011 12:21 AM, Sasha Levin wrote:
> > Hi,
> >
> > Currently when we run the guest we treat it as a black box, we're not
> > quite sure what it's going to start and whether it supports the same
> > features we expect it to support when running it from the host.
> >
> > This forces us to start the guest with the safest defaults possible, for
> > example: '-drive file=my_image.qcow2' will be started with slow IDE
> > emulation even though the guest is capable of virtio.
> >
> > I'm currently working on a method to try and detect whether the guest
> > kernel has specific configurations enabled and either warn the user if
> > we know the kernel is not going to properly work or use better defaults
> > if we know some advanced features are going to work.
> >
> > How am I planning to do it? First, we'll try finding which kernel the
> > guest is going to boot (easy when user does '-kernel', less easy when
> > the user boots an image). For simplicity sake I'll stick with the
> > '-kernel' option for now.
> 
> Is the problem you're trying to solve determine whether the guest kernel 
> is going to work well under kvm tool or trying to choose the right 
> hardware profile to expose to the guest?
> 
> If it's the former, I think the path you're heading down is the most 
> likely to succeed (trying to guess based on what you can infer about the 
> kernel).
> 
> If it's the later, there's some interesting possibilities we never fully 
> explored in QEMU.
> 

I was thinking about both, I've considered kvm tools to be the 'easy'
case where we only say if it would work or not, and QEMU as the hard one
where we need to build a working configuration.

> One would be exposing a well supported device (like IDE emulation) and 
> having a magic mode that allowed you to basically promote the device 
> from IDE emulation to virtio-blk.  Likewise, you could do something like 
> that to promote from the e1000 to virtio-net.
> 
> It might require some special support in the guest kernel and would 
> likely be impossible to do in Windows, but if you primarily care about 
> Linux guests, it ought to be possible.

You're thinking about trying to expose all interfaces during boot and
seeing which ones the kernel bites?

Another thing that comes to mind is that we could start this project
with a script that given a kernel, it would find the optimal hardware
configuration for it (and the matching QEMU command line).

It would simply work the other way around: try booting with the best
devices first, if it doesn't boot we would 'demote' them one at a time
until the kernel does boot.

-- 

Sasha.




Re: [Qemu-devel] Guest kernel device compatability auto-detection

2011-08-25 Thread Sasha Levin
On Thu, 2011-08-25 at 16:25 +, Decker, Schorschi wrote:
> I would ask two things be done in the design if it goes forward, 1)
> have an explicit way to disable this feature, where the hypervisor
> cannot interact with the guest OS directly in any way if disablement
> is selected.

I doubt that this (or anything similar) introduced will even be set to
on by default. It has the potential of breaking stuff that would work
otherwise (thats why the default boot is with the safest configuration
possible).

On Thu, 2011-08-25 at 16:25 +, Decker, Schorschi wrote:
> 2) implement the feature as an agent in the guest OS where the
> hypervisor can only query the guest OS agent, using a standard TCP/IP
> methodology.

I was planning to implementing it by probing the image before actually
booting it.
This process is completely offline and doesn't require interaction with
the guest. The guest isn't even running at that point.

-- 

Sasha.




Re: [Qemu-devel] Guest kernel device compatability auto-detection

2011-08-26 Thread Sasha Levin
On Fri, 2011-08-26 at 09:04 +0100, Richard W.M. Jones wrote:
> On Fri, Aug 26, 2011 at 09:22:45AM +0300, Sasha Levin wrote:
> > On Thu, 2011-08-25 at 16:25 +, Decker, Schorschi wrote:
> > > 2) implement the feature as an agent in the guest OS where the
> > > hypervisor can only query the guest OS agent, using a standard TCP/IP
> > > methodology.
> >
> > I was planning to implementing it by probing the image before
> > actually booting it.  This process is completely offline and doesn't
> > require interaction with the guest. The guest isn't even running at
> > that point.
> 
> There are still plenty of security issues to be concerned about with
> handling an offline guest.  It is quite possible for such a guest to
> be booby-trapped in a way that allows an exploit.  I summarised some
> of the issues I thought about here, but there are likely to be others:
> 
> http://libguestfs.org/guestfs.3.html#security

That was an interesting read.

Are the concerns still valid if we were going to boot the image anyway
later on?

I'm assuming that probing would happen only before we're trying to boot
a guest, and not just probe any image file we find.

-- 

Sasha.




Re: [Qemu-devel] Questions regarding ivshmem spec

2011-08-29 Thread Sasha Levin
On Thu, 2011-08-25 at 16:29 +0300, Sasha Levin wrote:
> Hello,
> 
> I am looking to implement an ivshmem device for KVM tools, the purpose
> is to provide same functionality as QEMU and interoperability with QEMU.

[snip]

> 1. File handles and guest IDs are passed between the server and the
> peers using sockets, is the protocol itself documented anywhere? I would
> like to be able to work alongside QEMU servers/peers. 

I'm still wondering if someone could do a quick sketch of the
client-server protocol or possibly point me to something that documents
it?

-- 

Sasha.




Re: [Qemu-devel] [PATCH] virtio: Make memory barriers be memory barriers

2011-09-01 Thread Sasha Levin
On Thu, 2011-09-01 at 16:09 +1000, David Gibson wrote:
> The virtio code already has memory barrier wmb() macros in the code.
> However they are was defined as no-ops.  The comment claims that real
> barriers are not necessary because the code does not run concurrent.
> However, with kvm and io-thread enabled, this is not true and this qemu
> code can indeed run concurrently with the guest kernel.  This does not
> cause problems on x86 due to it's strongly ordered storage model, but it
> causes a race leading to virtio errors on POWER which has a relaxed storage
> ordering model.
> 
> Specifically, the QEMU puts new element into the "used" ring and then
> updates the ring free-running counter.  Without a barrier between these
> under the right circumstances, the guest linux driver can receive an
> interrupt, read the counter change but find the ring element to be handled
> still has an old value, leading to an "id %u is not a head!\n" error
> message.
> 
> The problem is easy to reproduce on POWER using virtio-net with heavy
> traffic.
> 
> The patch defines wmb() as __sync_synchronize(), a cross platform memory
> barrier primitive available in sufficiently recent gcc versions (gcc 4.2
> and after?).  If we care about older gccs then this patch will need to
> be updated with some sort of fallback.
> 
> Signed-off-by: Alexey Kardashevskiy 
> Signed-off-by: David Gibson 
> ---
>  hw/virtio.c |   10 ++
>  1 files changed, 2 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/virtio.c b/hw/virtio.c
> index 13aa0fa..c9f0e75 100644
> --- a/hw/virtio.c
> +++ b/hw/virtio.c
> @@ -21,14 +21,8 @@
>   * x86 pagesize again. */
>  #define VIRTIO_PCI_VRING_ALIGN 4096
>  
> -/* QEMU doesn't strictly need write barriers since everything runs in
> - * lock-step.  We'll leave the calls to wmb() in though to make it obvious 
> for
> - * KVM or if kqemu gets SMP support.
> - * In any case, we must prevent the compiler from reordering the code.
> - * TODO: we likely need some rmb()/mb() as well.
> - */
> -
> -#define wmb() __asm__ __volatile__("": : :"memory")
> + /* TODO: we may also need rmb()s.  It hasn't bitten us yet, but.. */
> + #define wmb() __sync_synchronize()

That asm directive also implicitly provided a compiler barrier, I could
find whether __sync_synchronize() provides one as well.

Any idea if it does?

>  
>  typedef struct VRingDesc
>  {

-- 

Sasha.




Re: [Qemu-devel] [PATCH] virtio: Make memory barriers be memory barriers

2011-09-01 Thread Sasha Levin
On Thu, 2011-09-01 at 10:37 +0300, Sasha Levin wrote:
> On Thu, 2011-09-01 at 16:09 +1000, David Gibson wrote:
> > The virtio code already has memory barrier wmb() macros in the code.
> > However they are was defined as no-ops.  The comment claims that real
> > barriers are not necessary because the code does not run concurrent.
> > However, with kvm and io-thread enabled, this is not true and this qemu
> > code can indeed run concurrently with the guest kernel.  This does not
> > cause problems on x86 due to it's strongly ordered storage model, but it
> > causes a race leading to virtio errors on POWER which has a relaxed storage
> > ordering model.
> > 
> > Specifically, the QEMU puts new element into the "used" ring and then
> > updates the ring free-running counter.  Without a barrier between these
> > under the right circumstances, the guest linux driver can receive an
> > interrupt, read the counter change but find the ring element to be handled
> > still has an old value, leading to an "id %u is not a head!\n" error
> > message.
> > 
> > The problem is easy to reproduce on POWER using virtio-net with heavy
> > traffic.
> > 
> > The patch defines wmb() as __sync_synchronize(), a cross platform memory
> > barrier primitive available in sufficiently recent gcc versions (gcc 4.2
> > and after?).  If we care about older gccs then this patch will need to
> > be updated with some sort of fallback.
> > 
> > Signed-off-by: Alexey Kardashevskiy 
> > Signed-off-by: David Gibson 
> > ---
> >  hw/virtio.c |   10 ++
> >  1 files changed, 2 insertions(+), 8 deletions(-)
> > 
> > diff --git a/hw/virtio.c b/hw/virtio.c
> > index 13aa0fa..c9f0e75 100644
> > --- a/hw/virtio.c
> > +++ b/hw/virtio.c
> > @@ -21,14 +21,8 @@
> >   * x86 pagesize again. */
> >  #define VIRTIO_PCI_VRING_ALIGN 4096
> >  
> > -/* QEMU doesn't strictly need write barriers since everything runs in
> > - * lock-step.  We'll leave the calls to wmb() in though to make it obvious 
> > for
> > - * KVM or if kqemu gets SMP support.
> > - * In any case, we must prevent the compiler from reordering the code.
> > - * TODO: we likely need some rmb()/mb() as well.
> > - */
> > -
> > -#define wmb() __asm__ __volatile__("": : :"memory")
> > + /* TODO: we may also need rmb()s.  It hasn't bitten us yet, but.. */
> > + #define wmb() __sync_synchronize()
> 
> That asm directive also implicitly provided a compiler barrier, I could

couldn't.

> find whether __sync_synchronize() provides one as well.
> 
> Any idea if it does?
> 
> >  
> >  typedef struct VRingDesc
> >  {
> 

-- 

Sasha.




[Qemu-devel] [PATCH] pc_init: Fail on bad kernel

2011-09-03 Thread Sasha Levin
When providing QEMU with a bad '-kernel' parameter, such as a file which
is not really a kernel, QEMU will attempt to allocate a huge amount of
memory and fail either with "Failed to allocate memory: Cannot allocate
memory" or a GLib error: "GLib-ERROR **: gmem.c:170: failed to allocate
18446744073709529965 bytes"

This patch handles the case where the magic sig wasn't located in the
provided kernel, and loading it as multiboot failed as well.

Cc: Anthony Liguori 
Signed-off-by: Sasha Levin 
---
 hw/pc.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index 6b3662e..428440b 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -691,8 +691,14 @@ static void load_linux(void *fw_cfg,
/* This looks like a multiboot kernel. If it is, let's stop
   treating it like a Linux kernel. */
 if (load_multiboot(fw_cfg, f, kernel_filename, initrd_filename,
-   kernel_cmdline, kernel_size, header))
+   kernel_cmdline, kernel_size, header)) {
 return;
+} else {
+fprintf(stderr, "qemu: could not load kernel '%s': %s\n",
+   kernel_filename, strerror(errno));
+   exit(1);
+}
+   
protocol = 0;
 }
 
-- 
1.7.6.1




Re: [Qemu-devel] [PATCH] pc_init: Fail on bad kernel

2011-09-14 Thread Sasha Levin
Ping?

On Sat, 2011-09-03 at 22:35 +0300, Sasha Levin wrote:
> When providing QEMU with a bad '-kernel' parameter, such as a file which
> is not really a kernel, QEMU will attempt to allocate a huge amount of
> memory and fail either with "Failed to allocate memory: Cannot allocate
> memory" or a GLib error: "GLib-ERROR **: gmem.c:170: failed to allocate
> 18446744073709529965 bytes"
> 
> This patch handles the case where the magic sig wasn't located in the
> provided kernel, and loading it as multiboot failed as well.
> 
> Cc: Anthony Liguori 
> Signed-off-by: Sasha Levin 
> ---
>  hw/pc.c |8 +++-
>  1 files changed, 7 insertions(+), 1 deletions(-)
> 
> diff --git a/hw/pc.c b/hw/pc.c
> index 6b3662e..428440b 100644
> --- a/hw/pc.c
> +++ b/hw/pc.c
> @@ -691,8 +691,14 @@ static void load_linux(void *fw_cfg,
>   /* This looks like a multiboot kernel. If it is, let's stop
>  treating it like a Linux kernel. */
>  if (load_multiboot(fw_cfg, f, kernel_filename, initrd_filename,
> -   kernel_cmdline, kernel_size, header))
> +   kernel_cmdline, kernel_size, header)) {
>  return;
> +} else {
> +fprintf(stderr, "qemu: could not load kernel '%s': %s\n",
> + kernel_filename, strerror(errno));
> + exit(1);
> +}
> + 
>   protocol = 0;
>  }
>  

-- 

Sasha.




Re: [Qemu-devel] [RFC]QEMU disk I/O limits

2011-06-01 Thread Sasha Levin
Hi,

On Mon, 2011-05-30 at 13:09 +0800, Zhi Yong Wu wrote:
> Hello, all,
> 
> I have prepared to work on a feature called "Disk I/O limits" for 
> qemu-kvm projeect.
> This feature will enable the user to cap disk I/O amount performed by a 
> VM.It is important for some storage resources to be shared among multi-VMs. 
> As you've known, if some of VMs are doing excessive disk I/O, they will hurt 
> the performance of other VMs.
> 
> More detail is available here:
> http://wiki.qemu.org/Features/DiskIOLimits
> 
> 1.) Why we need per-drive disk I/O limits 
> As you've known, for linux, cgroup blkio-controller has supported I/O 
> throttling on block devices. More importantly, there is no single mechanism 
> for disk I/O throttling across all underlying storage types (image file, LVM, 
> NFS, Ceph) and for some types there is no way to throttle at all. 
> 
> Disk I/O limits feature introduces QEMU block layer I/O limits together 
> with command-line and QMP interfaces for configuring limits. This allows I/O 
> limits to be imposed across all underlying storage types using a single 
> interface.
> 
> 2.) How disk I/O limits will be implemented
> QEMU block layer will introduce a per-drive disk I/O request queue for 
> those disks whose "disk I/O limits" feature is enabled. It can control disk 
> I/O limits individually for each disk when multiple disks are attached to a 
> VM, and enable use cases like unlimited local disk access but shared storage 
> access with limits. 
> In mutliple I/O threads scenario, when an application in a VM issues a 
> block I/O request, this request will be intercepted by QEMU block layer, then 
> it will calculate disk runtime I/O rate and determine if it has go beyond its 
> limits. If yes, this I/O request will enqueue to that introduced queue; 
> otherwise it will be serviced.
> 
> 3.) How the users enable and play with it
> QEMU -drive option will be extended so that disk I/O limits can be 
> specified on its command line, such as -drive [iops=xxx,][throughput=xxx] or 
> -drive [iops_rd=xxx,][iops_wr=xxx,][throughput=xxx] etc. When this argument 
> is specified, it means that "disk I/O limits" feature is enabled for this 
> drive disk.
> The feature will also provide users with the ability to change per-drive 
> disk I/O limits at runtime using QMP commands.

I'm wondering if you've considered adding a 'burst' parameter -
something which will not limit (or limit less) the io ops or the
throughput for the first 'x' ms in a given time window.

> Regards,
> 
> Zhiyong Wu
> 

-- 

Sasha.




Re: [Qemu-devel] [RFC]QEMU disk I/O limits

2011-06-02 Thread Sasha Levin
On Thu, 2011-06-02 at 14:29 +0800, Zhi Yong Wu wrote:
> On Thu, Jun 02, 2011 at 09:17:06AM +0300, Sasha Levin wrote:
> >Date: Thu, 02 Jun 2011 09:17:06 +0300
> >From: Sasha Levin 
> >To: Zhi Yong Wu 
> >Cc: qemu-devel@nongnu.org, k...@vger.kernel.org, kw...@redhat.com,
> > aligu...@us.ibm.com, herb...@gondor.apana.org.au,
> > guijianf...@cn.fujitsu.com, wu...@cn.ibm.com, luow...@cn.ibm.com,
> > zh...@cn.ibm.com, zhaoy...@cn.ibm.com, l...@redhat.com,
> > rahar...@us.ibm.com, vgo...@redhat.com, stefa...@linux.vnet.ibm.com
> >Subject: Re: [Qemu-devel] [RFC]QEMU disk I/O limits
> >X-Mailer: Evolution 2.32.2 
> >
> >Hi,
> >
> >On Mon, 2011-05-30 at 13:09 +0800, Zhi Yong Wu wrote:
> >> Hello, all,
> >> 
> >> I have prepared to work on a feature called "Disk I/O limits" for 
> >> qemu-kvm projeect.
> >> This feature will enable the user to cap disk I/O amount performed by 
> >> a VM.It is important for some storage resources to be shared among 
> >> multi-VMs. As you've known, if some of VMs are doing excessive disk I/O, 
> >> they will hurt the performance of other VMs.
> >> 
> >> More detail is available here:
> >> http://wiki.qemu.org/Features/DiskIOLimits
> >> 
> >> 1.) Why we need per-drive disk I/O limits 
> >> As you've known, for linux, cgroup blkio-controller has supported I/O 
> >> throttling on block devices. More importantly, there is no single 
> >> mechanism for disk I/O throttling across all underlying storage types 
> >> (image file, LVM, NFS, Ceph) and for some types there is no way to 
> >> throttle at all. 
> >> 
> >> Disk I/O limits feature introduces QEMU block layer I/O limits 
> >> together with command-line and QMP interfaces for configuring limits. This 
> >> allows I/O limits to be imposed across all underlying storage types using 
> >> a single interface.
> >> 
> >> 2.) How disk I/O limits will be implemented
> >> QEMU block layer will introduce a per-drive disk I/O request queue for 
> >> those disks whose "disk I/O limits" feature is enabled. It can control 
> >> disk I/O limits individually for each disk when multiple disks are 
> >> attached to a VM, and enable use cases like unlimited local disk access 
> >> but shared storage access with limits. 
> >> In mutliple I/O threads scenario, when an application in a VM issues a 
> >> block I/O request, this request will be intercepted by QEMU block layer, 
> >> then it will calculate disk runtime I/O rate and determine if it has go 
> >> beyond its limits. If yes, this I/O request will enqueue to that 
> >> introduced queue; otherwise it will be serviced.
> >> 
> >> 3.) How the users enable and play with it
> >> QEMU -drive option will be extended so that disk I/O limits can be 
> >> specified on its command line, such as -drive [iops=xxx,][throughput=xxx] 
> >> or -drive [iops_rd=xxx,][iops_wr=xxx,][throughput=xxx] etc. When this 
> >> argument is specified, it means that "disk I/O limits" feature is enabled 
> >> for this drive disk.
> >> The feature will also provide users with the ability to change 
> >> per-drive disk I/O limits at runtime using QMP commands.
> >
> >I'm wondering if you've considered adding a 'burst' parameter -
> >something which will not limit (or limit less) the io ops or the
> >throughput for the first 'x' ms in a given time window.
> Currently no, Do you let us know what scenario it will make sense to?

My assumption is that most guests are not doing constant disk I/O
access. Instead, the operations are usually short and happen on small
scale (relatively small amount of bytes accessed).

For example: Multiple table DB lookup, serving a website, file servers.

Basically, if I need to do a DB lookup which needs 50MB of data from a
disk which is limited to 10MB/s, I'd rather let it burst for 1 second
and complete the lookup faster instead of having it read data for 5
seconds.

If the guest now starts running multiple lookups one after the other,
thats when I would like to limit.

> Regards,
> 
> Zhiyong Wu
> >
> >> Regards,
> >> 
> >> Zhiyong Wu
> >> 
> >
> >-- 
> >
> >Sasha.
> >

-- 

Sasha.




Re: [Qemu-devel] [PATCH v4 3/6] KVM: Initialize irqfd from kvm_init().

2013-04-03 Thread Sasha Levin
On 02/28/2013 04:22 AM, Cornelia Huck wrote:
> Currently, eventfd introduces module_init/module_exit functions
> to initialize/cleanup the irqfd workqueue. This only works, however,
> if no other module_init/module_exit functions are built into the
> same module.
> 
> Let's just move the initialization and cleanup to kvm_init and kvm_exit.
> This way, it is also clearer where kvm startup may fail.
> 
> Signed-off-by: Cornelia Huck 

I'm seeing this during boot:

[6.763302] [ cut here ]
[6.763763] WARNING: at kernel/workqueue.c:4204 
destroy_workqueue+0x1df/0x3d0()
[6.764507] Modules linked in:
[6.764792] Pid: 1, comm: swapper/0 Tainted: GW
3.9.0-rc5-next-20130402-sasha-00015-g3522ec5 #324
[6.765654] Call Trace:
[6.765875]  [] warn_slowpath_common+0x8b/0xc0
[6.766436]  [] warn_slowpath_null+0x15/0x20
[6.766947]  [] destroy_workqueue+0x1df/0x3d0
[6.768631]  [] kvm_irqfd_exit+0x10/0x20
[6.77]  [] kvm_init+0x2ab/0x310
[6.770607]  [] ? cpu_has_kvm_support+0x4d/0x4d
[6.771241]  [] vmx_init+0x1f4/0x437
[6.771709]  [] ? cpu_has_kvm_support+0x4d/0x4d
[6.772266]  [] do_one_initcall+0xb2/0x1b0
[6.772995]  [] kernel_init_freeable+0x15d/0x1ef
[6.773857]  [] ? loglevel+0x31/0x31
[6.774609]  [] ? rest_init+0x140/0x140
[6.775551]  [] kernel_init+0x9/0xf0
[6.776162]  [] ret_from_fork+0x7c/0xb0
[6.776662]  [] ? rest_init+0x140/0x140
[6.777241] ---[ end trace 10bba684ced4346a ]---

And I think it has something to do with this patch.


Thanks,
Sasha



[Qemu-devel] [PATCH] Don't limit node count by smp count

2011-03-30 Thread Sasha Levin
It is possible to create CPU-less NUMA nodes, node amount shouldn't be
limited by amount of CPUs.

Signed-off-by: Sasha Levin 
---
 vl.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/vl.c b/vl.c
index 8bcf2ae..8cc1aa8 100644
--- a/vl.c
+++ b/vl.c
@@ -3002,8 +3002,8 @@ int main(int argc, char **argv, char **envp)
 if (nb_numa_nodes > 0) {
 int i;
 
-if (nb_numa_nodes > smp_cpus) {
-nb_numa_nodes = smp_cpus;
+if (nb_numa_nodes > MAX_NODES) {
+nb_numa_nodes = MAX_NODES;
 }
 
 /* If no memory size if given for any node, assume the default
case




Re: [Qemu-devel] [PATCH] Don't limit node count by smp count

2011-03-30 Thread Sasha Levin
On Wed, 2011-03-30 at 14:28 -0500, Anthony Liguori wrote:
> On 03/30/2011 02:14 PM, Sasha Levin wrote:
> > It is possible to create CPU-less NUMA nodes, node amount shouldn't be
> > limited by amount of CPUs.
> 
> But does this actually work in the code today and does it work with any 
> guests?
> 

I've tested it on an Ubuntu desktop as a guest and it worked
fine, /sys/devices/system/node/has_cpu was as expected.

> Regards,
> 
> Anthony Liguori
> 
> > Signed-off-by: Sasha Levin
> > ---
> >   vl.c |4 ++--
> >   1 files changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/vl.c b/vl.c
> > index 8bcf2ae..8cc1aa8 100644
> > --- a/vl.c
> > +++ b/vl.c
> > @@ -3002,8 +3002,8 @@ int main(int argc, char **argv, char **envp)
> >   if (nb_numa_nodes>  0) {
> >   int i;
> >
> > -if (nb_numa_nodes>  smp_cpus) {
> > -nb_numa_nodes = smp_cpus;
> > +if (nb_numa_nodes>  MAX_NODES) {
> > +nb_numa_nodes = MAX_NODES;
> >   }
> >
> >   /* If no memory size if given for any node, assume the default
> > case
> >
> >
> 

-- 

Sasha.




Re: [Qemu-devel] [PATCH] Introduce QEMU_NEW()

2011-07-25 Thread Sasha Levin
On Mon, 2011-07-25 at 11:32 +0200, Alexander Graf wrote:
> On 25.07.2011, at 10:51, Avi Kivity wrote:
> 
> > qemu_malloc() is type-unsafe as it returns a void pointer.  Introduce
> > QEMU_NEW() (and QEMU_NEWZ()), which return the correct type.
> 
> What does this buy you over
> 
> type *x = qemu_malloc(sizeof(type));
> 
> ? I find the non-C++ version easier to read even.

It'll warn when you do silly things such as:

struct some_struct *k;

k = qemu_malloc(sizeof(k));

-- 

Sasha.




Re: [Qemu-devel] [net-next RFC PATCH 7/7] virtio-net changes

2011-08-12 Thread Sasha Levin
On Fri, 2011-08-12 at 09:55 +0800, Jason Wang wrote:
> From: Krishna Kumar 
> 
> Implement mq virtio-net driver.
> 
> Though struct virtio_net_config changes, it works with the old
> qemu since the last element is not accessed unless qemu sets
> VIRTIO_NET_F_MULTIQUEUE.
> 
> Signed-off-by: Krishna Kumar 
> Signed-off-by: Jason Wang 
> ---

Could these changes be documented to virtio-spec as well?

>  drivers/net/virtio_net.c   |  578 
> +++-
>  include/linux/virtio_net.h |3 
>  2 files changed, 411 insertions(+), 170 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 0c7321c..03a199d 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -49,16 +49,48 @@ struct virtnet_stats {
>   u64 rx_packets;
>  };
>  
> -struct virtnet_info {
> - struct virtio_device *vdev;
> - struct virtqueue *rvq, *svq, *cvq;
> - struct net_device *dev;
> +/* Internal representation of a send virtqueue */
> +struct send_queue {
> + /* Virtqueue associated with this send _queue */
> + struct virtqueue *svq;
> +
> + /* TX: fragments + linear part + virtio header */
> + struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
> +};
> +
> +/* Internal representation of a receive virtqueue */
> +struct receive_queue {
> + /* Virtqueue associated with this receive_queue */
> + struct virtqueue *rvq;
> +
> + /* Back pointer to the virtnet_info */
> + struct virtnet_info *vi;
> +
>   struct napi_struct napi;
> - unsigned int status;
>  
>   /* Number of input buffers, and max we've ever had. */
>   unsigned int num, max;
>  
> + /* Work struct for refilling if we run low on memory. */
> + struct delayed_work refill;
> +
> + /* Chain pages by the private ptr. */
> + struct page *pages;
> +
> + /* RX: fragments + linear part + virtio header */
> + struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
> +};
> +
> +struct virtnet_info {
> + struct send_queue **sq;
> + struct receive_queue **rq;
> +
> + int numtxqs; /* # of rxqs/txqs */
> + struct virtio_device *vdev;
> + struct virtqueue *cvq;
> + struct net_device *dev;
> + unsigned int status;
> +
>   /* I like... big packets and I cannot lie! */
>   bool big_packets;
>  
> @@ -67,16 +99,6 @@ struct virtnet_info {
>  
>   /* Active statistics */
>   struct virtnet_stats __percpu *stats;
> -
> - /* Work struct for refilling if we run low on memory. */
> - struct delayed_work refill;
> -
> - /* Chain pages by the private ptr. */
> - struct page *pages;
> -
> - /* fragments + linear part + virtio header */
> - struct scatterlist rx_sg[MAX_SKB_FRAGS + 2];
> - struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
>  };
>  
>  struct skb_vnet_hdr {
> @@ -106,22 +128,22 @@ static inline struct skb_vnet_hdr *skb_vnet_hdr(struct 
> sk_buff *skb)
>   * private is used to chain pages for big packets, put the whole
>   * most recent used list in the beginning for reuse
>   */
> -static void give_pages(struct virtnet_info *vi, struct page *page)
> +static void give_pages(struct receive_queue *rq, struct page *page)
>  {
>   struct page *end;
>  
>   /* Find end of list, sew whole thing into vi->pages. */
>   for (end = page; end->private; end = (struct page *)end->private);
> - end->private = (unsigned long)vi->pages;
> - vi->pages = page;
> + end->private = (unsigned long)rq->pages;
> + rq->pages = page;
>  }
>  
> -static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
> +static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
>  {
> - struct page *p = vi->pages;
> + struct page *p = rq->pages;
>  
>   if (p) {
> - vi->pages = (struct page *)p->private;
> + rq->pages = (struct page *)p->private;
>   /* clear private here, it is used to chain pages */
>   p->private = 0;
>   } else
> @@ -132,12 +154,13 @@ static struct page *get_a_page(struct virtnet_info *vi, 
> gfp_t gfp_mask)
>  static void skb_xmit_done(struct virtqueue *svq)
>  {
>   struct virtnet_info *vi = svq->vdev->priv;
> + int qnum = svq->queue_index / 2; /* RX/TX vqs are allocated in pairs */
>  
>   /* Suppress further interrupts. */
>   virtqueue_disable_cb(svq);
>  
>   /* We were probably waiting for more output buffers. */
> - netif_wake_queue(vi->dev);
> + netif_wake_subqueue(vi->dev, qnum);
>  }
>  
>  static void set_skb_frag(struct sk_buff *skb, struct page *page,
> @@ -157,9 +180,10 @@ static void set_skb_frag(struct sk_buff *skb, struct 
> page *page,
>   *len -= f->size;
>  }
>  
> -static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> +static struct sk_buff *page_to_skb(struct receive_queue *rq,
>  struct page *page, unsigned int len)
>  {
> + struct virtnet_info *vi = rq->vi;
>   struct sk_bu