KSM memory saving amount

2011-09-08 Thread Mihamina Rakotomandimby

Hi all,

Running an Ubuntu Natty where I launched 3 Xubuntu and 1 win 7, I cat see:
/sys/kernel/mm/ksm/full_scans: 34
/sys/kernel/mm/ksm/pages_shared: 8020
/sys/kernel/mm/ksm/pages_sharing: 36098
/sys/kernel/mm/ksm/pages_to_scan: 100
/sys/kernel/mm/ksm/pages_unshared: 247109
/sys/kernel/mm/ksm/pages_volatile: 32716
/sys/kernel/mm/ksm/run: 1
/sys/kernel/mm/ksm/sleep_millisecs: 200

I would like to evaluate the memory amount:
How many byte is a page?

--
RMA.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode

2011-09-08 Thread Michael S. Tsirkin
On Thu, Sep 08, 2011 at 07:53:11PM -0700, Roopa Prabhu wrote:
> >> Phase 1: Goal: Enable hardware filtering for all macvlan modes
> >> - In macvlan passthru mode the single guest virtio-nic connected will
> >>   receive traffic that he requested for
> >> - In macvlan non-passthru mode all guest virtio-nics sharing the
> >>   physical nic will see all other guest traffic
> >>   but the filtering at guest virtio-nic
> > 
> > I don't think guests currently filter anything.
> > 
> I was referring to Qemu-kvm virtio-net in
> virtion_net_receive->receive_filter. I think It only passes pkts that the
> guest OS is interested. It uses the filter table that I am passing to
> macvtap in this patch.

This happens after userspace thread gets woken up and data
is copied there. So relying on filtering at that level is
going to be very inefficient on a system with
multiple active guests. Further, and for that reason, vhost-net
doesn't do filtering at all, relying on the backends
to pass it correct packets.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qemu-kvm: Resolve PCI upstream diffs

2011-09-08 Thread Michael S. Tsirkin
On Thu, Sep 08, 2011 at 12:48:02PM +0200, Jan Kiszka wrote:
> Resolve all unneeded deviations from upstream code. No functional
> changes.
> 
> Signed-off-by: Jan Kiszka 

Acked-by: Michael S. Tsirkin 

> ---
>  hw/pci.c |   11 +++
>  hw/pci.h |5 -
>  2 files changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/pci.c b/hw/pci.c
> index e4c166a..4d8845c 100644
> --- a/hw/pci.c
> +++ b/hw/pci.c
> @@ -2106,8 +2106,6 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>  memset(pdev->wmask + offset, 0, size);
>  /* Check capability by default */
>  memset(pdev->cmask + offset, 0xFF, size);
> -
> -
>  return offset;
>  }
>  
> @@ -2125,9 +2123,14 @@ void pci_del_capability(PCIDevice *pdev, uint8_t 
> cap_id, uint8_t size)
>  memset(pdev->cmask + offset, 0, size);
>  memset(pdev->used + offset, 0, size);
>  
> -if (!pdev->config[PCI_CAPABILITY_LIST]) {
> +if (!pdev->config[PCI_CAPABILITY_LIST])
>  pdev->config[PCI_STATUS] &= ~PCI_STATUS_CAP_LIST;
> -}
> +}
> +
> +/* Reserve space for capability at a known offset (to call after load). */
> +void pci_reserve_capability(PCIDevice *pdev, uint8_t offset, uint8_t size)
> +{
> +memset(pdev->used + offset, 0xff, size);
>  }
>  
>  uint8_t pci_find_capability(PCIDevice *pdev, uint8_t cap_id)
> diff --git a/hw/pci.h b/hw/pci.h
> index da0c2d2..70fcd9c 100644
> --- a/hw/pci.h
> +++ b/hw/pci.h
> @@ -147,7 +147,7 @@ struct PCIDevice {
>  /* Used to implement RW1C(Write 1 to Clear) bytes */
>  uint8_t *w1cmask;
>  
> -/* Used to allocate config space and track capabilities. */
> +/* Used to allocate config space for capabilities. */
>  uint8_t *used;
>  
>  /* the following fields are read only */
> @@ -230,8 +230,11 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
>  
>  void pci_del_capability(PCIDevice *pci_dev, uint8_t cap_id, uint8_t 
> cap_size);
>  
> +void pci_reserve_capability(PCIDevice *pci_dev, uint8_t offset, uint8_t 
> size);
> +
>  uint8_t pci_find_capability(PCIDevice *pci_dev, uint8_t cap_id);
>  
> +
>  uint32_t pci_default_read_config(PCIDevice *d,
>   uint32_t address, int len);
>  void pci_default_write_config(PCIDevice *d,
> -- 
> 1.7.3.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode

2011-09-08 Thread Sridhar Samudrala

On 9/8/2011 8:00 PM, Roopa Prabhu wrote:



On 9/8/11 12:33 PM, "Michael S. Tsirkin"  wrote:


On Thu, Sep 08, 2011 at 12:23:56PM -0700, Roopa Prabhu wrote:

I think the main usecase for passthru mode is to assign a SR-IOV VF to
a single guest.


Yes and for the passthru usecase this patch should be enough to enable
filtering in hw (eventually like I indicated before I need to fix vlan
filtering too).

So with filtering in hw, and in sriov VF case, VFs
actually share a filtering table. How will that
be partitioned?

AFAIK, though it might maintain a single filter table space in hw, hw does
know which filter belongs to which VF. And the OS driver does not need to do
anything special. The VF driver exposes a VF netdev. And any uc/mc addresses
registered with a VF netdev are registered with the hw by the driver. And hw
will filter and send only pkts that the VF has expressed interest in.

Does your NIC & driver support adding multiple mac addresses to a VF?
I have tried a few other SR-IOV NICs sometime back and they didn't 
support this feature.


Currently, we don't have an interface to add multiple mac addresses to a 
netdev other than an

indirect way of creating a macvlan /if on top of it.

Thanks
Sridhar



No special filter partitioning in hw is required.

Thanks,
Roopa




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode

2011-09-08 Thread Roopa Prabhu



On 9/8/11 12:33 PM, "Michael S. Tsirkin"  wrote:

> On Thu, Sep 08, 2011 at 12:23:56PM -0700, Roopa Prabhu wrote:
>>> 
>>> I think the main usecase for passthru mode is to assign a SR-IOV VF to
>>> a single guest.
>>> 
>> Yes and for the passthru usecase this patch should be enough to enable
>> filtering in hw (eventually like I indicated before I need to fix vlan
>> filtering too).
> 
> So with filtering in hw, and in sriov VF case, VFs
> actually share a filtering table. How will that
> be partitioned?

AFAIK, though it might maintain a single filter table space in hw, hw does
know which filter belongs to which VF. And the OS driver does not need to do
anything special. The VF driver exposes a VF netdev. And any uc/mc addresses
registered with a VF netdev are registered with the hw by the driver. And hw
will filter and send only pkts that the VF has expressed interest in.

No special filter partitioning in hw is required.

Thanks,
Roopa

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode

2011-09-08 Thread Roopa Prabhu



On 9/8/11 12:11 PM, "Michael S. Tsirkin"  wrote:

> On Thu, Sep 08, 2011 at 09:19:32AM -0700, Roopa Prabhu wrote:
> There are more features we'll want down the road though,
> so let's see whether the interface will be able to
> satisfy them in a backwards compatible way before we
> set it in stone. Here's what I came up with:
> 
> How will the filtering table be partitioned within guests?
 
 Since this patch supports macvlan PASSTHRU mode only, in which the lower
 device has 1-1 mapping to the guest nic, it does not require any
 partitioning of filtering table within guests. Unless I missed
 understanding
 something. 
 If the lower device were being shared by multiple guest network interfaces
 (non PASSTHRU mode), only then we will need to maintain separate filter
 tables for each guest network interface in macvlan and forward the pkt to
 respective guest interface after a filter lookup. This could affect
 performance too I think.
>>> 
>>> Not with hardware filtering support. Which is where we'd need to
>>> partition the host nic mac table between guests.
>>> 
>> I need to understand this more. In non passthru case when a VF or physical
>> nic is shared between guests,
> 
> For example, consider a VF given to each guest. Hardware supports a fixed
> total number of filters, which can be partitioned between VFs.
> 
O ok. But hw maintains VF filters separately for every VF as far as I know.
Filters received on a VF are programmed for that VF only. Am assuming all
hardware do this. Atleast our hardware does this.
What I was referring to was a single VF shared between guests using macvtap
(could be bridge mode for example). All guests sharing the VF will register
 filters with the VF via macvlan. Hw makes sure what ever the VF asked for
is received at the VF. VF in hw does not know that it is shared by guests.
Only at macvlan we might need to re-filter the pkts received on the VF and
steer pkts to the individual guests based on what they asked for.


>> the nic does not really know about the guests,
>> so I was thinking we do the same thing as we do for the passthru case (ie
>> send all the address filters from macvlan to the physical nic). So at the
>> hardware, filtering is done for all guests sharing the nic. But if we want
>> each virtio-net nic or guest to get exactly what it asked for
>> macvlan/macvtap needs to maintain a copy of each guest filter and do a
>> lookup and send only the requested traffic to the guest. Here is the
>> performance hit that I was seeing. Please see my next comment for further
>> details. 
> 
> It won't be any slower than attaching a non-passthrough macvlan
> to a device, will it?
> 
Am not sure. The filter lookup in macvlan is the one I am concerned about.
Will need to try it out.

>> 
 I chose to support PASSTHRU Mode only at first because its simpler and all
 code additions are in control path only.
>>> 
>>> I agree. It would be a bit silly to have a dedicated interface
>>> for passthough and a completely separate one for
>>> non passthrough.
>>> 
>> Agree. The reason I did not focus on non-passthru case in the initial
>> version was because I was thinking things to do in the non-passthru case
>> will be just add-ons to the passthru case. But true Better to flush out the
>> non-pasthru case details.
>> 
>> After dwelling on this a bit more how about the below:
>> 
>> Phase 1: Goal: Enable hardware filtering for all macvlan modes
>> - In macvlan passthru mode the single guest virtio-nic connected will
>>   receive traffic that he requested for
>> - In macvlan non-passthru mode all guest virtio-nics sharing the
>>   physical nic will see all other guest traffic
>>   but the filtering at guest virtio-nic
> 
> I don't think guests currently filter anything.
> 
I was referring to Qemu-kvm virtio-net in
virtion_net_receive->receive_filter. I think It only passes pkts that the
guest OS is interested. It uses the filter table that I am passing to
macvtap in this patch.


>>   will make sure each guest
>>   eventually sees traffic he asked for. This is still better than
>>   putting the physical nic in promiscuous mode.
>> 
>> (This is mainly what my patch does...but will need to remove the passthru
>> check and see if there are any thing else needed for non-passthru case)
> 
> I'm fine with sticking with passthrough, make non passthrough
> a separate phase.
> 
Ok.

>> 
>> Phase 2: Goal: Enable filtering at macvlan so that each guest virtio-nic
>> receives only what he requested for.
>> - In this case, in addition to pushing the filters down to the physical
>>   nic we will have to maintain the same filter in macvlan and do a filter
>>   lookup before forwarding the traffic to a virtio-nic.
>> 
>> But I am thinking phase 2 might be redundant given virtio-nic already does
>> filtering for the guest.
> 
> It does? Do you mean the filter that qemu does in userspace?

Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode

2011-09-08 Thread Sridhar Samudrala
On Thu, 2011-09-08 at 09:19 -0700, Roopa Prabhu wrote:
> 
> 
> On 9/8/11 4:08 AM, "Michael S. Tsirkin"  wrote:
> 
> > On Wed, Sep 07, 2011 at 10:20:28PM -0700, Roopa Prabhu wrote:
> >> On 9/7/11 5:34 AM, "Michael S. Tsirkin"  wrote:
> >> 
> >>> On Tue, Sep 06, 2011 at 03:35:40PM -0700, Roopa Prabhu wrote:
>  This patch is an attempt at providing address filtering support for 
>  macvtap
>  devices in PASSTHRU mode. Its still a work in progress.
>  Briefly tested for basic functionality. Wanted to get some feedback on 
>  the
>  direction before proceeding.
>  
> >>> 
> >>> Good work, thanks.
> >>> 
> >> 
> >> Thanks.
> >> 
>  I have hopefully CC'ed all concerned people.
> >>> 
> >>> kvm crowd might also be interested.
> >>> Try using ./scripts/get_maintainer.pl as well.
> >>> 
> >> Thanks for the tip. Expanded CC list a bit more.
> >> 
>  PASSTHRU mode today sets the lowerdev in promiscous mode. In PASSTHRU 
>  mode
>  there is a 1-1 mapping between macvtap device and physical nic or VF. And
>  all
>  filtering is done in lowerdev hw. The lowerdev does not need to be in
>  promiscous mode as long as the guest filters are passed down to the
>  lowerdev.
>  This patch tries to remove the need for putting the lowerdev in 
>  promiscous
>  mode. 
>  I have also referred to the thread below where TUNSETTXFILTER was 
>  mentioned
>  in 
>  this context: 
>   http://patchwork.ozlabs.org/patch/69297/
>  
>  This patch basically passes the addresses got by TUNSETTXFILTER to 
>  macvlan
>  lowerdev.
>  
>  I have looked at previous work and discussions on this for qemu-kvm
>  by Michael Tsirkin, Alex Williamson and Dragos Tatulea
>  http://patchwork.ozlabs.org/patch/78595/
>  http://patchwork.ozlabs.org/patch/47160/
>  https://patchwork.kernel.org/patch/474481/
>  
>  Redhat bugzilla by Michael Tsirkin:
>  https://bugzilla.redhat.com/show_bug.cgi?id=655013
>  
>  I used Michael's qemu-kvm patch for testing the changes with KVM
>  
>  I would like to cover both MAC and vlan filtering in this work.
>  
>  Open Questions/Issues:
>  - There is a need for vlan filtering to complete the patch. It will 
>  require
>    a new tap ioctl cmd for vlans.
>    Some ideas on this are:
>  
>    a) TUNSETVLANFILTER: This will entail we send the whole vlan bitmap
>  filter
>  (similar to tun_filter for addresses). Passing the vlan id's to lower
>  device will mean going thru the whole list of vlans every time.
>  
>    OR
>  
>    b) TUNSETVLAN with vlan id and flag to set/unset
>  
>    Does option 'b' sound ok ?
>  
>  - In this implementation we make the macvlan address list same as the
>  address
>    list that came in the filter with TUNSETTXFILTER. This will not cover
>  cases
>    where the macvlan device needs to have other addresses that are not
>    necessarily in the filter. Is this a problem ?
> >>> 
> >>> What cases do you have in mind?
> >>> 
> >> This patch targets only macvlan PASSTHRU mode and for PASSTHRU mode I don't
> >> see a problem with uc/mc address list being the same in all the stacked
> >> netdevs in the path. I called that out above to make sure I was not missing
> >> any case in PASSTHRU mode where this might be invalid. Otherwise I don't 
> >> see
> >> a problem in the simple PASSTHRU use case this patch supports.
> >> 
>  - The patch currently only supports passing of IFF_PROMISC and
>  IFF_MULTICAST
>  filter flags to lowerdev
>  
>  This patch series implements the following
>  01/3 - macvlan: Add support for unicast filtering in macvlan
>  02/3 - macvlan: Add function to set addr filter on lower device in 
>  passthru
>  mode
>  03/3 - macvtap: Add support for TUNSETTXFILTER
>  
>  Please comment. Thanks.
>  
>  Signed-off-by: Roopa Prabhu 
>  Signed-off-by: Christian Benvenuti 
>  Signed-off-by: David Wang 
> >>> 
> >>> The security isn't lower than with promisc, so I don't see
> >>> a problem with this as such.
> >>> 
> >>> There are more features we'll want down the road though,
> >>> so let's see whether the interface will be able to
> >>> satisfy them in a backwards compatible way before we
> >>> set it in stone. Here's what I came up with:
> >>> 
> >>> How will the filtering table be partitioned within guests?
> >> 
> >> Since this patch supports macvlan PASSTHRU mode only, in which the lower
> >> device has 1-1 mapping to the guest nic, it does not require any
> >> partitioning of filtering table within guests. Unless I missed 
> >> understanding
> >> something. 
> >> If the lower device were being shared by multiple guest network interfaces
> >> (non PASSTHRU mode), only then we will need to maintain separate filter
> >> tables for each guest network interface in macvlan and fo

Re: [PATCH 1/2] kvm tools: fix repeated io emulation

2011-09-08 Thread Xiao Guangrong
On 08/18/2011 11:08 PM, Avi Kivity wrote:
> On 08/18/2011 12:35 AM, Sasha Levin wrote:
>> On Thu, 2011-08-18 at 09:13 +0300, Pekka Enberg wrote:
>> >  Hi,
>> >
>> >  On Thu, Aug 18, 2011 at 6:06 AM, Xiao Guangrong
>> >wrote:
>> >  >  When kvm emulates repeation io read instruction, it can exit to 
>> > user-space with
>> >  >  'count'>  1, we need to emulate io access for many times
>> >  >
>> >  >  Signed-off-by: Xiao Guangrong
>> >
>> >  The KVM tool is not actually maintained by Avi and Marcelo but by me
>> >  and few others. Our git repository is here:
>> >
>> >  https://github.com/penberg/linux-kvm
>> >
>> >  Ingo pulls that to -tip few times a week or so. Sasha, can you please
>> >  take a look at these patches and if you're OK with them, I'll apply
>> >  them.
>>
>> Pekka,
>>
>> I can only assume they're right, 'count' isn't documented anywhere :)
>>
>> If any of KVM maintainers could confirm it I'll add it into the docs.
>>
> 
> Count is indeed the number of repetitions.
> 

Hi Pekka,

Could you pick up this patchset please?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM-test: Add two scripts to disable services for perf tests

2011-09-08 Thread Amos Kong
- Original Message -
> On Wed, 07 Sep 2011 14:07:49 +0800, Amos Kong 
> wrote:
> 
> > System services on guest and host take uncertain resource, it
> > effects
> > the perf results. We can use the below two scripts to disable some
> > services of host and guest.
> >
> > stop_serivices_perf.sh is used to stop the running serivices.
> > off_service_perf.sh is used to off services when host starts up.
> >
> > We can use them to prepare environment for performance testcases.
> 
> Which environment? I assume this should secured by an Distribution
> specific guard:

prepare clean testing environment.

> E.g.
> 
> if [ -f /etc/redhat-release ] then

yes.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Where to file bug report?

2011-09-08 Thread Emmett Culley
Yesterday I had a second VM guest fail.  The day before one failed in the same 
way.  Both suddenly had read only file systems.  In both cases these are the 
last lines in the syslog:

Sep  7 21:19:59 dev init: tty (/dev/tty1) main process ended, respawning
Sep  7 21:59:23 dev kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 
action 0x6 frozen
Sep  7 21:59:23 dev kernel: ata1.00: failed command: WRITE DMA
Sep  7 21:59:23 dev kernel: ata1.00: cmd ca/00:08:a0:02:31/00:00:00:00:00/e3 
tag 0 dma 4096 out
Sep  7 21:59:23 dev kernel: res 40/00:01:00:00:00/00:00:00:00:00/a0 
Emask 0x4 (timeout)
Sep  7 21:59:23 dev kernel: ata1.00: status: { DRDY }
Sep  7 21:59:23 dev kernel: ata1: soft resetting link
Sep  7 21:59:23 dev kernel: ata1.00: configured for MWDMA2
Sep  7 21:59:23 dev kernel: ata1.00: device reported invalid CHS sector 0
Sep  7 21:59:23 dev kernel: ata1: EH complete

I assume this is the reason the root was suddenly read only.

I am running updated CentOS 6 on the host and all guests.  Which list should I 
search and/or report this to?  Kernel or kvn?

Emmett
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 4/5] VFIO: Add PCI device support

2011-09-08 Thread Alex Williamson
On Thu, 2011-09-08 at 10:52 +0300, Avi Kivity wrote:
> On 09/07/2011 09:55 PM, Konrad Rzeszutek Wilk wrote:
> > >   If you don't know what to do here, say N.
> > >  +
> > >  +menuconfig VFIO_PCI
> > >  +bool "VFIO support for PCI devices"
> > >  +depends on VFIO&&  PCI
> > >  +default y if X86
> >
> > Hahah.. And Linus is going to tear your behind for that.
> >
> > Default should be 'n'
> 
> It depends on VFIO, which presumably defaults to n.

Yes, exactly.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] KVM test: Introduce migrate.with_blkdebug variant

2011-09-08 Thread Lucas Meneghel Rodrigues
In order to solve task

http://autotest.kernel.org/ticket/50

Create a migrate.with_blkdebug variant, that will
migrate the vm with state check before and after
migration. As the reporter of the original problem
on task 50 said, indeed we have a lot of migration
state mismatch problems with a current qemu-kvm.

Signed-off-by: Lucas Meneghel Rodrigues 
---
 client/tests/kvm/subtests.cfg.sample |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/client/tests/kvm/subtests.cfg.sample 
b/client/tests/kvm/subtests.cfg.sample
index 9d9dbba..d79b3af 100644
--- a/client/tests/kvm/subtests.cfg.sample
+++ b/client/tests/kvm/subtests.cfg.sample
@@ -164,6 +164,12 @@ variants:
 mig_cancel = yes
 variants:
 - @default:
+- with_blkdebug:
+drive_blkdebug_image1 = blkdebug/default.conf
+drive_rerror_image1 = stop
+drive_werror_image1 = stop
+vmstate_check = yes
+migration_living_guest = no
 - with_set_speed:
 mig_speed = 1G
 pre_migrate = "mig_set_speed"
-- 
1.7.6

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] KVM test: migration subtest - Introduce migration_living_guest param

2011-09-08 Thread Lucas Meneghel Rodrigues
In order to test a special guest configuration (special blkdebug
config files to inject errors on the block subsystem), added a
migration_living_guest param. If the param is set to "no", no
attempt to have a session on the guest will be made, since the
guest block subsystem will stop when the first error is injected.

Signed-off-by: Lucas Meneghel Rodrigues 
---
 client/tests/kvm/tests/migration.py |   78 ---
 1 files changed, 45 insertions(+), 33 deletions(-)

diff --git a/client/tests/kvm/tests/migration.py 
b/client/tests/kvm/tests/migration.py
index 7e877fe..1e02784 100644
--- a/client/tests/kvm/tests/migration.py
+++ b/client/tests/kvm/tests/migration.py
@@ -36,34 +36,40 @@ def run_migration(test, params, env):
 mig_speed = params.get("mig_speed", "1G")
 return vm.monitor.migrate_set_speed(mig_speed)
 
-vm = env.get_vm(params["main_vm"])
-vm.verify_alive()
-timeout = int(params.get("login_timeout", 360))
-session = vm.wait_for_login(timeout=timeout)
-
 mig_timeout = float(params.get("mig_timeout", "3600"))
 mig_protocol = params.get("migration_protocol", "tcp")
 mig_cancel_delay = int(params.get("mig_cancel") == "yes") * 2
 offline = params.get("offline", "no") == "yes"
 check = params.get("vmstate_check", "no") == "yes"
+living_guest_os = params.get("migration_living_guest", "yes") == "yes"
+
+vm = env.get_vm(params["main_vm"])
+vm.verify_alive()
+timeout = int(params.get("login_timeout", 360))
+if living_guest_os:
+session = vm.wait_for_login(timeout=timeout)
 
 # Get the output of migration_test_command
-test_command = params.get("migration_test_command")
-reference_output = session.cmd_output(test_command)
+if living_guest_os:
+test_command = params.get("migration_test_command")
+reference_output = session.cmd_output(test_command)
 
 # Start some process in the background (and leave the session open)
-background_command = params.get("migration_bg_command", "")
-session.sendline(background_command)
-time.sleep(5)
+if living_guest_os:
+background_command = params.get("migration_bg_command", "")
+session.sendline(background_command)
+time.sleep(5)
 
 # Start another session with the guest and make sure the background
 # process is running
-session2 = vm.wait_for_login(timeout=timeout)
+if living_guest_os:
+session2 = vm.wait_for_login(timeout=timeout)
 
 try:
-check_command = params.get("migration_bg_check_command", "")
-session2.cmd(check_command, timeout=30)
-session2.close()
+if living_guest_os:
+check_command = params.get("migration_bg_check_command", "")
+session2.cmd(check_command, timeout=30)
+session2.close()
 
 # run some functions before migrate start.
 pre_migrate = get_functions(params.get("pre_migrate"), locals())
@@ -79,32 +85,38 @@ def run_migration(test, params, env):
 func()
 
 # Log into the guest again
-logging.info("Logging into guest after migration...")
-session2 = vm.wait_for_login(timeout=30)
-logging.info("Logged in after migration")
+if living_guest_os:
+logging.info("Logging into guest after migration...")
+session2 = vm.wait_for_login(timeout=30)
+logging.info("Logged in after migration")
 
 # Make sure the background process is still running
-session2.cmd(check_command, timeout=30)
+if living_guest_os:
+session2.cmd(check_command, timeout=30)
 
 # Get the output of migration_test_command
-output = session2.cmd_output(test_command)
+if living_guest_os:
+output = session2.cmd_output(test_command)
 
 # Compare output to reference output
-if output != reference_output:
-logging.info("Command output before migration differs from "
- "command output after migration")
-logging.info("Command: %s", test_command)
-logging.info("Output before:" +
- virt_utils.format_str_for_message(reference_output))
-logging.info("Output after:" +
- virt_utils.format_str_for_message(output))
-raise error.TestFail("Command '%s' produced different output "
- "before and after migration" % test_command)
+if living_guest_os:
+if output != reference_output:
+logging.info("Command output before migration differs from "
+ "command output after migration")
+logging.info("Command: %s", test_command)
+logging.info("Output before:" +
+ 
virt_utils.format_str_for_message(reference_output))
+logging.info("Output after:" +
+   

[PATCH 0/3] Introduce blkdebug support to KVM autotest + migration test

2011-09-08 Thread Lucas Meneghel Rodrigues
In order to solve task:

http://autotest.kernel.org/ticket/50

Implemented support to providing blkdebug files to arbitrary guest images in
autotest. Also, modified a bit the migrate test and introduced the variant
migrate.with_blkdebug variant, that tries to mimic as much as possible the
testcase described in ticket 50. Kevin, sorry that I took too long to work
on this, let me know if you see any obvious problems on the implementation.

With this we can have people reproducing the test case mentioned with a
stock autotest clone. Indeed we see state migration mismatch problems
running the testcase:

Test  Status Seconds  Info
  -- ---  
(Result file: ../../results/default/status)
Fedora.15.64.migrate.with_blkdebug.tcp.smp2   FAIL   35   
Unhandled VMMigrateStateMismatchError: Mismatch of VM state before and after 
migration (cd739e82e72179821b0599169432729c != 
39920fecc39325cca5deab4e0c54792d)[context: migrating 'vm1' --> after 
migration]
Fedora.15.64.migrate.with_blkdebug.unix.smp2  FAIL   46   
Unhandled VMMigrateStateMismatchError: Mismatch of VM state before and after 
migration (9482fb1523f15e708b0e46058b22196c != 
60f7757c88e8a3a656f701fa97501a97)[context: migrating 'vm1' --> after 
migration]
Fedora.15.64.migrate.with_blkdebug.exec.smp2  FAIL   44   
Unhandled VMMigrateStateMismatchError: Mismatch of VM state before and after 
migration (7f5d1cce87383c001a83d0d3a6529c68 != 
76549ae93fc0b3d270dbf51e8cb49dd6)[context: migrating 'vm1' --> after 
migration]
Fedora.15.64.migrate.with_blkdebug.mig_cancel.smp2FAIL   101  
Unhandled VMMigrateCancelError: Cannot cancel migration[context: migrating 
'vm1']
  GOOD   261 

Cheers!

Lucas Meneghel Rodrigues (3):
  KVM test: Introduce blkdebug param to images
  KVM test: migration subtest - Introduce migration_living_guest param
  KVM test: Introduce migrate.with_blkdebug variant

 client/tests/kvm/base.cfg.sample   |9 
 client/tests/kvm/blkdebug/default.conf |   16 +++
 client/tests/kvm/subtests.cfg.sample   |6 +++
 client/tests/kvm/tests/migration.py|   78 ++-
 client/virt/kvm_vm.py  |   14 +-
 client/virt/virt_vm.py |   21 +
 6 files changed, 108 insertions(+), 36 deletions(-)
 create mode 100644 client/tests/kvm/blkdebug/default.conf

-- 
1.7.6

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] KVM test: Introduce blkdebug param to images

2011-09-08 Thread Lucas Meneghel Rodrigues
blkdebug files allow to inject errors in the block
subsystem, allowing more types of tests to be
executed. Now it is possible to set a blkdebug file
for an arbitrary image 'imagefoo' used just like this

drive_blkdebug_imagefoo = blkdebug/default.conf

The blkdebug files can be stored on the specific location:

client/tests/kvm/blkdebug

And then specified like the above. An example of debug
file can be seen here:

[inject-error]
state = "2"
event = "read_aio"
errno = "7"
immediately = "off"
once = "on"

[set-state]
state = "1"
event = "read_aio"
new_state = "2"

[set-state]
state = "2"
event = "read_aio"
new_state = "3"

Signed-off-by: Lucas Meneghel Rodrigues 
---
 client/tests/kvm/base.cfg.sample   |9 +
 client/tests/kvm/blkdebug/default.conf |   16 
 client/virt/kvm_vm.py  |   14 +++---
 client/virt/virt_vm.py |   21 +
 4 files changed, 57 insertions(+), 3 deletions(-)
 create mode 100644 client/tests/kvm/blkdebug/default.conf

diff --git a/client/tests/kvm/base.cfg.sample b/client/tests/kvm/base.cfg.sample
index 0e6d222..96e14a6 100644
--- a/client/tests/kvm/base.cfg.sample
+++ b/client/tests/kvm/base.cfg.sample
@@ -28,6 +28,15 @@ image_size = 10G
 image_raw_device = no
 drive_index_image1 = 0
 drive_cache = none
+# You can specify a blkdebug file here, relative to kvm/blkdebug dir
+# we have a premade default.conf in there. Important to note that you
+# can set this for any image defined in the config at a given time
+#drive_blkdebug_image1 = blkdebug/default.conf
+drive_blkdebug_image1 =
+# What to do whether a read error is detected, such as 'stop'
+drive_rerror_image1 =
+# What to do whether a write error is detected, such as 'stop'
+drive_werror_image1 =
 
 # Cdrom drive index
 drive_index_cd1 = 1
diff --git a/client/tests/kvm/blkdebug/default.conf 
b/client/tests/kvm/blkdebug/default.conf
new file mode 100644
index 000..0497405
--- /dev/null
+++ b/client/tests/kvm/blkdebug/default.conf
@@ -0,0 +1,16 @@
+[inject-error]
+state = "2"
+event = "read_aio"
+errno = "7"
+immediately = "off"
+once = "on"
+
+[set-state]
+state = "1"
+event = "read_aio"
+new_state = "2"
+
+[set-state]
+state = "2"
+event = "read_aio"
+new_state = "3"
diff --git a/client/virt/kvm_vm.py b/client/virt/kvm_vm.py
index f7b8345..d0b8015 100644
--- a/client/virt/kvm_vm.py
+++ b/client/virt/kvm_vm.py
@@ -215,7 +215,8 @@ class VM(virt_vm.BaseVM):
 return " -cdrom '%s'" % filename
 
 def add_drive(help, filename, index=None, format=None, cache=None,
-  werror=None, serial=None, snapshot=False, boot=False):
+  werror=None, rerror=None, serial=None, snapshot=False,
+  boot=False, blkdebug=None):
 name = None;
 dev = "";
 if format == "ahci":
@@ -229,13 +230,18 @@ class VM(virt_vm.BaseVM):
 dev += ",port=%d" % (int(index) + 1)
 format = "none"
 index = None
-cmd = " -drive file='%s'" % filename
+if blkdebug is not None:
+cmd = " -drive file=blkdebug:%s:%s" % (blkdebug, filename)
+else:
+cmd = " -drive file='%s'" % filename
 if index is not None:
 cmd += ",index=%s" % index
 if format:
 cmd += ",if=%s" % format
 if cache:
 cmd += ",cache=%s" % cache
+if rerror:
+cmd += ",rerror=%s" % rerror
 if werror:
 cmd += ",werror=%s" % werror
 if serial:
@@ -440,10 +446,12 @@ class VM(virt_vm.BaseVM):
   image_params.get("drive_index"),
   image_params.get("drive_format"),
   image_params.get("drive_cache"),
+  image_params.get("drive_rerror"),
   image_params.get("drive_werror"),
   image_params.get("drive_serial"),
   image_params.get("image_snapshot") == "yes",
-  image_params.get("image_boot") == "yes")
+  image_params.get("image_boot") == "yes",
+virt_vm.get_image_blkdebug_filename(image_params, 
root_dir))
 
 redirs = []
 for redir_name in params.objects("redirs"):
diff --git a/client/virt/virt_vm.py b/client/virt/virt_vm.py
index 8815bf4..de321c4 100644
--- a/client/virt/virt_vm.py
+++ b/client/virt/virt_vm.py
@@ -188,6 +188,27 @@ class VMRebootError(VMError):
 class VMStatusError(VMError):
 pass
 
+
+def get_image_blkdebug_filename(params, root_dir):
+"""
+Generate an blkdebug file path from params and root_dir.
+
+blkdebug files allow error injection in the block subsystem.
+
+@param params: Dictionary containing the test par

Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode

2011-09-08 Thread Michael S. Tsirkin
On Thu, Sep 08, 2011 at 12:23:56PM -0700, Roopa Prabhu wrote:
> > 
> > I think the main usecase for passthru mode is to assign a SR-IOV VF to
> > a single guest.
> > 
> Yes and for the passthru usecase this patch should be enough to enable
> filtering in hw (eventually like I indicated before I need to fix vlan
> filtering too).

So with filtering in hw, and in sriov VF case, VFs
actually share a filtering table. How will that
be partitioned?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode

2011-09-08 Thread Roopa Prabhu



On 9/8/11 4:08 AM, "Michael S. Tsirkin"  wrote:

> On Wed, Sep 07, 2011 at 10:20:28PM -0700, Roopa Prabhu wrote:
>> On 9/7/11 5:34 AM, "Michael S. Tsirkin"  wrote:
>> 
>>> On Tue, Sep 06, 2011 at 03:35:40PM -0700, Roopa Prabhu wrote:
 This patch is an attempt at providing address filtering support for macvtap
 devices in PASSTHRU mode. Its still a work in progress.
 Briefly tested for basic functionality. Wanted to get some feedback on the
 direction before proceeding.
 
>>> 
>>> Good work, thanks.
>>> 
>> 
>> Thanks.
>> 
 I have hopefully CC'ed all concerned people.
>>> 
>>> kvm crowd might also be interested.
>>> Try using ./scripts/get_maintainer.pl as well.
>>> 
>> Thanks for the tip. Expanded CC list a bit more.
>> 
 PASSTHRU mode today sets the lowerdev in promiscous mode. In PASSTHRU mode
 there is a 1-1 mapping between macvtap device and physical nic or VF. And
 all
 filtering is done in lowerdev hw. The lowerdev does not need to be in
 promiscous mode as long as the guest filters are passed down to the
 lowerdev.
 This patch tries to remove the need for putting the lowerdev in promiscous
 mode. 
 I have also referred to the thread below where TUNSETTXFILTER was mentioned
 in 
 this context: 
  http://patchwork.ozlabs.org/patch/69297/
 
 This patch basically passes the addresses got by TUNSETTXFILTER to macvlan
 lowerdev.
 
 I have looked at previous work and discussions on this for qemu-kvm
 by Michael Tsirkin, Alex Williamson and Dragos Tatulea
 http://patchwork.ozlabs.org/patch/78595/
 http://patchwork.ozlabs.org/patch/47160/
 https://patchwork.kernel.org/patch/474481/
 
 Redhat bugzilla by Michael Tsirkin:
 https://bugzilla.redhat.com/show_bug.cgi?id=655013
 
 I used Michael's qemu-kvm patch for testing the changes with KVM
 
 I would like to cover both MAC and vlan filtering in this work.
 
 Open Questions/Issues:
 - There is a need for vlan filtering to complete the patch. It will require
   a new tap ioctl cmd for vlans.
   Some ideas on this are:
 
   a) TUNSETVLANFILTER: This will entail we send the whole vlan bitmap
 filter
 (similar to tun_filter for addresses). Passing the vlan id's to lower
 device will mean going thru the whole list of vlans every time.
 
   OR
 
   b) TUNSETVLAN with vlan id and flag to set/unset
 
   Does option 'b' sound ok ?
 
 - In this implementation we make the macvlan address list same as the
 address
   list that came in the filter with TUNSETTXFILTER. This will not cover
 cases
   where the macvlan device needs to have other addresses that are not
   necessarily in the filter. Is this a problem ?
>>> 
>>> What cases do you have in mind?
>>> 
>> This patch targets only macvlan PASSTHRU mode and for PASSTHRU mode I don't
>> see a problem with uc/mc address list being the same in all the stacked
>> netdevs in the path. I called that out above to make sure I was not missing
>> any case in PASSTHRU mode where this might be invalid. Otherwise I don't see
>> a problem in the simple PASSTHRU use case this patch supports.
>> 
 - The patch currently only supports passing of IFF_PROMISC and
 IFF_MULTICAST
 filter flags to lowerdev
 
 This patch series implements the following
 01/3 - macvlan: Add support for unicast filtering in macvlan
 02/3 - macvlan: Add function to set addr filter on lower device in passthru
 mode
 03/3 - macvtap: Add support for TUNSETTXFILTER
 
 Please comment. Thanks.
 
 Signed-off-by: Roopa Prabhu 
 Signed-off-by: Christian Benvenuti 
 Signed-off-by: David Wang 
>>> 
>>> The security isn't lower than with promisc, so I don't see
>>> a problem with this as such.
>>> 
>>> There are more features we'll want down the road though,
>>> so let's see whether the interface will be able to
>>> satisfy them in a backwards compatible way before we
>>> set it in stone. Here's what I came up with:
>>> 
>>> How will the filtering table be partitioned within guests?
>> 
>> Since this patch supports macvlan PASSTHRU mode only, in which the lower
>> device has 1-1 mapping to the guest nic, it does not require any
>> partitioning of filtering table within guests. Unless I missed understanding
>> something. 
>> If the lower device were being shared by multiple guest network interfaces
>> (non PASSTHRU mode), only then we will need to maintain separate filter
>> tables for each guest network interface in macvlan and forward the pkt to
>> respective guest interface after a filter lookup. This could affect
>> performance too I think.
> 
> Not with hardware filtering support. Which is where we'd need to
> partition the host nic mac table between guests.
> 
I need to understand this more. In non passthru case when a VF or physical
nic is shared between guests, the nic does not really

Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode

2011-09-08 Thread Michael S. Tsirkin
On Thu, Sep 08, 2011 at 09:19:32AM -0700, Roopa Prabhu wrote:
> >>> There are more features we'll want down the road though,
> >>> so let's see whether the interface will be able to
> >>> satisfy them in a backwards compatible way before we
> >>> set it in stone. Here's what I came up with:
> >>> 
> >>> How will the filtering table be partitioned within guests?
> >> 
> >> Since this patch supports macvlan PASSTHRU mode only, in which the lower
> >> device has 1-1 mapping to the guest nic, it does not require any
> >> partitioning of filtering table within guests. Unless I missed 
> >> understanding
> >> something. 
> >> If the lower device were being shared by multiple guest network interfaces
> >> (non PASSTHRU mode), only then we will need to maintain separate filter
> >> tables for each guest network interface in macvlan and forward the pkt to
> >> respective guest interface after a filter lookup. This could affect
> >> performance too I think.
> > 
> > Not with hardware filtering support. Which is where we'd need to
> > partition the host nic mac table between guests.
> > 
> I need to understand this more. In non passthru case when a VF or physical
> nic is shared between guests,

For example, consider a VF given to each guest. Hardware supports a fixed
total number of filters, which can be partitioned between VFs.

> the nic does not really know about the guests,
> so I was thinking we do the same thing as we do for the passthru case (ie
> send all the address filters from macvlan to the physical nic). So at the
> hardware, filtering is done for all guests sharing the nic. But if we want
> each virtio-net nic or guest to get exactly what it asked for
> macvlan/macvtap needs to maintain a copy of each guest filter and do a
> lookup and send only the requested traffic to the guest. Here is the
> performance hit that I was seeing. Please see my next comment for further
> details. 

It won't be any slower than attaching a non-passthrough macvlan
to a device, will it?

> 
> >> I chose to support PASSTHRU Mode only at first because its simpler and all
> >> code additions are in control path only.
> > 
> > I agree. It would be a bit silly to have a dedicated interface
> > for passthough and a completely separate one for
> > non passthrough.
> >
> Agree. The reason I did not focus on non-passthru case in the initial
> version was because I was thinking things to do in the non-passthru case
> will be just add-ons to the passthru case. But true Better to flush out the
> non-pasthru case details.
> 
> After dwelling on this a bit more how about the below:
> 
> Phase 1: Goal: Enable hardware filtering for all macvlan modes
> - In macvlan passthru mode the single guest virtio-nic connected will
>   receive traffic that he requested for
> - In macvlan non-passthru mode all guest virtio-nics sharing the
>   physical nic will see all other guest traffic
>   but the filtering at guest virtio-nic

I don't think guests currently filter anything.

>   will make sure each guest
>   eventually sees traffic he asked for. This is still better than
>   putting the physical nic in promiscuous mode.
> 
> (This is mainly what my patch does...but will need to remove the passthru
> check and see if there are any thing else needed for non-passthru case)

I'm fine with sticking with passthrough, make non passthrough
a separate phase.

> 
> Phase 2: Goal: Enable filtering at macvlan so that each guest virtio-nic
> receives only what he requested for.
> - In this case, in addition to pushing the filters down to the physical
>   nic we will have to maintain the same filter in macvlan and do a filter
>   lookup before forwarding the traffic to a virtio-nic.
> 
> But I am thinking phase 2 might be redundant given virtio-nic already does
> filtering for the guest.

It does? Do you mean the filter that qemu does in userspace?

> In which case we might not need phase 2 at all. I
> might have been over complicating things.
> 
> Please comment. And please correct if I missed something.
>  
>  
> >>> 
> >>> A way to limit what the guest can do would also be useful.
> >>> How can this be done? selinux?
> >> 
> >> I vaguely remember a thread on the same context.. had a suggestion to
> >> maintain pre-approved address lists and allow guest filter registration of
> >> only those addresses for security. This seemed reasonable. Plus the ability
> >> to support additional address registration from guest could be made
> >> configurable (One of your ideas again from prior work).
> >> 
> >> I am not an selinux expert, but I am thinking we can use it to only allow 
> >> or
> >> disallow access or operations to the macvtap device. (?). I will check more
> >> on this.
> > 
> > We'd have to have a way to revoke that as well.
> > 
> Yes true.
> 
> 
> >>> 
> >>> Any thoughts on spoofing filtering?
> >> 
> >> I can only think of checking addresses against an allowed address list.
> >> Don't know of any other ways. An

Re: [RFC PATCH 0/5] VFIO-NG group/device/iommu framework

2011-09-08 Thread Alex Williamson
On Wed, 2011-09-07 at 13:58 +0200, Alexander Graf wrote:
> On 01.09.2011, at 21:50, Alex Williamson wrote:
> 
> > Trying to move beyond talking about how VFIO should work to
> > re-writing the code.  This is pre-alpha, known broken, will
> > probably crash your system but it illustrates some of how
> > I see groups, devices, and iommus interacting.  This is just
> > the framework, no code to actually support user space drivers
> > or device assignment yet.
> > 
> > The iommu portions are still using the "FIXME" PCI specific
> > hooks.  Once Joerg gets some buy-in on his bus specific iommu
> > patches, we can move to that.
> > 
> > The group management is more complicated than I'd like and
> > you can get groups into a bad state by killing the test program
> > with devices/iommus open.  The locking is overly simplistic.
> > But, it's a start.  Please make constructive comments and
> > suggestions.  Patches based on v3.0.  Thanks,
> 
> Looks pretty reasonable to me so far, but I guess we only know for sure once 
> we have non-PCI implemented and working with this scheme as well.
> Btw I couldn't find the PCI BAR regions mmaps and general config space 
> exposure. Where has that gone?

I ripped it out for now just to work on the group/device/iommu
framework.  I didn't see a need to make a functional RFC just to get
some buy-in on the framework.  Thanks,

Alex



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/13] xen/pvticketlock: disable interrupts while blocking

2011-09-08 Thread Jeremy Fitzhardinge
On 09/08/2011 12:51 AM, Avi Kivity wrote:
> On 09/07/2011 10:09 PM, Jeremy Fitzhardinge wrote:
>> On 09/07/2011 10:41 AM, Avi Kivity wrote:
>> >>  Hm, I'm interested to know what you're thinking in more detail. 
>> Can you
>> >>  leave an NMI pending before you block in the same way you can with
>> >>  "sti;halt" with normal interrupts?
>> >
>> >
>> >  Nope.  But you can do
>> >
>> > if (regs->rip in critical section)
>> > regs->rip = after_halt;
>> >
>> >  and effectively emulate it.  The critical section is something like
>> >
>> >  critical_section_start:
>> >  if (woken_up)
>> >  goto critical_section_end;
>> >  hlt
>> >  critical_section_end:
>>
>> Hm.  It's a pity you have to deliver an actual interrupt to implement
>> the kick though.
>
> I don't think it's that expensive, especially compared to the
> double-context-switch and vmexit of the spinner going to sleep.  On
> AMD we do have to take an extra vmexit (on IRET) though.

Fair enough - so if the vcpu blocks itself, it ends up being rescheduled
in the NMI handler, which then returns to the lock slowpath.  And if its
a normal hlt, then you can also take interrupts if they're enabled while
spinning.

And if you get nested NMIs (since you can get multiple spurious kicks,
or from other NMI sources), then one NMI will get latched and any others
will get dropped?

> Well we could have a specialized sleep/wakeup hypercall pair like Xen,
> but I'd like to avoid it if at all possible.

Yeah, that's something that just falls out of the existing event channel
machinery, so it isn't something that I specifically added.  But it does
mean that you simply end up with a hypercall returning on kick, with no
real complexities.

J
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode

2011-09-08 Thread Roopa Prabhu



On 9/8/11 10:42 AM, "Sridhar Samudrala"  wrote:

> On Thu, 2011-09-08 at 09:19 -0700, Roopa Prabhu wrote:
>> 
>> 
>> On 9/8/11 4:08 AM, "Michael S. Tsirkin"  wrote:
>> 
>>> On Wed, Sep 07, 2011 at 10:20:28PM -0700, Roopa Prabhu wrote:
 On 9/7/11 5:34 AM, "Michael S. Tsirkin"  wrote:
 
> On Tue, Sep 06, 2011 at 03:35:40PM -0700, Roopa Prabhu wrote:
>> This patch is an attempt at providing address filtering support for
>> macvtap
>> devices in PASSTHRU mode. Its still a work in progress.
>> Briefly tested for basic functionality. Wanted to get some feedback on
>> the
>> direction before proceeding.
>> 
> 
> Good work, thanks.
> 
 
 Thanks.
 
>> I have hopefully CC'ed all concerned people.
> 
> kvm crowd might also be interested.
> Try using ./scripts/get_maintainer.pl as well.
> 
 Thanks for the tip. Expanded CC list a bit more.
 
>> PASSTHRU mode today sets the lowerdev in promiscous mode. In PASSTHRU
>> mode
>> there is a 1-1 mapping between macvtap device and physical nic or VF. And
>> all
>> filtering is done in lowerdev hw. The lowerdev does not need to be in
>> promiscous mode as long as the guest filters are passed down to the
>> lowerdev.
>> This patch tries to remove the need for putting the lowerdev in
>> promiscous
>> mode. 
>> I have also referred to the thread below where TUNSETTXFILTER was
>> mentioned
>> in 
>> this context:
>>  http://patchwork.ozlabs.org/patch/69297/
>> 
>> This patch basically passes the addresses got by TUNSETTXFILTER to
>> macvlan
>> lowerdev.
>> 
>> I have looked at previous work and discussions on this for qemu-kvm
>> by Michael Tsirkin, Alex Williamson and Dragos Tatulea
>> http://patchwork.ozlabs.org/patch/78595/
>> http://patchwork.ozlabs.org/patch/47160/
>> https://patchwork.kernel.org/patch/474481/
>> 
>> Redhat bugzilla by Michael Tsirkin:
>> https://bugzilla.redhat.com/show_bug.cgi?id=655013
>> 
>> I used Michael's qemu-kvm patch for testing the changes with KVM
>> 
>> I would like to cover both MAC and vlan filtering in this work.
>> 
>> Open Questions/Issues:
>> - There is a need for vlan filtering to complete the patch. It will
>> require
>>   a new tap ioctl cmd for vlans.
>>   Some ideas on this are:
>> 
>>   a) TUNSETVLANFILTER: This will entail we send the whole vlan bitmap
>> filter
>> (similar to tun_filter for addresses). Passing the vlan id's to lower
>> device will mean going thru the whole list of vlans every time.
>> 
>>   OR
>> 
>>   b) TUNSETVLAN with vlan id and flag to set/unset
>> 
>>   Does option 'b' sound ok ?
>> 
>> - In this implementation we make the macvlan address list same as the
>> address
>>   list that came in the filter with TUNSETTXFILTER. This will not cover
>> cases
>>   where the macvlan device needs to have other addresses that are not
>>   necessarily in the filter. Is this a problem ?
> 
> What cases do you have in mind?
> 
 This patch targets only macvlan PASSTHRU mode and for PASSTHRU mode I don't
 see a problem with uc/mc address list being the same in all the stacked
 netdevs in the path. I called that out above to make sure I was not missing
 any case in PASSTHRU mode where this might be invalid. Otherwise I don't
 see
 a problem in the simple PASSTHRU use case this patch supports.
 
>> - The patch currently only supports passing of IFF_PROMISC and
>> IFF_MULTICAST
>> filter flags to lowerdev
>> 
>> This patch series implements the following
>> 01/3 - macvlan: Add support for unicast filtering in macvlan
>> 02/3 - macvlan: Add function to set addr filter on lower device in
>> passthru
>> mode
>> 03/3 - macvtap: Add support for TUNSETTXFILTER
>> 
>> Please comment. Thanks.
>> 
>> Signed-off-by: Roopa Prabhu 
>> Signed-off-by: Christian Benvenuti 
>> Signed-off-by: David Wang 
> 
> The security isn't lower than with promisc, so I don't see
> a problem with this as such.
> 
> There are more features we'll want down the road though,
> so let's see whether the interface will be able to
> satisfy them in a backwards compatible way before we
> set it in stone. Here's what I came up with:
> 
> How will the filtering table be partitioned within guests?
 
 Since this patch supports macvlan PASSTHRU mode only, in which the lower
 device has 1-1 mapping to the guest nic, it does not require any
 partitioning of filtering table within guests. Unless I missed
 understanding
 something. 
 If the lower device were being shared by multiple guest network interfaces
 (non PASSTHRU mode), only then we will need to maintain separate filter
 tables for

RE: [PATCH] KVM: emulate lapic tsc deadline timer for hvm

2011-09-08 Thread Liu, Jinsong
>>> --- a/arch/x86/include/asm/msr-index.h
>>> +++ b/arch/x86/include/asm/msr-index.h
>>> @@ -229,6 +229,8 @@
>>>  #define MSR_IA32_APICBASE_ENABLE   (1<<11)
>>>  #define MSR_IA32_APICBASE_BASE (0xf<<12)
>>> 
>>> +#define MSR_IA32_TSCDEADLINE   0x06e0
>>> +
>>>  #define MSR_IA32_UCODE_WRITE   0x0079
>>>  #define MSR_IA32_UCODE_REV 0x008b
>> 
>> Need to add to msrs_to_save so live migration works.
> 
> MSR must be explicitly listed in qemu, also.
> 

Marcelo, seems MSR don't need explicitly list in qemu?
KVM side adding MSR_IA32_TSCDEADLINE to msrs_to_save is enough. Qemu will get 
it through KVM_GET_MSR_INDEX_LIST.
Do I miss something?

Thanks,
Jinsong--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM got slow after adding more physical memory to host - SOLVED

2011-09-08 Thread Michael Tokarev
On 06.09.2011 23:22, Nikola Ciprich wrote:
> Hello guys,
> 
> thanks to both of You for Your replies. The problem is solved,
> exactly as Avi said, the DMA in windows got somehow disabled.
> So this certainly was not related to adding the memory...
> 
> anyways, note for further generations:
> in windows XP, the DMA usage can be checked in 
> Device Manager->IDE ATA/ATAPI Controllers -> Properties->Advanced Settings
> Current Transfer Mode must be Multi-Word DMA2 or something similar,
> NOT PIO!
> 
> The way I enabled this, was to uninstall both primary and secondary controller
> in device controller, then also uninstall Intel controller, and THEN 
> rebooting (NOT sooner!)
> after reboot, controllers got detected and installed again, with DMA properly 
> enabled.
> Note that when controller is in PIO mode, this is really a patience test, 
> switching to DMA took
> me like half an hour to complete, so slow the system was :-/

You can open regedit and search for "NoIDE" - it should find one key
with that name, with value "1" or "yes" - just delete it, there's no
need to go that route with removing/reinstalling device drivers.

The reason why it has been disabled is - most likely - due to some
timeout while handling i/o -- eg, when your host was loaded too much
or were swapping - you should be able  to find something about that
in windows event log.

/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CFQ I/O starvation problem triggered by RHEL6.0 KVM guests

2011-09-08 Thread Vivek Goyal
On Thu, Sep 08, 2011 at 06:13:53PM +0900, Takuya Yoshikawa wrote:
> This is a report of strange cfq behaviour which seems to be triggered by
> QEMU posix aio threads.
> 
> Host environment:
>   OS: RHEL6.0 KVM/qemu-kvm (with no patch applied)
>   IO scheduler: cfq (with the default parameters)

So you are using both RHEL 6.0 in both host and guest kernel? Can you
reproduce the same issue with upstream kernels? How easily/frequently
you can reproduce this with RHEL6.0 host.

> 
> On the host, we were running 3 linux guests to see if I/O from these guests
> would be handled fairly by host; each guest did dd write with oflag=direct.
> 
> Guest virtual disk:
>   We used a host local disk which had 3 partitions, and each guest was
>   allocated one of these as dd write target.
> 
> So our test was for checking if cfq could keep fairness for the 3 guests
> who shared the same disk.
> 
> The result (strage starvation):
>   Sometimes, one guest dominated cfq for more than 10sec and requests from
>   other guests were not handled at all during that time.
> 
> Below is the blktrace log which shows that a request to (8,27) in cfq2068S 
> (*1)
> is not handled at all during cfq2095S and cfq2067S which hold requests to
> (8,26) are being handled alternately.
> 
> *1) WS 104920578 + 64
> 
> Question:
>   I guess that cfq_close_cooperator() was being called in an unusual manner.
>   If so, do you think that cfq is responsible for keeping fairness for this
>   kind of unusual write requests?

- If two guests are doing IO to separate partitions, they should really
  not be very close (until and unless partitions are really small).

- Even if there are close cooperators, these queues are merged and they
  are treated as single queue from slice point of view. So cooperating
  queues should be merged and get a single slice instead of starving
  other queues in the system.

Can you upload the blktrace logs somewhere which shows what happened 
during that 10 seconds.

> 
> Note:
>   With RHEL6.1, this problem could not triggered. But I guess that was due to
>   QEMU's block layer updates.

You can try reproducing this with fio.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode

2011-09-08 Thread Michael S. Tsirkin
On Wed, Sep 07, 2011 at 10:20:28PM -0700, Roopa Prabhu wrote:
> On 9/7/11 5:34 AM, "Michael S. Tsirkin"  wrote:
> 
> > On Tue, Sep 06, 2011 at 03:35:40PM -0700, Roopa Prabhu wrote:
> >> This patch is an attempt at providing address filtering support for macvtap
> >> devices in PASSTHRU mode. Its still a work in progress.
> >> Briefly tested for basic functionality. Wanted to get some feedback on the
> >> direction before proceeding.
> >> 
> > 
> > Good work, thanks.
> > 
> 
> Thanks.
> 
> >> I have hopefully CC'ed all concerned people.
> > 
> > kvm crowd might also be interested.
> > Try using ./scripts/get_maintainer.pl as well.
> > 
> Thanks for the tip. Expanded CC list a bit more.
> 
> >> PASSTHRU mode today sets the lowerdev in promiscous mode. In PASSTHRU mode
> >> there is a 1-1 mapping between macvtap device and physical nic or VF. And 
> >> all
> >> filtering is done in lowerdev hw. The lowerdev does not need to be in
> >> promiscous mode as long as the guest filters are passed down to the 
> >> lowerdev.
> >> This patch tries to remove the need for putting the lowerdev in promiscous
> >> mode. 
> >> I have also referred to the thread below where TUNSETTXFILTER was mentioned
> >> in 
> >> this context: 
> >>  http://patchwork.ozlabs.org/patch/69297/
> >> 
> >> This patch basically passes the addresses got by TUNSETTXFILTER to macvlan
> >> lowerdev.
> >> 
> >> I have looked at previous work and discussions on this for qemu-kvm
> >> by Michael Tsirkin, Alex Williamson and Dragos Tatulea
> >> http://patchwork.ozlabs.org/patch/78595/
> >> http://patchwork.ozlabs.org/patch/47160/
> >> https://patchwork.kernel.org/patch/474481/
> >> 
> >> Redhat bugzilla by Michael Tsirkin:
> >> https://bugzilla.redhat.com/show_bug.cgi?id=655013
> >> 
> >> I used Michael's qemu-kvm patch for testing the changes with KVM
> >> 
> >> I would like to cover both MAC and vlan filtering in this work.
> >> 
> >> Open Questions/Issues:
> >> - There is a need for vlan filtering to complete the patch. It will require
> >>   a new tap ioctl cmd for vlans.
> >>   Some ideas on this are:
> >> 
> >>   a) TUNSETVLANFILTER: This will entail we send the whole vlan bitmap 
> >> filter
> >> (similar to tun_filter for addresses). Passing the vlan id's to lower
> >> device will mean going thru the whole list of vlans every time.
> >> 
> >>   OR
> >> 
> >>   b) TUNSETVLAN with vlan id and flag to set/unset
> >> 
> >>   Does option 'b' sound ok ?
> >> 
> >> - In this implementation we make the macvlan address list same as the 
> >> address
> >>   list that came in the filter with TUNSETTXFILTER. This will not cover 
> >> cases
> >>   where the macvlan device needs to have other addresses that are not
> >>   necessarily in the filter. Is this a problem ?
> > 
> > What cases do you have in mind?
> > 
> This patch targets only macvlan PASSTHRU mode and for PASSTHRU mode I don't
> see a problem with uc/mc address list being the same in all the stacked
> netdevs in the path. I called that out above to make sure I was not missing
> any case in PASSTHRU mode where this might be invalid. Otherwise I don't see
> a problem in the simple PASSTHRU use case this patch supports.
> 
> >> - The patch currently only supports passing of IFF_PROMISC and 
> >> IFF_MULTICAST
> >> filter flags to lowerdev
> >> 
> >> This patch series implements the following
> >> 01/3 - macvlan: Add support for unicast filtering in macvlan
> >> 02/3 - macvlan: Add function to set addr filter on lower device in passthru
> >> mode
> >> 03/3 - macvtap: Add support for TUNSETTXFILTER
> >> 
> >> Please comment. Thanks.
> >> 
> >> Signed-off-by: Roopa Prabhu 
> >> Signed-off-by: Christian Benvenuti 
> >> Signed-off-by: David Wang 
> > 
> > The security isn't lower than with promisc, so I don't see
> > a problem with this as such.
> > 
> > There are more features we'll want down the road though,
> > so let's see whether the interface will be able to
> > satisfy them in a backwards compatible way before we
> > set it in stone. Here's what I came up with:
> > 
> > How will the filtering table be partitioned within guests?
> 
> Since this patch supports macvlan PASSTHRU mode only, in which the lower
> device has 1-1 mapping to the guest nic, it does not require any
> partitioning of filtering table within guests. Unless I missed understanding
> something. 
> If the lower device were being shared by multiple guest network interfaces
> (non PASSTHRU mode), only then we will need to maintain separate filter
> tables for each guest network interface in macvlan and forward the pkt to
> respective guest interface after a filter lookup. This could affect
> performance too I think.

Not with hardware filtering support. Which is where we'd need to
partition the host nic mac table between guests.

> I chose to support PASSTHRU Mode only at first because its simpler and all
> code additions are in control path only.

I agree. It would be a bit silly to have a dedicated interface
for passthough and a co

[PATCH 2/2] qemu-kvm: pc: Do not start APIC timer spuriously

2011-09-08 Thread Jan Kiszka
apic_timer_update not only calculates the next timer expiry that we need
to write out the vmstate, it may also start the timer of the user space
APIC model. That can cause spurious signals to the corresponding vCPU
thread when the timer expires. Fix this by using the new apic_next_timer
that does not start the timer.

Signed-off-by: Jan Kiszka 
---

Found while hunting user space exits of our never-exits-to-user-space
real-time guest.

 hw/apic.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index b3044aa..e43219f 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -950,7 +950,7 @@ static void kvm_kernel_lapic_save_to_user(APICState *s)
 s->count_shift = (v + 1) & 7;
 
 s->initial_count_load_time = qemu_get_clock_ns(vm_clock);
-apic_timer_update(s, s->initial_count_load_time);
+apic_next_timer(s, s->initial_count_load_time);
 }
 
 static void kvm_kernel_lapic_load_from_user(APICState *s)
-- 
1.7.3.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] qemu-kvm: pc: Factor out apic_next_timer

2011-09-08 Thread Jan Kiszka
Factor out apic_next_timer from apic_timer_update. The former can then
be used to update next_timer without actually starting the qemu timer.
KVM's in-kernel APIC model will make use of it.

Signed-off-by: Jan Kiszka 
---
 hw/apic.c |   45 -
 1 files changed, 28 insertions(+), 17 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index 96173ac..b3044aa 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -661,27 +661,38 @@ static uint32_t apic_get_current_count(APICState *s)
 return val;
 }
 
-static void apic_timer_update(APICState *s, int64_t current_time)
+static bool apic_next_timer(APICState *s, int64_t current_time)
 {
-int64_t next_time, d;
-
-if (!(s->lvt[APIC_LVT_TIMER] & APIC_LVT_MASKED)) {
-d = (current_time - s->initial_count_load_time) >>
-s->count_shift;
-if (s->lvt[APIC_LVT_TIMER] & APIC_LVT_TIMER_PERIODIC) {
-if (!s->initial_count)
-goto no_timer;
-d = ((d / ((uint64_t)s->initial_count + 1)) + 1) * 
((uint64_t)s->initial_count + 1);
-} else {
-if (d >= s->initial_count)
-goto no_timer;
-d = (uint64_t)s->initial_count + 1;
+int64_t d;
+
+if (s->lvt[APIC_LVT_TIMER] & APIC_LVT_MASKED) {
+return false;
+}
+
+d = (current_time - s->initial_count_load_time) >> s->count_shift;
+
+if (s->lvt[APIC_LVT_TIMER] & APIC_LVT_TIMER_PERIODIC) {
+if (!s->initial_count) {
+return false;
 }
-next_time = s->initial_count_load_time + (d << s->count_shift);
-qemu_mod_timer(s->timer, next_time);
-s->next_time = next_time;
+d = ((d / ((uint64_t)s->initial_count + 1)) + 1) *
+((uint64_t)s->initial_count + 1);
+} else {
+if (d >= s->initial_count) {
+return false;
+}
+d = (uint64_t)s->initial_count + 1;
+}
+s->next_time = s->initial_count_load_time + (d << s->count_shift);
+
+return true;
+}
+
+static void apic_timer_update(APICState *s, int64_t current_time)
+{
+if (apic_next_timer(s, current_time)) {
+qemu_mod_timer(s->timer, s->next_time);
 } else {
-no_timer:
 qemu_del_timer(s->timer);
 }
 }
-- 
1.7.3.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] qemu-kvm: Resolve PCI upstream diffs

2011-09-08 Thread Jan Kiszka
Resolve all unneeded deviations from upstream code. No functional
changes.

Signed-off-by: Jan Kiszka 
---
 hw/pci.c |   11 +++
 hw/pci.h |5 -
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index e4c166a..4d8845c 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -2106,8 +2106,6 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
 memset(pdev->wmask + offset, 0, size);
 /* Check capability by default */
 memset(pdev->cmask + offset, 0xFF, size);
-
-
 return offset;
 }
 
@@ -2125,9 +2123,14 @@ void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, 
uint8_t size)
 memset(pdev->cmask + offset, 0, size);
 memset(pdev->used + offset, 0, size);
 
-if (!pdev->config[PCI_CAPABILITY_LIST]) {
+if (!pdev->config[PCI_CAPABILITY_LIST])
 pdev->config[PCI_STATUS] &= ~PCI_STATUS_CAP_LIST;
-}
+}
+
+/* Reserve space for capability at a known offset (to call after load). */
+void pci_reserve_capability(PCIDevice *pdev, uint8_t offset, uint8_t size)
+{
+memset(pdev->used + offset, 0xff, size);
 }
 
 uint8_t pci_find_capability(PCIDevice *pdev, uint8_t cap_id)
diff --git a/hw/pci.h b/hw/pci.h
index da0c2d2..70fcd9c 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -147,7 +147,7 @@ struct PCIDevice {
 /* Used to implement RW1C(Write 1 to Clear) bytes */
 uint8_t *w1cmask;
 
-/* Used to allocate config space and track capabilities. */
+/* Used to allocate config space for capabilities. */
 uint8_t *used;
 
 /* the following fields are read only */
@@ -230,8 +230,11 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
 
 void pci_del_capability(PCIDevice *pci_dev, uint8_t cap_id, uint8_t cap_size);
 
+void pci_reserve_capability(PCIDevice *pci_dev, uint8_t offset, uint8_t size);
+
 uint8_t pci_find_capability(PCIDevice *pci_dev, uint8_t cap_id);
 
+
 uint32_t pci_default_read_config(PCIDevice *d,
  uint32_t address, int len);
 void pci_default_write_config(PCIDevice *d,
-- 
1.7.3.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 0/4] The intro of QEMU block I/O throttling

2011-09-08 Thread Zhi Yong Wu
The main goal of the patch is to effectively cap the disk I/O speed or counts 
of one single VM.It is only one draft, so it unavoidably has some drawbacks, if 
you catch them, please let me know.

The patch will mainly introduce one block I/O throttling algorithm, one timer 
and one block queue for each I/O limits enabled drive.

When a block request is coming in, the throttling algorithm will check if its 
I/O rate or counts exceed the limits; if yes, then it will enqueue to the block 
queue; The timer will handle the I/O requests in it.

Some available features follow as below:
(1) global bps limit.
   -drive bps=xxxin bytes/s
(2) only read bps limit
   -drive bps_rd=xxx in bytes/s
(3) only write bps limit
   -drive bps_wr=xxx in bytes/s
(4) global iops limit
   -drive iops=xxx   in ios/s
(5) only read iops limit
   -drive iops_rd=xxxin ios/s
(6) only write iops limit
   -drive iops_wr=xxxin ios/s
(7) the combination of some limits.
   -drive bps=xxx,iops=xxx

Known Limitations:
(1) #1 can not coexist with #2, #3
(2) #4 can not coexist with #5, #6
(3) When bps/iops limits are specified to a small value such as 511 bytes/s, 
this VM will hang up. We are considering how to handle this senario.

Changes since code V7:
  fix the build per patch based on stefan's comments.

Zhi Yong Wu (4):
  block: add the command line support
  block: add the block queue support
  block: add block timer and throttling algorithm
  qmp/hmp: add block_set_io_throttle

 v7: Mainly simply the block queue.
 Adjust codes based on stefan's comments.

 v6: Mainly fix the aio callback issue for block queue.
 Adjust codes based on Ram Pai's comments.

 v5: add qmp/hmp support.
 Adjust the codes based on stefan's comments
 qmp/hmp: add block_set_io_throttle

 v4: fix memory leaking based on ryan's feedback.

 v3: Added the code for extending slice time, and modified the method to 
compute wait time for the timer.

 v2: The codes V2 for QEMU disk I/O limits.
 Modified the codes mainly based on stefan's comments.

 v1: Submit the codes for QEMU disk I/O limits.
 Only a code draft.


 Makefile.objs |2 +-
 block.c   |  344 +++--
 block.h   |6 +-
 block/blk-queue.c |  201 +++
 block/blk-queue.h |   59 +
 block_int.h   |   30 +
 blockdev.c|   98 +++
 blockdev.h|2 +
 hmp-commands.hx   |   15 +++
 qemu-config.c |   24 
 qemu-options.hx   |1 +
 qerror.c  |4 +
 qerror.h  |3 +
 qmp-commands.hx   |   52 -
 14 files changed, 825 insertions(+), 16 deletions(-)
 create mode 100644 block/blk-queue.c
 create mode 100644 block/blk-queue.h

-- 
1.7.6

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v8 4/4] qmp/hmp: add block_set_io_throttle

2011-09-08 Thread Zhi Yong Wu
The patch introduce one new command block_set_io_throttle; For its usage 
syntax, if you have better idea, pls let me know.

Signed-off-by: Zhi Yong Wu 
---
 block.c |   26 +++-
 blockdev.c  |   69 +++
 blockdev.h  |2 +
 hmp-commands.hx |   15 
 qerror.c|4 +++
 qerror.h|3 ++
 qmp-commands.hx |   52 -
 7 files changed, 168 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index c08fde8..1d3f067 100644
--- a/block.c
+++ b/block.c
@@ -1938,6 +1938,16 @@ static void bdrv_print_dict(QObject *obj, void *opaque)
 qdict_get_bool(qdict, "ro"),
 qdict_get_str(qdict, "drv"),
 qdict_get_bool(qdict, "encrypted"));
+
+monitor_printf(mon, " bps=%" PRId64 " bps_rd=%" PRId64
+" bps_wr=%" PRId64 " iops=%" PRId64
+" iops_rd=%" PRId64 " iops_wr=%" PRId64,
+qdict_get_int(qdict, "bps"),
+qdict_get_int(qdict, "bps_rd"),
+qdict_get_int(qdict, "bps_wr"),
+qdict_get_int(qdict, "iops"),
+qdict_get_int(qdict, "iops_rd"),
+qdict_get_int(qdict, "iops_wr"));
 } else {
 monitor_printf(mon, " [not inserted]");
 }
@@ -1970,10 +1980,22 @@ void bdrv_info(Monitor *mon, QObject **ret_data)
 QDict *bs_dict = qobject_to_qdict(bs_obj);
 
 obj = qobject_from_jsonf("{ 'file': %s, 'ro': %i, 'drv': %s, "
- "'encrypted': %i }",
+ "'encrypted': %i, "
+ "'bps': %" PRId64 ","
+ "'bps_rd': %" PRId64 ","
+ "'bps_wr': %" PRId64 ","
+ "'iops': %" PRId64 ","
+ "'iops_rd': %" PRId64 ","
+ "'iops_wr': %" PRId64 "}",
  bs->filename, bs->read_only,
  bs->drv->format_name,
- bdrv_is_encrypted(bs));
+ bdrv_is_encrypted(bs),
+ bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL],
+ bs->io_limits.bps[BLOCK_IO_LIMIT_READ],
+ bs->io_limits.bps[BLOCK_IO_LIMIT_WRITE],
+ bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL],
+ bs->io_limits.iops[BLOCK_IO_LIMIT_READ],
+ bs->io_limits.iops[BLOCK_IO_LIMIT_WRITE]);
 if (bs->backing_file[0] != '\0') {
 QDict *qdict = qobject_to_qdict(obj);
 qdict_put(qdict, "backing_file",
diff --git a/blockdev.c b/blockdev.c
index 619ae9f..7f5c4df 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -747,6 +747,75 @@ int do_change_block(Monitor *mon, const char *device,
 return monitor_read_bdrv_key_start(mon, bs, NULL, NULL);
 }
 
+/* throttling disk I/O limits */
+int do_block_set_io_throttle(Monitor *mon,
+   const QDict *qdict, QObject **ret_data)
+{
+const char *devname = qdict_get_str(qdict, "device");
+uint64_t bps= qdict_get_try_int(qdict, "bps", -1);
+uint64_t bps_rd = qdict_get_try_int(qdict, "bps_rd", -1);
+uint64_t bps_wr = qdict_get_try_int(qdict, "bps_wr", -1);
+uint64_t iops   = qdict_get_try_int(qdict, "iops", -1);
+uint64_t iops_rd= qdict_get_try_int(qdict, "iops_rd", -1);
+uint64_t iops_wr= qdict_get_try_int(qdict, "iops_wr", -1);
+BlockDriverState *bs;
+
+bs = bdrv_find(devname);
+if (!bs) {
+qerror_report(QERR_DEVICE_NOT_FOUND, devname);
+return -1;
+}
+
+if ((bps == -1) && (bps_rd == -1) && (bps_wr == -1)
+&& (iops == -1) && (iops_rd == -1) && (iops_wr == -1)) {
+qerror_report(QERR_MISSING_PARAMETER,
+  "bps/bps_rd/bps_wr/iops/iops_rd/iops_wr");
+return -1;
+}
+
+if (((bps != -1) && ((bps_rd != -1) || (bps_wr != -1)))
+|| ((iops != -1) && ((iops_rd != -1) || (iops_wr != -1 {
+qerror_report(QERR_INVALID_PARAMETER_COMBINATION);
+return -1;
+}
+
+if (bps != -1) {
+bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL] = bps;
+bs->io_limits.bps[BLOCK_IO_LIMIT_READ]  = 0;
+bs->io_limits.bps[BLOCK_IO_LIMIT_WRITE] = 0;
+}
+
+if ((bps_rd != -1) || (bps_wr != -1)) {
+bs->io_limits.bps[BLOCK_IO_LIMIT_READ]   =
+   (bps_rd == -1) ? bs->io_limits.bps[BLOCK_IO_LIMIT_READ] : bps_rd;
+bs->io_limits.bps[BLOCK_IO_LIMIT_WRITE]  =
+  

[PATCH v8 3/4] block: add block timer and throttling algorithm

2011-09-08 Thread Zhi Yong Wu
Note:
 1.) When bps/iops limits are specified to a small value such as 511 
bytes/s, this VM will hang up. We are considering how to handle this senario.
 2.) When "dd" command is issued in guest, if its option bs is set to a 
large value such as "bs=1024K", the result speed will slightly bigger than the 
limits.

For these problems, if you have nice thought, pls let us know.:)

Signed-off-by: Zhi Yong Wu 
---
 block.c |  259 ---
 block.h |1 -
 2 files changed, 248 insertions(+), 12 deletions(-)

diff --git a/block.c b/block.c
index cd75183..c08fde8 100644
--- a/block.c
+++ b/block.c
@@ -30,6 +30,9 @@
 #include "qemu-objects.h"
 #include "qemu-coroutine.h"
 
+#include "qemu-timer.h"
+#include "block/blk-queue.h"
+
 #ifdef CONFIG_BSD
 #include 
 #include 
@@ -72,6 +75,13 @@ static int coroutine_fn bdrv_co_writev_em(BlockDriverState 
*bs,
  QEMUIOVector *iov);
 static int coroutine_fn bdrv_co_flush_em(BlockDriverState *bs);
 
+static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
+bool is_write, double elapsed_time, uint64_t *wait);
+static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
+double elapsed_time, uint64_t *wait);
+static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
+bool is_write, int64_t *wait);
+
 static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
 QTAILQ_HEAD_INITIALIZER(bdrv_states);
 
@@ -745,6 +755,11 @@ int bdrv_open(BlockDriverState *bs, const char *filename, 
int flags,
 bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
 }
 
+/* throttling disk I/O limits */
+if (bs->io_limits_enabled) {
+bdrv_io_limits_enable(bs);
+}
+
 return 0;
 
 unlink_and_fail:
@@ -783,6 +798,18 @@ void bdrv_close(BlockDriverState *bs)
 if (bs->change_cb)
 bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
 }
+
+/* throttling disk I/O limits */
+if (bs->block_queue) {
+qemu_del_block_queue(bs->block_queue);
+bs->block_queue = NULL;
+}
+
+if (bs->block_timer) {
+qemu_del_timer(bs->block_timer);
+qemu_free_timer(bs->block_timer);
+bs->block_timer = NULL;
+}
 }
 
 void bdrv_close_all(void)
@@ -2341,16 +2368,48 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, 
int64_t sector_num,
  BlockDriverCompletionFunc *cb, void *opaque)
 {
 BlockDriver *drv = bs->drv;
-
+BlockDriverAIOCB *ret;
+int64_t wait_time = -1;
+printf("sector_num=%ld, nb_sectors=%d\n", sector_num, nb_sectors);
 trace_bdrv_aio_readv(bs, sector_num, nb_sectors, opaque);
 
-if (!drv)
-return NULL;
-if (bdrv_check_request(bs, sector_num, nb_sectors))
+if (!drv || bdrv_check_request(bs, sector_num, nb_sectors)) {
 return NULL;
+}
+
+/* throttling disk read I/O */
+if (bs->io_limits_enabled) {
+if (bdrv_exceed_io_limits(bs, nb_sectors, false, &wait_time)) {
+ret = qemu_block_queue_enqueue(bs->block_queue, bs, bdrv_aio_readv,
+   sector_num, qiov, nb_sectors, cb, opaque);
+printf("wait_time=%ld\n", wait_time);
+if (wait_time != -1) {
+printf("reset block timer\n");
+qemu_mod_timer(bs->block_timer,
+   wait_time + qemu_get_clock_ns(vm_clock));
+}
+
+if (ret) {
+printf("ori ret is not null\n");
+} else {
+printf("ori ret is null\n");
+}
+
+return ret;
+}
+}
 
-return drv->bdrv_aio_readv(bs, sector_num, qiov, nb_sectors,
+ret =  drv->bdrv_aio_readv(bs, sector_num, qiov, nb_sectors,
cb, opaque);
+if (ret) {
+if (bs->io_limits_enabled) {
+bs->io_disps.bytes[BLOCK_IO_LIMIT_READ] +=
+  (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
+bs->io_disps.ios[BLOCK_IO_LIMIT_READ]++;
+}
+}
+
+return ret;
 }
 
 typedef struct BlockCompleteData {
@@ -2396,15 +2455,14 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, 
int64_t sector_num,
 BlockDriver *drv = bs->drv;
 BlockDriverAIOCB *ret;
 BlockCompleteData *blk_cb_data;
+int64_t wait_time = -1;
 
 trace_bdrv_aio_writev(bs, sector_num, nb_sectors, opaque);
 
-if (!drv)
-return NULL;
-if (bs->read_only)
-return NULL;
-if (bdrv_check_request(bs, sector_num, nb_sectors))
+if (!drv || bs->read_only
+|| bdrv_check_request(bs, sector_num, nb_sectors)) {
 return NULL;
+}
 
 if (bs->dirty_bitmap) {
 blk_cb_data = blk_dirty_cb_alloc(bs, sector_num, nb_sectors, cb,
@@ -2413,13 +2471,32 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, 
int64_t sector_num,
 opaque = blk_cb_data;
 }

[PATCH v8 2/4] block: add the command line support

2011-09-08 Thread Zhi Yong Wu
Signed-off-by: Zhi Yong Wu 
---
 block.c |   59 +++
 block.h |5 
 block_int.h |3 ++
 blockdev.c  |   29 +++
 qemu-config.c   |   24 ++
 qemu-options.hx |1 +
 6 files changed, 121 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index 43742b7..cd75183 100644
--- a/block.c
+++ b/block.c
@@ -104,6 +104,57 @@ int is_windows_drive(const char *filename)
 }
 #endif
 
+/* throttling disk I/O limits */
+void bdrv_io_limits_disable(BlockDriverState *bs)
+{
+bs->io_limits_enabled = false;
+
+if (bs->block_queue) {
+qemu_block_queue_flush(bs->block_queue);
+qemu_del_block_queue(bs->block_queue);
+bs->block_queue = NULL;
+}
+
+if (bs->block_timer) {
+qemu_del_timer(bs->block_timer);
+qemu_free_timer(bs->block_timer);
+bs->block_timer = NULL;
+}
+
+bs->slice_start = 0;
+
+bs->slice_end   = 0;
+}
+
+static void bdrv_block_timer(void *opaque)
+{
+BlockDriverState *bs = opaque;
+BlockQueue *queue= bs->block_queue;
+
+qemu_block_queue_flush(queue);
+}
+
+void bdrv_io_limits_enable(BlockDriverState *bs)
+{
+bs->block_queue = qemu_new_block_queue();
+bs->block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
+
+bs->slice_start = qemu_get_clock_ns(vm_clock);
+
+bs->slice_end   = bs->slice_start + BLOCK_IO_SLICE_TIME;
+}
+
+bool bdrv_io_limits_enabled(BlockDriverState *bs)
+{
+BlockIOLimit *io_limits = &bs->io_limits;
+return io_limits->bps[BLOCK_IO_LIMIT_READ]
+ || io_limits->bps[BLOCK_IO_LIMIT_WRITE]
+ || io_limits->bps[BLOCK_IO_LIMIT_TOTAL]
+ || io_limits->iops[BLOCK_IO_LIMIT_READ]
+ || io_limits->iops[BLOCK_IO_LIMIT_WRITE]
+ || io_limits->iops[BLOCK_IO_LIMIT_TOTAL];
+}
+
 /* check if the path starts with ":" */
 static int path_has_protocol(const char *path)
 {
@@ -1453,6 +1504,14 @@ void bdrv_get_geometry_hint(BlockDriverState *bs,
 *psecs = bs->secs;
 }
 
+/* throttling disk io limits */
+void bdrv_set_io_limits(BlockDriverState *bs,
+BlockIOLimit *io_limits)
+{
+bs->io_limits = *io_limits;
+bs->io_limits_enabled = bdrv_io_limits_enabled(bs);
+}
+
 /* Recognize floppy formats */
 typedef struct FDFormat {
 FDriveType drive;
diff --git a/block.h b/block.h
index 3ac0b94..a3e69db 100644
--- a/block.h
+++ b/block.h
@@ -58,6 +58,11 @@ void bdrv_info(Monitor *mon, QObject **ret_data);
 void bdrv_stats_print(Monitor *mon, const QObject *data);
 void bdrv_info_stats(Monitor *mon, QObject **ret_data);
 
+/* disk I/O throttling */
+void bdrv_io_limits_enable(BlockDriverState *bs);
+void bdrv_io_limits_disable(BlockDriverState *bs);
+bool bdrv_io_limits_enabled(BlockDriverState *bs);
+
 void bdrv_init(void);
 void bdrv_init_with_whitelist(void);
 BlockDriver *bdrv_find_protocol(const char *filename);
diff --git a/block_int.h b/block_int.h
index 201e635..368c776 100644
--- a/block_int.h
+++ b/block_int.h
@@ -257,6 +257,9 @@ void qemu_aio_release(void *p);
 
 void *qemu_blockalign(BlockDriverState *bs, size_t size);
 
+void bdrv_set_io_limits(BlockDriverState *bs,
+BlockIOLimit *io_limits);
+
 #ifdef _WIN32
 int is_windows_drive(const char *filename);
 #endif
diff --git a/blockdev.c b/blockdev.c
index 2602591..619ae9f 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -236,6 +236,7 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
 int on_read_error, on_write_error;
 const char *devaddr;
 DriveInfo *dinfo;
+BlockIOLimit io_limits;
 int snapshot = 0;
 int ret;
 
@@ -354,6 +355,31 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
 }
 }
 
+/* disk I/O throttling */
+io_limits.bps[BLOCK_IO_LIMIT_TOTAL]  =
+   qemu_opt_get_number(opts, "bps", 0);
+io_limits.bps[BLOCK_IO_LIMIT_READ]   =
+   qemu_opt_get_number(opts, "bps_rd", 0);
+io_limits.bps[BLOCK_IO_LIMIT_WRITE]  =
+   qemu_opt_get_number(opts, "bps_wr", 0);
+io_limits.iops[BLOCK_IO_LIMIT_TOTAL] =
+   qemu_opt_get_number(opts, "iops", 0);
+io_limits.iops[BLOCK_IO_LIMIT_READ]  =
+   qemu_opt_get_number(opts, "iops_rd", 0);
+io_limits.iops[BLOCK_IO_LIMIT_WRITE] =
+   qemu_opt_get_number(opts, "iops_wr", 0);
+
+if (((io_limits.bps[BLOCK_IO_LIMIT_TOTAL] != 0)
+&& ((io_limits.bps[BLOCK_IO_LIMIT_READ] != 0)
+|| (io_limits.bps[BLOCK_IO_LIMIT_WRITE] != 0)))
+|| ((io_limits.iops[BLOCK_IO_LIMIT_TOTAL] != 0)
+&& ((io_limits.iops[BLOCK_IO_LIMIT_READ] != 0)
+|| (io_limits.iops[BLOCK_IO_LIMIT_WRITE] != 0 {
+error_report("bps(iops) and bps_rd/bps_wr(iops_rd/iops_wr)"
+ "cannot be used at the same time")

[PATCH v8 1/4] block: add the block queue support

2011-09-08 Thread Zhi Yong Wu
Signed-off-by: Zhi Yong Wu 
---
 Makefile.objs |2 +-
 block/blk-queue.c |  201 +
 block/blk-queue.h |   59 
 block_int.h   |   27 +++
 4 files changed, 288 insertions(+), 1 deletions(-)
 create mode 100644 block/blk-queue.c
 create mode 100644 block/blk-queue.h

diff --git a/Makefile.objs b/Makefile.objs
index 26b885b..5dcf456 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -33,7 +33,7 @@ block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o 
dmg.o bochs.o vpc.o vv
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
qcow2-cache.o
 block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-nested-y += qed-check.o
-block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
+block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o 
blk-queue.o
 block-nested-$(CONFIG_WIN32) += raw-win32.o
 block-nested-$(CONFIG_POSIX) += raw-posix.o
 block-nested-$(CONFIG_CURL) += curl.o
diff --git a/block/blk-queue.c b/block/blk-queue.c
new file mode 100644
index 000..adef497
--- /dev/null
+++ b/block/blk-queue.c
@@ -0,0 +1,201 @@
+/*
+ * QEMU System Emulator queue definition for block layer
+ *
+ * Copyright (c) IBM, Corp. 2011
+ *
+ * Authors:
+ *  Zhi Yong Wu  
+ *  Stefan Hajnoczi 
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "block_int.h"
+#include "block/blk-queue.h"
+#include "qemu-common.h"
+
+/* The APIs for block request queue on qemu block layer.
+ */
+
+struct BlockQueueAIOCB {
+BlockDriverAIOCB common;
+QTAILQ_ENTRY(BlockQueueAIOCB) entry;
+BlockRequestHandler *handler;
+BlockDriverAIOCB *real_acb;
+
+int64_t sector_num;
+QEMUIOVector *qiov;
+int nb_sectors;
+};
+
+typedef struct BlockQueueAIOCB BlockQueueAIOCB;
+
+struct BlockQueue {
+QTAILQ_HEAD(requests, BlockQueueAIOCB) requests;
+bool req_failed;
+bool flushing;
+};
+
+static void qemu_block_queue_dequeue(BlockQueue *queue,
+ BlockQueueAIOCB *request)
+{
+BlockQueueAIOCB *req;
+
+assert(queue);
+while (!QTAILQ_EMPTY(&queue->requests)) {
+req = QTAILQ_FIRST(&queue->requests);
+if (req == request) {
+QTAILQ_REMOVE(&queue->requests, req, entry);
+break;
+}
+}
+}
+
+static void qemu_block_queue_cancel(BlockDriverAIOCB *acb)
+{
+BlockQueueAIOCB *request = container_of(acb, BlockQueueAIOCB, common);
+if (request->real_acb) {
+bdrv_aio_cancel(request->real_acb);
+} else {
+assert(request->common.bs->block_queue);
+qemu_block_queue_dequeue(request->common.bs->block_queue,
+ request);
+}
+
+qemu_aio_release(request);
+}
+
+static AIOPool block_queue_pool = {
+.aiocb_size = sizeof(struct BlockQueueAIOCB),
+.cancel = qemu_block_queue_cancel,
+};
+
+static void qemu_block_queue_callback(void *opaque, int ret)
+{
+BlockQueueAIOCB *acb = opaque;
+
+if (acb->common.cb) {
+acb->common.cb(acb->common.opaque, ret);
+}
+
+qemu_aio_release(acb);
+}
+
+BlockQueue *qemu_new_block_queue(void)
+{
+BlockQueue *queue;
+
+queue = g_malloc0(sizeof(BlockQueue));
+
+QTAILQ_INIT(&queue->requests);
+
+queue->req_failed = true;
+queue->flushing   = false;
+
+return queue;
+}
+
+void qemu_del_block_queue(BlockQueue *queue)
+{
+BlockQueueAIOCB *request, *next;
+
+QTAILQ_FOREACH_SAFE(request, &queue->requests, entry, next) {
+QTAILQ_REMOVE(&queue->requests, request, entry);
+qemu_aio_release(request);
+}
+
+g_free(queue);
+}
+
+BlockDriverAIOCB *qemu_block_queue_enqueue(BlockQueue *queue,
+BlockDriverState *bs,
+BlockRequestHandler *handler,
+int64_t sector_num,
+QEMUIO

CFQ I/O starvation problem triggered by RHEL6.0 KVM guests

2011-09-08 Thread Takuya Yoshikawa
This is a report of strange cfq behaviour which seems to be triggered by
QEMU posix aio threads.

Host environment:
  OS: RHEL6.0 KVM/qemu-kvm (with no patch applied)
  IO scheduler: cfq (with the default parameters)

On the host, we were running 3 linux guests to see if I/O from these guests
would be handled fairly by host; each guest did dd write with oflag=direct.

Guest virtual disk:
  We used a host local disk which had 3 partitions, and each guest was
  allocated one of these as dd write target.

So our test was for checking if cfq could keep fairness for the 3 guests
who shared the same disk.

The result (strage starvation):
  Sometimes, one guest dominated cfq for more than 10sec and requests from
  other guests were not handled at all during that time.

Below is the blktrace log which shows that a request to (8,27) in cfq2068S (*1)
is not handled at all during cfq2095S and cfq2067S which hold requests to
(8,26) are being handled alternately.

*1) WS 104920578 + 64

Question:
  I guess that cfq_close_cooperator() was being called in an unusual manner.
  If so, do you think that cfq is responsible for keeping fairness for this
  kind of unusual write requests?

Note:
  With RHEL6.1, this problem could not triggered. But I guess that was due to
  QEMU's block layer updates.

Thanks,
Takuya

--- blktrace log ---
  8,16   0 2010 0.275081840  2068  A  WS 104920578 + 64 <- (8,27) 0
  8,16   0 2011 0.275082180  2068  Q  WS 104920578 + 64 [qemu-kvm]
  8,16   00 0.275091369 0  m   N cfq2068S / alloced
  8,16   0 2012 0.275091909  2068  G  WS 104920578 + 64 [qemu-kvm]
  8,16   0 2013 0.275093352  2068  P   N [qemu-kvm]
  8,16   0 2014 0.275094059  2068  I   W 104920578 + 64 [qemu-kvm]
  8,16   00 0.275094887 0  m   N cfq2068S / insert_request
  8,16   00 0.275095742 0  m   N cfq2068S / add_to_rr
  8,16   0 2015 0.275097194  2068  U   N [qemu-kvm] 1
  8,16   2 2073 0.275189462  2095  A  WS 83979688 + 64 <- (8,26) 4
  8,16   2 2074 0.275189989  2095  Q  WS 83979688 + 64 [qemu-kvm]
  8,16   2 2075 0.275192534  2095  G  WS 83979688 + 64 [qemu-kvm]
  8,16   2 2076 0.275193909  2095  I   W 83979688 + 64 [qemu-kvm]
  8,16   20 0.275195609 0  m   N cfq2095S / insert_request
  8,16   20 0.275196404 0  m   N cfq2095S / add_to_rr
  8,16   20 0.275198004 0  m   N cfq2095S / preempt
  8,16   20 0.275198688 0  m   N cfq2067S / slice expired t=1
  8,16   20 0.275199631 0  m   N cfq2067S / resid=100
  8,16   20 0.275200413 0  m   N cfq2067S / sl_used=1
  8,16   20 0.275201521 0  m   N / served: vt=1671968768 
min_vt=1671966720
  8,16   20 0.275202323 0  m   N cfq2067S / del_from_rr
  8,16   20 0.275204263 0  m   N cfq2095S / set_active 
wl_prio:0 wl_type:2
  8,16   20 0.275205131 0  m   N cfq2095S / fifo=(null)
  8,16   20 0.275205851 0  m   N cfq2095S / dispatch_insert
  8,16   20 0.275207121 0  m   N cfq2095S / dispatched a request
  8,16   20 0.275207873 0  m   N cfq2095S / activate rq, drv=1
  8,16   2 2077 0.275208198  2095  D   W 83979688 + 64 [qemu-kvm]
  8,16   2 2078 0.275269567  2095  U   N [qemu-kvm] 2
  8,16   4  836 0.275483550 0  C   W 83979688 + 64 [0]
  8,16   40 0.275496745 0  m   N cfq2095S / complete rqnoidle 0
  8,16   40 0.275497825 0  m   N cfq2095S / set_slice=100
  8,16   40 0.275499512 0  m   N cfq2095S / arm_idle: 8
  8,16   40 0.275499862 0  m   N cfq schedule dispatch
  8,16   6   85 0.275626195  2067  A  WS 83979752 + 64 <- (8,26) 40064
  8,16   6   86 0.275626598  2067  Q  WS 83979752 + 64 [qemu-kvm]
  8,16   6   87 0.275628580  2067  G  WS 83979752 + 64 [qemu-kvm]
  8,16   6   88 0.275629630  2067  I   W 83979752 + 64 [qemu-kvm]
  8,16   60 0.275631047 0  m   N cfq2067S / insert_request
  8,16   60 0.275631730 0  m   N cfq2067S / add_to_rr
  8,16   60 0.275633567 0  m   N cfq2067S / preempt
  8,16   60 0.275634275 0  m   N cfq2095S / slice expired t=1
  8,16   60 0.275635285 0  m   N cfq2095S / resid=100
  8,16   60 0.275635985 0  m   N cfq2095S / sl_used=1
  8,16   60 0.275636882 0  m   N / served: vt=1671970816 
min_vt=1671968768
  8,16   60 0.275637585 0  m   N cfq2095S / del_from_rr
  8,16   60 0.275639382 0  m   N cfq2067S / set_active 
wl_prio:0 wl_type:2
  8,16   60 0.275640222 0  m   N cfq2067S / fifo=(null)
  8,16   60 0.275640809 0  m   N cfq2067S / dispatch_insert
  8,16   60 0.275641929 0  m   N cfq2067S / dispatched a request
  8

Re: [PATCH 02/10] Driver core: Add iommu_ops to bus_type

2011-09-08 Thread Joerg Roedel
On Wed, Sep 07, 2011 at 04:37:11PM -0400, Don Dutile wrote:
> On 09/07/2011 03:44 PM, Greg KH wrote:

>> Why are you pushing this down into the driver core?  What other busses
>> becides PCI use/need this?
>>
>> If you can have a different IOMMU driver on the same bus, then wouldn't
>> this be a per-device thing instead of a per-bus thing?
>>
> And given the dma api takes a struct device *, it'd be more efficient
> to be tied into the device structure.

More efficient in what way? Currently no dma-api implementation uses the
iommu-api directly. It is planned to change that, of course. But the
dma-api implementation will use the iommu_map/iommu_unmap functions which
get an iommu_domain parameter and no device structure. Thats why the
domain structure keeps an iommu_ops pointer too.

> Device structure would get iommu ops set by parent(bus);
> if a bus (segment) doesn't provide a unique/different/layered IOMMU
> then the parent bus, it inherits the parent's iommu-ops.
> setting the iommu-ops in the root bus struct, seeds the iommu-ops
> for the (PCI) tree.
>
> For intel & amd IOMMUs, in early pci (bios,root?) init, you would
> seed the pci root busses with appropriate IOMMU support (based on
> dmar/drhd & ivrs/ivhd data structures, respectively), and
> then modify the PCI code to do the inheritence (PPB code inherits
> unless specific device driver for a given PPB vid-did loads a
> different iommu-ops for that segment/branch).
>
> This would enable different types of IOMMUs for different devices
> (or PCI segments, or branches of PCI trees) that are designed for
> different tasks -- simple IOMMUs for legacy devices; complicated, 
> io-page-faulting
> IOMMUs for plug-in, high-end devices on virtualizing servers for PCI (SRIOV) 
> endpoints.

This only works for pci, but the iommu-api is not pci-only.

> and as Greg indicates, is only relevant to PCI.

We already have non-pci iommu drivers implementing the iommu-api. So
this is certainly relevant outside pci too.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] Driver core: Add iommu_ops to bus_type

2011-09-08 Thread Joerg Roedel
On Wed, Sep 07, 2011 at 12:44:45PM -0700, Greg KH wrote:
> On Wed, Sep 07, 2011 at 09:19:19PM +0200, Joerg Roedel wrote:
> > Hi Greg,
> > 
> > the bus_set_iommu() function will be called by the IOMMU driver. There
> > can be different drivers for the same bus, depending on the hardware. On
> > PCI for example, there can be the Intel or the AMD IOMMU driver that
> > implement the iommu-api and that register for that bus.
> 
> Why are you pushing this down into the driver core?  What other busses
> becides PCI use/need this?

Currently it is the platform_bus besides pci. The pci iommus are on x86
and ia64 while all arm iommus use the platform_bus (by 'iommus' I only
mean those implementing the iommu-api). Currently there are two drivers
for arm iommus in /drivers/iommu.

> If you can have a different IOMMU driver on the same bus, then wouldn't
> this be a per-device thing instead of a per-bus thing?

Well, I havn't seen a system yet where multiple iommus are on the same
bus. Or to state it better, multiple iommus of different type that
require different drivers. There is no 1-1 mapping between real hardware
iommus and iommu_ops. There is only such a mapping for iommu drivers and
iommu_ops. An iommu driver usually handles all hardware iommus of the
same type in the system.

So having iommu_ops per-device doesn't make much sense at this point.
With this patch-set they are accessible by dev->bus->iommu_ops anyway.
But if I am wrong on this I can change this of course.

This patch-set improves the current situation where only on active
iommu-driver is allowed to be active on a system (because of the global
iommu_ops). But the main reason to put this into the bus_type structure
is that it allows to generalize the device-handling on a bus between
iommu drivers.


> 
> 
> > On Wed, Sep 07, 2011 at 11:47:50AM -0700, Greg KH wrote:
> > > > +#ifdef CONFIG_IOMMU_API
> > > > +int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops)
> > > > +{
> > > > +   if (bus->iommu_ops != NULL)
> > > > +   return -EBUSY;
> > > 
> > > Busy?
> > 
> > Yes, it signals to the IOMMU driver that another driver has already
> > registered for that bus. In the previous register_iommu() interface this
> > was just a BUG(), but I think returning an error to the caller is
> > better. It can be turned back into a BUG() if it is considered better,
> > though.
> 
> Can you ever have more than one IOMMU driver per bus?  If so, this seems
> wrong (see above.)

As I said, I havn't seen such systems. But if they exist or are planned
I am happy to redesign the whole thing.

> > The IOMMUs are usually devices on the bus itself, so they are
> > initialized after the bus is set up and the devices on it are
> > populated.  So the function can not be called on bus initialization
> > because the IOMMU is not ready at this point.
> 
> Ok, that makes more sense, please state as much in the documentation.

Will do, thanks.


Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/13] xen/pvticketlock: disable interrupts while blocking

2011-09-08 Thread Avi Kivity

On 09/07/2011 10:09 PM, Jeremy Fitzhardinge wrote:

On 09/07/2011 10:41 AM, Avi Kivity wrote:
>>  Hm, I'm interested to know what you're thinking in more detail.  Can you
>>  leave an NMI pending before you block in the same way you can with
>>  "sti;halt" with normal interrupts?
>
>
>  Nope.  But you can do
>
> if (regs->rip in critical section)
> regs->rip = after_halt;
>
>  and effectively emulate it.  The critical section is something like
>
>  critical_section_start:
>  if (woken_up)
>  goto critical_section_end;
>  hlt
>  critical_section_end:

Hm.  It's a pity you have to deliver an actual interrupt to implement
the kick though.


I don't think it's that expensive, especially compared to the 
double-context-switch and vmexit of the spinner going to sleep.  On AMD 
we do have to take an extra vmexit (on IRET) though.



>>
>>  I was thinking you might want to do something with monitor/mwait to
>>  implement the blocking/kick ops. (Handwave)
>>
>
>  monitor/mwait are incredibly expensive to virtualize since they
>  require write-protecting a page, IPIs flying everywhere and flushing
>  tlbs, not to mention my lovely hugepages being broken up mercilessly.

Or what about a futex-like hypercall?



Well we could have a specialized sleep/wakeup hypercall pair like Xen, 
but I'd like to avoid it if at all possible.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 4/5] VFIO: Add PCI device support

2011-09-08 Thread Avi Kivity

On 09/07/2011 09:55 PM, Konrad Rzeszutek Wilk wrote:

>If you don't know what to do here, say N.
>  +
>  +menuconfig VFIO_PCI
>  + bool "VFIO support for PCI devices"
>  + depends on VFIO&&  PCI
>  + default y if X86

Hahah.. And Linus is going to tear your behind for that.

Default should be 'n'


It depends on VFIO, which presumably defaults to n.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html