KSM memory saving amount
Hi all, Running an Ubuntu Natty where I launched 3 Xubuntu and 1 win 7, I cat see: /sys/kernel/mm/ksm/full_scans: 34 /sys/kernel/mm/ksm/pages_shared: 8020 /sys/kernel/mm/ksm/pages_sharing: 36098 /sys/kernel/mm/ksm/pages_to_scan: 100 /sys/kernel/mm/ksm/pages_unshared: 247109 /sys/kernel/mm/ksm/pages_volatile: 32716 /sys/kernel/mm/ksm/run: 1 /sys/kernel/mm/ksm/sleep_millisecs: 200 I would like to evaluate the memory amount: How many byte is a page? -- RMA. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode
On Thu, Sep 08, 2011 at 07:53:11PM -0700, Roopa Prabhu wrote: > >> Phase 1: Goal: Enable hardware filtering for all macvlan modes > >> - In macvlan passthru mode the single guest virtio-nic connected will > >> receive traffic that he requested for > >> - In macvlan non-passthru mode all guest virtio-nics sharing the > >> physical nic will see all other guest traffic > >> but the filtering at guest virtio-nic > > > > I don't think guests currently filter anything. > > > I was referring to Qemu-kvm virtio-net in > virtion_net_receive->receive_filter. I think It only passes pkts that the > guest OS is interested. It uses the filter table that I am passing to > macvtap in this patch. This happens after userspace thread gets woken up and data is copied there. So relying on filtering at that level is going to be very inefficient on a system with multiple active guests. Further, and for that reason, vhost-net doesn't do filtering at all, relying on the backends to pass it correct packets. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: Resolve PCI upstream diffs
On Thu, Sep 08, 2011 at 12:48:02PM +0200, Jan Kiszka wrote: > Resolve all unneeded deviations from upstream code. No functional > changes. > > Signed-off-by: Jan Kiszka Acked-by: Michael S. Tsirkin > --- > hw/pci.c | 11 +++ > hw/pci.h |5 - > 2 files changed, 11 insertions(+), 5 deletions(-) > > diff --git a/hw/pci.c b/hw/pci.c > index e4c166a..4d8845c 100644 > --- a/hw/pci.c > +++ b/hw/pci.c > @@ -2106,8 +2106,6 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, > memset(pdev->wmask + offset, 0, size); > /* Check capability by default */ > memset(pdev->cmask + offset, 0xFF, size); > - > - > return offset; > } > > @@ -2125,9 +2123,14 @@ void pci_del_capability(PCIDevice *pdev, uint8_t > cap_id, uint8_t size) > memset(pdev->cmask + offset, 0, size); > memset(pdev->used + offset, 0, size); > > -if (!pdev->config[PCI_CAPABILITY_LIST]) { > +if (!pdev->config[PCI_CAPABILITY_LIST]) > pdev->config[PCI_STATUS] &= ~PCI_STATUS_CAP_LIST; > -} > +} > + > +/* Reserve space for capability at a known offset (to call after load). */ > +void pci_reserve_capability(PCIDevice *pdev, uint8_t offset, uint8_t size) > +{ > +memset(pdev->used + offset, 0xff, size); > } > > uint8_t pci_find_capability(PCIDevice *pdev, uint8_t cap_id) > diff --git a/hw/pci.h b/hw/pci.h > index da0c2d2..70fcd9c 100644 > --- a/hw/pci.h > +++ b/hw/pci.h > @@ -147,7 +147,7 @@ struct PCIDevice { > /* Used to implement RW1C(Write 1 to Clear) bytes */ > uint8_t *w1cmask; > > -/* Used to allocate config space and track capabilities. */ > +/* Used to allocate config space for capabilities. */ > uint8_t *used; > > /* the following fields are read only */ > @@ -230,8 +230,11 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, > > void pci_del_capability(PCIDevice *pci_dev, uint8_t cap_id, uint8_t > cap_size); > > +void pci_reserve_capability(PCIDevice *pci_dev, uint8_t offset, uint8_t > size); > + > uint8_t pci_find_capability(PCIDevice *pci_dev, uint8_t cap_id); > > + > uint32_t pci_default_read_config(PCIDevice *d, > uint32_t address, int len); > void pci_default_write_config(PCIDevice *d, > -- > 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode
On 9/8/2011 8:00 PM, Roopa Prabhu wrote: On 9/8/11 12:33 PM, "Michael S. Tsirkin" wrote: On Thu, Sep 08, 2011 at 12:23:56PM -0700, Roopa Prabhu wrote: I think the main usecase for passthru mode is to assign a SR-IOV VF to a single guest. Yes and for the passthru usecase this patch should be enough to enable filtering in hw (eventually like I indicated before I need to fix vlan filtering too). So with filtering in hw, and in sriov VF case, VFs actually share a filtering table. How will that be partitioned? AFAIK, though it might maintain a single filter table space in hw, hw does know which filter belongs to which VF. And the OS driver does not need to do anything special. The VF driver exposes a VF netdev. And any uc/mc addresses registered with a VF netdev are registered with the hw by the driver. And hw will filter and send only pkts that the VF has expressed interest in. Does your NIC & driver support adding multiple mac addresses to a VF? I have tried a few other SR-IOV NICs sometime back and they didn't support this feature. Currently, we don't have an interface to add multiple mac addresses to a netdev other than an indirect way of creating a macvlan /if on top of it. Thanks Sridhar No special filter partitioning in hw is required. Thanks, Roopa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode
On 9/8/11 12:33 PM, "Michael S. Tsirkin" wrote: > On Thu, Sep 08, 2011 at 12:23:56PM -0700, Roopa Prabhu wrote: >>> >>> I think the main usecase for passthru mode is to assign a SR-IOV VF to >>> a single guest. >>> >> Yes and for the passthru usecase this patch should be enough to enable >> filtering in hw (eventually like I indicated before I need to fix vlan >> filtering too). > > So with filtering in hw, and in sriov VF case, VFs > actually share a filtering table. How will that > be partitioned? AFAIK, though it might maintain a single filter table space in hw, hw does know which filter belongs to which VF. And the OS driver does not need to do anything special. The VF driver exposes a VF netdev. And any uc/mc addresses registered with a VF netdev are registered with the hw by the driver. And hw will filter and send only pkts that the VF has expressed interest in. No special filter partitioning in hw is required. Thanks, Roopa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode
On 9/8/11 12:11 PM, "Michael S. Tsirkin" wrote: > On Thu, Sep 08, 2011 at 09:19:32AM -0700, Roopa Prabhu wrote: > There are more features we'll want down the road though, > so let's see whether the interface will be able to > satisfy them in a backwards compatible way before we > set it in stone. Here's what I came up with: > > How will the filtering table be partitioned within guests? Since this patch supports macvlan PASSTHRU mode only, in which the lower device has 1-1 mapping to the guest nic, it does not require any partitioning of filtering table within guests. Unless I missed understanding something. If the lower device were being shared by multiple guest network interfaces (non PASSTHRU mode), only then we will need to maintain separate filter tables for each guest network interface in macvlan and forward the pkt to respective guest interface after a filter lookup. This could affect performance too I think. >>> >>> Not with hardware filtering support. Which is where we'd need to >>> partition the host nic mac table between guests. >>> >> I need to understand this more. In non passthru case when a VF or physical >> nic is shared between guests, > > For example, consider a VF given to each guest. Hardware supports a fixed > total number of filters, which can be partitioned between VFs. > O ok. But hw maintains VF filters separately for every VF as far as I know. Filters received on a VF are programmed for that VF only. Am assuming all hardware do this. Atleast our hardware does this. What I was referring to was a single VF shared between guests using macvtap (could be bridge mode for example). All guests sharing the VF will register filters with the VF via macvlan. Hw makes sure what ever the VF asked for is received at the VF. VF in hw does not know that it is shared by guests. Only at macvlan we might need to re-filter the pkts received on the VF and steer pkts to the individual guests based on what they asked for. >> the nic does not really know about the guests, >> so I was thinking we do the same thing as we do for the passthru case (ie >> send all the address filters from macvlan to the physical nic). So at the >> hardware, filtering is done for all guests sharing the nic. But if we want >> each virtio-net nic or guest to get exactly what it asked for >> macvlan/macvtap needs to maintain a copy of each guest filter and do a >> lookup and send only the requested traffic to the guest. Here is the >> performance hit that I was seeing. Please see my next comment for further >> details. > > It won't be any slower than attaching a non-passthrough macvlan > to a device, will it? > Am not sure. The filter lookup in macvlan is the one I am concerned about. Will need to try it out. >> I chose to support PASSTHRU Mode only at first because its simpler and all code additions are in control path only. >>> >>> I agree. It would be a bit silly to have a dedicated interface >>> for passthough and a completely separate one for >>> non passthrough. >>> >> Agree. The reason I did not focus on non-passthru case in the initial >> version was because I was thinking things to do in the non-passthru case >> will be just add-ons to the passthru case. But true Better to flush out the >> non-pasthru case details. >> >> After dwelling on this a bit more how about the below: >> >> Phase 1: Goal: Enable hardware filtering for all macvlan modes >> - In macvlan passthru mode the single guest virtio-nic connected will >> receive traffic that he requested for >> - In macvlan non-passthru mode all guest virtio-nics sharing the >> physical nic will see all other guest traffic >> but the filtering at guest virtio-nic > > I don't think guests currently filter anything. > I was referring to Qemu-kvm virtio-net in virtion_net_receive->receive_filter. I think It only passes pkts that the guest OS is interested. It uses the filter table that I am passing to macvtap in this patch. >> will make sure each guest >> eventually sees traffic he asked for. This is still better than >> putting the physical nic in promiscuous mode. >> >> (This is mainly what my patch does...but will need to remove the passthru >> check and see if there are any thing else needed for non-passthru case) > > I'm fine with sticking with passthrough, make non passthrough > a separate phase. > Ok. >> >> Phase 2: Goal: Enable filtering at macvlan so that each guest virtio-nic >> receives only what he requested for. >> - In this case, in addition to pushing the filters down to the physical >> nic we will have to maintain the same filter in macvlan and do a filter >> lookup before forwarding the traffic to a virtio-nic. >> >> But I am thinking phase 2 might be redundant given virtio-nic already does >> filtering for the guest. > > It does? Do you mean the filter that qemu does in userspace?
Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode
On Thu, 2011-09-08 at 09:19 -0700, Roopa Prabhu wrote: > > > On 9/8/11 4:08 AM, "Michael S. Tsirkin" wrote: > > > On Wed, Sep 07, 2011 at 10:20:28PM -0700, Roopa Prabhu wrote: > >> On 9/7/11 5:34 AM, "Michael S. Tsirkin" wrote: > >> > >>> On Tue, Sep 06, 2011 at 03:35:40PM -0700, Roopa Prabhu wrote: > This patch is an attempt at providing address filtering support for > macvtap > devices in PASSTHRU mode. Its still a work in progress. > Briefly tested for basic functionality. Wanted to get some feedback on > the > direction before proceeding. > > >>> > >>> Good work, thanks. > >>> > >> > >> Thanks. > >> > I have hopefully CC'ed all concerned people. > >>> > >>> kvm crowd might also be interested. > >>> Try using ./scripts/get_maintainer.pl as well. > >>> > >> Thanks for the tip. Expanded CC list a bit more. > >> > PASSTHRU mode today sets the lowerdev in promiscous mode. In PASSTHRU > mode > there is a 1-1 mapping between macvtap device and physical nic or VF. And > all > filtering is done in lowerdev hw. The lowerdev does not need to be in > promiscous mode as long as the guest filters are passed down to the > lowerdev. > This patch tries to remove the need for putting the lowerdev in > promiscous > mode. > I have also referred to the thread below where TUNSETTXFILTER was > mentioned > in > this context: > http://patchwork.ozlabs.org/patch/69297/ > > This patch basically passes the addresses got by TUNSETTXFILTER to > macvlan > lowerdev. > > I have looked at previous work and discussions on this for qemu-kvm > by Michael Tsirkin, Alex Williamson and Dragos Tatulea > http://patchwork.ozlabs.org/patch/78595/ > http://patchwork.ozlabs.org/patch/47160/ > https://patchwork.kernel.org/patch/474481/ > > Redhat bugzilla by Michael Tsirkin: > https://bugzilla.redhat.com/show_bug.cgi?id=655013 > > I used Michael's qemu-kvm patch for testing the changes with KVM > > I would like to cover both MAC and vlan filtering in this work. > > Open Questions/Issues: > - There is a need for vlan filtering to complete the patch. It will > require > a new tap ioctl cmd for vlans. > Some ideas on this are: > > a) TUNSETVLANFILTER: This will entail we send the whole vlan bitmap > filter > (similar to tun_filter for addresses). Passing the vlan id's to lower > device will mean going thru the whole list of vlans every time. > > OR > > b) TUNSETVLAN with vlan id and flag to set/unset > > Does option 'b' sound ok ? > > - In this implementation we make the macvlan address list same as the > address > list that came in the filter with TUNSETTXFILTER. This will not cover > cases > where the macvlan device needs to have other addresses that are not > necessarily in the filter. Is this a problem ? > >>> > >>> What cases do you have in mind? > >>> > >> This patch targets only macvlan PASSTHRU mode and for PASSTHRU mode I don't > >> see a problem with uc/mc address list being the same in all the stacked > >> netdevs in the path. I called that out above to make sure I was not missing > >> any case in PASSTHRU mode where this might be invalid. Otherwise I don't > >> see > >> a problem in the simple PASSTHRU use case this patch supports. > >> > - The patch currently only supports passing of IFF_PROMISC and > IFF_MULTICAST > filter flags to lowerdev > > This patch series implements the following > 01/3 - macvlan: Add support for unicast filtering in macvlan > 02/3 - macvlan: Add function to set addr filter on lower device in > passthru > mode > 03/3 - macvtap: Add support for TUNSETTXFILTER > > Please comment. Thanks. > > Signed-off-by: Roopa Prabhu > Signed-off-by: Christian Benvenuti > Signed-off-by: David Wang > >>> > >>> The security isn't lower than with promisc, so I don't see > >>> a problem with this as such. > >>> > >>> There are more features we'll want down the road though, > >>> so let's see whether the interface will be able to > >>> satisfy them in a backwards compatible way before we > >>> set it in stone. Here's what I came up with: > >>> > >>> How will the filtering table be partitioned within guests? > >> > >> Since this patch supports macvlan PASSTHRU mode only, in which the lower > >> device has 1-1 mapping to the guest nic, it does not require any > >> partitioning of filtering table within guests. Unless I missed > >> understanding > >> something. > >> If the lower device were being shared by multiple guest network interfaces > >> (non PASSTHRU mode), only then we will need to maintain separate filter > >> tables for each guest network interface in macvlan and fo
Re: [PATCH 1/2] kvm tools: fix repeated io emulation
On 08/18/2011 11:08 PM, Avi Kivity wrote: > On 08/18/2011 12:35 AM, Sasha Levin wrote: >> On Thu, 2011-08-18 at 09:13 +0300, Pekka Enberg wrote: >> > Hi, >> > >> > On Thu, Aug 18, 2011 at 6:06 AM, Xiao Guangrong >> >wrote: >> > > When kvm emulates repeation io read instruction, it can exit to >> > user-space with >> > > 'count'> 1, we need to emulate io access for many times >> > > >> > > Signed-off-by: Xiao Guangrong >> > >> > The KVM tool is not actually maintained by Avi and Marcelo but by me >> > and few others. Our git repository is here: >> > >> > https://github.com/penberg/linux-kvm >> > >> > Ingo pulls that to -tip few times a week or so. Sasha, can you please >> > take a look at these patches and if you're OK with them, I'll apply >> > them. >> >> Pekka, >> >> I can only assume they're right, 'count' isn't documented anywhere :) >> >> If any of KVM maintainers could confirm it I'll add it into the docs. >> > > Count is indeed the number of repetitions. > Hi Pekka, Could you pick up this patchset please? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM-test: Add two scripts to disable services for perf tests
- Original Message - > On Wed, 07 Sep 2011 14:07:49 +0800, Amos Kong > wrote: > > > System services on guest and host take uncertain resource, it > > effects > > the perf results. We can use the below two scripts to disable some > > services of host and guest. > > > > stop_serivices_perf.sh is used to stop the running serivices. > > off_service_perf.sh is used to off services when host starts up. > > > > We can use them to prepare environment for performance testcases. > > Which environment? I assume this should secured by an Distribution > specific guard: prepare clean testing environment. > E.g. > > if [ -f /etc/redhat-release ] then yes. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Where to file bug report?
Yesterday I had a second VM guest fail. The day before one failed in the same way. Both suddenly had read only file systems. In both cases these are the last lines in the syslog: Sep 7 21:19:59 dev init: tty (/dev/tty1) main process ended, respawning Sep 7 21:59:23 dev kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Sep 7 21:59:23 dev kernel: ata1.00: failed command: WRITE DMA Sep 7 21:59:23 dev kernel: ata1.00: cmd ca/00:08:a0:02:31/00:00:00:00:00/e3 tag 0 dma 4096 out Sep 7 21:59:23 dev kernel: res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout) Sep 7 21:59:23 dev kernel: ata1.00: status: { DRDY } Sep 7 21:59:23 dev kernel: ata1: soft resetting link Sep 7 21:59:23 dev kernel: ata1.00: configured for MWDMA2 Sep 7 21:59:23 dev kernel: ata1.00: device reported invalid CHS sector 0 Sep 7 21:59:23 dev kernel: ata1: EH complete I assume this is the reason the root was suddenly read only. I am running updated CentOS 6 on the host and all guests. Which list should I search and/or report this to? Kernel or kvn? Emmett -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 4/5] VFIO: Add PCI device support
On Thu, 2011-09-08 at 10:52 +0300, Avi Kivity wrote: > On 09/07/2011 09:55 PM, Konrad Rzeszutek Wilk wrote: > > > If you don't know what to do here, say N. > > > + > > > +menuconfig VFIO_PCI > > > +bool "VFIO support for PCI devices" > > > +depends on VFIO&& PCI > > > +default y if X86 > > > > Hahah.. And Linus is going to tear your behind for that. > > > > Default should be 'n' > > It depends on VFIO, which presumably defaults to n. Yes, exactly. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] KVM test: Introduce migrate.with_blkdebug variant
In order to solve task http://autotest.kernel.org/ticket/50 Create a migrate.with_blkdebug variant, that will migrate the vm with state check before and after migration. As the reporter of the original problem on task 50 said, indeed we have a lot of migration state mismatch problems with a current qemu-kvm. Signed-off-by: Lucas Meneghel Rodrigues --- client/tests/kvm/subtests.cfg.sample |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/client/tests/kvm/subtests.cfg.sample b/client/tests/kvm/subtests.cfg.sample index 9d9dbba..d79b3af 100644 --- a/client/tests/kvm/subtests.cfg.sample +++ b/client/tests/kvm/subtests.cfg.sample @@ -164,6 +164,12 @@ variants: mig_cancel = yes variants: - @default: +- with_blkdebug: +drive_blkdebug_image1 = blkdebug/default.conf +drive_rerror_image1 = stop +drive_werror_image1 = stop +vmstate_check = yes +migration_living_guest = no - with_set_speed: mig_speed = 1G pre_migrate = "mig_set_speed" -- 1.7.6 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] KVM test: migration subtest - Introduce migration_living_guest param
In order to test a special guest configuration (special blkdebug config files to inject errors on the block subsystem), added a migration_living_guest param. If the param is set to "no", no attempt to have a session on the guest will be made, since the guest block subsystem will stop when the first error is injected. Signed-off-by: Lucas Meneghel Rodrigues --- client/tests/kvm/tests/migration.py | 78 --- 1 files changed, 45 insertions(+), 33 deletions(-) diff --git a/client/tests/kvm/tests/migration.py b/client/tests/kvm/tests/migration.py index 7e877fe..1e02784 100644 --- a/client/tests/kvm/tests/migration.py +++ b/client/tests/kvm/tests/migration.py @@ -36,34 +36,40 @@ def run_migration(test, params, env): mig_speed = params.get("mig_speed", "1G") return vm.monitor.migrate_set_speed(mig_speed) -vm = env.get_vm(params["main_vm"]) -vm.verify_alive() -timeout = int(params.get("login_timeout", 360)) -session = vm.wait_for_login(timeout=timeout) - mig_timeout = float(params.get("mig_timeout", "3600")) mig_protocol = params.get("migration_protocol", "tcp") mig_cancel_delay = int(params.get("mig_cancel") == "yes") * 2 offline = params.get("offline", "no") == "yes" check = params.get("vmstate_check", "no") == "yes" +living_guest_os = params.get("migration_living_guest", "yes") == "yes" + +vm = env.get_vm(params["main_vm"]) +vm.verify_alive() +timeout = int(params.get("login_timeout", 360)) +if living_guest_os: +session = vm.wait_for_login(timeout=timeout) # Get the output of migration_test_command -test_command = params.get("migration_test_command") -reference_output = session.cmd_output(test_command) +if living_guest_os: +test_command = params.get("migration_test_command") +reference_output = session.cmd_output(test_command) # Start some process in the background (and leave the session open) -background_command = params.get("migration_bg_command", "") -session.sendline(background_command) -time.sleep(5) +if living_guest_os: +background_command = params.get("migration_bg_command", "") +session.sendline(background_command) +time.sleep(5) # Start another session with the guest and make sure the background # process is running -session2 = vm.wait_for_login(timeout=timeout) +if living_guest_os: +session2 = vm.wait_for_login(timeout=timeout) try: -check_command = params.get("migration_bg_check_command", "") -session2.cmd(check_command, timeout=30) -session2.close() +if living_guest_os: +check_command = params.get("migration_bg_check_command", "") +session2.cmd(check_command, timeout=30) +session2.close() # run some functions before migrate start. pre_migrate = get_functions(params.get("pre_migrate"), locals()) @@ -79,32 +85,38 @@ def run_migration(test, params, env): func() # Log into the guest again -logging.info("Logging into guest after migration...") -session2 = vm.wait_for_login(timeout=30) -logging.info("Logged in after migration") +if living_guest_os: +logging.info("Logging into guest after migration...") +session2 = vm.wait_for_login(timeout=30) +logging.info("Logged in after migration") # Make sure the background process is still running -session2.cmd(check_command, timeout=30) +if living_guest_os: +session2.cmd(check_command, timeout=30) # Get the output of migration_test_command -output = session2.cmd_output(test_command) +if living_guest_os: +output = session2.cmd_output(test_command) # Compare output to reference output -if output != reference_output: -logging.info("Command output before migration differs from " - "command output after migration") -logging.info("Command: %s", test_command) -logging.info("Output before:" + - virt_utils.format_str_for_message(reference_output)) -logging.info("Output after:" + - virt_utils.format_str_for_message(output)) -raise error.TestFail("Command '%s' produced different output " - "before and after migration" % test_command) +if living_guest_os: +if output != reference_output: +logging.info("Command output before migration differs from " + "command output after migration") +logging.info("Command: %s", test_command) +logging.info("Output before:" + + virt_utils.format_str_for_message(reference_output)) +logging.info("Output after:" + +
[PATCH 0/3] Introduce blkdebug support to KVM autotest + migration test
In order to solve task: http://autotest.kernel.org/ticket/50 Implemented support to providing blkdebug files to arbitrary guest images in autotest. Also, modified a bit the migrate test and introduced the variant migrate.with_blkdebug variant, that tries to mimic as much as possible the testcase described in ticket 50. Kevin, sorry that I took too long to work on this, let me know if you see any obvious problems on the implementation. With this we can have people reproducing the test case mentioned with a stock autotest clone. Indeed we see state migration mismatch problems running the testcase: Test Status Seconds Info -- --- (Result file: ../../results/default/status) Fedora.15.64.migrate.with_blkdebug.tcp.smp2 FAIL 35 Unhandled VMMigrateStateMismatchError: Mismatch of VM state before and after migration (cd739e82e72179821b0599169432729c != 39920fecc39325cca5deab4e0c54792d)[context: migrating 'vm1' --> after migration] Fedora.15.64.migrate.with_blkdebug.unix.smp2 FAIL 46 Unhandled VMMigrateStateMismatchError: Mismatch of VM state before and after migration (9482fb1523f15e708b0e46058b22196c != 60f7757c88e8a3a656f701fa97501a97)[context: migrating 'vm1' --> after migration] Fedora.15.64.migrate.with_blkdebug.exec.smp2 FAIL 44 Unhandled VMMigrateStateMismatchError: Mismatch of VM state before and after migration (7f5d1cce87383c001a83d0d3a6529c68 != 76549ae93fc0b3d270dbf51e8cb49dd6)[context: migrating 'vm1' --> after migration] Fedora.15.64.migrate.with_blkdebug.mig_cancel.smp2FAIL 101 Unhandled VMMigrateCancelError: Cannot cancel migration[context: migrating 'vm1'] GOOD 261 Cheers! Lucas Meneghel Rodrigues (3): KVM test: Introduce blkdebug param to images KVM test: migration subtest - Introduce migration_living_guest param KVM test: Introduce migrate.with_blkdebug variant client/tests/kvm/base.cfg.sample |9 client/tests/kvm/blkdebug/default.conf | 16 +++ client/tests/kvm/subtests.cfg.sample |6 +++ client/tests/kvm/tests/migration.py| 78 ++- client/virt/kvm_vm.py | 14 +- client/virt/virt_vm.py | 21 + 6 files changed, 108 insertions(+), 36 deletions(-) create mode 100644 client/tests/kvm/blkdebug/default.conf -- 1.7.6 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] KVM test: Introduce blkdebug param to images
blkdebug files allow to inject errors in the block subsystem, allowing more types of tests to be executed. Now it is possible to set a blkdebug file for an arbitrary image 'imagefoo' used just like this drive_blkdebug_imagefoo = blkdebug/default.conf The blkdebug files can be stored on the specific location: client/tests/kvm/blkdebug And then specified like the above. An example of debug file can be seen here: [inject-error] state = "2" event = "read_aio" errno = "7" immediately = "off" once = "on" [set-state] state = "1" event = "read_aio" new_state = "2" [set-state] state = "2" event = "read_aio" new_state = "3" Signed-off-by: Lucas Meneghel Rodrigues --- client/tests/kvm/base.cfg.sample |9 + client/tests/kvm/blkdebug/default.conf | 16 client/virt/kvm_vm.py | 14 +++--- client/virt/virt_vm.py | 21 + 4 files changed, 57 insertions(+), 3 deletions(-) create mode 100644 client/tests/kvm/blkdebug/default.conf diff --git a/client/tests/kvm/base.cfg.sample b/client/tests/kvm/base.cfg.sample index 0e6d222..96e14a6 100644 --- a/client/tests/kvm/base.cfg.sample +++ b/client/tests/kvm/base.cfg.sample @@ -28,6 +28,15 @@ image_size = 10G image_raw_device = no drive_index_image1 = 0 drive_cache = none +# You can specify a blkdebug file here, relative to kvm/blkdebug dir +# we have a premade default.conf in there. Important to note that you +# can set this for any image defined in the config at a given time +#drive_blkdebug_image1 = blkdebug/default.conf +drive_blkdebug_image1 = +# What to do whether a read error is detected, such as 'stop' +drive_rerror_image1 = +# What to do whether a write error is detected, such as 'stop' +drive_werror_image1 = # Cdrom drive index drive_index_cd1 = 1 diff --git a/client/tests/kvm/blkdebug/default.conf b/client/tests/kvm/blkdebug/default.conf new file mode 100644 index 000..0497405 --- /dev/null +++ b/client/tests/kvm/blkdebug/default.conf @@ -0,0 +1,16 @@ +[inject-error] +state = "2" +event = "read_aio" +errno = "7" +immediately = "off" +once = "on" + +[set-state] +state = "1" +event = "read_aio" +new_state = "2" + +[set-state] +state = "2" +event = "read_aio" +new_state = "3" diff --git a/client/virt/kvm_vm.py b/client/virt/kvm_vm.py index f7b8345..d0b8015 100644 --- a/client/virt/kvm_vm.py +++ b/client/virt/kvm_vm.py @@ -215,7 +215,8 @@ class VM(virt_vm.BaseVM): return " -cdrom '%s'" % filename def add_drive(help, filename, index=None, format=None, cache=None, - werror=None, serial=None, snapshot=False, boot=False): + werror=None, rerror=None, serial=None, snapshot=False, + boot=False, blkdebug=None): name = None; dev = ""; if format == "ahci": @@ -229,13 +230,18 @@ class VM(virt_vm.BaseVM): dev += ",port=%d" % (int(index) + 1) format = "none" index = None -cmd = " -drive file='%s'" % filename +if blkdebug is not None: +cmd = " -drive file=blkdebug:%s:%s" % (blkdebug, filename) +else: +cmd = " -drive file='%s'" % filename if index is not None: cmd += ",index=%s" % index if format: cmd += ",if=%s" % format if cache: cmd += ",cache=%s" % cache +if rerror: +cmd += ",rerror=%s" % rerror if werror: cmd += ",werror=%s" % werror if serial: @@ -440,10 +446,12 @@ class VM(virt_vm.BaseVM): image_params.get("drive_index"), image_params.get("drive_format"), image_params.get("drive_cache"), + image_params.get("drive_rerror"), image_params.get("drive_werror"), image_params.get("drive_serial"), image_params.get("image_snapshot") == "yes", - image_params.get("image_boot") == "yes") + image_params.get("image_boot") == "yes", +virt_vm.get_image_blkdebug_filename(image_params, root_dir)) redirs = [] for redir_name in params.objects("redirs"): diff --git a/client/virt/virt_vm.py b/client/virt/virt_vm.py index 8815bf4..de321c4 100644 --- a/client/virt/virt_vm.py +++ b/client/virt/virt_vm.py @@ -188,6 +188,27 @@ class VMRebootError(VMError): class VMStatusError(VMError): pass + +def get_image_blkdebug_filename(params, root_dir): +""" +Generate an blkdebug file path from params and root_dir. + +blkdebug files allow error injection in the block subsystem. + +@param params: Dictionary containing the test par
Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode
On Thu, Sep 08, 2011 at 12:23:56PM -0700, Roopa Prabhu wrote: > > > > I think the main usecase for passthru mode is to assign a SR-IOV VF to > > a single guest. > > > Yes and for the passthru usecase this patch should be enough to enable > filtering in hw (eventually like I indicated before I need to fix vlan > filtering too). So with filtering in hw, and in sriov VF case, VFs actually share a filtering table. How will that be partitioned? -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode
On 9/8/11 4:08 AM, "Michael S. Tsirkin" wrote: > On Wed, Sep 07, 2011 at 10:20:28PM -0700, Roopa Prabhu wrote: >> On 9/7/11 5:34 AM, "Michael S. Tsirkin" wrote: >> >>> On Tue, Sep 06, 2011 at 03:35:40PM -0700, Roopa Prabhu wrote: This patch is an attempt at providing address filtering support for macvtap devices in PASSTHRU mode. Its still a work in progress. Briefly tested for basic functionality. Wanted to get some feedback on the direction before proceeding. >>> >>> Good work, thanks. >>> >> >> Thanks. >> I have hopefully CC'ed all concerned people. >>> >>> kvm crowd might also be interested. >>> Try using ./scripts/get_maintainer.pl as well. >>> >> Thanks for the tip. Expanded CC list a bit more. >> PASSTHRU mode today sets the lowerdev in promiscous mode. In PASSTHRU mode there is a 1-1 mapping between macvtap device and physical nic or VF. And all filtering is done in lowerdev hw. The lowerdev does not need to be in promiscous mode as long as the guest filters are passed down to the lowerdev. This patch tries to remove the need for putting the lowerdev in promiscous mode. I have also referred to the thread below where TUNSETTXFILTER was mentioned in this context: http://patchwork.ozlabs.org/patch/69297/ This patch basically passes the addresses got by TUNSETTXFILTER to macvlan lowerdev. I have looked at previous work and discussions on this for qemu-kvm by Michael Tsirkin, Alex Williamson and Dragos Tatulea http://patchwork.ozlabs.org/patch/78595/ http://patchwork.ozlabs.org/patch/47160/ https://patchwork.kernel.org/patch/474481/ Redhat bugzilla by Michael Tsirkin: https://bugzilla.redhat.com/show_bug.cgi?id=655013 I used Michael's qemu-kvm patch for testing the changes with KVM I would like to cover both MAC and vlan filtering in this work. Open Questions/Issues: - There is a need for vlan filtering to complete the patch. It will require a new tap ioctl cmd for vlans. Some ideas on this are: a) TUNSETVLANFILTER: This will entail we send the whole vlan bitmap filter (similar to tun_filter for addresses). Passing the vlan id's to lower device will mean going thru the whole list of vlans every time. OR b) TUNSETVLAN with vlan id and flag to set/unset Does option 'b' sound ok ? - In this implementation we make the macvlan address list same as the address list that came in the filter with TUNSETTXFILTER. This will not cover cases where the macvlan device needs to have other addresses that are not necessarily in the filter. Is this a problem ? >>> >>> What cases do you have in mind? >>> >> This patch targets only macvlan PASSTHRU mode and for PASSTHRU mode I don't >> see a problem with uc/mc address list being the same in all the stacked >> netdevs in the path. I called that out above to make sure I was not missing >> any case in PASSTHRU mode where this might be invalid. Otherwise I don't see >> a problem in the simple PASSTHRU use case this patch supports. >> - The patch currently only supports passing of IFF_PROMISC and IFF_MULTICAST filter flags to lowerdev This patch series implements the following 01/3 - macvlan: Add support for unicast filtering in macvlan 02/3 - macvlan: Add function to set addr filter on lower device in passthru mode 03/3 - macvtap: Add support for TUNSETTXFILTER Please comment. Thanks. Signed-off-by: Roopa Prabhu Signed-off-by: Christian Benvenuti Signed-off-by: David Wang >>> >>> The security isn't lower than with promisc, so I don't see >>> a problem with this as such. >>> >>> There are more features we'll want down the road though, >>> so let's see whether the interface will be able to >>> satisfy them in a backwards compatible way before we >>> set it in stone. Here's what I came up with: >>> >>> How will the filtering table be partitioned within guests? >> >> Since this patch supports macvlan PASSTHRU mode only, in which the lower >> device has 1-1 mapping to the guest nic, it does not require any >> partitioning of filtering table within guests. Unless I missed understanding >> something. >> If the lower device were being shared by multiple guest network interfaces >> (non PASSTHRU mode), only then we will need to maintain separate filter >> tables for each guest network interface in macvlan and forward the pkt to >> respective guest interface after a filter lookup. This could affect >> performance too I think. > > Not with hardware filtering support. Which is where we'd need to > partition the host nic mac table between guests. > I need to understand this more. In non passthru case when a VF or physical nic is shared between guests, the nic does not really
Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode
On Thu, Sep 08, 2011 at 09:19:32AM -0700, Roopa Prabhu wrote: > >>> There are more features we'll want down the road though, > >>> so let's see whether the interface will be able to > >>> satisfy them in a backwards compatible way before we > >>> set it in stone. Here's what I came up with: > >>> > >>> How will the filtering table be partitioned within guests? > >> > >> Since this patch supports macvlan PASSTHRU mode only, in which the lower > >> device has 1-1 mapping to the guest nic, it does not require any > >> partitioning of filtering table within guests. Unless I missed > >> understanding > >> something. > >> If the lower device were being shared by multiple guest network interfaces > >> (non PASSTHRU mode), only then we will need to maintain separate filter > >> tables for each guest network interface in macvlan and forward the pkt to > >> respective guest interface after a filter lookup. This could affect > >> performance too I think. > > > > Not with hardware filtering support. Which is where we'd need to > > partition the host nic mac table between guests. > > > I need to understand this more. In non passthru case when a VF or physical > nic is shared between guests, For example, consider a VF given to each guest. Hardware supports a fixed total number of filters, which can be partitioned between VFs. > the nic does not really know about the guests, > so I was thinking we do the same thing as we do for the passthru case (ie > send all the address filters from macvlan to the physical nic). So at the > hardware, filtering is done for all guests sharing the nic. But if we want > each virtio-net nic or guest to get exactly what it asked for > macvlan/macvtap needs to maintain a copy of each guest filter and do a > lookup and send only the requested traffic to the guest. Here is the > performance hit that I was seeing. Please see my next comment for further > details. It won't be any slower than attaching a non-passthrough macvlan to a device, will it? > > >> I chose to support PASSTHRU Mode only at first because its simpler and all > >> code additions are in control path only. > > > > I agree. It would be a bit silly to have a dedicated interface > > for passthough and a completely separate one for > > non passthrough. > > > Agree. The reason I did not focus on non-passthru case in the initial > version was because I was thinking things to do in the non-passthru case > will be just add-ons to the passthru case. But true Better to flush out the > non-pasthru case details. > > After dwelling on this a bit more how about the below: > > Phase 1: Goal: Enable hardware filtering for all macvlan modes > - In macvlan passthru mode the single guest virtio-nic connected will > receive traffic that he requested for > - In macvlan non-passthru mode all guest virtio-nics sharing the > physical nic will see all other guest traffic > but the filtering at guest virtio-nic I don't think guests currently filter anything. > will make sure each guest > eventually sees traffic he asked for. This is still better than > putting the physical nic in promiscuous mode. > > (This is mainly what my patch does...but will need to remove the passthru > check and see if there are any thing else needed for non-passthru case) I'm fine with sticking with passthrough, make non passthrough a separate phase. > > Phase 2: Goal: Enable filtering at macvlan so that each guest virtio-nic > receives only what he requested for. > - In this case, in addition to pushing the filters down to the physical > nic we will have to maintain the same filter in macvlan and do a filter > lookup before forwarding the traffic to a virtio-nic. > > But I am thinking phase 2 might be redundant given virtio-nic already does > filtering for the guest. It does? Do you mean the filter that qemu does in userspace? > In which case we might not need phase 2 at all. I > might have been over complicating things. > > Please comment. And please correct if I missed something. > > > >>> > >>> A way to limit what the guest can do would also be useful. > >>> How can this be done? selinux? > >> > >> I vaguely remember a thread on the same context.. had a suggestion to > >> maintain pre-approved address lists and allow guest filter registration of > >> only those addresses for security. This seemed reasonable. Plus the ability > >> to support additional address registration from guest could be made > >> configurable (One of your ideas again from prior work). > >> > >> I am not an selinux expert, but I am thinking we can use it to only allow > >> or > >> disallow access or operations to the macvtap device. (?). I will check more > >> on this. > > > > We'd have to have a way to revoke that as well. > > > Yes true. > > > >>> > >>> Any thoughts on spoofing filtering? > >> > >> I can only think of checking addresses against an allowed address list. > >> Don't know of any other ways. An
Re: [RFC PATCH 0/5] VFIO-NG group/device/iommu framework
On Wed, 2011-09-07 at 13:58 +0200, Alexander Graf wrote: > On 01.09.2011, at 21:50, Alex Williamson wrote: > > > Trying to move beyond talking about how VFIO should work to > > re-writing the code. This is pre-alpha, known broken, will > > probably crash your system but it illustrates some of how > > I see groups, devices, and iommus interacting. This is just > > the framework, no code to actually support user space drivers > > or device assignment yet. > > > > The iommu portions are still using the "FIXME" PCI specific > > hooks. Once Joerg gets some buy-in on his bus specific iommu > > patches, we can move to that. > > > > The group management is more complicated than I'd like and > > you can get groups into a bad state by killing the test program > > with devices/iommus open. The locking is overly simplistic. > > But, it's a start. Please make constructive comments and > > suggestions. Patches based on v3.0. Thanks, > > Looks pretty reasonable to me so far, but I guess we only know for sure once > we have non-PCI implemented and working with this scheme as well. > Btw I couldn't find the PCI BAR regions mmaps and general config space > exposure. Where has that gone? I ripped it out for now just to work on the group/device/iommu framework. I didn't see a need to make a functional RFC just to get some buy-in on the framework. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/13] xen/pvticketlock: disable interrupts while blocking
On 09/08/2011 12:51 AM, Avi Kivity wrote: > On 09/07/2011 10:09 PM, Jeremy Fitzhardinge wrote: >> On 09/07/2011 10:41 AM, Avi Kivity wrote: >> >> Hm, I'm interested to know what you're thinking in more detail. >> Can you >> >> leave an NMI pending before you block in the same way you can with >> >> "sti;halt" with normal interrupts? >> > >> > >> > Nope. But you can do >> > >> > if (regs->rip in critical section) >> > regs->rip = after_halt; >> > >> > and effectively emulate it. The critical section is something like >> > >> > critical_section_start: >> > if (woken_up) >> > goto critical_section_end; >> > hlt >> > critical_section_end: >> >> Hm. It's a pity you have to deliver an actual interrupt to implement >> the kick though. > > I don't think it's that expensive, especially compared to the > double-context-switch and vmexit of the spinner going to sleep. On > AMD we do have to take an extra vmexit (on IRET) though. Fair enough - so if the vcpu blocks itself, it ends up being rescheduled in the NMI handler, which then returns to the lock slowpath. And if its a normal hlt, then you can also take interrupts if they're enabled while spinning. And if you get nested NMIs (since you can get multiple spurious kicks, or from other NMI sources), then one NMI will get latched and any others will get dropped? > Well we could have a specialized sleep/wakeup hypercall pair like Xen, > but I'd like to avoid it if at all possible. Yeah, that's something that just falls out of the existing event channel machinery, so it isn't something that I specifically added. But it does mean that you simply end up with a hypercall returning on kick, with no real complexities. J -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode
On 9/8/11 10:42 AM, "Sridhar Samudrala" wrote: > On Thu, 2011-09-08 at 09:19 -0700, Roopa Prabhu wrote: >> >> >> On 9/8/11 4:08 AM, "Michael S. Tsirkin" wrote: >> >>> On Wed, Sep 07, 2011 at 10:20:28PM -0700, Roopa Prabhu wrote: On 9/7/11 5:34 AM, "Michael S. Tsirkin" wrote: > On Tue, Sep 06, 2011 at 03:35:40PM -0700, Roopa Prabhu wrote: >> This patch is an attempt at providing address filtering support for >> macvtap >> devices in PASSTHRU mode. Its still a work in progress. >> Briefly tested for basic functionality. Wanted to get some feedback on >> the >> direction before proceeding. >> > > Good work, thanks. > Thanks. >> I have hopefully CC'ed all concerned people. > > kvm crowd might also be interested. > Try using ./scripts/get_maintainer.pl as well. > Thanks for the tip. Expanded CC list a bit more. >> PASSTHRU mode today sets the lowerdev in promiscous mode. In PASSTHRU >> mode >> there is a 1-1 mapping between macvtap device and physical nic or VF. And >> all >> filtering is done in lowerdev hw. The lowerdev does not need to be in >> promiscous mode as long as the guest filters are passed down to the >> lowerdev. >> This patch tries to remove the need for putting the lowerdev in >> promiscous >> mode. >> I have also referred to the thread below where TUNSETTXFILTER was >> mentioned >> in >> this context: >> http://patchwork.ozlabs.org/patch/69297/ >> >> This patch basically passes the addresses got by TUNSETTXFILTER to >> macvlan >> lowerdev. >> >> I have looked at previous work and discussions on this for qemu-kvm >> by Michael Tsirkin, Alex Williamson and Dragos Tatulea >> http://patchwork.ozlabs.org/patch/78595/ >> http://patchwork.ozlabs.org/patch/47160/ >> https://patchwork.kernel.org/patch/474481/ >> >> Redhat bugzilla by Michael Tsirkin: >> https://bugzilla.redhat.com/show_bug.cgi?id=655013 >> >> I used Michael's qemu-kvm patch for testing the changes with KVM >> >> I would like to cover both MAC and vlan filtering in this work. >> >> Open Questions/Issues: >> - There is a need for vlan filtering to complete the patch. It will >> require >> a new tap ioctl cmd for vlans. >> Some ideas on this are: >> >> a) TUNSETVLANFILTER: This will entail we send the whole vlan bitmap >> filter >> (similar to tun_filter for addresses). Passing the vlan id's to lower >> device will mean going thru the whole list of vlans every time. >> >> OR >> >> b) TUNSETVLAN with vlan id and flag to set/unset >> >> Does option 'b' sound ok ? >> >> - In this implementation we make the macvlan address list same as the >> address >> list that came in the filter with TUNSETTXFILTER. This will not cover >> cases >> where the macvlan device needs to have other addresses that are not >> necessarily in the filter. Is this a problem ? > > What cases do you have in mind? > This patch targets only macvlan PASSTHRU mode and for PASSTHRU mode I don't see a problem with uc/mc address list being the same in all the stacked netdevs in the path. I called that out above to make sure I was not missing any case in PASSTHRU mode where this might be invalid. Otherwise I don't see a problem in the simple PASSTHRU use case this patch supports. >> - The patch currently only supports passing of IFF_PROMISC and >> IFF_MULTICAST >> filter flags to lowerdev >> >> This patch series implements the following >> 01/3 - macvlan: Add support for unicast filtering in macvlan >> 02/3 - macvlan: Add function to set addr filter on lower device in >> passthru >> mode >> 03/3 - macvtap: Add support for TUNSETTXFILTER >> >> Please comment. Thanks. >> >> Signed-off-by: Roopa Prabhu >> Signed-off-by: Christian Benvenuti >> Signed-off-by: David Wang > > The security isn't lower than with promisc, so I don't see > a problem with this as such. > > There are more features we'll want down the road though, > so let's see whether the interface will be able to > satisfy them in a backwards compatible way before we > set it in stone. Here's what I came up with: > > How will the filtering table be partitioned within guests? Since this patch supports macvlan PASSTHRU mode only, in which the lower device has 1-1 mapping to the guest nic, it does not require any partitioning of filtering table within guests. Unless I missed understanding something. If the lower device were being shared by multiple guest network interfaces (non PASSTHRU mode), only then we will need to maintain separate filter tables for
RE: [PATCH] KVM: emulate lapic tsc deadline timer for hvm
>>> --- a/arch/x86/include/asm/msr-index.h >>> +++ b/arch/x86/include/asm/msr-index.h >>> @@ -229,6 +229,8 @@ >>> #define MSR_IA32_APICBASE_ENABLE (1<<11) >>> #define MSR_IA32_APICBASE_BASE (0xf<<12) >>> >>> +#define MSR_IA32_TSCDEADLINE 0x06e0 >>> + >>> #define MSR_IA32_UCODE_WRITE 0x0079 >>> #define MSR_IA32_UCODE_REV 0x008b >> >> Need to add to msrs_to_save so live migration works. > > MSR must be explicitly listed in qemu, also. > Marcelo, seems MSR don't need explicitly list in qemu? KVM side adding MSR_IA32_TSCDEADLINE to msrs_to_save is enough. Qemu will get it through KVM_GET_MSR_INDEX_LIST. Do I miss something? Thanks, Jinsong-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM got slow after adding more physical memory to host - SOLVED
On 06.09.2011 23:22, Nikola Ciprich wrote: > Hello guys, > > thanks to both of You for Your replies. The problem is solved, > exactly as Avi said, the DMA in windows got somehow disabled. > So this certainly was not related to adding the memory... > > anyways, note for further generations: > in windows XP, the DMA usage can be checked in > Device Manager->IDE ATA/ATAPI Controllers -> Properties->Advanced Settings > Current Transfer Mode must be Multi-Word DMA2 or something similar, > NOT PIO! > > The way I enabled this, was to uninstall both primary and secondary controller > in device controller, then also uninstall Intel controller, and THEN > rebooting (NOT sooner!) > after reboot, controllers got detected and installed again, with DMA properly > enabled. > Note that when controller is in PIO mode, this is really a patience test, > switching to DMA took > me like half an hour to complete, so slow the system was :-/ You can open regedit and search for "NoIDE" - it should find one key with that name, with value "1" or "yes" - just delete it, there's no need to go that route with removing/reinstalling device drivers. The reason why it has been disabled is - most likely - due to some timeout while handling i/o -- eg, when your host was loaded too much or were swapping - you should be able to find something about that in windows event log. /mjt -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CFQ I/O starvation problem triggered by RHEL6.0 KVM guests
On Thu, Sep 08, 2011 at 06:13:53PM +0900, Takuya Yoshikawa wrote: > This is a report of strange cfq behaviour which seems to be triggered by > QEMU posix aio threads. > > Host environment: > OS: RHEL6.0 KVM/qemu-kvm (with no patch applied) > IO scheduler: cfq (with the default parameters) So you are using both RHEL 6.0 in both host and guest kernel? Can you reproduce the same issue with upstream kernels? How easily/frequently you can reproduce this with RHEL6.0 host. > > On the host, we were running 3 linux guests to see if I/O from these guests > would be handled fairly by host; each guest did dd write with oflag=direct. > > Guest virtual disk: > We used a host local disk which had 3 partitions, and each guest was > allocated one of these as dd write target. > > So our test was for checking if cfq could keep fairness for the 3 guests > who shared the same disk. > > The result (strage starvation): > Sometimes, one guest dominated cfq for more than 10sec and requests from > other guests were not handled at all during that time. > > Below is the blktrace log which shows that a request to (8,27) in cfq2068S > (*1) > is not handled at all during cfq2095S and cfq2067S which hold requests to > (8,26) are being handled alternately. > > *1) WS 104920578 + 64 > > Question: > I guess that cfq_close_cooperator() was being called in an unusual manner. > If so, do you think that cfq is responsible for keeping fairness for this > kind of unusual write requests? - If two guests are doing IO to separate partitions, they should really not be very close (until and unless partitions are really small). - Even if there are close cooperators, these queues are merged and they are treated as single queue from slice point of view. So cooperating queues should be merged and get a single slice instead of starving other queues in the system. Can you upload the blktrace logs somewhere which shows what happened during that 10 seconds. > > Note: > With RHEL6.1, this problem could not triggered. But I guess that was due to > QEMU's block layer updates. You can try reproducing this with fio. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode
On Wed, Sep 07, 2011 at 10:20:28PM -0700, Roopa Prabhu wrote: > On 9/7/11 5:34 AM, "Michael S. Tsirkin" wrote: > > > On Tue, Sep 06, 2011 at 03:35:40PM -0700, Roopa Prabhu wrote: > >> This patch is an attempt at providing address filtering support for macvtap > >> devices in PASSTHRU mode. Its still a work in progress. > >> Briefly tested for basic functionality. Wanted to get some feedback on the > >> direction before proceeding. > >> > > > > Good work, thanks. > > > > Thanks. > > >> I have hopefully CC'ed all concerned people. > > > > kvm crowd might also be interested. > > Try using ./scripts/get_maintainer.pl as well. > > > Thanks for the tip. Expanded CC list a bit more. > > >> PASSTHRU mode today sets the lowerdev in promiscous mode. In PASSTHRU mode > >> there is a 1-1 mapping between macvtap device and physical nic or VF. And > >> all > >> filtering is done in lowerdev hw. The lowerdev does not need to be in > >> promiscous mode as long as the guest filters are passed down to the > >> lowerdev. > >> This patch tries to remove the need for putting the lowerdev in promiscous > >> mode. > >> I have also referred to the thread below where TUNSETTXFILTER was mentioned > >> in > >> this context: > >> http://patchwork.ozlabs.org/patch/69297/ > >> > >> This patch basically passes the addresses got by TUNSETTXFILTER to macvlan > >> lowerdev. > >> > >> I have looked at previous work and discussions on this for qemu-kvm > >> by Michael Tsirkin, Alex Williamson and Dragos Tatulea > >> http://patchwork.ozlabs.org/patch/78595/ > >> http://patchwork.ozlabs.org/patch/47160/ > >> https://patchwork.kernel.org/patch/474481/ > >> > >> Redhat bugzilla by Michael Tsirkin: > >> https://bugzilla.redhat.com/show_bug.cgi?id=655013 > >> > >> I used Michael's qemu-kvm patch for testing the changes with KVM > >> > >> I would like to cover both MAC and vlan filtering in this work. > >> > >> Open Questions/Issues: > >> - There is a need for vlan filtering to complete the patch. It will require > >> a new tap ioctl cmd for vlans. > >> Some ideas on this are: > >> > >> a) TUNSETVLANFILTER: This will entail we send the whole vlan bitmap > >> filter > >> (similar to tun_filter for addresses). Passing the vlan id's to lower > >> device will mean going thru the whole list of vlans every time. > >> > >> OR > >> > >> b) TUNSETVLAN with vlan id and flag to set/unset > >> > >> Does option 'b' sound ok ? > >> > >> - In this implementation we make the macvlan address list same as the > >> address > >> list that came in the filter with TUNSETTXFILTER. This will not cover > >> cases > >> where the macvlan device needs to have other addresses that are not > >> necessarily in the filter. Is this a problem ? > > > > What cases do you have in mind? > > > This patch targets only macvlan PASSTHRU mode and for PASSTHRU mode I don't > see a problem with uc/mc address list being the same in all the stacked > netdevs in the path. I called that out above to make sure I was not missing > any case in PASSTHRU mode where this might be invalid. Otherwise I don't see > a problem in the simple PASSTHRU use case this patch supports. > > >> - The patch currently only supports passing of IFF_PROMISC and > >> IFF_MULTICAST > >> filter flags to lowerdev > >> > >> This patch series implements the following > >> 01/3 - macvlan: Add support for unicast filtering in macvlan > >> 02/3 - macvlan: Add function to set addr filter on lower device in passthru > >> mode > >> 03/3 - macvtap: Add support for TUNSETTXFILTER > >> > >> Please comment. Thanks. > >> > >> Signed-off-by: Roopa Prabhu > >> Signed-off-by: Christian Benvenuti > >> Signed-off-by: David Wang > > > > The security isn't lower than with promisc, so I don't see > > a problem with this as such. > > > > There are more features we'll want down the road though, > > so let's see whether the interface will be able to > > satisfy them in a backwards compatible way before we > > set it in stone. Here's what I came up with: > > > > How will the filtering table be partitioned within guests? > > Since this patch supports macvlan PASSTHRU mode only, in which the lower > device has 1-1 mapping to the guest nic, it does not require any > partitioning of filtering table within guests. Unless I missed understanding > something. > If the lower device were being shared by multiple guest network interfaces > (non PASSTHRU mode), only then we will need to maintain separate filter > tables for each guest network interface in macvlan and forward the pkt to > respective guest interface after a filter lookup. This could affect > performance too I think. Not with hardware filtering support. Which is where we'd need to partition the host nic mac table between guests. > I chose to support PASSTHRU Mode only at first because its simpler and all > code additions are in control path only. I agree. It would be a bit silly to have a dedicated interface for passthough and a co
[PATCH 2/2] qemu-kvm: pc: Do not start APIC timer spuriously
apic_timer_update not only calculates the next timer expiry that we need to write out the vmstate, it may also start the timer of the user space APIC model. That can cause spurious signals to the corresponding vCPU thread when the timer expires. Fix this by using the new apic_next_timer that does not start the timer. Signed-off-by: Jan Kiszka --- Found while hunting user space exits of our never-exits-to-user-space real-time guest. hw/apic.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/hw/apic.c b/hw/apic.c index b3044aa..e43219f 100644 --- a/hw/apic.c +++ b/hw/apic.c @@ -950,7 +950,7 @@ static void kvm_kernel_lapic_save_to_user(APICState *s) s->count_shift = (v + 1) & 7; s->initial_count_load_time = qemu_get_clock_ns(vm_clock); -apic_timer_update(s, s->initial_count_load_time); +apic_next_timer(s, s->initial_count_load_time); } static void kvm_kernel_lapic_load_from_user(APICState *s) -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] qemu-kvm: pc: Factor out apic_next_timer
Factor out apic_next_timer from apic_timer_update. The former can then be used to update next_timer without actually starting the qemu timer. KVM's in-kernel APIC model will make use of it. Signed-off-by: Jan Kiszka --- hw/apic.c | 45 - 1 files changed, 28 insertions(+), 17 deletions(-) diff --git a/hw/apic.c b/hw/apic.c index 96173ac..b3044aa 100644 --- a/hw/apic.c +++ b/hw/apic.c @@ -661,27 +661,38 @@ static uint32_t apic_get_current_count(APICState *s) return val; } -static void apic_timer_update(APICState *s, int64_t current_time) +static bool apic_next_timer(APICState *s, int64_t current_time) { -int64_t next_time, d; - -if (!(s->lvt[APIC_LVT_TIMER] & APIC_LVT_MASKED)) { -d = (current_time - s->initial_count_load_time) >> -s->count_shift; -if (s->lvt[APIC_LVT_TIMER] & APIC_LVT_TIMER_PERIODIC) { -if (!s->initial_count) -goto no_timer; -d = ((d / ((uint64_t)s->initial_count + 1)) + 1) * ((uint64_t)s->initial_count + 1); -} else { -if (d >= s->initial_count) -goto no_timer; -d = (uint64_t)s->initial_count + 1; +int64_t d; + +if (s->lvt[APIC_LVT_TIMER] & APIC_LVT_MASKED) { +return false; +} + +d = (current_time - s->initial_count_load_time) >> s->count_shift; + +if (s->lvt[APIC_LVT_TIMER] & APIC_LVT_TIMER_PERIODIC) { +if (!s->initial_count) { +return false; } -next_time = s->initial_count_load_time + (d << s->count_shift); -qemu_mod_timer(s->timer, next_time); -s->next_time = next_time; +d = ((d / ((uint64_t)s->initial_count + 1)) + 1) * +((uint64_t)s->initial_count + 1); +} else { +if (d >= s->initial_count) { +return false; +} +d = (uint64_t)s->initial_count + 1; +} +s->next_time = s->initial_count_load_time + (d << s->count_shift); + +return true; +} + +static void apic_timer_update(APICState *s, int64_t current_time) +{ +if (apic_next_timer(s, current_time)) { +qemu_mod_timer(s->timer, s->next_time); } else { -no_timer: qemu_del_timer(s->timer); } } -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] qemu-kvm: Resolve PCI upstream diffs
Resolve all unneeded deviations from upstream code. No functional changes. Signed-off-by: Jan Kiszka --- hw/pci.c | 11 +++ hw/pci.h |5 - 2 files changed, 11 insertions(+), 5 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index e4c166a..4d8845c 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -2106,8 +2106,6 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, memset(pdev->wmask + offset, 0, size); /* Check capability by default */ memset(pdev->cmask + offset, 0xFF, size); - - return offset; } @@ -2125,9 +2123,14 @@ void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t size) memset(pdev->cmask + offset, 0, size); memset(pdev->used + offset, 0, size); -if (!pdev->config[PCI_CAPABILITY_LIST]) { +if (!pdev->config[PCI_CAPABILITY_LIST]) pdev->config[PCI_STATUS] &= ~PCI_STATUS_CAP_LIST; -} +} + +/* Reserve space for capability at a known offset (to call after load). */ +void pci_reserve_capability(PCIDevice *pdev, uint8_t offset, uint8_t size) +{ +memset(pdev->used + offset, 0xff, size); } uint8_t pci_find_capability(PCIDevice *pdev, uint8_t cap_id) diff --git a/hw/pci.h b/hw/pci.h index da0c2d2..70fcd9c 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -147,7 +147,7 @@ struct PCIDevice { /* Used to implement RW1C(Write 1 to Clear) bytes */ uint8_t *w1cmask; -/* Used to allocate config space and track capabilities. */ +/* Used to allocate config space for capabilities. */ uint8_t *used; /* the following fields are read only */ @@ -230,8 +230,11 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, void pci_del_capability(PCIDevice *pci_dev, uint8_t cap_id, uint8_t cap_size); +void pci_reserve_capability(PCIDevice *pci_dev, uint8_t offset, uint8_t size); + uint8_t pci_find_capability(PCIDevice *pci_dev, uint8_t cap_id); + uint32_t pci_default_read_config(PCIDevice *d, uint32_t address, int len); void pci_default_write_config(PCIDevice *d, -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v8 0/4] The intro of QEMU block I/O throttling
The main goal of the patch is to effectively cap the disk I/O speed or counts of one single VM.It is only one draft, so it unavoidably has some drawbacks, if you catch them, please let me know. The patch will mainly introduce one block I/O throttling algorithm, one timer and one block queue for each I/O limits enabled drive. When a block request is coming in, the throttling algorithm will check if its I/O rate or counts exceed the limits; if yes, then it will enqueue to the block queue; The timer will handle the I/O requests in it. Some available features follow as below: (1) global bps limit. -drive bps=xxxin bytes/s (2) only read bps limit -drive bps_rd=xxx in bytes/s (3) only write bps limit -drive bps_wr=xxx in bytes/s (4) global iops limit -drive iops=xxx in ios/s (5) only read iops limit -drive iops_rd=xxxin ios/s (6) only write iops limit -drive iops_wr=xxxin ios/s (7) the combination of some limits. -drive bps=xxx,iops=xxx Known Limitations: (1) #1 can not coexist with #2, #3 (2) #4 can not coexist with #5, #6 (3) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario. Changes since code V7: fix the build per patch based on stefan's comments. Zhi Yong Wu (4): block: add the command line support block: add the block queue support block: add block timer and throttling algorithm qmp/hmp: add block_set_io_throttle v7: Mainly simply the block queue. Adjust codes based on stefan's comments. v6: Mainly fix the aio callback issue for block queue. Adjust codes based on Ram Pai's comments. v5: add qmp/hmp support. Adjust the codes based on stefan's comments qmp/hmp: add block_set_io_throttle v4: fix memory leaking based on ryan's feedback. v3: Added the code for extending slice time, and modified the method to compute wait time for the timer. v2: The codes V2 for QEMU disk I/O limits. Modified the codes mainly based on stefan's comments. v1: Submit the codes for QEMU disk I/O limits. Only a code draft. Makefile.objs |2 +- block.c | 344 +++-- block.h |6 +- block/blk-queue.c | 201 +++ block/blk-queue.h | 59 + block_int.h | 30 + blockdev.c| 98 +++ blockdev.h|2 + hmp-commands.hx | 15 +++ qemu-config.c | 24 qemu-options.hx |1 + qerror.c |4 + qerror.h |3 + qmp-commands.hx | 52 - 14 files changed, 825 insertions(+), 16 deletions(-) create mode 100644 block/blk-queue.c create mode 100644 block/blk-queue.h -- 1.7.6 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v8 4/4] qmp/hmp: add block_set_io_throttle
The patch introduce one new command block_set_io_throttle; For its usage syntax, if you have better idea, pls let me know. Signed-off-by: Zhi Yong Wu --- block.c | 26 +++- blockdev.c | 69 +++ blockdev.h |2 + hmp-commands.hx | 15 qerror.c|4 +++ qerror.h|3 ++ qmp-commands.hx | 52 - 7 files changed, 168 insertions(+), 3 deletions(-) diff --git a/block.c b/block.c index c08fde8..1d3f067 100644 --- a/block.c +++ b/block.c @@ -1938,6 +1938,16 @@ static void bdrv_print_dict(QObject *obj, void *opaque) qdict_get_bool(qdict, "ro"), qdict_get_str(qdict, "drv"), qdict_get_bool(qdict, "encrypted")); + +monitor_printf(mon, " bps=%" PRId64 " bps_rd=%" PRId64 +" bps_wr=%" PRId64 " iops=%" PRId64 +" iops_rd=%" PRId64 " iops_wr=%" PRId64, +qdict_get_int(qdict, "bps"), +qdict_get_int(qdict, "bps_rd"), +qdict_get_int(qdict, "bps_wr"), +qdict_get_int(qdict, "iops"), +qdict_get_int(qdict, "iops_rd"), +qdict_get_int(qdict, "iops_wr")); } else { monitor_printf(mon, " [not inserted]"); } @@ -1970,10 +1980,22 @@ void bdrv_info(Monitor *mon, QObject **ret_data) QDict *bs_dict = qobject_to_qdict(bs_obj); obj = qobject_from_jsonf("{ 'file': %s, 'ro': %i, 'drv': %s, " - "'encrypted': %i }", + "'encrypted': %i, " + "'bps': %" PRId64 "," + "'bps_rd': %" PRId64 "," + "'bps_wr': %" PRId64 "," + "'iops': %" PRId64 "," + "'iops_rd': %" PRId64 "," + "'iops_wr': %" PRId64 "}", bs->filename, bs->read_only, bs->drv->format_name, - bdrv_is_encrypted(bs)); + bdrv_is_encrypted(bs), + bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL], + bs->io_limits.bps[BLOCK_IO_LIMIT_READ], + bs->io_limits.bps[BLOCK_IO_LIMIT_WRITE], + bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL], + bs->io_limits.iops[BLOCK_IO_LIMIT_READ], + bs->io_limits.iops[BLOCK_IO_LIMIT_WRITE]); if (bs->backing_file[0] != '\0') { QDict *qdict = qobject_to_qdict(obj); qdict_put(qdict, "backing_file", diff --git a/blockdev.c b/blockdev.c index 619ae9f..7f5c4df 100644 --- a/blockdev.c +++ b/blockdev.c @@ -747,6 +747,75 @@ int do_change_block(Monitor *mon, const char *device, return monitor_read_bdrv_key_start(mon, bs, NULL, NULL); } +/* throttling disk I/O limits */ +int do_block_set_io_throttle(Monitor *mon, + const QDict *qdict, QObject **ret_data) +{ +const char *devname = qdict_get_str(qdict, "device"); +uint64_t bps= qdict_get_try_int(qdict, "bps", -1); +uint64_t bps_rd = qdict_get_try_int(qdict, "bps_rd", -1); +uint64_t bps_wr = qdict_get_try_int(qdict, "bps_wr", -1); +uint64_t iops = qdict_get_try_int(qdict, "iops", -1); +uint64_t iops_rd= qdict_get_try_int(qdict, "iops_rd", -1); +uint64_t iops_wr= qdict_get_try_int(qdict, "iops_wr", -1); +BlockDriverState *bs; + +bs = bdrv_find(devname); +if (!bs) { +qerror_report(QERR_DEVICE_NOT_FOUND, devname); +return -1; +} + +if ((bps == -1) && (bps_rd == -1) && (bps_wr == -1) +&& (iops == -1) && (iops_rd == -1) && (iops_wr == -1)) { +qerror_report(QERR_MISSING_PARAMETER, + "bps/bps_rd/bps_wr/iops/iops_rd/iops_wr"); +return -1; +} + +if (((bps != -1) && ((bps_rd != -1) || (bps_wr != -1))) +|| ((iops != -1) && ((iops_rd != -1) || (iops_wr != -1 { +qerror_report(QERR_INVALID_PARAMETER_COMBINATION); +return -1; +} + +if (bps != -1) { +bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL] = bps; +bs->io_limits.bps[BLOCK_IO_LIMIT_READ] = 0; +bs->io_limits.bps[BLOCK_IO_LIMIT_WRITE] = 0; +} + +if ((bps_rd != -1) || (bps_wr != -1)) { +bs->io_limits.bps[BLOCK_IO_LIMIT_READ] = + (bps_rd == -1) ? bs->io_limits.bps[BLOCK_IO_LIMIT_READ] : bps_rd; +bs->io_limits.bps[BLOCK_IO_LIMIT_WRITE] = +
[PATCH v8 3/4] block: add block timer and throttling algorithm
Note: 1.) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario. 2.) When "dd" command is issued in guest, if its option bs is set to a large value such as "bs=1024K", the result speed will slightly bigger than the limits. For these problems, if you have nice thought, pls let us know.:) Signed-off-by: Zhi Yong Wu --- block.c | 259 --- block.h |1 - 2 files changed, 248 insertions(+), 12 deletions(-) diff --git a/block.c b/block.c index cd75183..c08fde8 100644 --- a/block.c +++ b/block.c @@ -30,6 +30,9 @@ #include "qemu-objects.h" #include "qemu-coroutine.h" +#include "qemu-timer.h" +#include "block/blk-queue.h" + #ifdef CONFIG_BSD #include #include @@ -72,6 +75,13 @@ static int coroutine_fn bdrv_co_writev_em(BlockDriverState *bs, QEMUIOVector *iov); static int coroutine_fn bdrv_co_flush_em(BlockDriverState *bs); +static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors, +bool is_write, double elapsed_time, uint64_t *wait); +static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write, +double elapsed_time, uint64_t *wait); +static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors, +bool is_write, int64_t *wait); + static QTAILQ_HEAD(, BlockDriverState) bdrv_states = QTAILQ_HEAD_INITIALIZER(bdrv_states); @@ -745,6 +755,11 @@ int bdrv_open(BlockDriverState *bs, const char *filename, int flags, bs->change_cb(bs->change_opaque, CHANGE_MEDIA); } +/* throttling disk I/O limits */ +if (bs->io_limits_enabled) { +bdrv_io_limits_enable(bs); +} + return 0; unlink_and_fail: @@ -783,6 +798,18 @@ void bdrv_close(BlockDriverState *bs) if (bs->change_cb) bs->change_cb(bs->change_opaque, CHANGE_MEDIA); } + +/* throttling disk I/O limits */ +if (bs->block_queue) { +qemu_del_block_queue(bs->block_queue); +bs->block_queue = NULL; +} + +if (bs->block_timer) { +qemu_del_timer(bs->block_timer); +qemu_free_timer(bs->block_timer); +bs->block_timer = NULL; +} } void bdrv_close_all(void) @@ -2341,16 +2368,48 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num, BlockDriverCompletionFunc *cb, void *opaque) { BlockDriver *drv = bs->drv; - +BlockDriverAIOCB *ret; +int64_t wait_time = -1; +printf("sector_num=%ld, nb_sectors=%d\n", sector_num, nb_sectors); trace_bdrv_aio_readv(bs, sector_num, nb_sectors, opaque); -if (!drv) -return NULL; -if (bdrv_check_request(bs, sector_num, nb_sectors)) +if (!drv || bdrv_check_request(bs, sector_num, nb_sectors)) { return NULL; +} + +/* throttling disk read I/O */ +if (bs->io_limits_enabled) { +if (bdrv_exceed_io_limits(bs, nb_sectors, false, &wait_time)) { +ret = qemu_block_queue_enqueue(bs->block_queue, bs, bdrv_aio_readv, + sector_num, qiov, nb_sectors, cb, opaque); +printf("wait_time=%ld\n", wait_time); +if (wait_time != -1) { +printf("reset block timer\n"); +qemu_mod_timer(bs->block_timer, + wait_time + qemu_get_clock_ns(vm_clock)); +} + +if (ret) { +printf("ori ret is not null\n"); +} else { +printf("ori ret is null\n"); +} + +return ret; +} +} -return drv->bdrv_aio_readv(bs, sector_num, qiov, nb_sectors, +ret = drv->bdrv_aio_readv(bs, sector_num, qiov, nb_sectors, cb, opaque); +if (ret) { +if (bs->io_limits_enabled) { +bs->io_disps.bytes[BLOCK_IO_LIMIT_READ] += + (unsigned) nb_sectors * BDRV_SECTOR_SIZE; +bs->io_disps.ios[BLOCK_IO_LIMIT_READ]++; +} +} + +return ret; } typedef struct BlockCompleteData { @@ -2396,15 +2455,14 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num, BlockDriver *drv = bs->drv; BlockDriverAIOCB *ret; BlockCompleteData *blk_cb_data; +int64_t wait_time = -1; trace_bdrv_aio_writev(bs, sector_num, nb_sectors, opaque); -if (!drv) -return NULL; -if (bs->read_only) -return NULL; -if (bdrv_check_request(bs, sector_num, nb_sectors)) +if (!drv || bs->read_only +|| bdrv_check_request(bs, sector_num, nb_sectors)) { return NULL; +} if (bs->dirty_bitmap) { blk_cb_data = blk_dirty_cb_alloc(bs, sector_num, nb_sectors, cb, @@ -2413,13 +2471,32 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num, opaque = blk_cb_data; }
[PATCH v8 2/4] block: add the command line support
Signed-off-by: Zhi Yong Wu --- block.c | 59 +++ block.h |5 block_int.h |3 ++ blockdev.c | 29 +++ qemu-config.c | 24 ++ qemu-options.hx |1 + 6 files changed, 121 insertions(+), 0 deletions(-) diff --git a/block.c b/block.c index 43742b7..cd75183 100644 --- a/block.c +++ b/block.c @@ -104,6 +104,57 @@ int is_windows_drive(const char *filename) } #endif +/* throttling disk I/O limits */ +void bdrv_io_limits_disable(BlockDriverState *bs) +{ +bs->io_limits_enabled = false; + +if (bs->block_queue) { +qemu_block_queue_flush(bs->block_queue); +qemu_del_block_queue(bs->block_queue); +bs->block_queue = NULL; +} + +if (bs->block_timer) { +qemu_del_timer(bs->block_timer); +qemu_free_timer(bs->block_timer); +bs->block_timer = NULL; +} + +bs->slice_start = 0; + +bs->slice_end = 0; +} + +static void bdrv_block_timer(void *opaque) +{ +BlockDriverState *bs = opaque; +BlockQueue *queue= bs->block_queue; + +qemu_block_queue_flush(queue); +} + +void bdrv_io_limits_enable(BlockDriverState *bs) +{ +bs->block_queue = qemu_new_block_queue(); +bs->block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs); + +bs->slice_start = qemu_get_clock_ns(vm_clock); + +bs->slice_end = bs->slice_start + BLOCK_IO_SLICE_TIME; +} + +bool bdrv_io_limits_enabled(BlockDriverState *bs) +{ +BlockIOLimit *io_limits = &bs->io_limits; +return io_limits->bps[BLOCK_IO_LIMIT_READ] + || io_limits->bps[BLOCK_IO_LIMIT_WRITE] + || io_limits->bps[BLOCK_IO_LIMIT_TOTAL] + || io_limits->iops[BLOCK_IO_LIMIT_READ] + || io_limits->iops[BLOCK_IO_LIMIT_WRITE] + || io_limits->iops[BLOCK_IO_LIMIT_TOTAL]; +} + /* check if the path starts with ":" */ static int path_has_protocol(const char *path) { @@ -1453,6 +1504,14 @@ void bdrv_get_geometry_hint(BlockDriverState *bs, *psecs = bs->secs; } +/* throttling disk io limits */ +void bdrv_set_io_limits(BlockDriverState *bs, +BlockIOLimit *io_limits) +{ +bs->io_limits = *io_limits; +bs->io_limits_enabled = bdrv_io_limits_enabled(bs); +} + /* Recognize floppy formats */ typedef struct FDFormat { FDriveType drive; diff --git a/block.h b/block.h index 3ac0b94..a3e69db 100644 --- a/block.h +++ b/block.h @@ -58,6 +58,11 @@ void bdrv_info(Monitor *mon, QObject **ret_data); void bdrv_stats_print(Monitor *mon, const QObject *data); void bdrv_info_stats(Monitor *mon, QObject **ret_data); +/* disk I/O throttling */ +void bdrv_io_limits_enable(BlockDriverState *bs); +void bdrv_io_limits_disable(BlockDriverState *bs); +bool bdrv_io_limits_enabled(BlockDriverState *bs); + void bdrv_init(void); void bdrv_init_with_whitelist(void); BlockDriver *bdrv_find_protocol(const char *filename); diff --git a/block_int.h b/block_int.h index 201e635..368c776 100644 --- a/block_int.h +++ b/block_int.h @@ -257,6 +257,9 @@ void qemu_aio_release(void *p); void *qemu_blockalign(BlockDriverState *bs, size_t size); +void bdrv_set_io_limits(BlockDriverState *bs, +BlockIOLimit *io_limits); + #ifdef _WIN32 int is_windows_drive(const char *filename); #endif diff --git a/blockdev.c b/blockdev.c index 2602591..619ae9f 100644 --- a/blockdev.c +++ b/blockdev.c @@ -236,6 +236,7 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi) int on_read_error, on_write_error; const char *devaddr; DriveInfo *dinfo; +BlockIOLimit io_limits; int snapshot = 0; int ret; @@ -354,6 +355,31 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi) } } +/* disk I/O throttling */ +io_limits.bps[BLOCK_IO_LIMIT_TOTAL] = + qemu_opt_get_number(opts, "bps", 0); +io_limits.bps[BLOCK_IO_LIMIT_READ] = + qemu_opt_get_number(opts, "bps_rd", 0); +io_limits.bps[BLOCK_IO_LIMIT_WRITE] = + qemu_opt_get_number(opts, "bps_wr", 0); +io_limits.iops[BLOCK_IO_LIMIT_TOTAL] = + qemu_opt_get_number(opts, "iops", 0); +io_limits.iops[BLOCK_IO_LIMIT_READ] = + qemu_opt_get_number(opts, "iops_rd", 0); +io_limits.iops[BLOCK_IO_LIMIT_WRITE] = + qemu_opt_get_number(opts, "iops_wr", 0); + +if (((io_limits.bps[BLOCK_IO_LIMIT_TOTAL] != 0) +&& ((io_limits.bps[BLOCK_IO_LIMIT_READ] != 0) +|| (io_limits.bps[BLOCK_IO_LIMIT_WRITE] != 0))) +|| ((io_limits.iops[BLOCK_IO_LIMIT_TOTAL] != 0) +&& ((io_limits.iops[BLOCK_IO_LIMIT_READ] != 0) +|| (io_limits.iops[BLOCK_IO_LIMIT_WRITE] != 0 { +error_report("bps(iops) and bps_rd/bps_wr(iops_rd/iops_wr)" + "cannot be used at the same time")
[PATCH v8 1/4] block: add the block queue support
Signed-off-by: Zhi Yong Wu --- Makefile.objs |2 +- block/blk-queue.c | 201 + block/blk-queue.h | 59 block_int.h | 27 +++ 4 files changed, 288 insertions(+), 1 deletions(-) create mode 100644 block/blk-queue.c create mode 100644 block/blk-queue.h diff --git a/Makefile.objs b/Makefile.objs index 26b885b..5dcf456 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -33,7 +33,7 @@ block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vv block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o block-nested-y += qed-check.o -block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o +block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o blk-queue.o block-nested-$(CONFIG_WIN32) += raw-win32.o block-nested-$(CONFIG_POSIX) += raw-posix.o block-nested-$(CONFIG_CURL) += curl.o diff --git a/block/blk-queue.c b/block/blk-queue.c new file mode 100644 index 000..adef497 --- /dev/null +++ b/block/blk-queue.c @@ -0,0 +1,201 @@ +/* + * QEMU System Emulator queue definition for block layer + * + * Copyright (c) IBM, Corp. 2011 + * + * Authors: + * Zhi Yong Wu + * Stefan Hajnoczi + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ + +#include "block_int.h" +#include "block/blk-queue.h" +#include "qemu-common.h" + +/* The APIs for block request queue on qemu block layer. + */ + +struct BlockQueueAIOCB { +BlockDriverAIOCB common; +QTAILQ_ENTRY(BlockQueueAIOCB) entry; +BlockRequestHandler *handler; +BlockDriverAIOCB *real_acb; + +int64_t sector_num; +QEMUIOVector *qiov; +int nb_sectors; +}; + +typedef struct BlockQueueAIOCB BlockQueueAIOCB; + +struct BlockQueue { +QTAILQ_HEAD(requests, BlockQueueAIOCB) requests; +bool req_failed; +bool flushing; +}; + +static void qemu_block_queue_dequeue(BlockQueue *queue, + BlockQueueAIOCB *request) +{ +BlockQueueAIOCB *req; + +assert(queue); +while (!QTAILQ_EMPTY(&queue->requests)) { +req = QTAILQ_FIRST(&queue->requests); +if (req == request) { +QTAILQ_REMOVE(&queue->requests, req, entry); +break; +} +} +} + +static void qemu_block_queue_cancel(BlockDriverAIOCB *acb) +{ +BlockQueueAIOCB *request = container_of(acb, BlockQueueAIOCB, common); +if (request->real_acb) { +bdrv_aio_cancel(request->real_acb); +} else { +assert(request->common.bs->block_queue); +qemu_block_queue_dequeue(request->common.bs->block_queue, + request); +} + +qemu_aio_release(request); +} + +static AIOPool block_queue_pool = { +.aiocb_size = sizeof(struct BlockQueueAIOCB), +.cancel = qemu_block_queue_cancel, +}; + +static void qemu_block_queue_callback(void *opaque, int ret) +{ +BlockQueueAIOCB *acb = opaque; + +if (acb->common.cb) { +acb->common.cb(acb->common.opaque, ret); +} + +qemu_aio_release(acb); +} + +BlockQueue *qemu_new_block_queue(void) +{ +BlockQueue *queue; + +queue = g_malloc0(sizeof(BlockQueue)); + +QTAILQ_INIT(&queue->requests); + +queue->req_failed = true; +queue->flushing = false; + +return queue; +} + +void qemu_del_block_queue(BlockQueue *queue) +{ +BlockQueueAIOCB *request, *next; + +QTAILQ_FOREACH_SAFE(request, &queue->requests, entry, next) { +QTAILQ_REMOVE(&queue->requests, request, entry); +qemu_aio_release(request); +} + +g_free(queue); +} + +BlockDriverAIOCB *qemu_block_queue_enqueue(BlockQueue *queue, +BlockDriverState *bs, +BlockRequestHandler *handler, +int64_t sector_num, +QEMUIO
CFQ I/O starvation problem triggered by RHEL6.0 KVM guests
This is a report of strange cfq behaviour which seems to be triggered by QEMU posix aio threads. Host environment: OS: RHEL6.0 KVM/qemu-kvm (with no patch applied) IO scheduler: cfq (with the default parameters) On the host, we were running 3 linux guests to see if I/O from these guests would be handled fairly by host; each guest did dd write with oflag=direct. Guest virtual disk: We used a host local disk which had 3 partitions, and each guest was allocated one of these as dd write target. So our test was for checking if cfq could keep fairness for the 3 guests who shared the same disk. The result (strage starvation): Sometimes, one guest dominated cfq for more than 10sec and requests from other guests were not handled at all during that time. Below is the blktrace log which shows that a request to (8,27) in cfq2068S (*1) is not handled at all during cfq2095S and cfq2067S which hold requests to (8,26) are being handled alternately. *1) WS 104920578 + 64 Question: I guess that cfq_close_cooperator() was being called in an unusual manner. If so, do you think that cfq is responsible for keeping fairness for this kind of unusual write requests? Note: With RHEL6.1, this problem could not triggered. But I guess that was due to QEMU's block layer updates. Thanks, Takuya --- blktrace log --- 8,16 0 2010 0.275081840 2068 A WS 104920578 + 64 <- (8,27) 0 8,16 0 2011 0.275082180 2068 Q WS 104920578 + 64 [qemu-kvm] 8,16 00 0.275091369 0 m N cfq2068S / alloced 8,16 0 2012 0.275091909 2068 G WS 104920578 + 64 [qemu-kvm] 8,16 0 2013 0.275093352 2068 P N [qemu-kvm] 8,16 0 2014 0.275094059 2068 I W 104920578 + 64 [qemu-kvm] 8,16 00 0.275094887 0 m N cfq2068S / insert_request 8,16 00 0.275095742 0 m N cfq2068S / add_to_rr 8,16 0 2015 0.275097194 2068 U N [qemu-kvm] 1 8,16 2 2073 0.275189462 2095 A WS 83979688 + 64 <- (8,26) 4 8,16 2 2074 0.275189989 2095 Q WS 83979688 + 64 [qemu-kvm] 8,16 2 2075 0.275192534 2095 G WS 83979688 + 64 [qemu-kvm] 8,16 2 2076 0.275193909 2095 I W 83979688 + 64 [qemu-kvm] 8,16 20 0.275195609 0 m N cfq2095S / insert_request 8,16 20 0.275196404 0 m N cfq2095S / add_to_rr 8,16 20 0.275198004 0 m N cfq2095S / preempt 8,16 20 0.275198688 0 m N cfq2067S / slice expired t=1 8,16 20 0.275199631 0 m N cfq2067S / resid=100 8,16 20 0.275200413 0 m N cfq2067S / sl_used=1 8,16 20 0.275201521 0 m N / served: vt=1671968768 min_vt=1671966720 8,16 20 0.275202323 0 m N cfq2067S / del_from_rr 8,16 20 0.275204263 0 m N cfq2095S / set_active wl_prio:0 wl_type:2 8,16 20 0.275205131 0 m N cfq2095S / fifo=(null) 8,16 20 0.275205851 0 m N cfq2095S / dispatch_insert 8,16 20 0.275207121 0 m N cfq2095S / dispatched a request 8,16 20 0.275207873 0 m N cfq2095S / activate rq, drv=1 8,16 2 2077 0.275208198 2095 D W 83979688 + 64 [qemu-kvm] 8,16 2 2078 0.275269567 2095 U N [qemu-kvm] 2 8,16 4 836 0.275483550 0 C W 83979688 + 64 [0] 8,16 40 0.275496745 0 m N cfq2095S / complete rqnoidle 0 8,16 40 0.275497825 0 m N cfq2095S / set_slice=100 8,16 40 0.275499512 0 m N cfq2095S / arm_idle: 8 8,16 40 0.275499862 0 m N cfq schedule dispatch 8,16 6 85 0.275626195 2067 A WS 83979752 + 64 <- (8,26) 40064 8,16 6 86 0.275626598 2067 Q WS 83979752 + 64 [qemu-kvm] 8,16 6 87 0.275628580 2067 G WS 83979752 + 64 [qemu-kvm] 8,16 6 88 0.275629630 2067 I W 83979752 + 64 [qemu-kvm] 8,16 60 0.275631047 0 m N cfq2067S / insert_request 8,16 60 0.275631730 0 m N cfq2067S / add_to_rr 8,16 60 0.275633567 0 m N cfq2067S / preempt 8,16 60 0.275634275 0 m N cfq2095S / slice expired t=1 8,16 60 0.275635285 0 m N cfq2095S / resid=100 8,16 60 0.275635985 0 m N cfq2095S / sl_used=1 8,16 60 0.275636882 0 m N / served: vt=1671970816 min_vt=1671968768 8,16 60 0.275637585 0 m N cfq2095S / del_from_rr 8,16 60 0.275639382 0 m N cfq2067S / set_active wl_prio:0 wl_type:2 8,16 60 0.275640222 0 m N cfq2067S / fifo=(null) 8,16 60 0.275640809 0 m N cfq2067S / dispatch_insert 8,16 60 0.275641929 0 m N cfq2067S / dispatched a request 8
Re: [PATCH 02/10] Driver core: Add iommu_ops to bus_type
On Wed, Sep 07, 2011 at 04:37:11PM -0400, Don Dutile wrote: > On 09/07/2011 03:44 PM, Greg KH wrote: >> Why are you pushing this down into the driver core? What other busses >> becides PCI use/need this? >> >> If you can have a different IOMMU driver on the same bus, then wouldn't >> this be a per-device thing instead of a per-bus thing? >> > And given the dma api takes a struct device *, it'd be more efficient > to be tied into the device structure. More efficient in what way? Currently no dma-api implementation uses the iommu-api directly. It is planned to change that, of course. But the dma-api implementation will use the iommu_map/iommu_unmap functions which get an iommu_domain parameter and no device structure. Thats why the domain structure keeps an iommu_ops pointer too. > Device structure would get iommu ops set by parent(bus); > if a bus (segment) doesn't provide a unique/different/layered IOMMU > then the parent bus, it inherits the parent's iommu-ops. > setting the iommu-ops in the root bus struct, seeds the iommu-ops > for the (PCI) tree. > > For intel & amd IOMMUs, in early pci (bios,root?) init, you would > seed the pci root busses with appropriate IOMMU support (based on > dmar/drhd & ivrs/ivhd data structures, respectively), and > then modify the PCI code to do the inheritence (PPB code inherits > unless specific device driver for a given PPB vid-did loads a > different iommu-ops for that segment/branch). > > This would enable different types of IOMMUs for different devices > (or PCI segments, or branches of PCI trees) that are designed for > different tasks -- simple IOMMUs for legacy devices; complicated, > io-page-faulting > IOMMUs for plug-in, high-end devices on virtualizing servers for PCI (SRIOV) > endpoints. This only works for pci, but the iommu-api is not pci-only. > and as Greg indicates, is only relevant to PCI. We already have non-pci iommu drivers implementing the iommu-api. So this is certainly relevant outside pci too. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] Driver core: Add iommu_ops to bus_type
On Wed, Sep 07, 2011 at 12:44:45PM -0700, Greg KH wrote: > On Wed, Sep 07, 2011 at 09:19:19PM +0200, Joerg Roedel wrote: > > Hi Greg, > > > > the bus_set_iommu() function will be called by the IOMMU driver. There > > can be different drivers for the same bus, depending on the hardware. On > > PCI for example, there can be the Intel or the AMD IOMMU driver that > > implement the iommu-api and that register for that bus. > > Why are you pushing this down into the driver core? What other busses > becides PCI use/need this? Currently it is the platform_bus besides pci. The pci iommus are on x86 and ia64 while all arm iommus use the platform_bus (by 'iommus' I only mean those implementing the iommu-api). Currently there are two drivers for arm iommus in /drivers/iommu. > If you can have a different IOMMU driver on the same bus, then wouldn't > this be a per-device thing instead of a per-bus thing? Well, I havn't seen a system yet where multiple iommus are on the same bus. Or to state it better, multiple iommus of different type that require different drivers. There is no 1-1 mapping between real hardware iommus and iommu_ops. There is only such a mapping for iommu drivers and iommu_ops. An iommu driver usually handles all hardware iommus of the same type in the system. So having iommu_ops per-device doesn't make much sense at this point. With this patch-set they are accessible by dev->bus->iommu_ops anyway. But if I am wrong on this I can change this of course. This patch-set improves the current situation where only on active iommu-driver is allowed to be active on a system (because of the global iommu_ops). But the main reason to put this into the bus_type structure is that it allows to generalize the device-handling on a bus between iommu drivers. > > > > On Wed, Sep 07, 2011 at 11:47:50AM -0700, Greg KH wrote: > > > > +#ifdef CONFIG_IOMMU_API > > > > +int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops) > > > > +{ > > > > + if (bus->iommu_ops != NULL) > > > > + return -EBUSY; > > > > > > Busy? > > > > Yes, it signals to the IOMMU driver that another driver has already > > registered for that bus. In the previous register_iommu() interface this > > was just a BUG(), but I think returning an error to the caller is > > better. It can be turned back into a BUG() if it is considered better, > > though. > > Can you ever have more than one IOMMU driver per bus? If so, this seems > wrong (see above.) As I said, I havn't seen such systems. But if they exist or are planned I am happy to redesign the whole thing. > > The IOMMUs are usually devices on the bus itself, so they are > > initialized after the bus is set up and the devices on it are > > populated. So the function can not be called on bus initialization > > because the IOMMU is not ready at this point. > > Ok, that makes more sense, please state as much in the documentation. Will do, thanks. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/13] xen/pvticketlock: disable interrupts while blocking
On 09/07/2011 10:09 PM, Jeremy Fitzhardinge wrote: On 09/07/2011 10:41 AM, Avi Kivity wrote: >> Hm, I'm interested to know what you're thinking in more detail. Can you >> leave an NMI pending before you block in the same way you can with >> "sti;halt" with normal interrupts? > > > Nope. But you can do > > if (regs->rip in critical section) > regs->rip = after_halt; > > and effectively emulate it. The critical section is something like > > critical_section_start: > if (woken_up) > goto critical_section_end; > hlt > critical_section_end: Hm. It's a pity you have to deliver an actual interrupt to implement the kick though. I don't think it's that expensive, especially compared to the double-context-switch and vmexit of the spinner going to sleep. On AMD we do have to take an extra vmexit (on IRET) though. >> >> I was thinking you might want to do something with monitor/mwait to >> implement the blocking/kick ops. (Handwave) >> > > monitor/mwait are incredibly expensive to virtualize since they > require write-protecting a page, IPIs flying everywhere and flushing > tlbs, not to mention my lovely hugepages being broken up mercilessly. Or what about a futex-like hypercall? Well we could have a specialized sleep/wakeup hypercall pair like Xen, but I'd like to avoid it if at all possible. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 4/5] VFIO: Add PCI device support
On 09/07/2011 09:55 PM, Konrad Rzeszutek Wilk wrote: >If you don't know what to do here, say N. > + > +menuconfig VFIO_PCI > + bool "VFIO support for PCI devices" > + depends on VFIO&& PCI > + default y if X86 Hahah.. And Linus is going to tear your behind for that. Default should be 'n' It depends on VFIO, which presumably defaults to n. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html