OOM kernel panic on bootup in debian/ubuntu guests
Hi, I'm not sure if this is the correct venue for this question; feel free to redirect. When I start debian or ubuntu guests on a debian host running KVM 0.12.5, my guests kernel-panic immediately after bootup -- the OOM killer kills every process on the system and then dies itself. This problem only occurs if I enable the virtio-balloon driver -- without it, the guests run just fine. Also, this does not happen on CentOS guests, which also run fine and which balloon up/down happily. The problem occurs just after the system gives me a getty -- if I have the guest consoled, I'll see a login prompt but never get a chance to use it. I've tried compiling a vanilla kernel on an ubuntu guest, but the problem persisted. I'm not sure where to go from here, so any tips on debugging the problem are appreciated. --Igor signature.asc Description: Digital signature
Re: [PATCHv4 15/15] Pass boot device list to firmware.
On Mon, Nov 15, 2010 at 08:29:24PM +, Blue Swirl wrote: > 2010/11/15 Gleb Natapov : > > On Sun, Nov 14, 2010 at 10:50:13PM +, Blue Swirl wrote: > >> On Sun, Nov 14, 2010 at 3:39 PM, Gleb Natapov wrote: > >> > > >> > Signed-off-by: Gleb Natapov > >> > --- > >> > hw/fw_cfg.c | 14 ++ > >> > hw/fw_cfg.h | 4 +++- > >> > sysemu.h | 1 + > >> > vl.c | 51 +++ > >> > 4 files changed, 69 insertions(+), 1 deletions(-) > >> > > >> > diff --git a/hw/fw_cfg.c b/hw/fw_cfg.c > >> > index 7b9434f..f6a67db 100644 > >> > --- a/hw/fw_cfg.c > >> > +++ b/hw/fw_cfg.c > >> > @@ -53,6 +53,7 @@ struct FWCfgState { > >> > FWCfgFiles *files; > >> > uint16_t cur_entry; > >> > uint32_t cur_offset; > >> > + Notifier machine_ready; > >> > }; > >> > > >> > static void fw_cfg_write(FWCfgState *s, uint8_t value) > >> > @@ -315,6 +316,15 @@ int fw_cfg_add_file(FWCfgState *s, const char > >> > *filename, uint8_t *data, > >> > return 1; > >> > } > >> > > >> > +static void fw_cfg_machine_ready(struct Notifier* n) > >> > +{ > >> > + uint32_t len; > >> > + char *bootindex = get_boot_devices_list(&len); > >> > + > >> > + fw_cfg_add_bytes(container_of(n, FWCfgState, machine_ready), > >> > + FW_CFG_BOOTINDEX, (uint8_t*)bootindex, len); > >> > +} > >> > + > >> > FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t data_port, > >> > target_phys_addr_t ctl_addr, target_phys_addr_t > >> > data_addr) > >> > { > >> > @@ -343,6 +353,10 @@ FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t > >> > data_port, > >> > fw_cfg_add_i16(s, FW_CFG_MAX_CPUS, (uint16_t)max_cpus); > >> > fw_cfg_add_i16(s, FW_CFG_BOOT_MENU, (uint16_t)boot_menu); > >> > > >> > + > >> > + s->machine_ready.notify = fw_cfg_machine_ready; > >> > + qemu_add_machine_init_done_notifier(&s->machine_ready); > >> > + > >> > return s; > >> > } > >> > > >> > diff --git a/hw/fw_cfg.h b/hw/fw_cfg.h > >> > index 856bf91..4d61410 100644 > >> > --- a/hw/fw_cfg.h > >> > +++ b/hw/fw_cfg.h > >> > @@ -30,7 +30,9 @@ > >> > > >> > #define FW_CFG_FILE_FIRST 0x20 > >> > #define FW_CFG_FILE_SLOTS 0x10 > >> > -#define FW_CFG_MAX_ENTRY (FW_CFG_FILE_FIRST+FW_CFG_FILE_SLOTS) > >> > +#define FW_CFG_FILE_LAST_SLOT (FW_CFG_FILE_FIRST+FW_CFG_FILE_SLOTS) > >> > +#define FW_CFG_BOOTINDEX (FW_CFG_FILE_LAST_SLOT + 1) > >> > +#define FW_CFG_MAX_ENTRY FW_CFG_BOOTINDEX > >> > >> This should be > >> #define FW_CFG_MAX_ENTRY (FW_CFG_BOOTINDEX + 1) > >> because the check is like this: > >> if ((key & FW_CFG_ENTRY_MASK) >= FW_CFG_MAX_ENTRY) { > >> s->cur_entry = FW_CFG_INVALID; > >> > > Yeah, will fix. > > > >> With that change, I got the bootindex passed to OpenBIOS: > >> OpenBIOS for Sparc64 > >> Configuration device id QEMU version 1 machine id 0 > >> kernel cmdline > >> CPUs: 1 x SUNW,UltraSPARC-IIi > >> UUID: ---- > >> bootindex num_strings 1 > >> bootindex /p...@01fe/i...@5/dr...@1/d...@0 > >> > >> The device path does not match exactly, but it's close: > >> /p...@1fe,0/pci-...@5/i...@600/d...@0 > > > > pbm->pci should be solvable by the patch at the end. Were in the spec > > it is allowed to abbreviate 1fe as 1fe,0? Spec allows to drop > > starting zeroes but TARGET_FMT_plx definition in targphys.h has 0 after > > %. I can define another one without leading zeroes. Can you suggest > > a name? > > I think OpenBIOS for Sparc64 is not correct here, so it may be a bad > reference architecture. OBP on a real Ultra-5 used a path like this: > /p...@1f,0/p...@1,1/i...@3/d...@0,0 > > p...@1f,0 specifies the PCI host bridge at UPA bus port ID of 0x1f. According to device name qemu creates pci controller is memory mapped at address 1fe and by looking at the code I can see that this is indeed the case. How is UPA naming works? > p...@1,1 specifies a PCI-PCI bridge. > > > TARGET_FMT_lx is poisoned. As of ATA there is no open firmware > > binding spec for ATA, so everyone does what he pleases. I based my > > implementation on what open firmware showing when running on qemu x86. > > "pci-ata" should be "ide" according to PCI binding spec :) > > Yes, for example there is no ATA in the Ultra-5 tree but in UltraAX it exists: > /p...@1f,4000/i...@3/a...@0,0/c...@0,0 > > > diff --git a/hw/apb_pci.c b/hw/apb_pci.c > > index c619112..643aa49 100644 > > --- a/hw/apb_pci.c > > +++ b/hw/apb_pci.c > > @@ -453,6 +453,7 @@ static PCIDeviceInfo pbm_pci_host_info = { > > > > static SysBusDeviceInfo pbm_host_info = { > > .qdev.name = "pbm", > > + .qdev.fw_name = "pci", > > Perhaps the FW path should use device class names if no name is specified. What do you mean by "device class name". We can do something like this: if (dev->child_bus.lh_first) return dev->child_bus.lh_first->info->name; i.e if there is child bus use its bus n
Re: [PATCHv4 15/15] Pass boot device list to firmware.
On Tue, Nov 16, 2010 at 09:22:45AM +0200, Gleb Natapov wrote: > On Mon, Nov 15, 2010 at 09:52:19PM -0500, Kevin O'Connor wrote: > > I also have an ulterior motive here. If the boot order is exposed as > > a newline separated list via an entry in QEMU_CFG_FILE_DIR, then this > > becomes free for coreboot users as well. (On coreboot, the boot order > > could be placed in a "file" in flash with no change to the seabios > > code.) > > > You can define get_boot_order() function and implement it differently > for qemu and coreboot. For coreboot it will be one linear. Just call > cbfs_copyfile("bootorder"). BTW why newline separation is important? Sure, but it'd be nice to just use romfile_copy("bootorder"). Using newline separated just makes it easier for users to vi and/or cat the file. -Kevin -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: limit conectivity of a VM
Hello, you may also have a look at VDE (Virtual Distributed Ethernet). You can connect your VMs to virtual switches and then use the tool 'wirefilter'[1] to modify different attributes (bandwidth, loss, delay, etc) of the virtual network. [1] http://wiki.virtualsquare.org/wiki/index.php/VDE Regards, Markus Am 19.11.2010 20:47, schrieb hadi golestani: > Hello, > I need to limit the port speed of a VM to 10 mbps ( or 5 mbps if it's > possible). > What's the way of doing so? > > Regards > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] device-assignment: register a reset function
On Tue, 2010-11-16 at 15:05 +0100, Bernhard Kohl wrote: > This is necessary because during reboot of a VM the assigned devices > continue DMA transfers which causes memory corruption. > > Signed-off-by: Thomas Ostler > Signed-off-by: Bernhard Kohl > --- > Changes v1 -> v2: > - use defined macros, e.g. PCI_COMMAND > - write all zero to the command register to disconnect the device logically > --- > hw/device-assignment.c | 12 > 1 files changed, 12 insertions(+), 0 deletions(-) Looks good to me. Acked-by: Alex Williamson > diff --git a/hw/device-assignment.c b/hw/device-assignment.c > index 5f5bde1..8d5a609 100644 > --- a/hw/device-assignment.c > +++ b/hw/device-assignment.c > @@ -1434,6 +1434,17 @@ static void > assigned_dev_unregister_msix_mmio(AssignedDevice *dev) > dev->msix_table_page = NULL; > } > > +static void reset_assigned_device(DeviceState *dev) > +{ > +PCIDevice *d = DO_UPCAST(PCIDevice, qdev, dev); > + > +/* > + * When a 0 is written to the command register, the device is logically > + * disconnected from the PCI bus. This avoids further DMA transfers. > + */ > +assigned_dev_pci_write_config(d, PCI_COMMAND, 0, 2); > +} > + > static int assigned_initfn(struct PCIDevice *pci_dev) > { > AssignedDevice *dev = DO_UPCAST(AssignedDevice, dev, pci_dev); > @@ -1544,6 +1555,7 @@ static PCIDeviceInfo assign_info = { > .qdev.name= "pci-assign", > .qdev.desc= "pass through host pci devices to the guest", > .qdev.size= sizeof(AssignedDevice), > +.qdev.reset = reset_assigned_device, > .init = assigned_initfn, > .exit = assigned_exitfn, > .config_read = assigned_dev_pci_read_config, -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 8/9] device-assignment: Make use of config_map
We can figure out the capability being touched much more quickly and efficiently with the config_map. Use it. Signed-off-by: Alex Williamson --- hw/device-assignment.c | 32 +++- 1 files changed, 19 insertions(+), 13 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 970ffa1..832c236 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1254,28 +1254,34 @@ static void assigned_dev_update_msix(PCIDevice *pci_dev, unsigned int ctrl_pos) static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint32_t address, uint32_t val, int len) { -AssignedDevice *assigned_dev = container_of(pci_dev, AssignedDevice, dev); +uint8_t cap_id = pci_dev->config_map[address]; pci_default_write_config(pci_dev, address, val, len); +switch (cap_id) { #ifdef KVM_CAP_IRQ_ROUTING +case PCI_CAP_ID_MSI: #ifdef KVM_CAP_DEVICE_MSI -if (assigned_dev->cap.available & ASSIGNED_DEVICE_CAP_MSI) { -int pos = pci_find_capability(pci_dev, PCI_CAP_ID_MSI); -if (ranges_overlap(address, len, pos + PCI_MSI_FLAGS, 1)) { -assigned_dev_update_msi(pci_dev, pos + PCI_MSI_FLAGS); +{ +uint8_t cap = pci_find_capability(pci_dev, cap_id); +if (ranges_overlap(address - cap, len, PCI_MSI_FLAGS, 1)) { +assigned_dev_update_msi(pci_dev, cap + PCI_MSI_FLAGS); +} } -} #endif +break; + +case PCI_CAP_ID_MSIX: #ifdef KVM_CAP_DEVICE_MSIX -if (assigned_dev->cap.available & ASSIGNED_DEVICE_CAP_MSIX) { -int pos = pci_find_capability(pci_dev, PCI_CAP_ID_MSIX); -if (ranges_overlap(address, len, pos + PCI_MSIX_FLAGS + 1, 1)) { -assigned_dev_update_msix(pci_dev, pos + PCI_MSIX_FLAGS); - } -} +{ +uint8_t cap = pci_find_capability(pci_dev, cap_id); +if (ranges_overlap(address - cap, len, PCI_MSIX_FLAGS + 1, 1)) { +assigned_dev_update_msix(pci_dev, cap + PCI_MSIX_FLAGS); +} +} #endif +break; #endif -return; +} } static int assigned_device_pci_cap_init(PCIDevice *pci_dev) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 9/9] device-assignment: pass through and stub more PCI caps
Some drivers depend on finding capabilities like power management, PCI express/X, vital product data, or vendor specific fields. Now that we have better capability support, we can pass more of these tables through to the guest. Note that VPD and VNDR are direct pass through capabilies, the rest are mostly empty shells with a few writable bits where necessary. It may be possible to consolidate dummy capabilities into common files for other drivers to use, but I prefer to leave them here for now as we figure out what bits to handle directly with hardware and what bits are purely emulated. Signed-off-by: Alex Williamson --- hw/device-assignment.c | 213 +--- 1 files changed, 199 insertions(+), 14 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 832c236..cd62b5a 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -67,6 +67,9 @@ static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint32_t address, uint32_t val, int len); +static uint32_t assigned_device_pci_cap_read_config(PCIDevice *pci_dev, +uint32_t address, int len); + static uint32_t assigned_dev_ioport_rw(AssignedDevRegion *dev_region, uint32_t addr, int len, uint32_t *val) { @@ -370,6 +373,27 @@ static uint8_t assigned_dev_pci_read_byte(PCIDevice *d, int pos) return (uint8_t)assigned_dev_pci_read(d, pos, 1); } +static void assigned_dev_pci_write(PCIDevice *d, int pos, uint32_t val, int len) +{ +AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev); +ssize_t ret; +int fd = pci_dev->real_device.config_fd; + +again: +ret = pwrite(fd, &val, len, pos); +if (ret != len) { + if ((ret < 0) && (errno == EINTR || errno == EAGAIN)) + goto again; + + fprintf(stderr, "%s: pwrite failed, ret = %zd errno = %d\n", + __func__, ret, errno); + + exit(1); +} + +return; +} + static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap) { int id; @@ -453,10 +477,13 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice *d, uint32_t address, ssize_t ret; AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev); +if (address > PCI_CONFIG_HEADER_SIZE && d->config_map[address]) { +return assigned_device_pci_cap_read_config(d, address, len); +} + if (address < 0x4 || (pci_dev->need_emulate_cmd && address == 0x4) || (address >= 0x10 && address <= 0x24) || address == 0x30 || -address == 0x34 || address == 0x3c || address == 0x3d || -(address > PCI_CONFIG_HEADER_SIZE && d->config_map[address])) { +address == 0x34 || address == 0x3c || address == 0x3d) { val = pci_default_read_config(d, address, len); DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n", (d->devfn >> 3) & 0x1F, (d->devfn & 0x7), address, val, len); @@ -1251,36 +1278,72 @@ static void assigned_dev_update_msix(PCIDevice *pci_dev, unsigned int ctrl_pos) #endif #endif -static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint32_t address, +static uint32_t assigned_device_pci_cap_read_config(PCIDevice *pci_dev, +uint32_t address, int len) +{ +uint8_t cap, cap_id = pci_dev->config_map[address]; + +switch (cap_id) { + +case PCI_CAP_ID_VPD: +cap = pci_find_capability(pci_dev, cap_id); +if (!ranges_overlap(address, len, cap, PCI_CAP_FLAGS)) { +return assigned_dev_pci_read(pci_dev, address, len); +} +break; + +case PCI_CAP_ID_VNDR: +cap = pci_find_capability(pci_dev, cap_id); +if (!ranges_overlap(address, len, cap, PCI_CAP_FLAGS + 1)) { +return assigned_dev_pci_read(pci_dev, address, len); +} +break; +} + +return pci_default_read_config(pci_dev, address, len); +} + +static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, + uint32_t address, uint32_t val, int len) { -uint8_t cap_id = pci_dev->config_map[address]; +uint8_t cap, cap_id = pci_dev->config_map[address]; pci_default_write_config(pci_dev, address, val, len); switch (cap_id) { #ifdef KVM_CAP_IRQ_ROUTING case PCI_CAP_ID_MSI: #ifdef KVM_CAP_DEVICE_MSI -{ -uint8_t cap = pci_find_capability(pci_dev, cap_id); -if (ranges_overlap(address - cap, len, PCI_MSI_FLAGS, 1)) { -assigned_dev_update_msi(pci_dev, cap + PCI_MSI_FLAGS); -} +cap = pci_find_capability(pci_dev, cap_id); +if (ranges_overlap(address - cap, len, PCI_MSI_FLAGS, 1)) { +assigned_dev_update_msi(pci_dev,
[PATCH v3 6/9] device-assignment: Move PCI capabilities to match physical hardware
Now that common PCI code doesn't have a hangup on capabilities being contiguous, move assigned device capabilities to match their offset on physical hardware. This helps for drivers that assume a capability configuration and don't bother searching. We can also remove several calls to assigned_dev_pci_read_* because we're overlaying the capability at the same location as the initial copy we made of config space. We can therefore just use pci_get_*. Signed-off-by: Alex Williamson --- hw/device-assignment.c | 65 +++- 1 files changed, 20 insertions(+), 45 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 975d3cb..6314773 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -366,16 +366,6 @@ static uint8_t assigned_dev_pci_read_byte(PCIDevice *d, int pos) return (uint8_t)assigned_dev_pci_read(d, pos, 1); } -static uint16_t assigned_dev_pci_read_word(PCIDevice *d, int pos) -{ -return (uint16_t)assigned_dev_pci_read(d, pos, 2); -} - -static uint32_t assigned_dev_pci_read_long(PCIDevice *d, int pos) -{ -return assigned_dev_pci_read(d, pos, 4); -} - static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap) { int id; @@ -1285,6 +1275,7 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) { AssignedDevice *dev = container_of(pci_dev, AssignedDevice, dev); PCIRegion *pci_region = dev->real_device.regions; +int pos; /* Clear initial capabilities pointer and status copied from hw */ pci_set_byte(pci_dev->config + PCI_CAPABILITY_LIST, 0); @@ -1296,60 +1287,44 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) #ifdef KVM_CAP_DEVICE_MSI /* Expose MSI capability * MSI capability is the 1st capability in capability config */ -if (pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI)) { -int vpos, ppos; -uint16_t flags; - +if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI))) { dev->cap.available |= ASSIGNED_DEVICE_CAP_MSI; -vpos = pci_add_capability(pci_dev, PCI_CAP_ID_MSI, 0, - PCI_CAPABILITY_CONFIG_MSI_LENGTH); - -memset(pci_dev->config + vpos + PCI_CAP_FLAGS, 0, - PCI_CAPABILITY_CONFIG_MSI_LENGTH - PCI_CAP_FLAGS); +pci_add_capability(pci_dev, PCI_CAP_ID_MSI, pos, + PCI_CAPABILITY_CONFIG_MSI_LENGTH); /* Only 32-bit/no-mask currently supported */ -ppos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI); -flags = assigned_dev_pci_read_word(pci_dev, ppos + PCI_MSI_FLAGS); -flags &= PCI_MSI_FLAGS_QMASK; -pci_set_word(pci_dev->config + vpos + PCI_MSI_FLAGS, flags); +pci_set_word(pci_dev->config + pos + PCI_MSI_FLAGS, + pci_get_word(pci_dev->config + pos + PCI_MSI_FLAGS) & + PCI_MSI_FLAGS_QMASK); +pci_set_long(pci_dev->config + pos + PCI_MSI_ADDRESS_LO, 0); +pci_set_word(pci_dev->config + pos + PCI_MSI_DATA_32, 0); /* Set writable fields */ -pci_set_word(pci_dev->wmask + vpos + PCI_MSI_FLAGS, +pci_set_word(pci_dev->wmask + pos + PCI_MSI_FLAGS, PCI_MSI_FLAGS_QSIZE | PCI_MSI_FLAGS_ENABLE); -pci_set_long(pci_dev->wmask + vpos + PCI_MSI_ADDRESS_LO, 0xfffc); -pci_set_long(pci_dev->wmask + vpos + PCI_MSI_DATA_32, 0x); +pci_set_long(pci_dev->wmask + pos + PCI_MSI_ADDRESS_LO, 0xfffc); +pci_set_word(pci_dev->wmask + pos + PCI_MSI_DATA_32, 0x); } #endif #ifdef KVM_CAP_DEVICE_MSIX /* Expose MSI-X capability */ -if (pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSIX)) { -int vpos, ppos, entry_nr, bar_nr; +if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSIX))) { +int bar_nr; uint32_t msix_table_entry; dev->cap.available |= ASSIGNED_DEVICE_CAP_MSIX; -vpos = pci_add_capability(pci_dev, PCI_CAP_ID_MSIX, 0, +pci_add_capability(pci_dev, PCI_CAP_ID_MSIX, pos, PCI_CAPABILITY_CONFIG_MSIX_LENGTH); -memset(pci_dev->config + vpos + PCI_CAP_FLAGS, 0, - PCI_CAPABILITY_CONFIG_MSIX_LENGTH - PCI_CAP_FLAGS); +pci_set_word(pci_dev->config + pos + PCI_MSIX_FLAGS, + pci_get_word(pci_dev->config + pos + PCI_MSIX_FLAGS) & + PCI_MSIX_TABSIZE); /* Only enable and function mask bits are writable */ -pci_set_word(pci_dev->wmask + vpos + PCI_MSIX_FLAGS, +pci_set_word(pci_dev->wmask + pos + PCI_MSIX_FLAGS, PCI_MSIX_FLAGS_ENABLE | PCI_MSIX_FLAGS_MASKALL); -ppos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSIX); - -entry_nr = assigned_dev_pci_read_word(pci_dev, ppos + PCI_MSIX_FLAGS); -entry_nr &= PCI_MSIX_TABSIZE; -pci_set_word(pci_dev->config + vpos + PCI_MSIX_FLAGS, entry_nr); - -msix_table_entry = assigned_
[PATCH v3 7/9] pci: Remove capability specific handlers
Drivers can break these out on their own if they need to. Signed-off-by: Alex Williamson --- hw/device-assignment.c | 16 - hw/pci.c | 61 ++-- hw/pci.h | 19 --- 3 files changed, 13 insertions(+), 83 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 6314773..970ffa1 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -63,6 +63,10 @@ static void assigned_dev_load_option_rom(AssignedDevice *dev); static void assigned_dev_unregister_msix_mmio(AssignedDevice *dev); +static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, + uint32_t address, + uint32_t val, int len); + static uint32_t assigned_dev_ioport_rw(AssignedDevRegion *dev_region, uint32_t addr, int len, uint32_t *val) { @@ -406,14 +410,17 @@ static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t address, ((d->devfn >> 3) & 0x1F), (d->devfn & 0x7), (uint16_t) address, val, len); +if (address > PCI_CONFIG_HEADER_SIZE && d->config_map[address]) { +return assigned_device_pci_cap_write_config(d, address, val, len); +} + if (address == 0x4) { pci_default_write_config(d, address, val, len); /* Continue to program the card */ } if ((address >= 0x10 && address <= 0x24) || address == 0x30 || -address == 0x34 || address == 0x3c || address == 0x3d || -(address > PCI_CONFIG_HEADER_SIZE && d->config_map[address])) { +address == 0x34 || address == 0x3c || address == 0x3d) { /* used for update-mappings (BAR emulation) */ pci_default_write_config(d, address, val, len); return; @@ -1249,7 +1256,7 @@ static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint32_t ad { AssignedDevice *assigned_dev = container_of(pci_dev, AssignedDevice, dev); -pci_default_cap_write_config(pci_dev, address, val, len); +pci_default_write_config(pci_dev, address, val, len); #ifdef KVM_CAP_IRQ_ROUTING #ifdef KVM_CAP_DEVICE_MSI if (assigned_dev->cap.available & ASSIGNED_DEVICE_CAP_MSI) { @@ -1478,9 +1485,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev) dev->h_busnr = dev->host.bus; dev->h_devfn = PCI_DEVFN(dev->host.dev, dev->host.func); -pci_register_capability_handlers(pci_dev, NULL, - assigned_device_pci_cap_write_config); - if (assigned_device_pci_cap_init(pci_dev) < 0) goto out; diff --git a/hw/pci.c b/hw/pci.c index 1cf62b6..e529793 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -772,8 +772,6 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus, config_write = pci_default_write_config; pci_dev->config_read = config_read; pci_dev->config_write = config_write; -pci_dev->cap.config_read = pci_default_cap_read_config; -pci_dev->cap.config_write = pci_default_cap_write_config; bus->devices[devfn] = pci_dev; pci_dev->irq = qemu_allocate_irqs(pci_set_irq, pci_dev, PCI_NUM_PINS); pci_dev->version_id = 2; /* Current pci device vmstate version */ @@ -1070,57 +1068,21 @@ static void pci_update_irq_disabled(PCIDevice *d, int was_irq_disabled) } } -static uint32_t pci_read_config(PCIDevice *d, -uint32_t address, int len) +uint32_t pci_default_read_config(PCIDevice *d, + uint32_t address, int len) { uint32_t val = 0; - +assert(len == 1 || len == 2 || len == 4); len = MIN(len, pci_config_size(d) - address); memcpy(&val, d->config + address, len); return le32_to_cpu(val); } -uint32_t pci_default_read_config(PCIDevice *d, - uint32_t address, int len) -{ -assert(len == 1 || len == 2 || len == 4); - -if (address > PCI_CONFIG_HEADER_SIZE && d->config_map[address]) { -return d->cap.config_read(d, address, len); -} - -return pci_read_config(d, address, len); -} - -uint32_t pci_default_cap_read_config(PCIDevice *pci_dev, - uint32_t address, int len) -{ -return pci_read_config(pci_dev, address, len); -} - -void pci_default_cap_write_config(PCIDevice *pci_dev, - uint32_t address, uint32_t val, int len) -{ -uint32_t config_size = pci_config_size(pci_dev); -int i; - -for (i = 0; i < len && address + i < config_size; val >>= 8, ++i) { -uint8_t wmask = pci_dev->wmask[address + i]; -pci_dev->config[address + i] = -(pci_dev->config[address + i] & ~wmask) | (val & wmask); -} -} - void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l) { int i, was_irq_disabled = pci_irq_disabled(d); uint32_t config_size =
[PATCH v3 5/9] pci: Remove cap.length, cap.start, cap.supported
Capabilities aren't required to be contiguous, so cap.length never really made much sense. Likewise, cap.start is mostly meaningless too. Both of these are better served by the capability map. We can also get rid of cap.supported, since it's really now unused and redundant with flag in the status word anyway. Signed-off-by: Alex Williamson --- hw/device-assignment.c |4 hw/pci.c |3 --- hw/pci.h |2 -- 3 files changed, 0 insertions(+), 9 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index a4fad60..975d3cb 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1292,8 +1292,6 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) pci_get_word(pci_dev->config + PCI_STATUS) & ~PCI_STATUS_CAP_LIST); -pci_dev->cap.length = 0; - #ifdef KVM_CAP_IRQ_ROUTING #ifdef KVM_CAP_DEVICE_MSI /* Expose MSI capability @@ -1320,7 +1318,6 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) PCI_MSI_FLAGS_QSIZE | PCI_MSI_FLAGS_ENABLE); pci_set_long(pci_dev->wmask + vpos + PCI_MSI_ADDRESS_LO, 0xfffc); pci_set_long(pci_dev->wmask + vpos + PCI_MSI_DATA_32, 0x); -pci_dev->cap.length += PCI_CAPABILITY_CONFIG_MSI_LENGTH; } #endif #ifdef KVM_CAP_DEVICE_MSIX @@ -1356,7 +1353,6 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) bar_nr = msix_table_entry & PCI_MSIX_BIR; msix_table_entry &= ~PCI_MSIX_BIR; dev->msix_table_addr = pci_region[bar_nr].base_addr + msix_table_entry; -pci_dev->cap.length += PCI_CAPABILITY_CONFIG_MSIX_LENGTH; } #endif #endif diff --git a/hw/pci.c b/hw/pci.c index cbe6fb7..1cf62b6 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -1909,8 +1909,6 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, memset(pdev->cmask + offset, 0xFF, size); pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST; -pdev->cap.supported = 1; -pdev->cap.start = pdev->cap.start ? MIN(pdev->cap.start, offset) : offset; return offset; } @@ -1931,7 +1929,6 @@ void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t size) if (!pdev->config[PCI_CAPABILITY_LIST]) { pdev->config[PCI_STATUS] &= ~PCI_STATUS_CAP_LIST; -pdev->cap.start = pdev->cap.length = 0; } } diff --git a/hw/pci.h b/hw/pci.h index bd15b43..146f81d 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -220,8 +220,6 @@ struct PCIDevice { /* Device capability configuration space */ struct { -int supported; -unsigned int start, length; PCICapConfigReadFunc *config_read; PCICapConfigWriteFunc *config_write; } cap; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 4/9] pci: Replace used bitmap with config byte map
Capabilities are allocated in bytes, so we can track both whether a byte is used and by what capability in the same structure. Remove pci_reserve_capability() as there are no users, remove pci_access_cap_config() since it's now a trivial lookup. Signed-off-by: Alex Williamson --- hw/device-assignment.c |4 ++-- hw/pci.c | 30 +- hw/pci.h |8 ++-- 3 files changed, 13 insertions(+), 29 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 76aacac..a4fad60 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -423,7 +423,7 @@ static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t address, if ((address >= 0x10 && address <= 0x24) || address == 0x30 || address == 0x34 || address == 0x3c || address == 0x3d || -pci_access_cap_config(d, address, len)) { +(address > PCI_CONFIG_HEADER_SIZE && d->config_map[address])) { /* used for update-mappings (BAR emulation) */ pci_default_write_config(d, address, val, len); return; @@ -459,7 +459,7 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice *d, uint32_t address, if (address < 0x4 || (pci_dev->need_emulate_cmd && address == 0x4) || (address >= 0x10 && address <= 0x24) || address == 0x30 || address == 0x34 || address == 0x3c || address == 0x3d || -pci_access_cap_config(d, address, len)) { +(address > PCI_CONFIG_HEADER_SIZE && d->config_map[address])) { val = pci_default_read_config(d, address, len); DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n", (d->devfn >> 3) & 0x1F, (d->devfn & 0x7), address, val, len); diff --git a/hw/pci.c b/hw/pci.c index f2896d9..cbe6fb7 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -712,7 +712,7 @@ static void pci_config_alloc(PCIDevice *pci_dev) pci_dev->cmask = qemu_mallocz(config_size); pci_dev->wmask = qemu_mallocz(config_size); pci_dev->w1cmask = qemu_mallocz(config_size); -pci_dev->used = qemu_mallocz(config_size); +pci_dev->config_map = qemu_mallocz(config_size); } static void pci_config_free(PCIDevice *pci_dev) @@ -721,7 +721,7 @@ static void pci_config_free(PCIDevice *pci_dev) qemu_free(pci_dev->cmask); qemu_free(pci_dev->wmask); qemu_free(pci_dev->w1cmask); -qemu_free(pci_dev->used); +qemu_free(pci_dev->config_map); } /* -1 for devfn means auto assign */ @@ -751,6 +751,8 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus, pci_dev->irq_state = 0; pci_config_alloc(pci_dev); +memset(pci_dev->config_map, 0xff, PCI_CONFIG_HEADER_SIZE); + if (!is_bridge) { pci_set_default_subsystem_id(pci_dev); } @@ -1083,21 +1085,13 @@ uint32_t pci_default_read_config(PCIDevice *d, { assert(len == 1 || len == 2 || len == 4); -if (pci_access_cap_config(d, address, len)) { +if (address > PCI_CONFIG_HEADER_SIZE && d->config_map[address]) { return d->cap.config_read(d, address, len); } return pci_read_config(d, address, len); } -int pci_access_cap_config(PCIDevice *pci_dev, uint32_t address, int len) -{ -if (pci_dev->cap.supported && address >= pci_dev->cap.start && -(address + len) < pci_dev->cap.start + pci_dev->cap.length) -return 1; -return 0; -} - uint32_t pci_default_cap_read_config(PCIDevice *pci_dev, uint32_t address, int len) { @@ -1122,7 +1116,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l) int i, was_irq_disabled = pci_irq_disabled(d); uint32_t config_size = pci_config_size(d); -if (pci_access_cap_config(d, addr, l)) { +if (addr > PCI_CONFIG_HEADER_SIZE && d->config_map[addr]) { d->cap.config_write(d, addr, val, l); return; } @@ -1789,7 +1783,7 @@ static int pci_find_space(PCIDevice *pdev, uint8_t size) int offset = PCI_CONFIG_HEADER_SIZE; int i; for (i = PCI_CONFIG_HEADER_SIZE; i < config_size; ++i) -if (pdev->used[i]) +if (pdev->config_map[i]) offset = i + 1; else if (i - offset + 1 == size) return offset; @@ -1908,7 +1902,7 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, config[PCI_CAP_LIST_ID] = cap_id; config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST]; pdev->config[PCI_CAPABILITY_LIST] = offset; -memset(pdev->used + offset, 0xFF, size); +memset(pdev->config_map + offset, cap_id, size); /* Make capability read-only by default */ memset(pdev->wmask + offset, 0, size); /* Check capability by default */ @@ -1933,7 +1927,7 @@ void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t size) memset(pdev->w1cmask + offset, 0, size); /* Clear cmask as device-specific registers can't be checked */ memset(pdev->cmask + offset, 0, size); -memset(pdev->u
[PATCH v3 3/9] device-assignment: Use PCI capabilities support
Convert to use common pci_add_capabilities() rather than creating our own mess. Signed-off-by: Alex Williamson --- hw/device-assignment.c | 112 +++- 1 files changed, 63 insertions(+), 49 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index a297cb4..76aacac 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -38,6 +38,7 @@ #include "device-assignment.h" #include "loader.h" #include "monitor.h" +#include "range.h" #include /* From linux/ioport.h */ @@ -1075,17 +1076,17 @@ static void assigned_dev_update_msi(PCIDevice *pci_dev, unsigned int ctrl_pos) } if (ctrl_byte & PCI_MSI_FLAGS_ENABLE) { +int pos = ctrl_pos - PCI_MSI_FLAGS; assigned_dev->entry = calloc(1, sizeof(struct kvm_irq_routing_entry)); if (!assigned_dev->entry) { perror("assigned_dev_update_msi: "); return; } assigned_dev->entry->u.msi.address_lo = -*(uint32_t *)(pci_dev->config + pci_dev->cap.start + - PCI_MSI_ADDRESS_LO); +pci_get_long(pci_dev->config + pos + PCI_MSI_ADDRESS_LO); assigned_dev->entry->u.msi.address_hi = 0; -assigned_dev->entry->u.msi.data = *(uint16_t *)(pci_dev->config + -pci_dev->cap.start + PCI_MSI_DATA_32); +assigned_dev->entry->u.msi.data = +pci_get_word(pci_dev->config + pos + PCI_MSI_DATA_32); assigned_dev->entry->type = KVM_IRQ_ROUTING_MSI; r = kvm_get_irq_route_gsi(); if (r < 0) { @@ -1123,10 +1124,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev) struct kvm_assigned_msix_entry msix_entry; void *va = adev->msix_table_page; -if (adev->cap.available & ASSIGNED_DEVICE_CAP_MSI) -pos = pci_dev->cap.start + PCI_CAPABILITY_CONFIG_MSI_LENGTH; -else -pos = pci_dev->cap.start; +pos = pci_find_capability(pci_dev, PCI_CAP_ID_MSIX); entries_max_nr = *(uint16_t *)(pci_dev->config + pos + 2); entries_max_nr &= PCI_MSIX_TABSIZE; @@ -1260,26 +1258,23 @@ static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint32_t ad uint32_t val, int len) { AssignedDevice *assigned_dev = container_of(pci_dev, AssignedDevice, dev); -unsigned int pos = pci_dev->cap.start, ctrl_pos; pci_default_cap_write_config(pci_dev, address, val, len); #ifdef KVM_CAP_IRQ_ROUTING #ifdef KVM_CAP_DEVICE_MSI if (assigned_dev->cap.available & ASSIGNED_DEVICE_CAP_MSI) { -ctrl_pos = pos + PCI_MSI_FLAGS; -if (address <= ctrl_pos && address + len > ctrl_pos) -assigned_dev_update_msi(pci_dev, ctrl_pos); -pos += PCI_CAPABILITY_CONFIG_MSI_LENGTH; +int pos = pci_find_capability(pci_dev, PCI_CAP_ID_MSI); +if (ranges_overlap(address, len, pos + PCI_MSI_FLAGS, 1)) { +assigned_dev_update_msi(pci_dev, pos + PCI_MSI_FLAGS); +} } #endif #ifdef KVM_CAP_DEVICE_MSIX if (assigned_dev->cap.available & ASSIGNED_DEVICE_CAP_MSIX) { -ctrl_pos = pos + 3; -if (address <= ctrl_pos && address + len > ctrl_pos) { -ctrl_pos--; /* control is word long */ -assigned_dev_update_msix(pci_dev, ctrl_pos); +int pos = pci_find_capability(pci_dev, PCI_CAP_ID_MSIX); +if (ranges_overlap(address, len, pos + PCI_MSIX_FLAGS + 1, 1)) { +assigned_dev_update_msix(pci_dev, pos + PCI_MSIX_FLAGS); } -pos += PCI_CAPABILITY_CONFIG_MSIX_LENGTH; } #endif #endif @@ -1290,58 +1285,77 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) { AssignedDevice *dev = container_of(pci_dev, AssignedDevice, dev); PCIRegion *pci_region = dev->real_device.regions; -int next_cap_pt = 0; -pci_dev->cap.supported = 1; -pci_dev->cap.start = PCI_CAPABILITY_CONFIG_DEFAULT_START_ADDR; +/* Clear initial capabilities pointer and status copied from hw */ +pci_set_byte(pci_dev->config + PCI_CAPABILITY_LIST, 0); +pci_set_word(pci_dev->config + PCI_STATUS, + pci_get_word(pci_dev->config + PCI_STATUS) & + ~PCI_STATUS_CAP_LIST); + pci_dev->cap.length = 0; -pci_dev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST; -pci_dev->config[PCI_CAPABILITY_LIST] = pci_dev->cap.start; #ifdef KVM_CAP_IRQ_ROUTING #ifdef KVM_CAP_DEVICE_MSI /* Expose MSI capability * MSI capability is the 1st capability in capability config */ if (pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI)) { +int vpos, ppos; +uint16_t flags; + dev->cap.available |= ASSIGNED_DEVICE_CAP_MSI; -memset(&pci_dev->config[pci_dev->cap.start + pci_dev->cap.length], - 0, PCI_CAPABILITY_CONFIG_MSI_LENGTH); -pci_dev->config[pci_dev->cap.start + pci_dev->cap.length] = -PCI_CAP_ID_MSI; +
[PATCH v3 2/9] pci: Remove pci_enable_capability_support()
This interface doesn't make much sense, adding a capability can take care of everything, just provide a means to register capability read/write handlers. Device assignment does it's own thing, so requires a couple ugly hacks that will be cleaned by subsequent patches. Signed-off-by: Alex Williamson --- hw/device-assignment.c | 12 --- hw/pci.c | 52 +--- hw/pci.h |9 +++- 3 files changed, 35 insertions(+), 38 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 369bff9..a297cb4 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1292,7 +1292,12 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) PCIRegion *pci_region = dev->real_device.regions; int next_cap_pt = 0; +pci_dev->cap.supported = 1; +pci_dev->cap.start = PCI_CAPABILITY_CONFIG_DEFAULT_START_ADDR; pci_dev->cap.length = 0; +pci_dev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST; +pci_dev->config[PCI_CAPABILITY_LIST] = pci_dev->cap.start; + #ifdef KVM_CAP_IRQ_ROUTING #ifdef KVM_CAP_DEVICE_MSI /* Expose MSI capability @@ -1488,9 +1493,10 @@ static int assigned_initfn(struct PCIDevice *pci_dev) dev->h_busnr = dev->host.bus; dev->h_devfn = PCI_DEVFN(dev->host.dev, dev->host.func); -if (pci_enable_capability_support(pci_dev, 0, NULL, -assigned_device_pci_cap_write_config, -assigned_device_pci_cap_init) < 0) +pci_register_capability_handlers(pci_dev, NULL, + assigned_device_pci_cap_write_config); + +if (assigned_device_pci_cap_init(pci_dev) < 0) goto out; /* assign device to guest */ diff --git a/hw/pci.c b/hw/pci.c index 8e99746..f2896d9 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -770,6 +770,8 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus, config_write = pci_default_write_config; pci_dev->config_read = config_read; pci_dev->config_write = config_write; +pci_dev->cap.config_read = pci_default_cap_read_config; +pci_dev->cap.config_write = pci_default_cap_write_config; bus->devices[devfn] = pci_dev; pci_dev->irq = qemu_allocate_irqs(pci_set_irq, pci_dev, PCI_NUM_PINS); pci_dev->version_id = 2; /* Current pci device vmstate version */ @@ -1754,35 +1756,21 @@ PCIDevice *pci_create_simple_multifunction(PCIBus *bus, int devfn, return dev; } -int pci_enable_capability_support(PCIDevice *pci_dev, - uint32_t config_start, - PCICapConfigReadFunc *config_read, - PCICapConfigWriteFunc *config_write, - PCICapConfigInitFunc *config_init) +void pci_register_capability_handlers(PCIDevice *pdev, + PCICapConfigReadFunc *config_read, + PCICapConfigWriteFunc *config_write) { -if (!pci_dev) -return -ENODEV; - -pci_dev->config[0x06] |= 0x10; // status = capabilities - -if (config_start == 0) - pci_dev->cap.start = PCI_CAPABILITY_CONFIG_DEFAULT_START_ADDR; -else if (config_start >= 0x40 && config_start < 0xff) -pci_dev->cap.start = config_start; -else -return -EINVAL; +if (config_read) { +pdev->cap.config_read = config_read; +} else { +pdev->cap.config_read = pci_default_cap_read_config; +} -if (config_read) -pci_dev->cap.config_read = config_read; -else -pci_dev->cap.config_read = pci_default_cap_read_config; -if (config_write) -pci_dev->cap.config_write = config_write; -else -pci_dev->cap.config_write = pci_default_cap_write_config; -pci_dev->cap.supported = 1; -pci_dev->config[PCI_CAPABILITY_LIST] = pci_dev->cap.start; -return config_init(pci_dev); +if (config_write) { +pdev->cap.config_write = config_write; +} else { +pdev->cap.config_write = pci_default_cap_write_config; +} } PCIDevice *pci_create(PCIBus *bus, int devfn, const char *name) @@ -1920,12 +1908,16 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, config[PCI_CAP_LIST_ID] = cap_id; config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST]; pdev->config[PCI_CAPABILITY_LIST] = offset; -pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST; memset(pdev->used + offset, 0xFF, size); /* Make capability read-only by default */ memset(pdev->wmask + offset, 0, size); /* Check capability by default */ memset(pdev->cmask + offset, 0xFF, size); + +pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST; +pdev->cap.supported = 1; +pdev->cap.start = pdev->cap.start ? MIN(pdev->cap.start, offset) : offset; + return offset; } @@ -1943,8 +1935,10 @@ void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, uint8_t siz
[PATCH v3 1/9] pci: pci_default_cap_write_config ignores wmask
Make use of wmask, just like the rest of config space. This duplicates code in pci_default_write_config, but we plan to get rid of this function anyway, so avoid the code churn. Signed-off-by: Alex Williamson --- hw/pci.c | 19 --- 1 files changed, 8 insertions(+), 11 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index c3b5048..8e99746 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -1088,16 +1088,6 @@ uint32_t pci_default_read_config(PCIDevice *d, return pci_read_config(d, address, len); } -static void pci_write_config(PCIDevice *pci_dev, - uint32_t address, uint32_t val, int len) -{ -int i; -for (i = 0; i < len; i++) { -pci_dev->config[address + i] = val & 0xff; -val >>= 8; -} -} - int pci_access_cap_config(PCIDevice *pci_dev, uint32_t address, int len) { if (pci_dev->cap.supported && address >= pci_dev->cap.start && @@ -1115,7 +1105,14 @@ uint32_t pci_default_cap_read_config(PCIDevice *pci_dev, void pci_default_cap_write_config(PCIDevice *pci_dev, uint32_t address, uint32_t val, int len) { -pci_write_config(pci_dev, address, val, len); +uint32_t config_size = pci_config_size(pci_dev); +int i; + +for (i = 0; i < len && address + i < config_size; val >>= 8, ++i) { +uint8_t wmask = pci_dev->wmask[address + i]; +pci_dev->config[address + i] = +(pci_dev->config[address + i] & ~wmask) | (val & wmask); +} } void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/9] PCI capability and device assignment improvements
v3: - Rework to avoid introducing conflicts with qemu.git hw/pci - Drop capability lookup table - Add back minimal device assignment PM, EXP, X, VPD, VNDR capabilities This version should do a much better job at not introducing new differences between qemu-kvm.git and qemu.git hw/pci. The diff between the pci files after this series is significantly reduced. I added back the previous RFC patch that adds new capabilities for assigned devices. I'm sure we'll want to add passthrough for various fields, but that can come later (along with figuring out whether we can consolidate emulation should other drivers want it). Thanks, Alex v2: - Fixed the function name in 1/8 per Michael's suggestion. - Removed capability specific config read/write registration - Added more checks to add_capability - Added capability lookup table to PCIDevice I've dropped the RFC patch to add more capabilities to device assignment while I do some more work on it. Please feel free to comment on the v1 version though. Patches still against qemu-kvm, but I hope some of this makes it easier to bring qemu & qemu-kvm closer here. v1: This series attempts to clean up capability support between common code and device assignment. In doing so, we can move existing MSI & MSI-X capabilities to offsets matching real hardware, and further enable more capabilities to be exposed. The last patch is only for RFC, I'd like some input on what we should pass directly and where we should only provide read-only/emulated access. Patches 1-7 are submitted for commit. --- Alex Williamson (9): device-assignment: pass through and stub more PCI caps device-assignment: Make use of config_map pci: Remove capability specific handlers device-assignment: Move PCI capabilities to match physical hardware pci: Remove cap.length, cap.start, cap.supported pci: Replace used bitmap with config byte map device-assignment: Use PCI capabilities support pci: Remove pci_enable_capability_support() pci: pci_default_cap_write_config ignores wmask hw/device-assignment.c | 332 +--- hw/pci.c | 109 ++-- hw/pci.h | 32 - 3 files changed, 276 insertions(+), 197 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] kvm: PCI error stub driver
When a devices is driven by userspace (like in the cases of KVM's device assignment) it's essential to provide proper PCI error handling support to the corresponding driver. Implementation == The PCI error stub driver is implemented on top of the uio framework. PCI errors are reported to userspace by signaling on an eventfd. Error codes and error results are exposed under a 'logical' BAR that can be mmap'ed. Userspace acknowledge the kernel using another eventfd which internally is consumed by the driver in a similar fashion than KVM's irqfd. The uio implementation accepts 32 bit writes to the device file. This interface is used to multiplex the passing of both eventfd to the driver. The guest-visible interface is not yet define. In theory it could be abstracted part of Q35 PCIe chipset or it could be define as a simple QEMU device model. >From performance's perspective, the error signaling path can reach the guest directly using KVM's irqfd. Both error code and error result can be made directly available to the guest. The guest acknowledge part can signal directly on an ioeventfd. Background information == Error handling happens in two phases; (1) Reporting, (2) Recovery. Phase 1, Reporting (.error_detected callback) = The purpose of the reporting phase is to A) notify the driver so that it can stop all I/O access to the devices quickly and B) provide an exit path to I/O spinloops or inconsistent internal state. Kernel assume that once the corresponding '.error_detected' callback return, the driver is in quiescent state. Ideally, interruptions to the device should also be disabled. If the driver doesn't respond to the '.error_detected' callback in a timely fashion, it may result in a scenario where an accumulation of posted writes choke the PCIe switch and eventually the CPU complex. Phase 2, Recovery( .mmio_enabled .link_reset .slot_reset .resume ) == The purpose of the recovery phase is to provide a mechanism for the driver to 'reconnect' with the device in a sequenced fashion. The times it takes to perform the recovery process usually doesn't impact the host. Reporting and recovery in userspace === Both error reporting and error recovery phases are driven by a state machine where the next state depends on the returned value of the current callback. Depending on the state that state machine is in, the host will perform various operation such as 'slot reset' or 'link reset'. Because of such the only way to maintain consistency between the device and it's corresponding driver in userspace is to perform both phases in lockstep with the host. The need for a policy; Error handling in userspace? === When the 'policy' is set to 'paranoid', the host kernel doesn't rely on userspace ever therefore as soon as an error is detected related to the device, the corresponding driver's process is terminated. When the 'policy' is set to 'strict', the entire reporting and recovery phases must succeed in lock step. Timeout in the lock step sequence terminate the corresponding driver's process. The 'lazy' mode is similar to the 'strict' mode except that if a timeout happen in the lock step sequence, the host kernel will take a default path and keep the process up and running. Considerations == - This is work in progress. Minimal testing. Looking for comments/ideas - For performance reason, the 'lazy' mode is interesting because it may very well operate in an asynchronous fashion without the expensive lock step mode. Obviously, the driver's expected recovery sequence _must_ match the default state machine sequence on the host. - Assuming that the error recovery is directed by qemu/Q35 PCIe, how to get the recovery action for every error callback since aer handling 'merge' callback return code from multiple source. It's hard to discriminate what value should be given back to the host for the corresponding assigned device. - Upon slot_reset who should restore the device's config space? Can the guest's driver do it? - Should suspend and resume for assigned devices be also part of this driver? - The only policy implemented is 'lazy' thanks, -Etienne Signed-off-by: Etienne Martineau --- drivers/uio/Kconfig | 11 + drivers/uio/Makefile |2 + drivers/uio/uio_pci_stub.c | 558 ++ include/linux/Kbuild |1 + include/linux/uio_pci_stub.h | 46 5 files changed, 618 insertions(+), 0 deletions(-) create mode 100644 drivers/uio/uio_pci_stub.c create mode 100644 include/linux/uio_pci_stub.h diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig index bb44079..d381c0e 100644 --- a/drivers/uio/Kconfig +++ b/drivers/uio/Kconfig @@ -93,5 +93,16 @@ config UIO_N
Re: limit conectivity of a VM
On Fri, Nov 19, 2010 at 2:47 PM, hadi golestani wrote: > Hello, > I need to limit the port speed of a VM to 10 mbps ( or 5 mbps if it's > possible). > What's the way of doing so? tc check http://lartc.org/howto/lartc.qdisc.html -- Javier -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
limit conectivity of a VM
Hello, I need to limit the port speed of a VM to 10 mbps ( or 5 mbps if it's possible). What's the way of doing so? Regards -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: fast-path msi injection with irqfd
>>> On 11/19/2010 at 10:54 AM, in message <20101119155427.ga20...@amt.cnet>, Marcelo Tosatti wrote: > On Thu, Nov 18, 2010 at 07:09:08PM +0200, Michael S. Tsirkin wrote: >> Store irq routing table pointer in the irqfd object, >> and use that to inject MSI directly without bouncing out to >> a kernel thread. >> >> While we touch this structure, rearrange irqfd fields to make fastpath >> better packed for better cache utilization. >> >> This also adds some comments about locking rules and rcu usage in code. >> >> Some notes on the design: >> - Use pointer into the rt instead of copying an entry, >> to make it possible to use rcu, thus side-stepping >> locking complexities. We also save some memory this way. >> - Old workqueue code is still used for level irqs. >> I don't think we DTRT with level anyway, however, >> it seems easier to keep the code around as >> it has been thought through and debugged, and fix level later than >> rip out and re-instate it later. >> >> Signed-off-by: Michael S. Tsirkin >> --- >> >> OK, this seems to work fine for me. Tested with virtio-net in guest >> with and without vhost-net. Pls review/apply if appropriate. > > Acked-by: Marcelo Tosatti Acked-by: Gregory Haskins -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: fast-path msi injection with irqfd
On Thu, Nov 18, 2010 at 07:09:08PM +0200, Michael S. Tsirkin wrote: > Store irq routing table pointer in the irqfd object, > and use that to inject MSI directly without bouncing out to > a kernel thread. > > While we touch this structure, rearrange irqfd fields to make fastpath > better packed for better cache utilization. > > This also adds some comments about locking rules and rcu usage in code. > > Some notes on the design: > - Use pointer into the rt instead of copying an entry, > to make it possible to use rcu, thus side-stepping > locking complexities. We also save some memory this way. > - Old workqueue code is still used for level irqs. > I don't think we DTRT with level anyway, however, > it seems easier to keep the code around as > it has been thought through and debugged, and fix level later than > rip out and re-instate it later. > > Signed-off-by: Michael S. Tsirkin > --- > > OK, this seems to work fine for me. Tested with virtio-net in guest > with and without vhost-net. Pls review/apply if appropriate. Acked-by: Marcelo Tosatti -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 6/6] KVM: MMU: delay flush all tlbs on sync_page path
On Fri, Nov 19, 2010 at 05:05:38PM +0800, Xiao Guangrong wrote: > Quote from Avi: > | I don't think we need to flush immediately; set a "tlb dirty" bit somewhere > | that is cleareded when we flush the tlb. kvm_mmu_notifier_invalidate_page() > | can consult the bit and force a flush if set. > > Signed-off-by: Xiao Guangrong > --- > arch/x86/kvm/paging_tmpl.h |4 ++-- > include/linux/kvm_host.h |7 +++ > virt/kvm/kvm_main.c|8 +++- > 3 files changed, 16 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h > index dfb906f..d5c302a 100644 > --- a/arch/x86/kvm/paging_tmpl.h > +++ b/arch/x86/kvm/paging_tmpl.h > @@ -781,14 +781,14 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, > struct kvm_mmu_page *sp) > gfn = gpte_to_gfn(gpte); > > if (FNAME(map_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) { > - kvm_flush_remote_tlbs(vcpu->kvm); > + kvm_mark_tlbs_dirty(vcpu->kvm); > continue; > } > > if (gfn != sp->gfns[i]) { > drop_spte(vcpu->kvm, &sp->spt[i], > shadow_trap_nonpresent_pte); > - kvm_flush_remote_tlbs(vcpu->kvm); > + kvm_mark_tlbs_dirty(vcpu->kvm); > continue; > } > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 4bd663d..82d449b 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -249,6 +249,7 @@ struct kvm { > struct mmu_notifier mmu_notifier; > unsigned long mmu_notifier_seq; > long mmu_notifier_count; > + atomic_t tlbs_dirty; > #endif > }; > > @@ -377,6 +378,12 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu); > void kvm_resched(struct kvm_vcpu *vcpu); > void kvm_load_guest_fpu(struct kvm_vcpu *vcpu); > void kvm_put_guest_fpu(struct kvm_vcpu *vcpu); > + > +static inline void kvm_mark_tlbs_dirty(struct kvm *kvm) > +{ > + atomic_inc(&kvm->tlbs_dirty); > +} > + > void kvm_flush_remote_tlbs(struct kvm *kvm); > void kvm_reload_remote_mmus(struct kvm *kvm); > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index fb93ff9..2962569 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -168,8 +168,12 @@ static bool make_all_cpus_request(struct kvm *kvm, > unsigned int req) > > void kvm_flush_remote_tlbs(struct kvm *kvm) > { > + int dirty_count = atomic_read(&kvm->tlbs_dirty); > + > + smp_mb(); > if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) > ++kvm->stat.remote_tlb_flush; > + atomic_sub(dirty_count, &kvm->tlbs_dirty); > } This is racy because kvm_flush_remote_tlbs might be called without mmu_lock protection. You could decrease the counter on invalidate_page/invalidate_range_start only, these are not fast paths anyway. > > void kvm_reload_remote_mmus(struct kvm *kvm) > @@ -249,7 +253,8 @@ static void kvm_mmu_notifier_invalidate_page(struct > mmu_notifier *mn, > idx = srcu_read_lock(&kvm->srcu); > spin_lock(&kvm->mmu_lock); > kvm->mmu_notifier_seq++; > - need_tlb_flush = kvm_unmap_hva(kvm, address); > + need_tlb_flush = kvm_unmap_hva(kvm, address) | > + atomic_read(&kvm->tlbs_dirty); > spin_unlock(&kvm->mmu_lock); > srcu_read_unlock(&kvm->srcu, idx); > > @@ -293,6 +298,7 @@ static void > kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, > kvm->mmu_notifier_count++; > for (; start < end; start += PAGE_SIZE) > need_tlb_flush |= kvm_unmap_hva(kvm, start); > + need_tlb_flush |= atomic_read(&kvm->tlbs_dirty); > spin_unlock(&kvm->mmu_lock); > srcu_read_unlock(&kvm->srcu, idx); > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: seabios 0.6.1 regression
On 11/16/2010 04:17 PM, Alexander Graf wrote: On 16.11.2010, at 15:15, Avi Kivity wrote: > On 11/16/2010 03:19 PM, Alexander Graf wrote: >> >>> >> >> >> >> Rewriting it to use inb / stos works (jecxz ; insb; loop doesn't) so it looks like a kernel bug in insb emulation. >> >> >> > >> > Turns out is was a subtle bug in the tpr optimization we do for Windows XP. The problem happens when we load the vapic option rom from the firmware config interface. With inb / movb, writing the vapic area happens in guest context, which the kernel is prepared to handle. ^^ >> With insb, the write happens from kvm, which is then undone on the next entry, leading to the tpr being set to a high value. >> >> Shouldn't the vapic area be mapped in on demand? Then we could map it on option rom init time and everyone's happy. > > Mapped in? It's an option rom. According to your wording, the kernel handles writes to the vapic area. If the vapic area is only made special after initialization of the option rom, we wouldn't have this issue. Ah, yes, and I'm even now writing a reset handler to make this happen. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ppc32 build failed
On 19.11.2010, at 07:24, Takuya Yoshikawa wrote: > (2010/11/19 15:01), Yang Rui Rui wrote: >> Hi, >> >> I searched the archive found some discutions about this, not fixed yet? >> could someone tell, is g4 kvm available now? > > Hi, (added kvm-ppc to Cc) > > I'm using g4 (Mac mini box) to run KVM. > - though not tried 2.6.37-rc2 yet. > > Aren't you using upstream qemu? > IIRC, ppc kvm needs to use upstream qemu. > > > You can see useful information on KVM "PowerPC port" page. > http://www.linux-kvm.org/page/PowerPC > > -> no g4 example but we can find enough information. > > > BTW, why the entry "Book3S PPC32" in the processor support table > is still "NO", anyone? > http://www.linux-kvm.org/page/Processor_support Because I'm better at writing code than at maintaining wikis :). Thanks a lot for the reminder. I changed it. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 6/6] KVM: MMU: delay flush all tlbs on sync_page path
Quote from Avi: | I don't think we need to flush immediately; set a "tlb dirty" bit somewhere | that is cleareded when we flush the tlb. kvm_mmu_notifier_invalidate_page() | can consult the bit and force a flush if set. Signed-off-by: Xiao Guangrong --- arch/x86/kvm/paging_tmpl.h |4 ++-- include/linux/kvm_host.h |7 +++ virt/kvm/kvm_main.c|8 +++- 3 files changed, 16 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index dfb906f..d5c302a 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -781,14 +781,14 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) gfn = gpte_to_gfn(gpte); if (FNAME(map_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) { - kvm_flush_remote_tlbs(vcpu->kvm); + kvm_mark_tlbs_dirty(vcpu->kvm); continue; } if (gfn != sp->gfns[i]) { drop_spte(vcpu->kvm, &sp->spt[i], shadow_trap_nonpresent_pte); - kvm_flush_remote_tlbs(vcpu->kvm); + kvm_mark_tlbs_dirty(vcpu->kvm); continue; } diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 4bd663d..82d449b 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -249,6 +249,7 @@ struct kvm { struct mmu_notifier mmu_notifier; unsigned long mmu_notifier_seq; long mmu_notifier_count; + atomic_t tlbs_dirty; #endif }; @@ -377,6 +378,12 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu); void kvm_resched(struct kvm_vcpu *vcpu); void kvm_load_guest_fpu(struct kvm_vcpu *vcpu); void kvm_put_guest_fpu(struct kvm_vcpu *vcpu); + +static inline void kvm_mark_tlbs_dirty(struct kvm *kvm) +{ + atomic_inc(&kvm->tlbs_dirty); +} + void kvm_flush_remote_tlbs(struct kvm *kvm); void kvm_reload_remote_mmus(struct kvm *kvm); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index fb93ff9..2962569 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -168,8 +168,12 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req) void kvm_flush_remote_tlbs(struct kvm *kvm) { + int dirty_count = atomic_read(&kvm->tlbs_dirty); + + smp_mb(); if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) ++kvm->stat.remote_tlb_flush; + atomic_sub(dirty_count, &kvm->tlbs_dirty); } void kvm_reload_remote_mmus(struct kvm *kvm) @@ -249,7 +253,8 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, idx = srcu_read_lock(&kvm->srcu); spin_lock(&kvm->mmu_lock); kvm->mmu_notifier_seq++; - need_tlb_flush = kvm_unmap_hva(kvm, address); + need_tlb_flush = kvm_unmap_hva(kvm, address) | + atomic_read(&kvm->tlbs_dirty); spin_unlock(&kvm->mmu_lock); srcu_read_unlock(&kvm->srcu, idx); @@ -293,6 +298,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, kvm->mmu_notifier_count++; for (; start < end; start += PAGE_SIZE) need_tlb_flush |= kvm_unmap_hva(kvm, start); + need_tlb_flush |= atomic_read(&kvm->tlbs_dirty); spin_unlock(&kvm->mmu_lock); srcu_read_unlock(&kvm->srcu, idx); -- 1.7.0.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 5/6] KVM: MMU: abstract invalid guest pte mapping
Introduce a common function to map invalid gpte Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c |3 -- arch/x86/kvm/paging_tmpl.h | 71 +++- 2 files changed, 37 insertions(+), 37 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index d878dd1..e3d2ee0 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3074,9 +3074,6 @@ static void mmu_pte_write_new_pte(struct kvm_vcpu *vcpu, return; } - if (is_rsvd_bits_set(&vcpu->arch.mmu, *(u64 *)new, PT_PAGE_TABLE_LEVEL)) - return; - ++vcpu->kvm->stat.mmu_pte_updated; if (!sp->role.cr4_pae) paging32_update_pte(vcpu, sp, spte, new); diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 60f00db..dfb906f 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -299,25 +299,42 @@ static int FNAME(walk_addr_nested)(struct guest_walker *walker, addr, access); } +static bool FNAME(map_invalid_gpte)(struct kvm_vcpu *vcpu, + struct kvm_mmu_page *sp, u64 *spte, + pt_element_t gpte) +{ + u64 nonpresent = shadow_trap_nonpresent_pte; + + if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL)) + goto no_present; + + if (!is_present_gpte(gpte)) { + if (!sp->unsync) + nonpresent = shadow_notrap_nonpresent_pte; + goto no_present; + } + + if (!(gpte & PT_ACCESSED_MASK)) + goto no_present; + + return false; + +no_present: + drop_spte(vcpu->kvm, spte, nonpresent); + return true; +} + static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, u64 *spte, const void *pte) { pt_element_t gpte; unsigned pte_access; pfn_t pfn; - u64 new_spte; gpte = *(const pt_element_t *)pte; - if (~gpte & (PT_PRESENT_MASK | PT_ACCESSED_MASK)) { - if (!is_present_gpte(gpte)) { - if (sp->unsync) - new_spte = shadow_trap_nonpresent_pte; - else - new_spte = shadow_notrap_nonpresent_pte; - __set_spte(spte, new_spte); - } + if (FNAME(map_invalid_gpte)(vcpu, sp, spte, gpte)) return; - } + pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte); pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte); if (gpte_to_gfn(gpte) != vcpu->arch.update_pte.gfn) @@ -364,7 +381,6 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw, u64 *sptep) { struct kvm_mmu_page *sp; - struct kvm_mmu *mmu = &vcpu->arch.mmu; pt_element_t *gptep = gw->prefetch_ptes; u64 *spte; int i; @@ -395,16 +411,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw, gpte = gptep[i]; - if (is_rsvd_bits_set(mmu, gpte, PT_PAGE_TABLE_LEVEL)) - continue; - - if (!is_present_gpte(gpte)) { - if (!sp->unsync) - __set_spte(spte, shadow_notrap_nonpresent_pte); - continue; - } - - if (!(gpte & PT_ACCESSED_MASK)) + if (FNAME(map_invalid_gpte)(vcpu, sp, spte, gpte)) continue; pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte); @@ -761,7 +768,6 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) pt_element_t gpte; gpa_t pte_gpa; gfn_t gfn; - bool rsvd_bits_set; if (!is_shadow_present_pte(sp->spt[i])) continue; @@ -773,18 +779,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) return -EINVAL; gfn = gpte_to_gfn(gpte); - rsvd_bits_set = is_rsvd_bits_set(&vcpu->arch.mmu, gpte, -PT_PAGE_TABLE_LEVEL); - if (rsvd_bits_set || gfn != sp->gfns[i] || - !is_present_gpte(gpte) || !(gpte & PT_ACCESSED_MASK)) { - u64 nonpresent; - - if (rsvd_bits_set || is_present_gpte(gpte) || - sp->unsync) - nonpresent = shadow_trap_nonpresent_pte; - else - nonpresent = shadow_notrap_nonpresent_pte; - drop_spte(vcpu->kvm, &sp->spt[i], nonpresent); + + if (FNAME(map_invalid_gpte)(vcpu, sp, &sp->sp
[PATCH v3 4/6] KVM: MMU: remove 'clear_unsync' parameter
Remove it since we can jude it by using sp->unsync Signed-off-by: Xiao Guangrong --- arch/x86/include/asm/kvm_host.h |2 +- arch/x86/kvm/mmu.c |8 arch/x86/kvm/paging_tmpl.h |5 ++--- 3 files changed, 7 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b04c0fa..ce8c1e4 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -250,7 +250,7 @@ struct kvm_mmu { void (*prefetch_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *page); int (*sync_page)(struct kvm_vcpu *vcpu, -struct kvm_mmu_page *sp, bool clear_unsync); +struct kvm_mmu_page *sp); void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva); hpa_t root_hpa; int root_level; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index dc64cd6..d878dd1 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1156,7 +1156,7 @@ static void nonpaging_prefetch_page(struct kvm_vcpu *vcpu, } static int nonpaging_sync_page(struct kvm_vcpu *vcpu, - struct kvm_mmu_page *sp, bool clear_unsync) + struct kvm_mmu_page *sp) { return 1; } @@ -1286,7 +1286,7 @@ static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, if (clear_unsync) kvm_unlink_unsync_page(vcpu->kvm, sp); - if (vcpu->arch.mmu.sync_page(vcpu, sp, clear_unsync)) { + if (vcpu->arch.mmu.sync_page(vcpu, sp)) { kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list); return 1; } @@ -1327,12 +1327,12 @@ static void kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t gfn) continue; WARN_ON(s->role.level != PT_PAGE_TABLE_LEVEL); + kvm_unlink_unsync_page(vcpu->kvm, s); if ((s->role.cr4_pae != !!is_pae(vcpu)) || - (vcpu->arch.mmu.sync_page(vcpu, s, true))) { + (vcpu->arch.mmu.sync_page(vcpu, s))) { kvm_mmu_prepare_zap_page(vcpu->kvm, s, &invalid_list); continue; } - kvm_unlink_unsync_page(vcpu->kvm, s); flush = true; } diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 57619ed..60f00db 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -740,8 +740,7 @@ static void FNAME(prefetch_page)(struct kvm_vcpu *vcpu, * - The spte has a reference to the struct page, so the pfn for a given gfn * can't change unless all sptes pointing to it are nuked first. */ -static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, - bool clear_unsync) +static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) { int i, offset, nr_present; bool host_writable; @@ -781,7 +780,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, u64 nonpresent; if (rsvd_bits_set || is_present_gpte(gpte) || - !clear_unsync) + sp->unsync) nonpresent = shadow_trap_nonpresent_pte; else nonpresent = shadow_notrap_nonpresent_pte; -- 1.7.0.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 3/6] KVM: MMU: rename 'reset_host_protection' to 'host_writable'
From: Lai Jiangshan Rename it to fit its sense better Signed-off-by: Lai Jiangshan Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c |8 arch/x86/kvm/paging_tmpl.h | 10 +- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 59b5bd2..dc64cd6 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1958,7 +1958,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pte_access, int user_fault, int write_fault, int dirty, int level, gfn_t gfn, pfn_t pfn, bool speculative, - bool can_unsync, bool reset_host_protection) + bool can_unsync, bool host_writable) { u64 spte, entry = *sptep; int ret = 0; @@ -1985,7 +1985,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, spte |= kvm_x86_ops->get_mt_mask(vcpu, gfn, kvm_is_mmio_pfn(pfn)); - if (reset_host_protection) + if (host_writable) spte |= SPTE_HOST_WRITEABLE; spte |= (u64)pfn << PAGE_SHIFT; @@ -2048,7 +2048,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, int user_fault, int write_fault, int dirty, int *ptwrite, int level, gfn_t gfn, pfn_t pfn, bool speculative, -bool reset_host_protection) +bool host_writable) { int was_rmapped = 0; int rmap_count; @@ -2083,7 +2083,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, if (set_spte(vcpu, sptep, pte_access, user_fault, write_fault, dirty, level, gfn, pfn, speculative, true, - reset_host_protection)) { + host_writable)) { if (write_fault) *ptwrite = 1; kvm_mmu_flush_tlb(vcpu); diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index ca0e5e8..57619ed 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -329,7 +329,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, return; kvm_get_pfn(pfn); /* -* we call mmu_set_spte() with reset_host_protection = true beacuse that +* we call mmu_set_spte() with host_writable = true beacuse that * vcpu->arch.update_pte.pfn was fetched from get_user_pages(write = 1). */ mmu_set_spte(vcpu, spte, sp->role.access, pte_access, 0, 0, @@ -744,7 +744,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, bool clear_unsync) { int i, offset, nr_present; - bool reset_host_protection; + bool host_writable; gpa_t first_pte_gpa; offset = nr_present = 0; @@ -794,14 +794,14 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte); if (!(sp->spt[i] & SPTE_HOST_WRITEABLE)) { pte_access &= ~ACC_WRITE_MASK; - reset_host_protection = 0; + host_writable = 0; } else { - reset_host_protection = 1; + host_writable = 1; } set_spte(vcpu, &sp->spt[i], pte_access, 0, 0, is_dirty_gpte(gpte), PT_PAGE_TABLE_LEVEL, gfn, spte_to_pfn(sp->spt[i]), true, false, -reset_host_protection); +host_writable); } return !nr_present; -- 1.7.0.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/6] KVM: MMU: don't drop spte if overwrite it from W to RO
We just need flush tlb if overwrite a writable spte with a read-only one. And we should move this operation to set_spte() for sync_page path Signed-off-by: Xiao Guangrong --- arch/x86/kvm/mmu.c | 20 +--- 1 files changed, 9 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index bdb9fa9..59b5bd2 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1960,7 +1960,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn, pfn_t pfn, bool speculative, bool can_unsync, bool reset_host_protection) { - u64 spte; + u64 spte, entry = *sptep; int ret = 0; /* @@ -2031,6 +2031,14 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, set_pte: update_spte(sptep, spte); + /* +* If we overwrite a writable spte with a read-only one we +* should flush remote TLBs. Otherwise rmap_write_protect +* will find a read-only spte, even though the writable spte +* might be cached on a CPU's TLB. +*/ + if (is_writable_pte(entry) && !is_writable_pte(*sptep)) + kvm_flush_remote_tlbs(vcpu->kvm); done: return ret; } @@ -2069,16 +2077,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, spte_to_pfn(*sptep), pfn); drop_spte(vcpu->kvm, sptep, shadow_trap_nonpresent_pte); kvm_flush_remote_tlbs(vcpu->kvm); - /* -* If we overwrite a writable spte with a read-only one, -* drop it and flush remote TLBs. Otherwise rmap_write_protect -* will find a read-only spte, even though the writable spte -* might be cached on a CPU's TLB. -*/ - } else if (is_writable_pte(*sptep) && - (!(pte_access & ACC_WRITE_MASK) || !dirty)) { - drop_spte(vcpu->kvm, sptep, shadow_trap_nonpresent_pte); - kvm_flush_remote_tlbs(vcpu->kvm); } else was_rmapped = 1; } -- 1.7.0.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/6] KVM: MMU: fix forgot flush tlbs on sync_page path
We should flush all tlbs after drop spte on sync_page path since: Quote from Avi: | sync_page | drop_spte | kvm_mmu_notifier_invalidate_page | kvm_unmap_rmapp | spte doesn't exist -> no flush | page is freed | guest can write into freed page? Signed-off-by: Xiao Guangrong --- arch/x86/kvm/paging_tmpl.h |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 590bf12..ca0e5e8 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -786,6 +786,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, else nonpresent = shadow_notrap_nonpresent_pte; drop_spte(vcpu->kvm, &sp->spt[i], nonpresent); + kvm_flush_remote_tlbs(vcpu->kvm); continue; } -- 1.7.0.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] KVM: selective write protection using dirty bitmap
(2010/11/18 22:06), Avi Kivity wrote: On 11/18/2010 07:15 AM, Takuya Yoshikawa wrote: We can also use this to selectively write protect pages to reduce unwanted page faults in the future. Looks like a good approach. Any measurements? OK, I'll do some tests to select a right approach. Personally, and some as a job if possible, I'm planning various things. a) Frame buffer: small bitmap size (my hobby now) Performance: Both function level and guest side measurements is necessary. *1 For interactivity, qemu side intelligence may be needed. Note: qemu side dirty bitmap optimization by Kemari team is pending now. Correctness: I doubt qemu side but I'm not good at device level VGA details yet, sorry. *1) I once tried to update dirty bitmap without srcu sync by using atomic_clear_mask() word by word. This seemed to be working and get_dirty_log() was not bad. But when I tested x11perf inside the guest, the numbers were worse than the original implementation (LOCK_PREFIX might do bad things to the cache and srcu reader side suffered, I guess). Good SRCU! b) Live-migration: large bitmap size (my target is ~4GB RAM now) Performance: Scalability is important in both RAM size and number of dirty pages point of view. We need to achieve this goal without sacrificing the low workload case. Correctness: I'm trying to make live-migration fail in the condition of without mst's fix by multi threaded high work load. SMP is necessary. + + if (!(gfn_offset || (gfn % huge))) + break; Why? I preserved these lines from Lai's original. >> http://www.spinics.net/lists/kvm/msg35871.html But OK, I found some duplication probably. I'll fix. wrt. O(1) write protection: hard to tell if the two methods can coexist. For direct mapped shadow pages (i.e. ep/npt) I think we can use the mask to speed up clearing of an individual sp's sptes. OK, I'll test both with and without NPT. The result may lead us to the right direction. Takuya -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v3] Monitor command: x-gpa2hva, translate guest physical address to host virtual address
On Tue, 2010-11-16 at 10:23 +0800, Huang Ying wrote: > Author: Max Asbock > > Add command x-gpa2hva to translate guest physical address to host > virtual address. Because gpa to hva translation is not consistent, so > this command is only used for debugging. > > The x-gpa2hva command provides one step in a chain of translations from > guest virtual to guest physical to host virtual to host physical. Host > physical is then used to inject a machine check error. As a > consequence the HWPOISON code on the host and the MCE injection code > in qemu-kvm are exercised. > > v3: > > - Rename to x-gpa2hva > - Remove QMP support, because gpa2hva is not consistent Is this patch an acceptable solution for now? This command is useful for our testing. Best Regards, Huang Ying -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html