date:20101119

OOM kernel panic on bootup in debian/ubuntu guests

2010-11-19 Thread Igor Serebryany

Hi, 

I'm not sure if this is the correct venue for this question; feel
free to redirect.

When I start debian or ubuntu guests on a debian host running KVM
0.12.5, my guests kernel-panic immediately after bootup -- the OOM
killer kills every process on the system and then dies itself.

This problem only occurs if I enable the virtio-balloon driver --
without it, the guests run just fine. Also, this does not happen on
CentOS guests, which also run fine and which balloon up/down
happily.

The problem occurs just after the system gives me a getty -- if I
have the guest consoled, I'll see a login prompt but never get a
chance to use it.

I've tried compiling a vanilla kernel on an ubuntu guest, but the
problem persisted. I'm not sure where to go from here, so any tips
on debugging the problem are appreciated.

--Igor



signature.asc
Description: Digital signature

Re: [PATCHv4 15/15] Pass boot device list to firmware.

2010-11-19 Thread Gleb Natapov

On Mon, Nov 15, 2010 at 08:29:24PM +, Blue Swirl wrote:
> 2010/11/15 Gleb Natapov :
> > On Sun, Nov 14, 2010 at 10:50:13PM +, Blue Swirl wrote:
> >> On Sun, Nov 14, 2010 at 3:39 PM, Gleb Natapov  wrote:
> >> >
> >> > Signed-off-by: Gleb Natapov 
> >> > ---
> >> >  hw/fw_cfg.c |   14 ++
> >> >  hw/fw_cfg.h |    4 +++-
> >> >  sysemu.h    |    1 +
> >> >  vl.c        |   51 +++
> >> >  4 files changed, 69 insertions(+), 1 deletions(-)
> >> >
> >> > diff --git a/hw/fw_cfg.c b/hw/fw_cfg.c
> >> > index 7b9434f..f6a67db 100644
> >> > --- a/hw/fw_cfg.c
> >> > +++ b/hw/fw_cfg.c
> >> > @@ -53,6 +53,7 @@ struct FWCfgState {
> >> >     FWCfgFiles *files;
> >> >     uint16_t cur_entry;
> >> >     uint32_t cur_offset;
> >> > +    Notifier machine_ready;
> >> >  };
> >> >
> >> >  static void fw_cfg_write(FWCfgState *s, uint8_t value)
> >> > @@ -315,6 +316,15 @@ int fw_cfg_add_file(FWCfgState *s,  const char 
> >> > *filename, uint8_t *data,
> >> >     return 1;
> >> >  }
> >> >
> >> > +static void fw_cfg_machine_ready(struct Notifier* n)
> >> > +{
> >> > +    uint32_t len;
> >> > +    char *bootindex = get_boot_devices_list(&len);
> >> > +
> >> > +    fw_cfg_add_bytes(container_of(n, FWCfgState, machine_ready),
> >> > +                     FW_CFG_BOOTINDEX, (uint8_t*)bootindex, len);
> >> > +}
> >> > +
> >> >  FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t data_port,
> >> >                         target_phys_addr_t ctl_addr, target_phys_addr_t 
> >> > data_addr)
> >> >  {
> >> > @@ -343,6 +353,10 @@ FWCfgState *fw_cfg_init(uint32_t ctl_port, uint32_t 
> >> > data_port,
> >> >     fw_cfg_add_i16(s, FW_CFG_MAX_CPUS, (uint16_t)max_cpus);
> >> >     fw_cfg_add_i16(s, FW_CFG_BOOT_MENU, (uint16_t)boot_menu);
> >> >
> >> > +
> >> > +    s->machine_ready.notify = fw_cfg_machine_ready;
> >> > +    qemu_add_machine_init_done_notifier(&s->machine_ready);
> >> > +
> >> >     return s;
> >> >  }
> >> >
> >> > diff --git a/hw/fw_cfg.h b/hw/fw_cfg.h
> >> > index 856bf91..4d61410 100644
> >> > --- a/hw/fw_cfg.h
> >> > +++ b/hw/fw_cfg.h
> >> > @@ -30,7 +30,9 @@
> >> >
> >> >  #define FW_CFG_FILE_FIRST       0x20
> >> >  #define FW_CFG_FILE_SLOTS       0x10
> >> > -#define FW_CFG_MAX_ENTRY        (FW_CFG_FILE_FIRST+FW_CFG_FILE_SLOTS)
> >> > +#define FW_CFG_FILE_LAST_SLOT   (FW_CFG_FILE_FIRST+FW_CFG_FILE_SLOTS)
> >> > +#define FW_CFG_BOOTINDEX        (FW_CFG_FILE_LAST_SLOT + 1)
> >> > +#define FW_CFG_MAX_ENTRY        FW_CFG_BOOTINDEX
> >>
> >> This should be
> >> #define FW_CFG_MAX_ENTRY        (FW_CFG_BOOTINDEX + 1)
> >> because the check is like this:
> >>     if ((key & FW_CFG_ENTRY_MASK) >= FW_CFG_MAX_ENTRY) {
> >>         s->cur_entry = FW_CFG_INVALID;
> >>
> > Yeah, will fix.
> >
> >> With that change, I got the bootindex passed to OpenBIOS:
> >> OpenBIOS for Sparc64
> >> Configuration device id QEMU version 1 machine id 0
> >> kernel cmdline
> >> CPUs: 1 x SUNW,UltraSPARC-IIi
> >> UUID: ----
> >> bootindex num_strings 1
> >> bootindex /p...@01fe/i...@5/dr...@1/d...@0
> >>
> >> The device path does not match exactly, but it's close:
> >> /p...@1fe,0/pci-...@5/i...@600/d...@0
> >
> > pbm->pci should be solvable by the patch at the end. Were in the spec
> > it is allowed to abbreviate 1fe as 1fe,0? Spec allows to drop
> > starting zeroes but TARGET_FMT_plx definition in targphys.h has 0 after
> > %. I can define another one without leading zeroes. Can you suggest
> > a name?
> 
> I think OpenBIOS for Sparc64 is not correct here, so it may be a bad
> reference architecture. OBP on a real Ultra-5 used a path like this:
> /p...@1f,0/p...@1,1/i...@3/d...@0,0
> 
> p...@1f,0 specifies the PCI host bridge at UPA bus port ID of 0x1f.
According to device name qemu creates pci controller is memory mapped
at address 1fe and by looking at the code I can see that this
is indeed the case. How is UPA naming works?

> p...@1,1 specifies a PCI-PCI bridge.
> 
> > TARGET_FMT_lx is poisoned. As of ATA there is no open firmware
> > binding spec for ATA, so everyone does what he pleases. I based my
> > implementation on what open firmware showing when running on qemu x86.
> > "pci-ata" should be "ide" according to PCI binding spec :)
> 
> Yes, for example there is no ATA in the Ultra-5 tree but in UltraAX it exists:
> /p...@1f,4000/i...@3/a...@0,0/c...@0,0
> 
> > diff --git a/hw/apb_pci.c b/hw/apb_pci.c
> > index c619112..643aa49 100644
> > --- a/hw/apb_pci.c
> > +++ b/hw/apb_pci.c
> > @@ -453,6 +453,7 @@ static PCIDeviceInfo pbm_pci_host_info = {
> >
> >  static SysBusDeviceInfo pbm_host_info = {
> >     .qdev.name = "pbm",
> > +    .qdev.fw_name = "pci",
> 
> Perhaps the FW path should use device class names if no name is specified.
What do you mean by "device class name". We can do something like this:
if (dev->child_bus.lh_first)
return dev->child_bus.lh_first->info->name;

i.e if there is child bus use its bus n

Re: [PATCHv4 15/15] Pass boot device list to firmware.

2010-11-19 Thread Kevin O'Connor

On Tue, Nov 16, 2010 at 09:22:45AM +0200, Gleb Natapov wrote:
> On Mon, Nov 15, 2010 at 09:52:19PM -0500, Kevin O'Connor wrote:
> > I also have an ulterior motive here.  If the boot order is exposed as
> > a newline separated list via an entry in QEMU_CFG_FILE_DIR, then this
> > becomes free for coreboot users as well.  (On coreboot, the boot order
> > could be placed in a "file" in flash with no change to the seabios
> > code.)
> > 
> You can define get_boot_order() function and implement it differently
> for qemu and coreboot. For coreboot it will be one linear. Just call
> cbfs_copyfile("bootorder"). BTW why newline separation is important? 

Sure, but it'd be nice to just use romfile_copy("bootorder").

Using newline separated just makes it easier for users to vi and/or
cat the file.

-Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: limit conectivity of a VM

2010-11-19 Thread Markus Breitländer

Hello,

you may also have a look at VDE (Virtual Distributed Ethernet). You can
connect your VMs to virtual switches and then use the tool
'wirefilter'[1] to modify different attributes (bandwidth, loss, delay,
etc) of the virtual network.

[1] http://wiki.virtualsquare.org/wiki/index.php/VDE

Regards,
Markus


Am 19.11.2010 20:47, schrieb hadi golestani:
> Hello,
> I need to limit the port speed of a VM to 10 mbps ( or 5 mbps if it's 
> possible).
> What's the way of doing so?
> 
> Regards
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] device-assignment: register a reset function

2010-11-19 Thread Alex Williamson

On Tue, 2010-11-16 at 15:05 +0100, Bernhard Kohl wrote:
> This is necessary because during reboot of a VM the assigned devices
> continue DMA transfers which causes memory corruption.
> 
> Signed-off-by: Thomas Ostler 
> Signed-off-by: Bernhard Kohl 
> ---
> Changes v1 -> v2:
> - use defined macros, e.g. PCI_COMMAND
> - write all zero to the command register to disconnect the device logically
> ---
>  hw/device-assignment.c |   12 
>  1 files changed, 12 insertions(+), 0 deletions(-)

Looks good to me.

Acked-by: Alex Williamson 

> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index 5f5bde1..8d5a609 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -1434,6 +1434,17 @@ static void 
> assigned_dev_unregister_msix_mmio(AssignedDevice *dev)
>  dev->msix_table_page = NULL;
>  }
>  
> +static void reset_assigned_device(DeviceState *dev)
> +{
> +PCIDevice *d = DO_UPCAST(PCIDevice, qdev, dev);
> +
> +/*
> + * When a 0 is written to the command register, the device is logically
> + * disconnected from the PCI bus. This avoids further DMA transfers.
> + */
> +assigned_dev_pci_write_config(d, PCI_COMMAND, 0, 2);
> +}
> +
>  static int assigned_initfn(struct PCIDevice *pci_dev)
>  {
>  AssignedDevice *dev = DO_UPCAST(AssignedDevice, dev, pci_dev);
> @@ -1544,6 +1555,7 @@ static PCIDeviceInfo assign_info = {
>  .qdev.name= "pci-assign",
>  .qdev.desc= "pass through host pci devices to the guest",
>  .qdev.size= sizeof(AssignedDevice),
> +.qdev.reset   = reset_assigned_device,
>  .init = assigned_initfn,
>  .exit = assigned_exitfn,
>  .config_read  = assigned_dev_pci_read_config,



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 8/9] device-assignment: Make use of config_map

2010-11-19 Thread Alex Williamson

We can figure out the capability being touched much more quickly
and efficiently with the config_map.  Use it.

Signed-off-by: Alex Williamson 
---

 hw/device-assignment.c |   32 +++-
 1 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 970ffa1..832c236 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1254,28 +1254,34 @@ static void assigned_dev_update_msix(PCIDevice 
*pci_dev, unsigned int ctrl_pos)
 static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint32_t 
address,
  uint32_t val, int len)
 {
-AssignedDevice *assigned_dev = container_of(pci_dev, AssignedDevice, dev);
+uint8_t cap_id = pci_dev->config_map[address];
 
 pci_default_write_config(pci_dev, address, val, len);
+switch (cap_id) {
 #ifdef KVM_CAP_IRQ_ROUTING
+case PCI_CAP_ID_MSI:
 #ifdef KVM_CAP_DEVICE_MSI
-if (assigned_dev->cap.available & ASSIGNED_DEVICE_CAP_MSI) {
-int pos = pci_find_capability(pci_dev, PCI_CAP_ID_MSI);
-if (ranges_overlap(address, len, pos + PCI_MSI_FLAGS, 1)) {
-assigned_dev_update_msi(pci_dev, pos + PCI_MSI_FLAGS);
+{
+uint8_t cap = pci_find_capability(pci_dev, cap_id);
+if (ranges_overlap(address - cap, len, PCI_MSI_FLAGS, 1)) {
+assigned_dev_update_msi(pci_dev, cap + PCI_MSI_FLAGS);
+}
 }
-}
 #endif
+break;
+
+case PCI_CAP_ID_MSIX:
 #ifdef KVM_CAP_DEVICE_MSIX
-if (assigned_dev->cap.available & ASSIGNED_DEVICE_CAP_MSIX) {
-int pos = pci_find_capability(pci_dev, PCI_CAP_ID_MSIX);
-if (ranges_overlap(address, len, pos + PCI_MSIX_FLAGS + 1, 1)) {
-assigned_dev_update_msix(pci_dev, pos + PCI_MSIX_FLAGS);
-   }
-}
+{
+uint8_t cap = pci_find_capability(pci_dev, cap_id);
+if (ranges_overlap(address - cap, len, PCI_MSIX_FLAGS + 1, 1)) {
+assigned_dev_update_msix(pci_dev, cap + PCI_MSIX_FLAGS);
+}
+}
 #endif
+break;
 #endif
-return;
+}
 }
 
 static int assigned_device_pci_cap_init(PCIDevice *pci_dev)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 9/9] device-assignment: pass through and stub more PCI caps

2010-11-19 Thread Alex Williamson

Some drivers depend on finding capabilities like power management,
PCI express/X, vital product data, or vendor specific fields.  Now
that we have better capability support, we can pass more of these
tables through to the guest.  Note that VPD and VNDR are direct pass
through capabilies, the rest are mostly empty shells with a few
writable bits where necessary.

It may be possible to consolidate dummy capabilities into common files
for other drivers to use, but I prefer to leave them here for now as
we figure out what bits to handle directly with hardware and what bits
are purely emulated.

Signed-off-by: Alex Williamson 
---

 hw/device-assignment.c |  213 +---
 1 files changed, 199 insertions(+), 14 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 832c236..cd62b5a 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -67,6 +67,9 @@ static void assigned_device_pci_cap_write_config(PCIDevice 
*pci_dev,
  uint32_t address,
  uint32_t val, int len);
 
+static uint32_t assigned_device_pci_cap_read_config(PCIDevice *pci_dev,
+uint32_t address, int len);
+
 static uint32_t assigned_dev_ioport_rw(AssignedDevRegion *dev_region,
uint32_t addr, int len, uint32_t *val)
 {
@@ -370,6 +373,27 @@ static uint8_t assigned_dev_pci_read_byte(PCIDevice *d, 
int pos)
 return (uint8_t)assigned_dev_pci_read(d, pos, 1);
 }
 
+static void assigned_dev_pci_write(PCIDevice *d, int pos, uint32_t val, int 
len)
+{
+AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev);
+ssize_t ret;
+int fd = pci_dev->real_device.config_fd;
+
+again:
+ret = pwrite(fd, &val, len, pos);
+if (ret != len) {
+   if ((ret < 0) && (errno == EINTR || errno == EAGAIN))
+   goto again;
+
+   fprintf(stderr, "%s: pwrite failed, ret = %zd errno = %d\n",
+   __func__, ret, errno);
+
+   exit(1);
+}
+
+return;
+}
+
 static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap)
 {
 int id;
@@ -453,10 +477,13 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice 
*d, uint32_t address,
 ssize_t ret;
 AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev);
 
+if (address > PCI_CONFIG_HEADER_SIZE && d->config_map[address]) {
+return assigned_device_pci_cap_read_config(d, address, len);
+}
+
 if (address < 0x4 || (pci_dev->need_emulate_cmd && address == 0x4) ||
(address >= 0x10 && address <= 0x24) || address == 0x30 ||
-address == 0x34 || address == 0x3c || address == 0x3d ||
-(address > PCI_CONFIG_HEADER_SIZE && d->config_map[address])) {
+address == 0x34 || address == 0x3c || address == 0x3d) {
 val = pci_default_read_config(d, address, len);
 DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n",
   (d->devfn >> 3) & 0x1F, (d->devfn & 0x7), address, val, len);
@@ -1251,36 +1278,72 @@ static void assigned_dev_update_msix(PCIDevice 
*pci_dev, unsigned int ctrl_pos)
 #endif
 #endif
 
-static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint32_t 
address,
+static uint32_t assigned_device_pci_cap_read_config(PCIDevice *pci_dev,
+uint32_t address, int len)
+{
+uint8_t cap, cap_id = pci_dev->config_map[address];
+
+switch (cap_id) {
+
+case PCI_CAP_ID_VPD:
+cap = pci_find_capability(pci_dev, cap_id);
+if (!ranges_overlap(address, len, cap, PCI_CAP_FLAGS)) {
+return assigned_dev_pci_read(pci_dev, address, len);
+}
+break;
+
+case PCI_CAP_ID_VNDR:
+cap = pci_find_capability(pci_dev, cap_id);
+if (!ranges_overlap(address, len, cap, PCI_CAP_FLAGS + 1)) {
+return assigned_dev_pci_read(pci_dev, address, len);
+}
+break;
+}
+
+return pci_default_read_config(pci_dev, address, len);
+}
+
+static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev,
+ uint32_t address,
  uint32_t val, int len)
 {
-uint8_t cap_id = pci_dev->config_map[address];
+uint8_t cap, cap_id = pci_dev->config_map[address];
 
 pci_default_write_config(pci_dev, address, val, len);
 switch (cap_id) {
 #ifdef KVM_CAP_IRQ_ROUTING
 case PCI_CAP_ID_MSI:
 #ifdef KVM_CAP_DEVICE_MSI
-{
-uint8_t cap = pci_find_capability(pci_dev, cap_id);
-if (ranges_overlap(address - cap, len, PCI_MSI_FLAGS, 1)) {
-assigned_dev_update_msi(pci_dev, cap + PCI_MSI_FLAGS);
-}
+cap = pci_find_capability(pci_dev, cap_id);
+if (ranges_overlap(address - cap, len, PCI_MSI_FLAGS, 1)) {
+assigned_dev_update_msi(pci_dev,

[PATCH v3 6/9] device-assignment: Move PCI capabilities to match physical hardware

2010-11-19 Thread Alex Williamson

Now that common PCI code doesn't have a hangup on capabilities
being contiguous, move assigned device capabilities to match
their offset on physical hardware.  This helps for drivers that
assume a capability configuration and don't bother searching.

We can also remove several calls to assigned_dev_pci_read_* because
we're overlaying the capability at the same location as the initial
copy we made of config space.  We can therefore just use pci_get_*.

Signed-off-by: Alex Williamson 
---

 hw/device-assignment.c |   65 +++-
 1 files changed, 20 insertions(+), 45 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 975d3cb..6314773 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -366,16 +366,6 @@ static uint8_t assigned_dev_pci_read_byte(PCIDevice *d, 
int pos)
 return (uint8_t)assigned_dev_pci_read(d, pos, 1);
 }
 
-static uint16_t assigned_dev_pci_read_word(PCIDevice *d, int pos)
-{
-return (uint16_t)assigned_dev_pci_read(d, pos, 2);
-}
-
-static uint32_t assigned_dev_pci_read_long(PCIDevice *d, int pos)
-{
-return assigned_dev_pci_read(d, pos, 4);
-}
-
 static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap)
 {
 int id;
@@ -1285,6 +1275,7 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 {
 AssignedDevice *dev = container_of(pci_dev, AssignedDevice, dev);
 PCIRegion *pci_region = dev->real_device.regions;
+int pos;
 
 /* Clear initial capabilities pointer and status copied from hw */
 pci_set_byte(pci_dev->config + PCI_CAPABILITY_LIST, 0);
@@ -1296,60 +1287,44 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 #ifdef KVM_CAP_DEVICE_MSI
 /* Expose MSI capability
  * MSI capability is the 1st capability in capability config */
-if (pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI)) {
-int vpos, ppos;
-uint16_t flags;
-
+if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI))) {
 dev->cap.available |= ASSIGNED_DEVICE_CAP_MSI;
-vpos = pci_add_capability(pci_dev, PCI_CAP_ID_MSI, 0,
-  PCI_CAPABILITY_CONFIG_MSI_LENGTH);
-
-memset(pci_dev->config + vpos + PCI_CAP_FLAGS, 0,
-   PCI_CAPABILITY_CONFIG_MSI_LENGTH - PCI_CAP_FLAGS);
+pci_add_capability(pci_dev, PCI_CAP_ID_MSI, pos,
+   PCI_CAPABILITY_CONFIG_MSI_LENGTH);
 
 /* Only 32-bit/no-mask currently supported */
-ppos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI);
-flags = assigned_dev_pci_read_word(pci_dev, ppos + PCI_MSI_FLAGS);
-flags &= PCI_MSI_FLAGS_QMASK;
-pci_set_word(pci_dev->config + vpos + PCI_MSI_FLAGS, flags);
+pci_set_word(pci_dev->config + pos + PCI_MSI_FLAGS,
+ pci_get_word(pci_dev->config + pos + PCI_MSI_FLAGS) &
+ PCI_MSI_FLAGS_QMASK);
+pci_set_long(pci_dev->config + pos + PCI_MSI_ADDRESS_LO, 0);
+pci_set_word(pci_dev->config + pos + PCI_MSI_DATA_32, 0);
 
 /* Set writable fields */
-pci_set_word(pci_dev->wmask + vpos + PCI_MSI_FLAGS,
+pci_set_word(pci_dev->wmask + pos + PCI_MSI_FLAGS,
  PCI_MSI_FLAGS_QSIZE | PCI_MSI_FLAGS_ENABLE);
-pci_set_long(pci_dev->wmask + vpos + PCI_MSI_ADDRESS_LO, 0xfffc);
-pci_set_long(pci_dev->wmask + vpos + PCI_MSI_DATA_32, 0x);
+pci_set_long(pci_dev->wmask + pos + PCI_MSI_ADDRESS_LO, 0xfffc);
+pci_set_word(pci_dev->wmask + pos + PCI_MSI_DATA_32, 0x);
 }
 #endif
 #ifdef KVM_CAP_DEVICE_MSIX
 /* Expose MSI-X capability */
-if (pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSIX)) {
-int vpos, ppos, entry_nr, bar_nr;
+if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSIX))) {
+int bar_nr;
 uint32_t msix_table_entry;
 
 dev->cap.available |= ASSIGNED_DEVICE_CAP_MSIX;
-vpos = pci_add_capability(pci_dev, PCI_CAP_ID_MSIX, 0,
+pci_add_capability(pci_dev, PCI_CAP_ID_MSIX, pos,
PCI_CAPABILITY_CONFIG_MSIX_LENGTH);
 
-memset(pci_dev->config + vpos + PCI_CAP_FLAGS, 0,
-   PCI_CAPABILITY_CONFIG_MSIX_LENGTH - PCI_CAP_FLAGS);
+pci_set_word(pci_dev->config + pos + PCI_MSIX_FLAGS,
+ pci_get_word(pci_dev->config + pos + PCI_MSIX_FLAGS) &
+ PCI_MSIX_TABSIZE);
 
 /* Only enable and function mask bits are writable */
-pci_set_word(pci_dev->wmask + vpos + PCI_MSIX_FLAGS,
+pci_set_word(pci_dev->wmask + pos + PCI_MSIX_FLAGS,
  PCI_MSIX_FLAGS_ENABLE | PCI_MSIX_FLAGS_MASKALL);
 
-ppos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSIX);
-
-entry_nr = assigned_dev_pci_read_word(pci_dev, ppos + PCI_MSIX_FLAGS);
-entry_nr &= PCI_MSIX_TABSIZE;
-pci_set_word(pci_dev->config + vpos + PCI_MSIX_FLAGS, entry_nr);
-
-msix_table_entry = assigned_

[PATCH v3 7/9] pci: Remove capability specific handlers

2010-11-19 Thread Alex Williamson

Drivers can break these out on their own if they need to.

Signed-off-by: Alex Williamson 
---

 hw/device-assignment.c |   16 -
 hw/pci.c   |   61 ++--
 hw/pci.h   |   19 ---
 3 files changed, 13 insertions(+), 83 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 6314773..970ffa1 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -63,6 +63,10 @@ static void assigned_dev_load_option_rom(AssignedDevice 
*dev);
 
 static void assigned_dev_unregister_msix_mmio(AssignedDevice *dev);
 
+static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev,
+ uint32_t address,
+ uint32_t val, int len);
+
 static uint32_t assigned_dev_ioport_rw(AssignedDevRegion *dev_region,
uint32_t addr, int len, uint32_t *val)
 {
@@ -406,14 +410,17 @@ static void assigned_dev_pci_write_config(PCIDevice *d, 
uint32_t address,
   ((d->devfn >> 3) & 0x1F), (d->devfn & 0x7),
   (uint16_t) address, val, len);
 
+if (address > PCI_CONFIG_HEADER_SIZE && d->config_map[address]) {
+return assigned_device_pci_cap_write_config(d, address, val, len);
+}
+
 if (address == 0x4) {
 pci_default_write_config(d, address, val, len);
 /* Continue to program the card */
 }
 
 if ((address >= 0x10 && address <= 0x24) || address == 0x30 ||
-address == 0x34 || address == 0x3c || address == 0x3d ||
-(address > PCI_CONFIG_HEADER_SIZE && d->config_map[address])) {
+address == 0x34 || address == 0x3c || address == 0x3d) {
 /* used for update-mappings (BAR emulation) */
 pci_default_write_config(d, address, val, len);
 return;
@@ -1249,7 +1256,7 @@ static void 
assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint32_t ad
 {
 AssignedDevice *assigned_dev = container_of(pci_dev, AssignedDevice, dev);
 
-pci_default_cap_write_config(pci_dev, address, val, len);
+pci_default_write_config(pci_dev, address, val, len);
 #ifdef KVM_CAP_IRQ_ROUTING
 #ifdef KVM_CAP_DEVICE_MSI
 if (assigned_dev->cap.available & ASSIGNED_DEVICE_CAP_MSI) {
@@ -1478,9 +1485,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev)
 dev->h_busnr = dev->host.bus;
 dev->h_devfn = PCI_DEVFN(dev->host.dev, dev->host.func);
 
-pci_register_capability_handlers(pci_dev, NULL,
- assigned_device_pci_cap_write_config);
-
 if (assigned_device_pci_cap_init(pci_dev) < 0)
 goto out;
 
diff --git a/hw/pci.c b/hw/pci.c
index 1cf62b6..e529793 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -772,8 +772,6 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev, PCIBus *bus,
 config_write = pci_default_write_config;
 pci_dev->config_read = config_read;
 pci_dev->config_write = config_write;
-pci_dev->cap.config_read = pci_default_cap_read_config;
-pci_dev->cap.config_write = pci_default_cap_write_config;
 bus->devices[devfn] = pci_dev;
 pci_dev->irq = qemu_allocate_irqs(pci_set_irq, pci_dev, PCI_NUM_PINS);
 pci_dev->version_id = 2; /* Current pci device vmstate version */
@@ -1070,57 +1068,21 @@ static void pci_update_irq_disabled(PCIDevice *d, int 
was_irq_disabled)
 }
 }
 
-static uint32_t pci_read_config(PCIDevice *d,
-uint32_t address, int len)
+uint32_t pci_default_read_config(PCIDevice *d,
+ uint32_t address, int len)
 {
 uint32_t val = 0;
-
+assert(len == 1 || len == 2 || len == 4);
 len = MIN(len, pci_config_size(d) - address);
 memcpy(&val, d->config + address, len);
 return le32_to_cpu(val);
 }
 
-uint32_t pci_default_read_config(PCIDevice *d,
- uint32_t address, int len)
-{
-assert(len == 1 || len == 2 || len == 4);
-
-if (address > PCI_CONFIG_HEADER_SIZE && d->config_map[address]) {
-return d->cap.config_read(d, address, len);
-}
-
-return pci_read_config(d, address, len);
-}
-
-uint32_t pci_default_cap_read_config(PCIDevice *pci_dev,
- uint32_t address, int len)
-{
-return pci_read_config(pci_dev, address, len);
-}
-
-void pci_default_cap_write_config(PCIDevice *pci_dev,
-  uint32_t address, uint32_t val, int len)
-{
-uint32_t config_size = pci_config_size(pci_dev);
-int i;
-
-for (i = 0; i < len && address + i < config_size; val >>= 8, ++i) {
-uint8_t wmask = pci_dev->wmask[address + i];
-pci_dev->config[address + i] =
-(pci_dev->config[address + i] & ~wmask) | (val & wmask);
-}
-}
-
 void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l)
 {
 int i, was_irq_disabled = pci_irq_disabled(d);
 uint32_t config_size =

[PATCH v3 5/9] pci: Remove cap.length, cap.start, cap.supported

2010-11-19 Thread Alex Williamson

Capabilities aren't required to be contiguous, so cap.length never
really made much sense.  Likewise, cap.start is mostly meaningless
too.  Both of these are better served by the capability map.  We
can also get rid of cap.supported, since it's really now unused
and redundant with flag in the status word anyway.

Signed-off-by: Alex Williamson 
---

 hw/device-assignment.c |4 
 hw/pci.c   |3 ---
 hw/pci.h   |2 --
 3 files changed, 0 insertions(+), 9 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index a4fad60..975d3cb 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1292,8 +1292,6 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
  pci_get_word(pci_dev->config + PCI_STATUS) &
  ~PCI_STATUS_CAP_LIST);
 
-pci_dev->cap.length = 0;
-
 #ifdef KVM_CAP_IRQ_ROUTING
 #ifdef KVM_CAP_DEVICE_MSI
 /* Expose MSI capability
@@ -1320,7 +1318,6 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
  PCI_MSI_FLAGS_QSIZE | PCI_MSI_FLAGS_ENABLE);
 pci_set_long(pci_dev->wmask + vpos + PCI_MSI_ADDRESS_LO, 0xfffc);
 pci_set_long(pci_dev->wmask + vpos + PCI_MSI_DATA_32, 0x);
-pci_dev->cap.length += PCI_CAPABILITY_CONFIG_MSI_LENGTH;
 }
 #endif
 #ifdef KVM_CAP_DEVICE_MSIX
@@ -1356,7 +1353,6 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 bar_nr = msix_table_entry & PCI_MSIX_BIR;
 msix_table_entry &= ~PCI_MSIX_BIR;
 dev->msix_table_addr = pci_region[bar_nr].base_addr + msix_table_entry;
-pci_dev->cap.length += PCI_CAPABILITY_CONFIG_MSIX_LENGTH;
 }
 #endif
 #endif
diff --git a/hw/pci.c b/hw/pci.c
index cbe6fb7..1cf62b6 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -1909,8 +1909,6 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
 memset(pdev->cmask + offset, 0xFF, size);
 
 pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
-pdev->cap.supported = 1;
-pdev->cap.start = pdev->cap.start ? MIN(pdev->cap.start, offset) : offset;
 
 return offset;
 }
@@ -1931,7 +1929,6 @@ void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, 
uint8_t size)
 
 if (!pdev->config[PCI_CAPABILITY_LIST]) {
 pdev->config[PCI_STATUS] &= ~PCI_STATUS_CAP_LIST;
-pdev->cap.start = pdev->cap.length = 0;
 }
 }
 
diff --git a/hw/pci.h b/hw/pci.h
index bd15b43..146f81d 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -220,8 +220,6 @@ struct PCIDevice {
 
 /* Device capability configuration space */
 struct {
-int supported;
-unsigned int start, length;
 PCICapConfigReadFunc *config_read;
 PCICapConfigWriteFunc *config_write;
 } cap;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 4/9] pci: Replace used bitmap with config byte map

2010-11-19 Thread Alex Williamson

Capabilities are allocated in bytes, so we can track both whether
a byte is used and by what capability in the same structure.

Remove pci_reserve_capability() as there are no users, remove
pci_access_cap_config() since it's now a trivial lookup.

Signed-off-by: Alex Williamson 
---

 hw/device-assignment.c |4 ++--
 hw/pci.c   |   30 +-
 hw/pci.h   |8 ++--
 3 files changed, 13 insertions(+), 29 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 76aacac..a4fad60 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -423,7 +423,7 @@ static void assigned_dev_pci_write_config(PCIDevice *d, 
uint32_t address,
 
 if ((address >= 0x10 && address <= 0x24) || address == 0x30 ||
 address == 0x34 || address == 0x3c || address == 0x3d ||
-pci_access_cap_config(d, address, len)) {
+(address > PCI_CONFIG_HEADER_SIZE && d->config_map[address])) {
 /* used for update-mappings (BAR emulation) */
 pci_default_write_config(d, address, val, len);
 return;
@@ -459,7 +459,7 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice *d, 
uint32_t address,
 if (address < 0x4 || (pci_dev->need_emulate_cmd && address == 0x4) ||
(address >= 0x10 && address <= 0x24) || address == 0x30 ||
 address == 0x34 || address == 0x3c || address == 0x3d ||
-pci_access_cap_config(d, address, len)) {
+(address > PCI_CONFIG_HEADER_SIZE && d->config_map[address])) {
 val = pci_default_read_config(d, address, len);
 DEBUG("(%x.%x): address=%04x val=0x%08x len=%d\n",
   (d->devfn >> 3) & 0x1F, (d->devfn & 0x7), address, val, len);
diff --git a/hw/pci.c b/hw/pci.c
index f2896d9..cbe6fb7 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -712,7 +712,7 @@ static void pci_config_alloc(PCIDevice *pci_dev)
 pci_dev->cmask = qemu_mallocz(config_size);
 pci_dev->wmask = qemu_mallocz(config_size);
 pci_dev->w1cmask = qemu_mallocz(config_size);
-pci_dev->used = qemu_mallocz(config_size);
+pci_dev->config_map = qemu_mallocz(config_size);
 }
 
 static void pci_config_free(PCIDevice *pci_dev)
@@ -721,7 +721,7 @@ static void pci_config_free(PCIDevice *pci_dev)
 qemu_free(pci_dev->cmask);
 qemu_free(pci_dev->wmask);
 qemu_free(pci_dev->w1cmask);
-qemu_free(pci_dev->used);
+qemu_free(pci_dev->config_map);
 }
 
 /* -1 for devfn means auto assign */
@@ -751,6 +751,8 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev, PCIBus *bus,
 pci_dev->irq_state = 0;
 pci_config_alloc(pci_dev);
 
+memset(pci_dev->config_map, 0xff, PCI_CONFIG_HEADER_SIZE);
+
 if (!is_bridge) {
 pci_set_default_subsystem_id(pci_dev);
 }
@@ -1083,21 +1085,13 @@ uint32_t pci_default_read_config(PCIDevice *d,
 {
 assert(len == 1 || len == 2 || len == 4);
 
-if (pci_access_cap_config(d, address, len)) {
+if (address > PCI_CONFIG_HEADER_SIZE && d->config_map[address]) {
 return d->cap.config_read(d, address, len);
 }
 
 return pci_read_config(d, address, len);
 }
 
-int pci_access_cap_config(PCIDevice *pci_dev, uint32_t address, int len)
-{
-if (pci_dev->cap.supported && address >= pci_dev->cap.start &&
-(address + len) < pci_dev->cap.start + pci_dev->cap.length)
-return 1;
-return 0;
-}
-
 uint32_t pci_default_cap_read_config(PCIDevice *pci_dev,
  uint32_t address, int len)
 {
@@ -1122,7 +1116,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t 
addr, uint32_t val, int l)
 int i, was_irq_disabled = pci_irq_disabled(d);
 uint32_t config_size = pci_config_size(d);
 
-if (pci_access_cap_config(d, addr, l)) {
+if (addr > PCI_CONFIG_HEADER_SIZE && d->config_map[addr]) {
 d->cap.config_write(d, addr, val, l);
 return;
 }
@@ -1789,7 +1783,7 @@ static int pci_find_space(PCIDevice *pdev, uint8_t size)
 int offset = PCI_CONFIG_HEADER_SIZE;
 int i;
 for (i = PCI_CONFIG_HEADER_SIZE; i < config_size; ++i)
-if (pdev->used[i])
+if (pdev->config_map[i])
 offset = i + 1;
 else if (i - offset + 1 == size)
 return offset;
@@ -1908,7 +1902,7 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
 config[PCI_CAP_LIST_ID] = cap_id;
 config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
 pdev->config[PCI_CAPABILITY_LIST] = offset;
-memset(pdev->used + offset, 0xFF, size);
+memset(pdev->config_map + offset, cap_id, size);
 /* Make capability read-only by default */
 memset(pdev->wmask + offset, 0, size);
 /* Check capability by default */
@@ -1933,7 +1927,7 @@ void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, 
uint8_t size)
 memset(pdev->w1cmask + offset, 0, size);
 /* Clear cmask as device-specific registers can't be checked */
 memset(pdev->cmask + offset, 0, size);
-memset(pdev->u

[PATCH v3 3/9] device-assignment: Use PCI capabilities support

2010-11-19 Thread Alex Williamson

Convert to use common pci_add_capabilities() rather than creating
our own mess.

Signed-off-by: Alex Williamson 
---

 hw/device-assignment.c |  112 +++-
 1 files changed, 63 insertions(+), 49 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index a297cb4..76aacac 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -38,6 +38,7 @@
 #include "device-assignment.h"
 #include "loader.h"
 #include "monitor.h"
+#include "range.h"
 #include 
 
 /* From linux/ioport.h */
@@ -1075,17 +1076,17 @@ static void assigned_dev_update_msi(PCIDevice *pci_dev, 
unsigned int ctrl_pos)
 }
 
 if (ctrl_byte & PCI_MSI_FLAGS_ENABLE) {
+int pos = ctrl_pos - PCI_MSI_FLAGS;
 assigned_dev->entry = calloc(1, sizeof(struct kvm_irq_routing_entry));
 if (!assigned_dev->entry) {
 perror("assigned_dev_update_msi: ");
 return;
 }
 assigned_dev->entry->u.msi.address_lo =
-*(uint32_t *)(pci_dev->config + pci_dev->cap.start +
-  PCI_MSI_ADDRESS_LO);
+pci_get_long(pci_dev->config + pos + PCI_MSI_ADDRESS_LO);
 assigned_dev->entry->u.msi.address_hi = 0;
-assigned_dev->entry->u.msi.data = *(uint16_t *)(pci_dev->config +
-pci_dev->cap.start + PCI_MSI_DATA_32);
+assigned_dev->entry->u.msi.data =
+pci_get_word(pci_dev->config + pos + PCI_MSI_DATA_32);
 assigned_dev->entry->type = KVM_IRQ_ROUTING_MSI;
 r = kvm_get_irq_route_gsi();
 if (r < 0) {
@@ -1123,10 +1124,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice 
*pci_dev)
 struct kvm_assigned_msix_entry msix_entry;
 void *va = adev->msix_table_page;
 
-if (adev->cap.available & ASSIGNED_DEVICE_CAP_MSI)
-pos = pci_dev->cap.start + PCI_CAPABILITY_CONFIG_MSI_LENGTH;
-else
-pos = pci_dev->cap.start;
+pos = pci_find_capability(pci_dev, PCI_CAP_ID_MSIX);
 
 entries_max_nr = *(uint16_t *)(pci_dev->config + pos + 2);
 entries_max_nr &= PCI_MSIX_TABSIZE;
@@ -1260,26 +1258,23 @@ static void 
assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint32_t ad
  uint32_t val, int len)
 {
 AssignedDevice *assigned_dev = container_of(pci_dev, AssignedDevice, dev);
-unsigned int pos = pci_dev->cap.start, ctrl_pos;
 
 pci_default_cap_write_config(pci_dev, address, val, len);
 #ifdef KVM_CAP_IRQ_ROUTING
 #ifdef KVM_CAP_DEVICE_MSI
 if (assigned_dev->cap.available & ASSIGNED_DEVICE_CAP_MSI) {
-ctrl_pos = pos + PCI_MSI_FLAGS;
-if (address <= ctrl_pos && address + len > ctrl_pos)
-assigned_dev_update_msi(pci_dev, ctrl_pos);
-pos += PCI_CAPABILITY_CONFIG_MSI_LENGTH;
+int pos = pci_find_capability(pci_dev, PCI_CAP_ID_MSI);
+if (ranges_overlap(address, len, pos + PCI_MSI_FLAGS, 1)) {
+assigned_dev_update_msi(pci_dev, pos + PCI_MSI_FLAGS);
+}
 }
 #endif
 #ifdef KVM_CAP_DEVICE_MSIX
 if (assigned_dev->cap.available & ASSIGNED_DEVICE_CAP_MSIX) {
-ctrl_pos = pos + 3;
-if (address <= ctrl_pos && address + len > ctrl_pos) {
-ctrl_pos--; /* control is word long */
-assigned_dev_update_msix(pci_dev, ctrl_pos);
+int pos = pci_find_capability(pci_dev, PCI_CAP_ID_MSIX);
+if (ranges_overlap(address, len, pos + PCI_MSIX_FLAGS + 1, 1)) {
+assigned_dev_update_msix(pci_dev, pos + PCI_MSIX_FLAGS);
}
-pos += PCI_CAPABILITY_CONFIG_MSIX_LENGTH;
 }
 #endif
 #endif
@@ -1290,58 +1285,77 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 {
 AssignedDevice *dev = container_of(pci_dev, AssignedDevice, dev);
 PCIRegion *pci_region = dev->real_device.regions;
-int next_cap_pt = 0;
 
-pci_dev->cap.supported = 1;
-pci_dev->cap.start = PCI_CAPABILITY_CONFIG_DEFAULT_START_ADDR;
+/* Clear initial capabilities pointer and status copied from hw */
+pci_set_byte(pci_dev->config + PCI_CAPABILITY_LIST, 0);
+pci_set_word(pci_dev->config + PCI_STATUS,
+ pci_get_word(pci_dev->config + PCI_STATUS) &
+ ~PCI_STATUS_CAP_LIST);
+
 pci_dev->cap.length = 0;
-pci_dev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
-pci_dev->config[PCI_CAPABILITY_LIST] = pci_dev->cap.start;
 
 #ifdef KVM_CAP_IRQ_ROUTING
 #ifdef KVM_CAP_DEVICE_MSI
 /* Expose MSI capability
  * MSI capability is the 1st capability in capability config */
 if (pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI)) {
+int vpos, ppos;
+uint16_t flags;
+
 dev->cap.available |= ASSIGNED_DEVICE_CAP_MSI;
-memset(&pci_dev->config[pci_dev->cap.start + pci_dev->cap.length],
-   0, PCI_CAPABILITY_CONFIG_MSI_LENGTH);
-pci_dev->config[pci_dev->cap.start + pci_dev->cap.length] =
-PCI_CAP_ID_MSI;
+

[PATCH v3 2/9] pci: Remove pci_enable_capability_support()

2010-11-19 Thread Alex Williamson

This interface doesn't make much sense, adding a capability can
take care of everything, just provide a means to register
capability read/write handlers.

Device assignment does it's own thing, so requires a couple
ugly hacks that will be cleaned by subsequent patches.

Signed-off-by: Alex Williamson 
---

 hw/device-assignment.c |   12 ---
 hw/pci.c   |   52 +---
 hw/pci.h   |9 +++-
 3 files changed, 35 insertions(+), 38 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 369bff9..a297cb4 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1292,7 +1292,12 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 PCIRegion *pci_region = dev->real_device.regions;
 int next_cap_pt = 0;
 
+pci_dev->cap.supported = 1;
+pci_dev->cap.start = PCI_CAPABILITY_CONFIG_DEFAULT_START_ADDR;
 pci_dev->cap.length = 0;
+pci_dev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
+pci_dev->config[PCI_CAPABILITY_LIST] = pci_dev->cap.start;
+
 #ifdef KVM_CAP_IRQ_ROUTING
 #ifdef KVM_CAP_DEVICE_MSI
 /* Expose MSI capability
@@ -1488,9 +1493,10 @@ static int assigned_initfn(struct PCIDevice *pci_dev)
 dev->h_busnr = dev->host.bus;
 dev->h_devfn = PCI_DEVFN(dev->host.dev, dev->host.func);
 
-if (pci_enable_capability_support(pci_dev, 0, NULL,
-assigned_device_pci_cap_write_config,
-assigned_device_pci_cap_init) < 0)
+pci_register_capability_handlers(pci_dev, NULL,
+ assigned_device_pci_cap_write_config);
+
+if (assigned_device_pci_cap_init(pci_dev) < 0)
 goto out;
 
 /* assign device to guest */
diff --git a/hw/pci.c b/hw/pci.c
index 8e99746..f2896d9 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -770,6 +770,8 @@ static PCIDevice *do_pci_register_device(PCIDevice 
*pci_dev, PCIBus *bus,
 config_write = pci_default_write_config;
 pci_dev->config_read = config_read;
 pci_dev->config_write = config_write;
+pci_dev->cap.config_read = pci_default_cap_read_config;
+pci_dev->cap.config_write = pci_default_cap_write_config;
 bus->devices[devfn] = pci_dev;
 pci_dev->irq = qemu_allocate_irqs(pci_set_irq, pci_dev, PCI_NUM_PINS);
 pci_dev->version_id = 2; /* Current pci device vmstate version */
@@ -1754,35 +1756,21 @@ PCIDevice *pci_create_simple_multifunction(PCIBus *bus, 
int devfn,
 return dev;
 }
 
-int pci_enable_capability_support(PCIDevice *pci_dev,
-  uint32_t config_start,
-  PCICapConfigReadFunc *config_read,
-  PCICapConfigWriteFunc *config_write,
-  PCICapConfigInitFunc *config_init)
+void pci_register_capability_handlers(PCIDevice *pdev,
+  PCICapConfigReadFunc *config_read,
+  PCICapConfigWriteFunc *config_write)
 {
-if (!pci_dev)
-return -ENODEV;
-
-pci_dev->config[0x06] |= 0x10; // status = capabilities
-
-if (config_start == 0)
-   pci_dev->cap.start = PCI_CAPABILITY_CONFIG_DEFAULT_START_ADDR;
-else if (config_start >= 0x40 && config_start < 0xff)
-pci_dev->cap.start = config_start;
-else
-return -EINVAL;
+if (config_read) {
+pdev->cap.config_read = config_read;
+} else {
+pdev->cap.config_read = pci_default_cap_read_config;
+}
 
-if (config_read)
-pci_dev->cap.config_read = config_read;
-else
-pci_dev->cap.config_read = pci_default_cap_read_config;
-if (config_write)
-pci_dev->cap.config_write = config_write;
-else
-pci_dev->cap.config_write = pci_default_cap_write_config;
-pci_dev->cap.supported = 1;
-pci_dev->config[PCI_CAPABILITY_LIST] = pci_dev->cap.start;
-return config_init(pci_dev);
+if (config_write) {
+pdev->cap.config_write = config_write;
+} else {
+pdev->cap.config_write = pci_default_cap_write_config;
+}
 }
 
 PCIDevice *pci_create(PCIBus *bus, int devfn, const char *name)
@@ -1920,12 +1908,16 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
 config[PCI_CAP_LIST_ID] = cap_id;
 config[PCI_CAP_LIST_NEXT] = pdev->config[PCI_CAPABILITY_LIST];
 pdev->config[PCI_CAPABILITY_LIST] = offset;
-pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
 memset(pdev->used + offset, 0xFF, size);
 /* Make capability read-only by default */
 memset(pdev->wmask + offset, 0, size);
 /* Check capability by default */
 memset(pdev->cmask + offset, 0xFF, size);
+
+pdev->config[PCI_STATUS] |= PCI_STATUS_CAP_LIST;
+pdev->cap.supported = 1;
+pdev->cap.start = pdev->cap.start ? MIN(pdev->cap.start, offset) : offset;
+
 return offset;
 }
 
@@ -1943,8 +1935,10 @@ void pci_del_capability(PCIDevice *pdev, uint8_t cap_id, 
uint8_t siz

[PATCH v3 1/9] pci: pci_default_cap_write_config ignores wmask

2010-11-19 Thread Alex Williamson

Make use of wmask, just like the rest of config space.

This duplicates code in pci_default_write_config, but we plan to get
rid of this function anyway, so avoid the code churn.

Signed-off-by: Alex Williamson 
---

 hw/pci.c |   19 ---
 1 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index c3b5048..8e99746 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -1088,16 +1088,6 @@ uint32_t pci_default_read_config(PCIDevice *d,
 return pci_read_config(d, address, len);
 }
 
-static void pci_write_config(PCIDevice *pci_dev,
- uint32_t address, uint32_t val, int len)
-{
-int i;
-for (i = 0; i < len; i++) {
-pci_dev->config[address + i] = val & 0xff;
-val >>= 8;
-}
-}
-
 int pci_access_cap_config(PCIDevice *pci_dev, uint32_t address, int len)
 {
 if (pci_dev->cap.supported && address >= pci_dev->cap.start &&
@@ -1115,7 +1105,14 @@ uint32_t pci_default_cap_read_config(PCIDevice *pci_dev,
 void pci_default_cap_write_config(PCIDevice *pci_dev,
   uint32_t address, uint32_t val, int len)
 {
-pci_write_config(pci_dev, address, val, len);
+uint32_t config_size = pci_config_size(pci_dev);
+int i;
+
+for (i = 0; i < len && address + i < config_size; val >>= 8, ++i) {
+uint8_t wmask = pci_dev->wmask[address + i];
+pci_dev->config[address + i] =
+(pci_dev->config[address + i] & ~wmask) | (val & wmask);
+}
 }
 
 void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 0/9] PCI capability and device assignment improvements

2010-11-19 Thread Alex Williamson

v3:

 - Rework to avoid introducing conflicts with qemu.git hw/pci
 - Drop capability lookup table
 - Add back minimal device assignment PM, EXP, X, VPD, VNDR
   capabilities

This version should do a much better job at not introducing new
differences between qemu-kvm.git and qemu.git hw/pci.  The diff
between the pci files after this series is significantly
reduced.  I added back the previous RFC patch that adds new
capabilities for assigned devices.  I'm sure we'll want to add
passthrough for various fields, but that can come later (along
with figuring out whether we can consolidate emulation should
other drivers want it).  Thanks,

Alex

v2:

 - Fixed the function name in 1/8 per Michael's suggestion.
 - Removed capability specific config read/write registration
 - Added more checks to add_capability
 - Added capability lookup table to PCIDevice

I've dropped the RFC patch to add more capabilities to device
assignment while I do some more work on it.  Please feel free
to comment on the v1 version though.  Patches still against
qemu-kvm, but I hope some of this makes it easier to bring
qemu & qemu-kvm closer here.

v1:

This series attempts to clean up capability support between common
code and device assignment.  In doing so, we can move existing MSI &
MSI-X capabilities to offsets matching real hardware, and further
enable more capabilities to be exposed.

The last patch is only for RFC, I'd like some input on what we should
pass directly and where we should only provide read-only/emulated
access.  Patches 1-7 are submitted for commit.

---

Alex Williamson (9):
  device-assignment: pass through and stub more PCI caps
  device-assignment: Make use of config_map
  pci: Remove capability specific handlers
  device-assignment: Move PCI capabilities to match physical hardware
  pci: Remove cap.length, cap.start, cap.supported
  pci: Replace used bitmap with config byte map
  device-assignment: Use PCI capabilities support
  pci: Remove pci_enable_capability_support()
  pci: pci_default_cap_write_config ignores wmask


 hw/device-assignment.c |  332 +---
 hw/pci.c   |  109 ++--
 hw/pci.h   |   32 -
 3 files changed, 276 insertions(+), 197 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH] kvm: PCI error stub driver

2010-11-19 Thread Etienne Martineau

When a devices is driven by userspace (like in the cases of KVM's
device assignment) it's essential to provide proper PCI error handling 
support to the corresponding driver.

Implementation
==
The PCI error stub driver is implemented on top of the uio framework.

PCI errors are reported to userspace by signaling on an eventfd. Error
codes and error results are exposed under a 'logical' BAR that can be 
mmap'ed. Userspace acknowledge the kernel using another eventfd which 
internally is consumed by the driver in a similar fashion than KVM's 
irqfd.

The uio implementation accepts 32 bit writes to the device file. This 
interface is used to multiplex the passing of both eventfd to the driver.

The guest-visible interface is not yet define. In theory it could be
abstracted part of Q35 PCIe chipset or it could be define as a simple
QEMU device model.

>From performance's perspective, the error signaling path can reach the
guest directly using KVM's irqfd. Both error code and error result can 
be made directly available to the guest. The guest acknowledge part can 
signal directly on an ioeventfd.

Background information
==
Error handling happens in two phases; (1) Reporting, (2) Recovery.

Phase 1, Reporting (.error_detected callback)
=
The purpose of the reporting phase is to A) notify the driver so that it 
can stop all I/O access to the devices quickly and B) provide an exit 
path to I/O spinloops or inconsistent internal state.

Kernel assume that once the corresponding '.error_detected' callback 
return, the driver is in quiescent state. Ideally, interruptions to the 
device should also be disabled.

If the driver doesn't respond to the '.error_detected' callback in a 
timely fashion, it may result in a scenario where an accumulation of 
posted writes choke the PCIe switch and eventually the CPU complex.

Phase 2, Recovery( .mmio_enabled .link_reset .slot_reset .resume )
==
The purpose of the recovery phase is to provide a mechanism for the 
driver to 'reconnect' with the device in a sequenced fashion. The 
times it takes to perform the recovery process usually doesn't impact 
the host.

Reporting and recovery in userspace
===
Both error reporting and error recovery phases are driven by a state 
machine where the next state depends on the returned value of the current 
callback. Depending on the state that state machine is in, the host will 
perform various operation such as 'slot reset' or 'link reset'.

Because of such the only way to maintain consistency between the device 
and it's corresponding driver in userspace is to perform both phases 
in lockstep with the host.

The need for a policy; Error handling in userspace?
===
When the 'policy' is set to 'paranoid', the host kernel doesn't rely on 
userspace ever therefore as soon as an error is detected related to the 
device, the corresponding driver's process is terminated.

When the 'policy' is set to 'strict', the entire reporting and recovery  
phases must succeed in lock step. Timeout in the lock step sequence 
terminate the corresponding driver's process.

The 'lazy' mode is similar to the 'strict' mode except that if a timeout 
happen in the lock step sequence, the host kernel will take a default 
path and keep the process up and running.

Considerations
==
- This is work in progress. Minimal testing. Looking for comments/ideas

- For performance reason, the 'lazy' mode is interesting because it
may very well operate in an asynchronous fashion without the expensive
lock step mode. Obviously, the driver's expected recovery sequence
_must_ match the default state machine sequence on the host.

- Assuming that the error recovery is directed by qemu/Q35 PCIe, how to 
get the recovery action for every error callback since aer handling
'merge' callback return code from multiple source. It's hard to 
discriminate what value should be given back to the host for the
corresponding assigned device.  
   
- Upon slot_reset who should restore the device's config space? Can the
guest's driver do it?

- Should suspend and resume for assigned devices be also part of this
driver?

- The only policy implemented is 'lazy'

thanks,
-Etienne

Signed-off-by: Etienne Martineau 
---
 drivers/uio/Kconfig  |   11 +
 drivers/uio/Makefile |2 +
 drivers/uio/uio_pci_stub.c   |  558 ++
 include/linux/Kbuild |1 +
 include/linux/uio_pci_stub.h |   46 
 5 files changed, 618 insertions(+), 0 deletions(-)
 create mode 100644 drivers/uio/uio_pci_stub.c
 create mode 100644 include/linux/uio_pci_stub.h

diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
index bb44079..d381c0e 100644
--- a/drivers/uio/Kconfig
+++ b/drivers/uio/Kconfig
@@ -93,5 +93,16 @@ config UIO_N

Re: limit conectivity of a VM

2010-11-19 Thread Javier Guerra Giraldez

On Fri, Nov 19, 2010 at 2:47 PM, hadi golestani
 wrote:
> Hello,
> I need to limit the port speed of a VM to 10 mbps ( or 5 mbps if it's 
> possible).
> What's the way of doing so?

tc

check http://lartc.org/howto/lartc.qdisc.html


-- 
Javier
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

limit conectivity of a VM

2010-11-19 Thread hadi golestani

Hello,
I need to limit the port speed of a VM to 10 mbps ( or 5 mbps if it's possible).
What's the way of doing so?

Regards
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: fast-path msi injection with irqfd

2010-11-19 Thread Gregory Haskins

>>> On 11/19/2010 at 10:54 AM, in message <20101119155427.ga20...@amt.cnet>,
Marcelo Tosatti  wrote: 
> On Thu, Nov 18, 2010 at 07:09:08PM +0200, Michael S. Tsirkin wrote:
>> Store irq routing table pointer in the irqfd object,
>> and use that to inject MSI directly without bouncing out to
>> a kernel thread.
>> 
>> While we touch this structure, rearrange irqfd fields to make fastpath
>> better packed for better cache utilization.
>> 
>> This also adds some comments about locking rules and rcu usage in code.
>> 
>> Some notes on the design:
>> - Use pointer into the rt instead of copying an entry,
>>   to make it possible to use rcu, thus side-stepping
>>   locking complexities.  We also save some memory this way.
>> - Old workqueue code is still used for level irqs.
>>   I don't think we DTRT with level anyway, however,
>>   it seems easier to keep the code around as
>>   it has been thought through and debugged, and fix level later than
>>   rip out and re-instate it later.
>> 
>> Signed-off-by: Michael S. Tsirkin 
>> ---
>> 
>> OK, this seems to work fine for me. Tested with virtio-net in guest
>> with and without vhost-net.  Pls review/apply if appropriate.
> 
> Acked-by: Marcelo Tosatti 

Acked-by: Gregory Haskins 



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: fast-path msi injection with irqfd

2010-11-19 Thread Marcelo Tosatti

On Thu, Nov 18, 2010 at 07:09:08PM +0200, Michael S. Tsirkin wrote:
> Store irq routing table pointer in the irqfd object,
> and use that to inject MSI directly without bouncing out to
> a kernel thread.
> 
> While we touch this structure, rearrange irqfd fields to make fastpath
> better packed for better cache utilization.
> 
> This also adds some comments about locking rules and rcu usage in code.
> 
> Some notes on the design:
> - Use pointer into the rt instead of copying an entry,
>   to make it possible to use rcu, thus side-stepping
>   locking complexities.  We also save some memory this way.
> - Old workqueue code is still used for level irqs.
>   I don't think we DTRT with level anyway, however,
>   it seems easier to keep the code around as
>   it has been thought through and debugged, and fix level later than
>   rip out and re-instate it later.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
> 
> OK, this seems to work fine for me. Tested with virtio-net in guest
> with and without vhost-net.  Pls review/apply if appropriate.

Acked-by: Marcelo Tosatti 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 6/6] KVM: MMU: delay flush all tlbs on sync_page path

2010-11-19 Thread Marcelo Tosatti

On Fri, Nov 19, 2010 at 05:05:38PM +0800, Xiao Guangrong wrote:
> Quote from Avi:
> | I don't think we need to flush immediately; set a "tlb dirty" bit somewhere
> | that is cleareded when we flush the tlb.  kvm_mmu_notifier_invalidate_page()
> | can consult the bit and force a flush if set. 
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/paging_tmpl.h |4 ++--
>  include/linux/kvm_host.h   |7 +++
>  virt/kvm/kvm_main.c|8 +++-
>  3 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index dfb906f..d5c302a 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -781,14 +781,14 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, 
> struct kvm_mmu_page *sp)
>   gfn = gpte_to_gfn(gpte);
>  
>   if (FNAME(map_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) {
> - kvm_flush_remote_tlbs(vcpu->kvm);
> + kvm_mark_tlbs_dirty(vcpu->kvm);
>   continue;
>   }
>  
>   if (gfn != sp->gfns[i]) {
>   drop_spte(vcpu->kvm, &sp->spt[i],
> shadow_trap_nonpresent_pte);
> - kvm_flush_remote_tlbs(vcpu->kvm);
> + kvm_mark_tlbs_dirty(vcpu->kvm);
>   continue;
>   }
>  
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4bd663d..82d449b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -249,6 +249,7 @@ struct kvm {
>   struct mmu_notifier mmu_notifier;
>   unsigned long mmu_notifier_seq;
>   long mmu_notifier_count;
> + atomic_t tlbs_dirty;
>  #endif
>  };
>  
> @@ -377,6 +378,12 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
>  void kvm_resched(struct kvm_vcpu *vcpu);
>  void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
>  void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
> +
> +static inline void kvm_mark_tlbs_dirty(struct kvm *kvm)
> +{
> + atomic_inc(&kvm->tlbs_dirty);
> +}
> +
>  void kvm_flush_remote_tlbs(struct kvm *kvm);
>  void kvm_reload_remote_mmus(struct kvm *kvm);
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fb93ff9..2962569 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -168,8 +168,12 @@ static bool make_all_cpus_request(struct kvm *kvm, 
> unsigned int req)
>  
>  void kvm_flush_remote_tlbs(struct kvm *kvm)
>  {
> + int dirty_count = atomic_read(&kvm->tlbs_dirty);
> +
> + smp_mb();
>   if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
>   ++kvm->stat.remote_tlb_flush;
> + atomic_sub(dirty_count, &kvm->tlbs_dirty);
>  }

This is racy because kvm_flush_remote_tlbs might be called without
mmu_lock protection. You could decrease the counter on
invalidate_page/invalidate_range_start only, these are not fast paths
anyway.

>  
>  void kvm_reload_remote_mmus(struct kvm *kvm)
> @@ -249,7 +253,8 @@ static void kvm_mmu_notifier_invalidate_page(struct 
> mmu_notifier *mn,
>   idx = srcu_read_lock(&kvm->srcu);
>   spin_lock(&kvm->mmu_lock);
>   kvm->mmu_notifier_seq++;
> - need_tlb_flush = kvm_unmap_hva(kvm, address);
> + need_tlb_flush = kvm_unmap_hva(kvm, address) |
> +   atomic_read(&kvm->tlbs_dirty);
>   spin_unlock(&kvm->mmu_lock);
>   srcu_read_unlock(&kvm->srcu, idx);
>  
> @@ -293,6 +298,7 @@ static void 
> kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   kvm->mmu_notifier_count++;
>   for (; start < end; start += PAGE_SIZE)
>   need_tlb_flush |= kvm_unmap_hva(kvm, start);
> + need_tlb_flush |= atomic_read(&kvm->tlbs_dirty);
>   spin_unlock(&kvm->mmu_lock);
>   srcu_read_unlock(&kvm->srcu, idx);
>  

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: seabios 0.6.1 regression

2010-11-19 Thread Avi Kivity


On 11/16/2010 04:17 PM, Alexander Graf wrote:

On 16.11.2010, at 15:15, Avi Kivity wrote:

>  On 11/16/2010 03:19 PM, Alexander Graf wrote:
>>  >>>
>>  >>
>>  >>   Rewriting it to use inb / stos works (jecxz ; insb; loop doesn't) so 
it looks like a kernel bug in insb emulation.
>>  >>
>>  >
>>  >   Turns out is was a subtle bug in the tpr optimization we do for Windows 
XP.  The problem happens when we load the vapic option rom from the firmware config 
interface.  With inb / movb, writing the vapic area happens in guest context, which the 
kernel is prepared to handle.

^^

>>  With insb, the write happens from kvm, which is then undone on the next 
entry, leading to the tpr being set to a high value.
>>
>>  Shouldn't the vapic area be mapped in on demand? Then we could map it on 
option rom init time and everyone's happy.
>
>  Mapped in? It's an option rom.

According to your wording, the kernel handles writes to the vapic area. If the 
vapic area is only made special after initialization of the option rom, we 
wouldn't have this issue.



Ah, yes, and I'm even now writing a reset handler to make this happen.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ppc32 build failed

2010-11-19 Thread Alexander Graf


On 19.11.2010, at 07:24, Takuya Yoshikawa wrote:

> (2010/11/19 15:01), Yang Rui Rui wrote:
>> Hi,
>> 
>> I searched the archive found some discutions about this, not fixed yet?
>> could someone tell, is g4 kvm available now?
> 
> Hi, (added kvm-ppc to Cc)
> 
> I'm using g4 (Mac mini box) to run KVM.
>  - though not tried 2.6.37-rc2 yet.
> 
> Aren't you using upstream qemu?
> IIRC, ppc kvm needs to use upstream qemu.
> 
> 
> You can see useful information on KVM "PowerPC port" page.
>  http://www.linux-kvm.org/page/PowerPC
> 
>  -> no g4 example but we can find enough information.
> 
> 
> BTW, why the entry "Book3S PPC32" in the processor support table
> is still "NO", anyone?
>  http://www.linux-kvm.org/page/Processor_support

Because I'm better at writing code than at maintaining wikis :). Thanks a lot 
for the reminder. I changed it.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 6/6] KVM: MMU: delay flush all tlbs on sync_page path

2010-11-19 Thread Xiao Guangrong

Quote from Avi:
| I don't think we need to flush immediately; set a "tlb dirty" bit somewhere
| that is cleareded when we flush the tlb.  kvm_mmu_notifier_invalidate_page()
| can consult the bit and force a flush if set. 

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/paging_tmpl.h |4 ++--
 include/linux/kvm_host.h   |7 +++
 virt/kvm/kvm_main.c|8 +++-
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index dfb906f..d5c302a 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -781,14 +781,14 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
gfn = gpte_to_gfn(gpte);
 
if (FNAME(map_invalid_gpte)(vcpu, sp, &sp->spt[i], gpte)) {
-   kvm_flush_remote_tlbs(vcpu->kvm);
+   kvm_mark_tlbs_dirty(vcpu->kvm);
continue;
}
 
if (gfn != sp->gfns[i]) {
drop_spte(vcpu->kvm, &sp->spt[i],
  shadow_trap_nonpresent_pte);
-   kvm_flush_remote_tlbs(vcpu->kvm);
+   kvm_mark_tlbs_dirty(vcpu->kvm);
continue;
}
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4bd663d..82d449b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -249,6 +249,7 @@ struct kvm {
struct mmu_notifier mmu_notifier;
unsigned long mmu_notifier_seq;
long mmu_notifier_count;
+   atomic_t tlbs_dirty;
 #endif
 };
 
@@ -377,6 +378,12 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu);
 void kvm_resched(struct kvm_vcpu *vcpu);
 void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
 void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
+
+static inline void kvm_mark_tlbs_dirty(struct kvm *kvm)
+{
+   atomic_inc(&kvm->tlbs_dirty);
+}
+
 void kvm_flush_remote_tlbs(struct kvm *kvm);
 void kvm_reload_remote_mmus(struct kvm *kvm);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fb93ff9..2962569 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -168,8 +168,12 @@ static bool make_all_cpus_request(struct kvm *kvm, 
unsigned int req)
 
 void kvm_flush_remote_tlbs(struct kvm *kvm)
 {
+   int dirty_count = atomic_read(&kvm->tlbs_dirty);
+
+   smp_mb();
if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
++kvm->stat.remote_tlb_flush;
+   atomic_sub(dirty_count, &kvm->tlbs_dirty);
 }
 
 void kvm_reload_remote_mmus(struct kvm *kvm)
@@ -249,7 +253,8 @@ static void kvm_mmu_notifier_invalidate_page(struct 
mmu_notifier *mn,
idx = srcu_read_lock(&kvm->srcu);
spin_lock(&kvm->mmu_lock);
kvm->mmu_notifier_seq++;
-   need_tlb_flush = kvm_unmap_hva(kvm, address);
+   need_tlb_flush = kvm_unmap_hva(kvm, address) |
+ atomic_read(&kvm->tlbs_dirty);
spin_unlock(&kvm->mmu_lock);
srcu_read_unlock(&kvm->srcu, idx);
 
@@ -293,6 +298,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct 
mmu_notifier *mn,
kvm->mmu_notifier_count++;
for (; start < end; start += PAGE_SIZE)
need_tlb_flush |= kvm_unmap_hva(kvm, start);
+   need_tlb_flush |= atomic_read(&kvm->tlbs_dirty);
spin_unlock(&kvm->mmu_lock);
srcu_read_unlock(&kvm->srcu, idx);
 
-- 
1.7.0.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 5/6] KVM: MMU: abstract invalid guest pte mapping

2010-11-19 Thread Xiao Guangrong

Introduce a common function to map invalid gpte

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c |3 --
 arch/x86/kvm/paging_tmpl.h |   71 +++-
 2 files changed, 37 insertions(+), 37 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d878dd1..e3d2ee0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3074,9 +3074,6 @@ static void mmu_pte_write_new_pte(struct kvm_vcpu *vcpu,
return;
 }
 
-   if (is_rsvd_bits_set(&vcpu->arch.mmu, *(u64 *)new, PT_PAGE_TABLE_LEVEL))
-   return;
-
++vcpu->kvm->stat.mmu_pte_updated;
if (!sp->role.cr4_pae)
paging32_update_pte(vcpu, sp, spte, new);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 60f00db..dfb906f 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -299,25 +299,42 @@ static int FNAME(walk_addr_nested)(struct guest_walker 
*walker,
addr, access);
 }
 
+static bool FNAME(map_invalid_gpte)(struct kvm_vcpu *vcpu,
+   struct kvm_mmu_page *sp, u64 *spte,
+   pt_element_t gpte)
+{
+   u64 nonpresent = shadow_trap_nonpresent_pte;
+
+   if (is_rsvd_bits_set(&vcpu->arch.mmu, gpte, PT_PAGE_TABLE_LEVEL))
+   goto no_present;
+
+   if (!is_present_gpte(gpte)) {
+   if (!sp->unsync)
+   nonpresent = shadow_notrap_nonpresent_pte;
+   goto no_present;
+   }
+
+   if (!(gpte & PT_ACCESSED_MASK))
+   goto no_present;
+
+   return false;
+
+no_present:
+   drop_spte(vcpu->kvm, spte, nonpresent);
+   return true;
+}
+
 static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
  u64 *spte, const void *pte)
 {
pt_element_t gpte;
unsigned pte_access;
pfn_t pfn;
-   u64 new_spte;
 
gpte = *(const pt_element_t *)pte;
-   if (~gpte & (PT_PRESENT_MASK | PT_ACCESSED_MASK)) {
-   if (!is_present_gpte(gpte)) {
-   if (sp->unsync)
-   new_spte = shadow_trap_nonpresent_pte;
-   else
-   new_spte = shadow_notrap_nonpresent_pte;
-   __set_spte(spte, new_spte);
-   }
+   if (FNAME(map_invalid_gpte)(vcpu, sp, spte, gpte))
return;
-   }
+
pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
if (gpte_to_gfn(gpte) != vcpu->arch.update_pte.gfn)
@@ -364,7 +381,6 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, 
struct guest_walker *gw,
u64 *sptep)
 {
struct kvm_mmu_page *sp;
-   struct kvm_mmu *mmu = &vcpu->arch.mmu;
pt_element_t *gptep = gw->prefetch_ptes;
u64 *spte;
int i;
@@ -395,16 +411,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, 
struct guest_walker *gw,
 
gpte = gptep[i];
 
-   if (is_rsvd_bits_set(mmu, gpte, PT_PAGE_TABLE_LEVEL))
-   continue;
-
-   if (!is_present_gpte(gpte)) {
-   if (!sp->unsync)
-   __set_spte(spte, shadow_notrap_nonpresent_pte);
-   continue;
-   }
-
-   if (!(gpte & PT_ACCESSED_MASK))
+   if (FNAME(map_invalid_gpte)(vcpu, sp, spte, gpte))
continue;
 
pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
@@ -761,7 +768,6 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
pt_element_t gpte;
gpa_t pte_gpa;
gfn_t gfn;
-   bool rsvd_bits_set;
 
if (!is_shadow_present_pte(sp->spt[i]))
continue;
@@ -773,18 +779,15 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
return -EINVAL;
 
gfn = gpte_to_gfn(gpte);
-   rsvd_bits_set = is_rsvd_bits_set(&vcpu->arch.mmu, gpte,
-PT_PAGE_TABLE_LEVEL);
-   if (rsvd_bits_set || gfn != sp->gfns[i] ||
- !is_present_gpte(gpte) || !(gpte & PT_ACCESSED_MASK)) {
-   u64 nonpresent;
-
-   if (rsvd_bits_set || is_present_gpte(gpte) ||
- sp->unsync)
-   nonpresent = shadow_trap_nonpresent_pte;
-   else
-   nonpresent = shadow_notrap_nonpresent_pte;
-   drop_spte(vcpu->kvm, &sp->spt[i], nonpresent);
+
+   if (FNAME(map_invalid_gpte)(vcpu, sp, &sp->sp

[PATCH v3 4/6] KVM: MMU: remove 'clear_unsync' parameter

2010-11-19 Thread Xiao Guangrong

Remove it since we can jude it by using sp->unsync

Signed-off-by: Xiao Guangrong 
---
 arch/x86/include/asm/kvm_host.h |2 +-
 arch/x86/kvm/mmu.c  |8 
 arch/x86/kvm/paging_tmpl.h  |5 ++---
 3 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b04c0fa..ce8c1e4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -250,7 +250,7 @@ struct kvm_mmu {
void (*prefetch_page)(struct kvm_vcpu *vcpu,
  struct kvm_mmu_page *page);
int (*sync_page)(struct kvm_vcpu *vcpu,
-struct kvm_mmu_page *sp, bool clear_unsync);
+struct kvm_mmu_page *sp);
void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva);
hpa_t root_hpa;
int root_level;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index dc64cd6..d878dd1 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1156,7 +1156,7 @@ static void nonpaging_prefetch_page(struct kvm_vcpu *vcpu,
 }
 
 static int nonpaging_sync_page(struct kvm_vcpu *vcpu,
-  struct kvm_mmu_page *sp, bool clear_unsync)
+  struct kvm_mmu_page *sp)
 {
return 1;
 }
@@ -1286,7 +1286,7 @@ static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
if (clear_unsync)
kvm_unlink_unsync_page(vcpu->kvm, sp);
 
-   if (vcpu->arch.mmu.sync_page(vcpu, sp, clear_unsync)) {
+   if (vcpu->arch.mmu.sync_page(vcpu, sp)) {
kvm_mmu_prepare_zap_page(vcpu->kvm, sp, invalid_list);
return 1;
}
@@ -1327,12 +1327,12 @@ static void kvm_sync_pages(struct kvm_vcpu *vcpu,  
gfn_t gfn)
continue;
 
WARN_ON(s->role.level != PT_PAGE_TABLE_LEVEL);
+   kvm_unlink_unsync_page(vcpu->kvm, s);
if ((s->role.cr4_pae != !!is_pae(vcpu)) ||
-   (vcpu->arch.mmu.sync_page(vcpu, s, true))) {
+   (vcpu->arch.mmu.sync_page(vcpu, s))) {
kvm_mmu_prepare_zap_page(vcpu->kvm, s, &invalid_list);
continue;
}
-   kvm_unlink_unsync_page(vcpu->kvm, s);
flush = true;
}
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 57619ed..60f00db 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -740,8 +740,7 @@ static void FNAME(prefetch_page)(struct kvm_vcpu *vcpu,
  * - The spte has a reference to the struct page, so the pfn for a given gfn
  *   can't change unless all sptes pointing to it are nuked first.
  */
-static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
-   bool clear_unsync)
+static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 {
int i, offset, nr_present;
bool host_writable;
@@ -781,7 +780,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
u64 nonpresent;
 
if (rsvd_bits_set || is_present_gpte(gpte) ||
- !clear_unsync)
+ sp->unsync)
nonpresent = shadow_trap_nonpresent_pte;
else
nonpresent = shadow_notrap_nonpresent_pte;
-- 
1.7.0.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 3/6] KVM: MMU: rename 'reset_host_protection' to 'host_writable'

2010-11-19 Thread Xiao Guangrong

From: Lai Jiangshan 

Rename it to fit its sense better 

Signed-off-by: Lai Jiangshan 
Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c |8 
 arch/x86/kvm/paging_tmpl.h |   10 +-
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 59b5bd2..dc64cd6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1958,7 +1958,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
unsigned pte_access, int user_fault,
int write_fault, int dirty, int level,
gfn_t gfn, pfn_t pfn, bool speculative,
-   bool can_unsync, bool reset_host_protection)
+   bool can_unsync, bool host_writable)
 {
u64 spte, entry = *sptep;
int ret = 0;
@@ -1985,7 +1985,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
spte |= kvm_x86_ops->get_mt_mask(vcpu, gfn,
kvm_is_mmio_pfn(pfn));
 
-   if (reset_host_protection)
+   if (host_writable)
spte |= SPTE_HOST_WRITEABLE;
 
spte |= (u64)pfn << PAGE_SHIFT;
@@ -2048,7 +2048,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 int user_fault, int write_fault, int dirty,
 int *ptwrite, int level, gfn_t gfn,
 pfn_t pfn, bool speculative,
-bool reset_host_protection)
+bool host_writable)
 {
int was_rmapped = 0;
int rmap_count;
@@ -2083,7 +2083,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 
if (set_spte(vcpu, sptep, pte_access, user_fault, write_fault,
  dirty, level, gfn, pfn, speculative, true,
- reset_host_protection)) {
+ host_writable)) {
if (write_fault)
*ptwrite = 1;
kvm_mmu_flush_tlb(vcpu);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index ca0e5e8..57619ed 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -329,7 +329,7 @@ static void FNAME(update_pte)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
return;
kvm_get_pfn(pfn);
/*
-* we call mmu_set_spte() with reset_host_protection = true beacuse that
+* we call mmu_set_spte() with host_writable = true beacuse that
 * vcpu->arch.update_pte.pfn was fetched from get_user_pages(write = 1).
 */
mmu_set_spte(vcpu, spte, sp->role.access, pte_access, 0, 0,
@@ -744,7 +744,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
bool clear_unsync)
 {
int i, offset, nr_present;
-   bool reset_host_protection;
+   bool host_writable;
gpa_t first_pte_gpa;
 
offset = nr_present = 0;
@@ -794,14 +794,14 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
pte_access = sp->role.access & FNAME(gpte_access)(vcpu, gpte);
if (!(sp->spt[i] & SPTE_HOST_WRITEABLE)) {
pte_access &= ~ACC_WRITE_MASK;
-   reset_host_protection = 0;
+   host_writable = 0;
} else {
-   reset_host_protection = 1;
+   host_writable = 1;
}
set_spte(vcpu, &sp->spt[i], pte_access, 0, 0,
 is_dirty_gpte(gpte), PT_PAGE_TABLE_LEVEL, gfn,
 spte_to_pfn(sp->spt[i]), true, false,
-reset_host_protection);
+host_writable);
}
 
return !nr_present;
-- 
1.7.0.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 2/6] KVM: MMU: don't drop spte if overwrite it from W to RO

2010-11-19 Thread Xiao Guangrong

We just need flush tlb if overwrite a writable spte with a read-only one.

And we should move this operation to set_spte() for sync_page path

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/mmu.c |   20 +---
 1 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index bdb9fa9..59b5bd2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1960,7 +1960,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
gfn_t gfn, pfn_t pfn, bool speculative,
bool can_unsync, bool reset_host_protection)
 {
-   u64 spte;
+   u64 spte, entry = *sptep;
int ret = 0;
 
/*
@@ -2031,6 +2031,14 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 
 set_pte:
update_spte(sptep, spte);
+   /*
+* If we overwrite a writable spte with a read-only one we
+* should flush remote TLBs. Otherwise rmap_write_protect
+* will find a read-only spte, even though the writable spte
+* might be cached on a CPU's TLB.
+*/
+   if (is_writable_pte(entry) && !is_writable_pte(*sptep))
+   kvm_flush_remote_tlbs(vcpu->kvm);
 done:
return ret;
 }
@@ -2069,16 +2077,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 spte_to_pfn(*sptep), pfn);
drop_spte(vcpu->kvm, sptep, shadow_trap_nonpresent_pte);
kvm_flush_remote_tlbs(vcpu->kvm);
-   /*
-* If we overwrite a writable spte with a read-only one,
-* drop it and flush remote TLBs. Otherwise rmap_write_protect
-* will find a read-only spte, even though the writable spte
-* might be cached on a CPU's TLB.
-*/
-   } else if (is_writable_pte(*sptep) &&
- (!(pte_access & ACC_WRITE_MASK) || !dirty)) {
-   drop_spte(vcpu->kvm, sptep, shadow_trap_nonpresent_pte);
-   kvm_flush_remote_tlbs(vcpu->kvm);
} else
was_rmapped = 1;
}
-- 
1.7.0.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 1/6] KVM: MMU: fix forgot flush tlbs on sync_page path

2010-11-19 Thread Xiao Guangrong

We should flush all tlbs after drop spte on sync_page path since:

Quote from Avi:
| sync_page
| drop_spte
| kvm_mmu_notifier_invalidate_page
| kvm_unmap_rmapp
| spte doesn't exist -> no flush
| page is freed
| guest can write into freed page? 

Signed-off-by: Xiao Guangrong 
---
 arch/x86/kvm/paging_tmpl.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 590bf12..ca0e5e8 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -786,6 +786,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp,
else
nonpresent = shadow_notrap_nonpresent_pte;
drop_spte(vcpu->kvm, &sp->spt[i], nonpresent);
+   kvm_flush_remote_tlbs(vcpu->kvm);
continue;
}
 
-- 
1.7.0.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/2] KVM: selective write protection using dirty bitmap

2010-11-19 Thread Takuya Yoshikawa


(2010/11/18 22:06), Avi Kivity wrote:

On 11/18/2010 07:15 AM, Takuya Yoshikawa wrote:




We can also use this to selectively write protect pages to reduce unwanted page
faults in the future.



Looks like a good approach. Any measurements?



OK, I'll do some tests to select a right approach.

Personally, and some as a job if possible, I'm planning various things.


a) Frame buffer: small bitmap size (my hobby now)

   Performance:
 Both function level and guest side measurements is necessary. *1
 For interactivity, qemu side intelligence may be needed.
 Note: qemu side dirty bitmap optimization by Kemari team is pending now.

   Correctness:
 I doubt qemu side but I'm not good at device level VGA details yet, sorry.

   *1) I once tried to update dirty bitmap without srcu sync by using
   atomic_clear_mask() word by word. This seemed to be working and
   get_dirty_log() was not bad. But when I tested x11perf inside the guest,
   the numbers were worse than the original implementation (LOCK_PREFIX
   might do bad things to the cache and srcu reader side suffered, I guess).

   Good SRCU!


b) Live-migration: large bitmap size (my target is ~4GB RAM now)

   Performance:
 Scalability is important in both RAM size and number of dirty pages point
 of view. We need to achieve this goal without sacrificing the low workload
 case.

   Correctness:
 I'm trying to make live-migration fail in the condition of without mst's 
fix
 by multi threaded high work load. SMP is necessary.





+
+ if (!(gfn_offset || (gfn % huge)))
+ break;


Why?


I preserved these lines from Lai's original.
>> http://www.spinics.net/lists/kvm/msg35871.html

But OK, I found some duplication probably. I'll fix.




wrt. O(1) write protection: hard to tell if the two methods can coexist. For 
direct mapped shadow pages (i.e. ep/npt) I think we can use the mask to speed 
up clearing of an individual sp's sptes.



OK, I'll test both with and without NPT.
The result may lead us to the right direction.



Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH -v3] Monitor command: x-gpa2hva, translate guest physical address to host virtual address

2010-11-19 Thread Huang Ying

On Tue, 2010-11-16 at 10:23 +0800, Huang Ying wrote:
> Author: Max Asbock 
> 
> Add command x-gpa2hva to translate guest physical address to host
> virtual address. Because gpa to hva translation is not consistent, so
> this command is only used for debugging.
> 
> The x-gpa2hva command provides one step in a chain of translations from
> guest virtual to guest physical to host virtual to host physical. Host
> physical is then used to inject a machine check error. As a
> consequence the HWPOISON code on the host and the MCE injection code
> in qemu-kvm are exercised.
> 
> v3:
> 
> - Rename to x-gpa2hva
> - Remove QMP support, because gpa2hva is not consistent

Is this patch an acceptable solution for now? This command is useful for
our testing.

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

OOM kernel panic on bootup in debian/ubuntu guests

Re: [PATCHv4 15/15] Pass boot device list to firmware.

Re: [PATCHv4 15/15] Pass boot device list to firmware.

Re: limit conectivity of a VM

Re: [PATCH v2] device-assignment: register a reset function

[PATCH v3 8/9] device-assignment: Make use of config_map

[PATCH v3 9/9] device-assignment: pass through and stub more PCI caps

[PATCH v3 6/9] device-assignment: Move PCI capabilities to match physical hardware

[PATCH v3 7/9] pci: Remove capability specific handlers

[PATCH v3 5/9] pci: Remove cap.length, cap.start, cap.supported

[PATCH v3 4/9] pci: Replace used bitmap with config byte map

[PATCH v3 3/9] device-assignment: Use PCI capabilities support

[PATCH v3 2/9] pci: Remove pci_enable_capability_support()

[PATCH v3 1/9] pci: pci_default_cap_write_config ignores wmask

[PATCH v3 0/9] PCI capability and device assignment improvements

[RFC PATCH] kvm: PCI error stub driver

Re: limit conectivity of a VM

limit conectivity of a VM

Re: [PATCH] kvm: fast-path msi injection with irqfd

Re: [PATCH] kvm: fast-path msi injection with irqfd

Re: [PATCH v3 6/6] KVM: MMU: delay flush all tlbs on sync_page path

Re: seabios 0.6.1 regression

Re: ppc32 build failed

[PATCH v3 6/6] KVM: MMU: delay flush all tlbs on sync_page path

[PATCH v3 5/6] KVM: MMU: abstract invalid guest pte mapping

[PATCH v3 4/6] KVM: MMU: remove 'clear_unsync' parameter

[PATCH v3 3/6] KVM: MMU: rename 'reset_host_protection' to 'host_writable'

[PATCH v3 2/6] KVM: MMU: don't drop spte if overwrite it from W to RO

[PATCH v3 1/6] KVM: MMU: fix forgot flush tlbs on sync_page path

Re: [RFC PATCH 2/2] KVM: selective write protection using dirty bitmap

Re: [PATCH -v3] Monitor command: x-gpa2hva, translate guest physical address to host virtual address

31 matches

Site Navigation

Mail list logo

Footer information