Re: [PATCH] kvm: add missing void __user * cast to access_ok() call

2011-05-24 Thread Takuya Yoshikawa
On Tue, 24 May 2011 07:51:27 +0200
Heiko Carstens heiko.carst...@de.ibm.com wrote:

 From: Heiko Carstens heiko.carst...@de.ibm.com
 
 fa3d315a KVM: Validate userspace_addr of memslot when registered introduced
 this new warning onn s390:
 
 kvm_main.c: In function '__kvm_set_memory_region':
 kvm_main.c:654:7: warning: passing argument 1 of '__access_ok' makes pointer 
 from integer without a cast
 arch/s390/include/asm/uaccess.h:53:19: note: expected 'const void *' but 
 argument is of type '__u64'
 
 Add the missing cast to get rid of it again...
 

Looks good to me, thank you!

I should have checked s390's type checking...

  Takuya


 Cc: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
 Signed-off-by: Heiko Carstens heiko.carst...@de.ibm.com
 ---
  virt/kvm/kvm_main.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -651,7 +651,8 @@ int __kvm_set_memory_region(struct kvm *
   /* We can read the guest memory with __xxx_user() later on. */
   if (user_alloc 
   ((mem-userspace_addr  (PAGE_SIZE - 1)) ||
 -  !access_ok(VERIFY_WRITE, mem-userspace_addr, mem-memory_size)))
 +  !access_ok(VERIFY_WRITE, (void __user *)mem-userspace_addr,
 + mem-memory_size)))
   goto out;
   if (mem-slot = KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS)
   goto out;


-- 
Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] Setup private bridge in KVM autotest, get rid of ifup scripts

2011-05-24 Thread Lucas Meneghel Rodrigues
This patch combines 2 awesome patchsets by Amos and Jason to
single handedly slay 2 long time annoyances for KVM autotest users:

* By default we were using qemu userspace networking, a mode unsupported
by KVM developers and not terribly useful, and many network tests are
not supposed to work properly with it
* Use of ifup scripts that tried to do a very naive form of bridge
autodetection. If we control our bridge creation, we don't need no
autodetection.

Carefully made, verified, but not tested in several setups.
This might need some more testing love before we can get it
upstream once for all.

Changes from v2:
* Try to close tap file descriptors in order to run multiple iterations

Lucas Meneghel Rodrigues (5):
  KVM test: Adding framework code to control bridge creation
  KVM test: Add helpers to control the TAP/bridge
  KVM test: virt_env_process: Setup private bridge during
postprocessing
  KVM test: setup tap fd and pass it to qemu-kvm v3
  KVM test: Changing KVM autotest default to private bridge

 client/tests/kvm/scripts/qemu-ifup |   11 --
 client/tests/kvm/tests_base.cfg.sample |   12 +-
 client/virt/kvm_vm.py  |   59 ++---
 client/virt/virt_env_process.py|9 ++
 client/virt/virt_test_setup.py |  103 +++
 client/virt/virt_utils.py  |  226 
 6 files changed, 384 insertions(+), 36 deletions(-)
 delete mode 100755 client/tests/kvm/scripts/qemu-ifup

-- 
1.7.5.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] KVM test: Adding framework code to control bridge creation

2011-05-24 Thread Lucas Meneghel Rodrigues
Provide in framework utility code to control the creation
of a bridge, in order to provide TAP functionality for
autotest users without relying on previous setup made by
the user.

This is a reimplementation of Amos's code, the differences
are:

 * Implemented as a setup class, taking advantage of object
internal state to use in different places of the code
 * Split up the operations to make it easier to understand
the steps and why we are doing them
 * Use of autotest API instead of commands

Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com
Signed-off-by: Amos Kong ak...@redhat.com
---
 client/virt/virt_test_setup.py |  103 
 1 files changed, 103 insertions(+), 0 deletions(-)

diff --git a/client/virt/virt_test_setup.py b/client/virt/virt_test_setup.py
index f915c1b..cdc0e23 100644
--- a/client/virt/virt_test_setup.py
+++ b/client/virt/virt_test_setup.py
@@ -105,3 +105,106 @@ class HugePageConfig(object):
 return
 utils.system(echo 0  %s % self.kernel_hp_file)
 logging.debug(Hugepage memory successfuly dealocated)
+
+
+class PrivateBridgeError(Exception):
+def __init__(self, brname):
+self.brname = brname
+
+def __str__(self):
+return Bridge %s not available after setup % self.brname
+
+
+class PrivateBridgeConfig(object):
+__shared_state = {}
+def __init__(self, params=None):
+self.__dict__ = self.__shared_state
+if params is not None:
+self.brname = params.get(priv_brname, 'atbr0')
+self.subnet = params.get(priv_subnet, '192.168.58')
+self.ip_version = params.get(bridge_ip_version, ipv4)
+self.dhcp_server_pid = None
+
+
+def _add_bridge(self):
+utils.system(brctl addbr %s % self.brname)
+ip_fwd_path = /proc/sys/net/%s/ip_forward % self.ip_version
+ip_fwd = open(ip_fwd_path, w)
+ip_fwd.write(1\n)
+utils.system(brctl stp %s on % self.brname)
+utils.system(brctl setfd %s 0 % self.brname)
+
+
+def _bring_bridge_up(self):
+utils.system(ifconfig %s %s.1 up % (self.brname, self.subnet))
+
+
+def _enable_nat(self):
+utils.system(iptables -t nat -A POSTROUTING -s %s.254/24 ! 
+ -d %s.254/24 -j MASQUERADE % (self.subnet, self.subnet))
+
+
+def _start_dhcp_server(self):
+utils.system(service dnsmasq stop)
+utils.system(dnsmasq --strict-order --bind-interfaces 
+ --listen-address %s.1 --dhcp-range %s.1,%s.254 
+ --pid-file=/tmp/dnsmasq.pid 
+ --log-facility=/tmp/dnsmasq.log %
+ (self.subnet, self.subnet, self.subnet))
+try:
+self.dhcp_server_pid = int(open('/tmp/dnsmasq.pid', 'r').read())
+except ValueError:
+raise PrivateBridgeError(self.brname)
+logging.debug(Started internal DHCP server with PID %s,
+  self.dhcp_server_pid)
+
+
+def _verify_bridge(self):
+brctl_output = utils.system_output(brctl show)
+if self.brname not in brctl_output:
+raise PrivateBridgeError(self.brname)
+
+
+def setup(self):
+brctl_output = utils.system_output(brctl show)
+if self.brname not in brctl_output:
+logging.info(Configuring KVM test private bridge %s, self.brname)
+self._add_bridge()
+self._bring_bridge_up()
+self._enable_nat()
+self._start_dhcp_server()
+self._verify_bridge()
+
+
+def _stop_dhcp_server(self):
+os.kill(self.dhcp_server_pid, 15)
+
+
+def _bring_bridge_down(self):
+utils.system(ifconfig %s down % self.brname)
+
+
+def _disable_nat(self):
+utils.system(iptables -t nat -D POSTROUTING -s %s.254/24 !
+  -d %s.254/24 -j MASQUERADE % (self.subnet, 
self.subnet))
+
+
+def _remove_bridge(self):
+utils.system(brctl delbr %s % self.brname)
+
+
+def cleanup(self):
+brctl_output = utils.system_output(brctl show)
+cleanup = False
+for line in brctl_output:
+if line.startswith(self.brname):
+# len == 4 means there is a TAP using the bridge
+# so don't try to clean up
+if len(line.split())  4:
+cleanup = True
+if cleanup:
+logging.debug(Cleaning up KVM test private bridge %s, 
self.brname)
+self._stop_dhcp_server()
+self._bring_bridge_down()
+self._disable_nat()
+self._remove_bridge()
-- 
1.7.5.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] KVM test: Add helpers to control the TAP/bridge

2011-05-24 Thread Lucas Meneghel Rodrigues
This patch adds some helpers to assist virt test to setup the bridge or
macvtap based guest networking.

Changes from v1:
 * Fixed undefined variable errors on the exception class definitions

Signed-off-by: Jason Wang jasow...@redhat.com
Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com
---
 client/virt/virt_utils.py |  222 +
 1 files changed, 222 insertions(+), 0 deletions(-)

diff --git a/client/virt/virt_utils.py b/client/virt/virt_utils.py
index 5510c89..7eeaf4f 100644
--- a/client/virt/virt_utils.py
+++ b/client/virt/virt_utils.py
@@ -6,6 +6,7 @@ KVM test utility functions.
 
 import time, string, random, socket, os, signal, re, logging, commands, cPickle
 import fcntl, shelve, ConfigParser, threading, sys, UserDict, inspect
+import struct
 from autotest_lib.client.bin import utils, os_dep
 from autotest_lib.client.common_lib import error, logging_config
 import rss_client, aexpect
@@ -15,6 +16,20 @@ try:
 except ImportError:
 KOJI_INSTALLED = False
 
+# From include/linux/sockios.h
+SIOCSIFHWADDR = 0x8924
+SIOCGIFHWADDR = 0x8927
+SIOCSIFFLAGS = 0x8914
+SIOCGIFINDEX = 0x8933
+SIOCBRADDIF = 0x89a2
+# From linux/include/linux/if_tun.h
+TUNSETIFF = 0x400454ca
+TUNGETIFF = 0x800454d2
+TUNGETFEATURES = 0x800454cf
+IFF_UP = 0x1
+IFF_TAP = 0x0002
+IFF_NO_PI = 0x1000
+IFF_VNET_HDR = 0x4000
 
 def _lock_file(filename):
 f = open(filename, w)
@@ -36,6 +51,80 @@ def is_vm(obj):
 return obj.__class__.__name__ == VM
 
 
+class NetError(Exception):
+pass
+
+
+class TAPModuleError(NetError):
+def __init__(self, devname):
+NetError.__init__(self, devname)
+self.devname = devname
+
+def __str__(self):
+return Can't open %s % self.devname
+
+class TAPNotExistError(NetError):
+def __init__(self, ifname):
+NetError.__init__(self, ifname)
+self.ifname = ifname
+
+def __str__(self):
+return Interface %s does not exist % self.ifname
+
+
+class TAPCreationError(NetError):
+def __init__(self, ifname, details=None):
+NetError.__init__(self, ifname, details)
+self.ifname = ifname
+self.details = details
+
+def __str__(self):
+e_msg = Cannot create TAP device %s % self.ifname
+if self.details is not None:
+e_msg += : %s % self.details
+return e_msg
+
+
+class TAPBringUpError(NetError):
+def __init__(self, ifname):
+NetError.__init__(self, ifname)
+self.ifname = ifname
+
+def __str__(self):
+return Cannot bring up TAP %s % self.ifname
+
+
+class BRAddIfError(NetError):
+def __init__(self, ifname, brname, details):
+NetError.__init__(self, ifname, brname, details)
+self.ifname = ifname
+self.brname = brname
+self.details = details
+
+def __str__(self):
+return (Can not add if %s to bridge %s: %s %
+(self.ifname, self.brname, self.details))
+
+
+class HwAddrSetError(NetError):
+def __init__(self, ifname, mac):
+NetError.__init__(self, ifname, mac)
+self.ifname = ifname
+self.mac = mac
+
+def __str__(self):
+return Can not set mac %s to interface %s % (self.mac, self.ifname)
+
+
+class HwAddrGetError(NetError):
+def __init__(self, ifname):
+NetError.__init__(self, ifname)
+self.ifname = ifname
+
+def __str__(self):
+return Can not get mac of interface %s % self.ifname
+
+
 class Env(UserDict.IterableUserDict):
 
 A dict-like object containing global objects used by tests.
@@ -2307,3 +2396,136 @@ def install_host_kernel(job, params):
 else:
 logging.info('Chose %s, using the current kernel for the host',
  install_type)
+
+
+def bridge_auto_detect():
+
+Automatically detect a bridge for tap through brctl.
+
+try:
+brctl_output = utils.system_output(ip route list,
+   retain_output=True)
+brname = re.findall(default.*dev (.*) , brctl_output)[0]
+except:
+raise BRAutoDetectError
+return brname
+
+
+def if_nametoindex(ifname):
+
+Map an interface name into its corresponding index.
+Returns 0 on error, as 0 is not a valid index
+
+@param ifname: interface name
+
+index = 0
+ctrl_sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, 0)
+ifr = struct.pack(16si, ifname, 0)
+r = fcntl.ioctl(ctrl_sock, SIOCGIFINDEX, ifr)
+index = struct.unpack(16si, r)[1]
+ctrl_sock.close()
+return index
+
+
+def vnet_hdr_probe(tapfd):
+
+Check if the IFF_VNET_HDR is support by tun.
+
+@param tapfd: the file descriptor of /dev/net/tun
+
+u = struct.pack(I, 0)
+r = fcntl.ioctl(tapfd, TUNGETFEATURES, u)
+flags = struct.unpack(I, r)[0]
+if flags  IFF_VNET_HDR:
+return True
+else:
+return False
+
+
+def open_tap(devname, ifname, vnet_hdr=True):
+
+Open a tap device and 

[PATCH 4/5] KVM test: setup tap fd and pass it to qemu-kvm v3

2011-05-24 Thread Lucas Meneghel Rodrigues
We used to use qemu-ifup to manage the tap which have several
limitations:

1) If we want to specify a bridge, we must create a customized
qemu-ifup file as the default script always match the first bridge.
2) It's hard to add support for macvtap device.

So this patch let kvm subtest control the tap creation and setup then
pass it to qemu-kvm. User could specify the bridge he want to used in
configuration file.

The original autoconfiguration was changed by private bridge setup.

Changes from v1:
* Combine the private bridge config and TAP fd in one patchset,
dropped the auto mode
* Close TAP fds on VM.destroy() (thanks to Amos Kong for finding
the problem)

Signed-off-by: Jason Wang jasow...@redhat.com
Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com
---
 client/tests/kvm/scripts/qemu-ifup |   11 ---
 client/virt/kvm_vm.py  |   59 +++
 client/virt/virt_utils.py  |8 -
 3 files changed, 45 insertions(+), 33 deletions(-)
 delete mode 100755 client/tests/kvm/scripts/qemu-ifup

diff --git a/client/tests/kvm/scripts/qemu-ifup 
b/client/tests/kvm/scripts/qemu-ifup
deleted file mode 100755
index c4debf5..000
--- a/client/tests/kvm/scripts/qemu-ifup
+++ /dev/null
@@ -1,11 +0,0 @@
-#!/bin/sh
-
-# The following expression selects the first bridge listed by 'brctl show'.
-# Modify it to suit your needs.
-switch=$(/usr/sbin/brctl show | awk 'NR==2 { print $1 }')
-
-/bin/echo 1  /proc/sys/net/ipv6/conf/${switch}/disable_ipv6
-/sbin/ifconfig $1 0.0.0.0 up
-/usr/sbin/brctl addif ${switch} $1
-/usr/sbin/brctl setfd ${switch} 0
-/usr/sbin/brctl stp ${switch} off
diff --git a/client/virt/kvm_vm.py b/client/virt/kvm_vm.py
index c9bb273..9b26dee 100644
--- a/client/virt/kvm_vm.py
+++ b/client/virt/kvm_vm.py
@@ -7,7 +7,7 @@ Utility classes and functions to handle Virtual Machine 
creation using qemu.
 import time, os, logging, fcntl, re, commands, glob
 from autotest_lib.client.common_lib import error
 from autotest_lib.client.bin import utils
-import virt_utils, virt_vm, kvm_monitor, aexpect
+import virt_utils, virt_vm, virt_test_setup, kvm_monitor, aexpect
 
 
 class VM(virt_vm.BaseVM):
@@ -41,6 +41,7 @@ class VM(virt_vm.BaseVM):
 self.pci_assignable = None
 self.netdev_id = []
 self.device_id = []
+self.tapfds = []
 self.uuid = None
 
 
@@ -240,19 +241,17 @@ class VM(virt_vm.BaseVM):
 cmd += ,id='%s' % device_id
 return cmd
 
-def add_net(help, vlan, mode, ifname=None, script=None,
-downscript=None, tftp=None, bootfile=None, hostfwd=[],
-netdev_id=None, netdev_extra_params=None):
+def add_net(help, vlan, mode, ifname=None, tftp=None, bootfile=None,
+hostfwd=[], netdev_id=None, netdev_extra_params=None,
+tapfd=None):
 if has_option(help, netdev):
 cmd =  -netdev %s,id=%s % (mode, netdev_id)
 if netdev_extra_params:
 cmd += ,%s % netdev_extra_params
 else:
 cmd =  -net %s,vlan=%d % (mode, vlan)
-if mode == tap:
-if ifname: cmd += ,ifname='%s' % ifname
-if script: cmd += ,script='%s' % script
-cmd += ,downscript='%s' % (downscript or no)
+if mode == tap and tapfd:
+cmd += ,fd=%d % tapfd
 elif mode == user:
 if tftp and [,tftp= in help:
 cmd += ,tftp='%s' % tftp
@@ -422,20 +421,22 @@ class VM(virt_vm.BaseVM):
 qemu_cmd += add_nic(help, vlan, nic_params.get(nic_model), mac,
 device_id, netdev_id, 
nic_params.get(nic_extra_params))
 # Handle the '-net tap' or '-net user' or '-netdev' part
-script = nic_params.get(nic_script)
-downscript = nic_params.get(nic_downscript)
 tftp = nic_params.get(tftp)
-if script:
-script = virt_utils.get_path(root_dir, script)
-if downscript:
-downscript = virt_utils.get_path(root_dir, downscript)
 if tftp:
 tftp = virt_utils.get_path(root_dir, tftp)
-qemu_cmd += add_net(help, vlan, nic_params.get(nic_mode, user),
-vm.get_ifname(vlan),
-script, downscript, tftp,
+if nic_params.get(nic_mode) == tap:
+try:
+tapfd = vm.tapfds[vlan]
+except:
+tapfd = None
+else:
+tapfd = None
+qemu_cmd += add_net(help, vlan,
+nic_params.get(nic_mode, user),
+vm.get_ifname(vlan), tftp,
 nic_params.get(bootp), redirs, netdev_id,
-

[PATCH 5/5] KVM test: Changing KVM autotest default to private bridge

2011-05-24 Thread Lucas Meneghel Rodrigues
Rather than the unsupported userspace mode, which still is
an option. This way we are giving users a default setup
much closer to a real world usage scenario, and enable
people to run all the network tests that don't work properly
in user mode.

Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com
---
 client/tests/kvm/tests_base.cfg.sample |   12 +++-
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/client/tests/kvm/tests_base.cfg.sample 
b/client/tests/kvm/tests_base.cfg.sample
index 2b0d63c..5276d23 100644
--- a/client/tests/kvm/tests_base.cfg.sample
+++ b/client/tests/kvm/tests_base.cfg.sample
@@ -57,10 +57,13 @@ redirs = remote_shell
 guest_port_remote_shell = 22
 
 # NIC parameters
-nic_mode = user
-#nic_mode = tap
-nic_script = scripts/qemu-ifup
-#nic_script = scripts/qemu-ifup-ipv6
+#nic_mode = user
+nic_mode = tap
+bridge = private
+# You can set bridge to
+# be a specific bridge
+# name, such as 'virbr0'
+#bridge = virbr0
 run_tcpdump = yes
 
 # Misc
@@ -117,7 +120,6 @@ variants:
 # Install guest from cdrom 
 - cdrom:
 medium = cdrom
-nic_mode = user
 redirs +=  unattended_install
 # Install guest from http/ftp url
 - url:
-- 
1.7.5.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] KVM test: virt_env_process: Setup private bridge during postprocessing

2011-05-24 Thread Lucas Meneghel Rodrigues
Call bridge setup at preprocessing and cleanup at postprocessing.
The bridge can be cleaned up when no tap interfaces are using it.

Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com
---
 client/virt/virt_env_process.py |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/client/virt/virt_env_process.py b/client/virt/virt_env_process.py
index 625b422..1742a02 100644
--- a/client/virt/virt_env_process.py
+++ b/client/virt/virt_env_process.py
@@ -196,6 +196,11 @@ def preprocess(test, params, env):
 @param env: The environment (a dict-like object).
 
 error.context(preprocessing)
+
+if params.get(bridge) == private:
+brcfg = virt_test_setup.PrivateBridgeConfig(params)
+brcfg.setup()
+
 # Start tcpdump if it isn't already running
 if address_cache not in env:
 env[address_cache] = {}
@@ -365,6 +370,10 @@ def postprocess(test, params, env):
 int(params.get(post_command_timeout, 600)),
 params.get(post_command_noncritical) == yes)
 
+if params.get(bridge) == private:
+brcfg = virt_test_setup.PrivateBridgeConfig()
+brcfg.cleanup()
+
 
 def postprocess_on_error(test, params, env):
 
-- 
1.7.5.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] Make possible to run client tests as subtests

2011-05-24 Thread Lucas Meneghel Rodrigues
In order to avoid duplication of code, make it possible to run the
existing autotest client tests as subtests. This patchset is result
of work on Jiri Zupka original single patch, the differences:

* Removed example subtest KVM autotest test
* Renamed some API introduced to net_utils for consistency
* Rewrote netperf in terms of the new 'subtest' infrastructure

Lucas Meneghel Rodrigues (4):
  client.bin.net.net_utils: Introduce get_local_ip()
  client: Make it possible to run subtests in autotest
  tools: Make html_report to deal with subtest results
  KVM test: Rewrite netperf in terms of subtest

 client/bin/client_logging_config.py |5 +-
 client/bin/net/net_utils.py |   17 +
 client/common_lib/base_job.py   |2 +
 client/common_lib/logging_config.py |3 +-
 client/common_lib/test.py   |   21 ++-
 client/tools/html_report.py |  124 +++---
 client/virt/tests/netperf.py|  117 +
 7 files changed, 143 insertions(+), 146 deletions(-)

-- 
1.7.5.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] client.bin.net.net_utils: Introduce get_local_ip()

2011-05-24 Thread Lucas Meneghel Rodrigues
Get ip address in local system which can communicate
with a given ip, that will be useful for subtests like
netperf2.

Signed-off-by: Jiri Zupka jzu...@redhat.com
---
 client/bin/net/net_utils.py |   17 +
 1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/client/bin/net/net_utils.py b/client/bin/net/net_utils.py
index 868958c..7c96ba0 100644
--- a/client/bin/net/net_utils.py
+++ b/client/bin/net/net_utils.py
@@ -5,6 +5,7 @@ This library is to release in the public repository.
 
 import commands, os, re, socket, sys, time, struct
 from autotest_lib.client.common_lib import error
+from autotest_lib.client.bin import utils as client_utils
 import utils
 
 TIMEOUT = 10 # Used for socket timeout and barrier timeout
@@ -27,6 +28,22 @@ class network_utils(object):
 utils.system('/sbin/ifconfig -a')
 
 
+def get_ip_local(self, query_ip, netmask=24):
+
+Get ip address in local system which can communicate with query_ip.
+
+@param query_ip: IP of client which wants to communicate with
+autotest machine.
+@return: IP address which can communicate with query_ip
+
+ip = client_utils.system_output(ip addr show to %s/%s %
+(query_ip, netmask))
+ip = re.search(rinet ([0-9.]*)/,ip)
+if ip is None:
+return ip
+return ip.group(1)
+
+
 def disable_ip_local_loopback(self, ignore_status=False):
 utils.system(echo '1'  /proc/sys/net/ipv4/route/no_local_loopback,
  ignore_status=ignore_status)
-- 
1.7.5.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] tools: Make html_report to deal with subtest results

2011-05-24 Thread Lucas Meneghel Rodrigues
Signed-off-by: Jiri Zupka jzu...@redhat.com
---
 client/tools/html_report.py |  124 ---
 1 files changed, 69 insertions(+), 55 deletions(-)

diff --git a/client/tools/html_report.py b/client/tools/html_report.py
index c4e97b2..563a7a9 100755
--- a/client/tools/html_report.py
+++ b/client/tools/html_report.py
@@ -1372,7 +1372,7 @@ function processList(ul) {
 }
 
 
-stimelist = []
+
 
 
 def make_html_file(metadata, results, tag, host, output_file_name, dirname):
@@ -1430,11 +1430,12 @@ return true;
 total_failed = 0
 total_passed = 0
 for res in results:
-total_executed += 1
-if res['status'] == 'GOOD':
-total_passed += 1
-else:
-total_failed += 1
+if results[res][2] != None:
+total_executed += 1
+if results[res][2]['status'] == 'GOOD':
+total_passed += 1
+else:
+total_failed += 1
 stat_str = 'No test cases executed'
 if total_executed  0:
 failed_perct = int(float(total_failed)/float(total_executed)*100)
@@ -1471,39 +1472,46 @@ id=t1 class=stats table-autosort:4 table-autofilter 
table-stripeclass:alterna
 tbody
 
 print  output, result_table_prefix
-for res in results:
-print  output, 'tr'
-print  output, 'td align=left%s/td' % res['time']
-print  output, 'td align=left%s/td' % res['testcase']
-if res['status'] == 'GOOD':
-print  output, 'td align=\left\bfont 
color=#00CC00PASS/font/b/td'
-elif res['status'] == 'FAIL':
-print  output, 'td align=\left\bfont 
color=redFAIL/font/b/td'
-elif res['status'] == 'ERROR':
-print  output, 'td align=\left\bfont 
color=redERROR!/font/b/td'
-else:
-print  output, 'td align=\left\%s/td' % res['status']
-# print exec time (seconds)
-print  output, 'td align=left%s/td' % res['exec_time_sec']
-# print log only if test failed..
-if res['log']:
-#chop all '\n' from log text (to prevent html errors)
-rx1 = re.compile('(\s+)')
-log_text = rx1.sub(' ', res['log'])
-
-# allow only a-zA-Z0-9_ in html title name
-# (due to bug in MS-explorer)
-rx2 = re.compile('([^a-zA-Z_0-9])')
-updated_tag = rx2.sub('_', res['title'])
-
-html_body_text = 
'htmlheadtitle%s/title/headbody%s/body/html' % 
(str(updated_tag), log_text)
-print  output, 'td align=\left\A HREF=\#\ 
onClick=\popup(\'%s\',\'%s\')\Info/A/td' % (str(updated_tag), 
str(html_body_text))
-else:
-print  output, 'td align=\left\/td'
-# print execution time
-print  output, 'td align=leftA HREF=\%s\Debug/A/td' % 
os.path.join(dirname, res['title'], debug)
+def print_result(result, indent):
+while result != []:
+r = result.pop(0)
+print r
+res = results[r][2]
+print  output, 'tr'
+print  output, 'td align=left%s/td' % res['time']
+print  output, 'td align=left 
style=padding-left:%dpx%s/td' % (indent * 20, res['title'])
+if res['status'] == 'GOOD':
+print  output, 'td align=\left\bfont 
color=#00CC00PASS/font/b/td'
+elif res['status'] == 'FAIL':
+print  output, 'td align=\left\bfont 
color=redFAIL/font/b/td'
+elif res['status'] == 'ERROR':
+print  output, 'td align=\left\bfont 
color=redERROR!/font/b/td'
+else:
+print  output, 'td align=\left\%s/td' % res['status']
+# print exec time (seconds)
+print  output, 'td align=left%s/td' % res['exec_time_sec']
+# print log only if test failed..
+if res['log']:
+#chop all '\n' from log text (to prevent html errors)
+rx1 = re.compile('(\s+)')
+log_text = rx1.sub(' ', res['log'])
+
+# allow only a-zA-Z0-9_ in html title name
+# (due to bug in MS-explorer)
+rx2 = re.compile('([^a-zA-Z_0-9])')
+updated_tag = rx2.sub('_', res['title'])
+
+html_body_text = 
'htmlheadtitle%s/title/headbody%s/body/html' % 
(str(updated_tag), log_text)
+print  output, 'td align=\left\A HREF=\#\ 
onClick=\popup(\'%s\',\'%s\')\Info/A/td' % (str(updated_tag), 
str(html_body_text))
+else:
+print  output, 'td align=\left\/td'
+# print execution time
+print  output, 'td align=leftA HREF=\%s\Debug/A/td' 
% os.path.join(dirname, res['subdir'], debug)
 
-print  output, '/tr'
+print  output, '/tr'
+print_result(results[r][1], indent + 1)
+
+print_result(results[][1], 0)
 print  output, /tbody/table
 
 
@@ -1531,21 +1539,27 @@ id=t1 class=stats table-autosort:4 table-autofilter 

[PATCH 2/4] client: Make it possible to run subtests in autotest

2011-05-24 Thread Lucas Meneghel Rodrigues
Do that by adding an additional utility function in the
test object, that will call another test from inside the
test scope.

Signed-off-by: Jiri Zupka jzu...@redhat.com
---
 client/bin/client_logging_config.py |5 +++--
 client/common_lib/base_job.py   |2 ++
 client/common_lib/logging_config.py |3 ++-
 client/common_lib/test.py   |   21 -
 4 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/client/bin/client_logging_config.py 
b/client/bin/client_logging_config.py
index a59b078..28c007d 100644
--- a/client/bin/client_logging_config.py
+++ b/client/bin/client_logging_config.py
@@ -12,8 +12,9 @@ class ClientLoggingConfig(logging_config.LoggingConfig):
 
 
 def configure_logging(self, results_dir=None, verbose=False):
-super(ClientLoggingConfig, self).configure_logging(use_console=True,
-   verbose=verbose)
+super(ClientLoggingConfig, self).configure_logging(
+  use_console=self.use_console,
+  verbose=verbose)
 
 if results_dir:
 log_dir = os.path.join(results_dir, 'debug')
diff --git a/client/common_lib/base_job.py b/client/common_lib/base_job.py
index 843c0e8..eef9efc 100644
--- a/client/common_lib/base_job.py
+++ b/client/common_lib/base_job.py
@@ -1117,6 +1117,7 @@ class base_job(object):
 tag_parts = []
 
 # build up the parts of the tag used for the test name
+master_testpath = dargs.get('master_testpath', )
 base_tag = dargs.pop('tag', None)
 if base_tag:
 tag_parts.append(str(base_tag))
@@ -1132,6 +1133,7 @@ class base_job(object):
 if subdir_tag:
 tag_parts.append(subdir_tag)
 subdir = '.'.join([testname] + tag_parts)
+subdir = os.path.join(master_testpath, subdir)
 tag = '.'.join(tag_parts)
 
 return full_testname, subdir, tag
diff --git a/client/common_lib/logging_config.py 
b/client/common_lib/logging_config.py
index afe754a..9114d7a 100644
--- a/client/common_lib/logging_config.py
+++ b/client/common_lib/logging_config.py
@@ -32,9 +32,10 @@ class LoggingConfig(object):
 fmt='%(asctime)s %(levelname)-5.5s| %(message)s',
 datefmt='%H:%M:%S')
 
-def __init__(self):
+def __init__(self, use_console=True):
 self.logger = logging.getLogger()
 self.global_level = logging.DEBUG
+self.use_console = use_console
 
 
 @classmethod
diff --git a/client/common_lib/test.py b/client/common_lib/test.py
index c55d23b..d5564c3 100644
--- a/client/common_lib/test.py
+++ b/client/common_lib/test.py
@@ -465,6 +465,24 @@ class base_test(object):
 self.job.enable_warnings(NETWORK)
 
 
+def runsubtest(self, url, *args, **dargs):
+
+Execute another autotest test from inside the current test's scope.
+
+@param test: Parent test.
+@param url: Url of new test.
+@param tag: Tag added to test name.
+@param args: Args for subtest.
+@param dargs: Dictionary with args for subtest.
+@iterations: Number of subtest iterations.
+@profile_only: If true execute one profiled run.
+
+dargs[profile_only] = dargs.get(profile_only, False)
+test_basepath = self.outputdir[len(self.job.resultdir + /):]
+self.job.run_test(url, master_testpath=test_basepath,
+  *args, **dargs)
+
+
 def _get_nonstar_args(func):
 Extract all the (normal) function parameter names.
 
@@ -658,7 +676,8 @@ def runtest(job, url, tag, args, dargs,
 if not bindir:
 raise error.TestError(testname + ': test does not exist')
 
-outputdir = os.path.join(job.resultdir, testname)
+subdir = os.path.join(dargs.pop('master_testpath', ), testname)
+outputdir = os.path.join(job.resultdir, subdir)
 if tag:
 outputdir += '.' + tag
 
-- 
1.7.5.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] KVM test: Rewrite netperf in terms of subtest

2011-05-24 Thread Lucas Meneghel Rodrigues
As the first usage of the new subtest function, reimplement
netperf using subtest. This way we just don't have to care
about replicating build and other boilerplate code that is
better handled by autotest itself.

Signed-off-by: Lucas Meneghel Rodrigues l...@redhat.com
---
 client/virt/tests/netperf.py |  117 +++---
 1 files changed, 30 insertions(+), 87 deletions(-)

diff --git a/client/virt/tests/netperf.py b/client/virt/tests/netperf.py
index 8a80d13..ab742f1 100644
--- a/client/virt/tests/netperf.py
+++ b/client/virt/tests/netperf.py
@@ -1,17 +1,18 @@
 import logging, os, signal
 from autotest_lib.client.common_lib import error
 from autotest_lib.client.bin import utils
-from autotest_lib.client.virt import aexpect, virt_utils
+from autotest_lib.client.bin.net import net_utils
+from autotest_lib.client.virt import aexpect, virt_utils, virt_test_utils
+
 
 def run_netperf(test, params, env):
 
 Network stress test with netperf.
 
 1) Boot up a VM with multiple nics.
-2) Launch netserver on guest.
-3) Execute multiple netperf clients on host in parallel
-   with different protocols.
-4) Output the test result.
+2) Launch netperf server on host.
+3) Execute netperf client on guest.
+4) Output the test results.
 
 @param test: KVM test object.
 @param params: Dictionary with the test parameters.
@@ -21,86 +22,28 @@ def run_netperf(test, params, env):
 vm.verify_alive()
 login_timeout = int(params.get(login_timeout, 360))
 session = vm.wait_for_login(timeout=login_timeout)
-session.close()
-session_serial = vm.wait_for_serial_login(timeout=login_timeout)
-
-netperf_dir = os.path.join(os.environ['AUTODIR'], tests/netperf2)
-setup_cmd = params.get(setup_cmd)
-
-firewall_flush = iptables -F
-session_serial.cmd_output(firewall_flush)
-try:
-utils.run(iptables -F)
-except:
-pass
-
-for i in params.get(netperf_files).split():
-vm.copy_files_to(os.path.join(netperf_dir, i), /tmp)
-
-try:
-session_serial.cmd(firewall_flush)
-except aexpect.ShellError:
-logging.warning(Could not flush firewall rules on guest)
-
-session_serial.cmd(setup_cmd % /tmp, timeout=200)
-session_serial.cmd(params.get(netserver_cmd) % /tmp)
-
-if tcpdump in env and env[tcpdump].is_alive():
-# Stop the background tcpdump process
-try:
-logging.debug(Stopping the background tcpdump)
-env[tcpdump].close()
-except:
-pass
-
-def netperf(i=0):
-guest_ip = vm.get_address(i)
-logging.info(Netperf_%s: netserver %s % (i, guest_ip))
-result_file = os.path.join(test.resultsdir, output_%s_%s
-   % (test.iteration, i ))
-list_fail = []
-result = open(result_file, w)
-result.write(Netperf test results\n)
-
-for p in params.get(protocols).split():
-packet_size = params.get(packet_size, 1500)
-for size in packet_size.split():
-cmd = params.get(netperf_cmd) % (netperf_dir, p,
-   guest_ip, size)
-logging.info(Netperf_%s: protocol %s % (i, p))
-try:
-netperf_output = utils.system_output(cmd,
- retain_output=True)
-result.write(%s\n % netperf_output)
-except:
-logging.error(Test of protocol %s failed, p)
-list_fail.append(p)
-
-result.close()
-if list_fail:
-raise error.TestFail(Some netperf tests failed: %s %
- , .join(list_fail))
-
-try:
-logging.info(Setup and run netperf clients on host)
-utils.run(setup_cmd % netperf_dir)
-
-bg = []
-nic_num = len(params.get(nics).split())
-for i in range(nic_num):
-bg.append(virt_utils.Thread(netperf, (i,)))
-bg[i].start()
 
-completed = False
-while not completed:
-completed = True
-for b in bg:
-if b.is_alive():
-completed = False
-finally:
-try:
-for b in bg:
-if b:
-b.join()
-finally:
-session_serial.cmd_output(killall netserver)
+session.cmd(iptables -F)
+
+timeout = int(params.get(test_timeout, 300))
+control_path = os.path.join(test.tmpdir, netperf_client.control)
+
+guest_ip = vm.get_address()
+host_ip = net_utils.network().get_ip_local(guest_ip)
+if host_ip is not None:
+c = open(control_path, 'w')
+c.write('job.run_test(url=netperf2, server_ip=%s, client_ip=%s, '
+'role=client, tag=client)' % (host_ip, guest_ip))
+c.close()
+guest = 

Restoring saved guest causes guest to reboot

2011-05-24 Thread Markus Schade

Hello,

after upgrading the kernel from 2.6.37.6 to 2.6.38.X the guest domain
reboots immediately after restore from a saved state. The OS of the guest is not
a factor as it is the same for Linux (Debian, Ubuntu, CentOS, etc.), FreeBSD
and Windows.

While save and restore seems to work on a suspended guest. But once the guest
is resumed, the domain reboots. Last working kernel version is 2.6.37.6.

I can confirm this behaviour with qemu-kvm 0.13.0 to latest 0.14.1
and vanilla kernels up to 2.6.39.

This also won't work on unmodified stock installs of Ubuntu 11.04 or
Fedora 15.

Please let me know, if you need further information or testing.

Best regards,
Markus
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Enable CPU SMEP feature for KVM

2011-05-24 Thread Avi Kivity

On 05/24/2011 05:53 AM, Haitao Shan wrote:



I don't understand why.  Can you elaborate?

Shadow implements the U bit, which is all that is needed by SMEP
as far as I can tell.

Basically, all SMEP-capable platform has EPT, which is on by default 
in KVM. Thus, we naturally thought there was little value to add it to 
SPT.


We try to keep features orthogonal.  That has value for testing, and 
results in clearer code.


Another thing that we are not so sure of is whether SPT has tricky 
usages on U bit (for optimization or whatever). With SMEP, this trick 
may be easily broken.


In fact it does, we play with the U bit to emulate cr0.wp.  I'll be 
happy to write the patch to handle this issue, since I'm familiar with 
the code.



Anyway, we are investigating enabling SMEP with SPT now.



Great, thanks.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 10/14] virtio_net: limit xmit polling

2011-05-24 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 05/23/2011 04:49:00 PM:

  To do this properly, we should really be using the actual number of sg
  elements needed, but we'd have to do most of xmit_skb beforehand so we
  know how many.
 
  Cheers,
  Rusty.

 Maybe I'm confused here.  The problem isn't the failing
 add_buf for the given skb IIUC.  What we are trying to do here is stop
 the queue *before xmit_skb fails*. We can't look at the
 number of fragments in the current skb - the next one can be
 much larger.  That's why we check capacity after xmit_skb,
 not before it, right?

Maybe Rusty means it is a simpler model to free the amount
of space that this xmit needs. We will still fail anyway
at some time but it is unlikely, since earlier iteration
freed up atleast the space that it was going to use. The
code could become much simpler:

start_xmit()
{
{
num_sgs = get num_sgs for this skb;

/* Free enough pending old buffers to enable queueing this one */
free_old_xmit_skbs(vi, num_sgs * 2); /* ?? */

if (virtqueue_get_capacity()  num_sgs) {
netif_stop_queue(dev);
if (virtqueue_enable_cb_delayed(vi-svq) ||
free_old_xmit_skbs(vi, num_sgs)) {
/* Nothing freed up, or not enough freed up */
kfree_skb(skb);
return NETDEV_TX_OK;
}
netif_start_queue(dev);
virtqueue_disable_cb(vi-svq);
}

/* xmit_skb cannot fail now, also pass 'num_sgs' */
xmit_skb(vi, skb, num_sgs);
virtqueue_kick(vi-svq);

skb_orphan(skb);
nf_reset(skb);

return NETDEV_TX_OK;
}

We could even return TX_BUSY since that makes the dequeue
code more efficient. See dev_dequeue_skb() - you can skip a
lot of code (and avoid taking locks) to check if the queue
is already stopped but that code runs only if you return
TX_BUSY in the earlier iteration.

BTW, shouldn't the check in start_xmit be:
if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
...
}

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Nadav Har'El
On Tue, May 24, 2011, Tian, Kevin wrote about RE: [PATCH 08/31] nVMX: Fix 
local_vcpus_link handling:
  +static inline void loaded_vmcs_init(struct loaded_vmcs *loaded_vmcs)
  +{
  +   vmcs_clear(loaded_vmcs-vmcs);
  +   loaded_vmcs-cpu = -1;
  +   loaded_vmcs-launched = 0;
  +}
  +
 
 call it vmcs_init instead since you now remove original vmcs_init invocation,
 which more reflect the necessity of adding VMCLEAR for a new vmcs?

The best name for this function, I think, would have been loaded_vmcs_clear,
because this function isn't necessarily used to init - it's also called to
VMCLEAR an old vmcs (and flush its content back to memory) - in that sense
it is definitely not a vmcs_init.

Unfortunately, I already have a whole chain of functions with this name :(
the existing loaded_vmcs_clear() does an IPI to the CPU which has this VMCS
loaded, and causes it to run __loaded_vmcs_clear(), which in turn calls
the above loaded_vmcs_init(). I wish I could call all three functions
loaded_vmcs_clear(), but of course I can't. If anyone reading this has a good
suggestion on how to name these three functions, please let me know.

  +static void __loaded_vmcs_clear(void *arg)
   {
  -   struct vcpu_vmx *vmx = arg;
  +   struct loaded_vmcs *loaded_vmcs = arg;
  int cpu = raw_smp_processor_id();
  
  -   if (vmx-vcpu.cpu == cpu)
  -   vmcs_clear(vmx-vmcs);
  -   if (per_cpu(current_vmcs, cpu) == vmx-vmcs)
  +   if (loaded_vmcs-cpu != cpu)
  +   return; /* cpu migration can race with cpu offline */
 
 what do you mean by cpu migration here? why does 'cpu offline'
 matter here regarding to the cpu change?

__loaded_vmcs_clear() is typically called in one of two cases: cpu migration
means that a guest that used to run on one CPU, and had its VMCS loaded there,
suddenly needs to run on a different CPU, so we need to clear the VMCS on
the old CPU. cpu offline means that we want to take a certain CPU offline,
and before we do that we should VMCLEAR all VMCSs which were loaded on it.

The (vmx-cpu.cpu != cpu) case in __loaded_vmcs_clear should ideally never
happen: In the cpu offline path, we only call it for the loaded_vmcss which
we know for sure are loaded on the current cpu. In the cpu migration path,
loaded_vmcs_clear runs __loaded_vmcs_clear on the right CPU, which ensures that
equality.

But, there can be a race condition (this was actually explained to me a while
back by Avi - I never seen this happening in practice): Imagine that cpu
migration calls loaded_vmcs_clear, which tells the old cpu (via IPI) to
VMCLEAR this vmcs. But before that old CPU gets a chance to act on that IPI,
a decision is made to take it offline, and all loaded_vmcs loaded on it
(including the one in question) are cleared. When that CPU acts on this IPI,
it notices that vmx-cpu.cpu==-1, i.e., != cpu, so it doesn't need to do
anything (in the new version of the code, I made this more explicit, by
returning immediately in this case).

At least this is the theory. As I said, I didn't see this problem in practice
(unsurprising, since I never offlined any CPU). Maybe Avi or someone else can
comment more about this (vmx-cpu.cpu == cpu) check, which existed before
my patch - in __vcpu_clear().

  +static void free_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)
  +{
  +   if (!loaded_vmcs-vmcs)
  +   return;
  +   loaded_vmcs_clear(loaded_vmcs);
  +   free_vmcs(loaded_vmcs-vmcs);
  +   loaded_vmcs-vmcs = NULL;
  +}
 
 prefer to not do cleanup work through loaded_vmcs since it's just a pointer
 to a loaded vmcs structure. Though you can carefully arrange the nested
 vmcs cleanup happening before it, it's not very clean and a bit error prone
 simply from this function itself. It's clearer to directly cleanup vmcs01, and
 if you want an assertion could be added to make sure loaded_vmcs doesn't
 point to any stale vmcs02 structure after nested cleanup step.

I'm afraid I didn't understand what you meant here... Basically, this
free_loaded_vmcs() is just a shortcut for loaded_vmcs_clear() and free_vmcs(),
as doing both is needed in 3 places: nested_free_vmcs02,
nested_free_all_saved_vmcss, vmx_free_vcpu. The same function is needed
for both vmcs01 and vmcs02 VMCSs - in both cases when we don't need them any
more we need to VMCLEAR them and then free the VMCS memory. Note that this
function does *not* free the loaded_vmcs structure itself.

What's wrong with this?
Would you prefer that I remove this function and explictly call
loaded_vmcs_clear() and then free_vmcs() in all three places?

Thanks,
Nadav.

-- 
Nadav Har'El|  Tuesday, May 24 2011, 20 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Linux: Because a PC is a terrible thing
http://nadav.harel.org.il   |to waste.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

RE: [PATCH 17/31] nVMX: Prepare vmcs02 from vmcs01 and vmcs12

2011-05-24 Thread Tian, Kevin
 From: Nadav Har'El
 Sent: Tuesday, May 17, 2011 3:53 AM

 This patch contains code to prepare the VMCS which can be used to actually
 run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the
 information
 in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (our desires for our
 own guests).

 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
  arch/x86/kvm/vmx.c |  269
 +++
  1 file changed, 269 insertions(+)

 --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:48.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:48.0 +0300
 @@ -347,6 +347,12 @@ struct nested_vmx {
   /* vmcs02_list cache of VMCSs recently used to run L2 guests */
   struct list_head vmcs02_pool;
   int vmcs02_num;
 + u64 vmcs01_tsc_offset;
 + /*
 +  * Guest pages referred to in vmcs02 with host-physical pointers, so
 +  * we must keep them pinned while L2 runs.
 +  */
 + struct page *apic_access_page;
  };

  struct vcpu_vmx {
 @@ -849,6 +855,18 @@ static inline bool report_flexpriority(v
   return flexpriority_enabled;
  }

 +static inline bool nested_cpu_has(struct vmcs12 *vmcs12, u32 bit)
 +{
 + return vmcs12-cpu_based_vm_exec_control  bit;
 +}
 +
 +static inline bool nested_cpu_has2(struct vmcs12 *vmcs12, u32 bit)
 +{
 + return (vmcs12-cpu_based_vm_exec_control 
 + CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) 
 + (vmcs12-secondary_vm_exec_control  bit);
 +}
 +
  static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
  {
   int i;
 @@ -1435,6 +1453,22 @@ static void vmx_fpu_activate(struct kvm_

  static void vmx_decache_cr0_guest_bits(struct kvm_vcpu *vcpu);

 +/*
 + * Return the cr0 value that a nested guest would read. This is a combination
 + * of the real cr0 used to run the guest (guest_cr0), and the bits shadowed 
 by
 + * its hypervisor (cr0_read_shadow).
 + */
 +static inline unsigned long guest_readable_cr0(struct vmcs12 *fields)
 +{
 + return (fields-guest_cr0  ~fields-cr0_guest_host_mask) |
 + (fields-cr0_read_shadow  fields-cr0_guest_host_mask);
 +}
 +static inline unsigned long guest_readable_cr4(struct vmcs12 *fields)
 +{
 + return (fields-guest_cr4  ~fields-cr4_guest_host_mask) |
 + (fields-cr4_read_shadow  fields-cr4_guest_host_mask);
 +}
 +

will guest_ prefix look confusing here? The 'guest' has a broad range which 
makes
above two functions look like they can be used in non-nested case. Should we 
stick
to nested_prefix for nested specific facilities?

  static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu)
  {
   vmx_decache_cr0_guest_bits(vcpu);
 @@ -3423,6 +3457,9 @@ static void set_cr4_guest_host_mask(stru
   vmx-vcpu.arch.cr4_guest_owned_bits =
 KVM_CR4_GUEST_OWNED_BITS;
   if (enable_ept)
   vmx-vcpu.arch.cr4_guest_owned_bits |= X86_CR4_PGE;
 + if (is_guest_mode(vmx-vcpu))
 + vmx-vcpu.arch.cr4_guest_owned_bits =
 + ~get_vmcs12(vmx-vcpu)-cr4_guest_host_mask;

why not is_nested_mode()? :-P

   vmcs_writel(CR4_GUEST_HOST_MASK,
 ~vmx-vcpu.arch.cr4_guest_owned_bits);
  }

 @@ -4760,6 +4797,11 @@ static void free_nested(struct vcpu_vmx
   vmx-nested.current_vmptr = -1ull;
   vmx-nested.current_vmcs12 = NULL;
   }
 + /* Unpin physical memory we referred to in current vmcs02 */
 + if (vmx-nested.apic_access_page) {
 + nested_release_page(vmx-nested.apic_access_page);
 + vmx-nested.apic_access_page = 0;
 + }

   nested_free_all_saved_vmcss(vmx);
  }
 @@ -5829,6 +5871,233 @@ static void vmx_set_supported_cpuid(u32
  }

  /*
 + * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
 + * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function merges it
 + * with L0's requirements for its guest (a.k.a. vmsc01), so we can run the L2
 + * guest in a way that will both be appropriate to L1's requests, and our
 + * needs. In addition to modifying the active vmcs (which is vmcs02), this
 + * function also has additional necessary side-effects, like setting various
 + * vcpu-arch fields.
 + */
 +static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 +{
 + struct vcpu_vmx *vmx = to_vmx(vcpu);
 + u32 exec_control;
 +
 + vmcs_write16(GUEST_ES_SELECTOR, vmcs12-guest_es_selector);
 + vmcs_write16(GUEST_CS_SELECTOR, vmcs12-guest_cs_selector);
 + vmcs_write16(GUEST_SS_SELECTOR, vmcs12-guest_ss_selector);
 + vmcs_write16(GUEST_DS_SELECTOR, vmcs12-guest_ds_selector);
 + vmcs_write16(GUEST_FS_SELECTOR, vmcs12-guest_fs_selector);
 + vmcs_write16(GUEST_GS_SELECTOR, vmcs12-guest_gs_selector);
 + vmcs_write16(GUEST_LDTR_SELECTOR, vmcs12-guest_ldtr_selector);
 + vmcs_write16(GUEST_TR_SELECTOR, vmcs12-guest_tr_selector);
 + vmcs_write32(GUEST_ES_LIMIT, vmcs12-guest_es_limit);
 + vmcs_write32(GUEST_CS_LIMIT, vmcs12-guest_cs_limit);
 + 

Re: memory zones and the KVM guest kernel

2011-05-24 Thread Avi Kivity

On 05/23/2011 11:19 PM, David Evensky wrote:

Hi,

When I boot my guest kernel with KVM, the dmesg output says that:

...
[0.00] Zone PFN ranges:
[0.00]   DMA  0x0010 -  0x1000
[0.00]   DMA320x1000 -  0x0010
[0.00]   Normal   empty
[0.00] Movable zone start PFN for each node
[0.00] early_node_map[2] active PFN ranges
[0.00] 0: 0x0010 -  0x009f
[0.00] 0: 0x0100 -  0x0007fffd
...

Why is the Normal Zone empty?


The DMA32 zone covers the first 4GB of memory.  If you have less than 
that (actually less than somewhat less than 4GB), nothing spills over to 
the normal zone.



  Is it possible to have some of the
guest's memory mapped in the Normal zone?


-m 8G


Is there a good reference that talks about the normal, movable,
etc. memory zones?


I expect Mel Gorman's book covers it.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Tian, Kevin
 From: Nadav Har'El [mailto:n...@math.technion.ac.il]
 Sent: Tuesday, May 24, 2011 3:57 PM
 
 On Tue, May 24, 2011, Tian, Kevin wrote about RE: [PATCH 08/31] nVMX: Fix
 local_vcpus_link handling:
   +static inline void loaded_vmcs_init(struct loaded_vmcs *loaded_vmcs)
   +{
   + vmcs_clear(loaded_vmcs-vmcs);
   + loaded_vmcs-cpu = -1;
   + loaded_vmcs-launched = 0;
   +}
   +
 
  call it vmcs_init instead since you now remove original vmcs_init 
  invocation,
  which more reflect the necessity of adding VMCLEAR for a new vmcs?
 
 The best name for this function, I think, would have been loaded_vmcs_clear,
 because this function isn't necessarily used to init - it's also called to
 VMCLEAR an old vmcs (and flush its content back to memory) - in that sense
 it is definitely not a vmcs_init.
 
 Unfortunately, I already have a whole chain of functions with this name :(
 the existing loaded_vmcs_clear() does an IPI to the CPU which has this VMCS
 loaded, and causes it to run __loaded_vmcs_clear(), which in turn calls
 the above loaded_vmcs_init(). I wish I could call all three functions
 loaded_vmcs_clear(), but of course I can't. If anyone reading this has a good
 suggestion on how to name these three functions, please let me know.

how about loaded_vmcs_reset?

 
   +static void __loaded_vmcs_clear(void *arg)
{
   - struct vcpu_vmx *vmx = arg;
   + struct loaded_vmcs *loaded_vmcs = arg;
 int cpu = raw_smp_processor_id();
  
   - if (vmx-vcpu.cpu == cpu)
   - vmcs_clear(vmx-vmcs);
   - if (per_cpu(current_vmcs, cpu) == vmx-vmcs)
   + if (loaded_vmcs-cpu != cpu)
   + return; /* cpu migration can race with cpu offline */
 
  what do you mean by cpu migration here? why does 'cpu offline'
  matter here regarding to the cpu change?
 
 __loaded_vmcs_clear() is typically called in one of two cases: cpu migration
 means that a guest that used to run on one CPU, and had its VMCS loaded
 there,
 suddenly needs to run on a different CPU, so we need to clear the VMCS on
 the old CPU. cpu offline means that we want to take a certain CPU offline,
 and before we do that we should VMCLEAR all VMCSs which were loaded on it.

So here you need explicitly differentiate a vcpu and a real cpu. For the 1st 
case
it's just 'vcpu migration, and the latter it's the real 'cpu offline'. 'cpu 
migration' 
is generally a RAS feature in mission critical world. :-) 

 
 The (vmx-cpu.cpu != cpu) case in __loaded_vmcs_clear should ideally never
 happen: In the cpu offline path, we only call it for the loaded_vmcss which
 we know for sure are loaded on the current cpu. In the cpu migration path,
 loaded_vmcs_clear runs __loaded_vmcs_clear on the right CPU, which ensures
 that
 equality.
 
 But, there can be a race condition (this was actually explained to me a while
 back by Avi - I never seen this happening in practice): Imagine that cpu
 migration calls loaded_vmcs_clear, which tells the old cpu (via IPI) to
 VMCLEAR this vmcs. But before that old CPU gets a chance to act on that IPI,
 a decision is made to take it offline, and all loaded_vmcs loaded on it
 (including the one in question) are cleared. When that CPU acts on this IPI,
 it notices that vmx-cpu.cpu==-1, i.e., != cpu, so it doesn't need to do
 anything (in the new version of the code, I made this more explicit, by
 returning immediately in this case).

the reverse also holds true. Right between the point where cpu_offline hits
a loaded_vmcs and the point where it calls __loaded_vmcs_clear, it's possible
that the vcpu is migrated to another cpu, and it's likely that migration path
(vmx_vcpu_load) has invoked loaded_vmcs_clear but hasn't delete this vmcs
from old cpu's linked list. This way later when __loaded_vmcs_clear is
invoked on the offlined cpu, there's still chance to observe cpu as -1.

 
 At least this is the theory. As I said, I didn't see this problem in practice
 (unsurprising, since I never offlined any CPU). Maybe Avi or someone else can
 comment more about this (vmx-cpu.cpu == cpu) check, which existed before
 my patch - in __vcpu_clear().

I agree this check is necessary, but just want you to make the comment clear
with right term.

 
   +static void free_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)
   +{
   + if (!loaded_vmcs-vmcs)
   + return;
   + loaded_vmcs_clear(loaded_vmcs);
   + free_vmcs(loaded_vmcs-vmcs);
   + loaded_vmcs-vmcs = NULL;
   +}
 
  prefer to not do cleanup work through loaded_vmcs since it's just a pointer
  to a loaded vmcs structure. Though you can carefully arrange the nested
  vmcs cleanup happening before it, it's not very clean and a bit error prone
  simply from this function itself. It's clearer to directly cleanup vmcs01, 
  and
  if you want an assertion could be added to make sure loaded_vmcs doesn't
  point to any stale vmcs02 structure after nested cleanup step.
 
 I'm afraid I didn't understand what you meant here... Basically, this
 free_loaded_vmcs() is just a shortcut for loaded_vmcs_clear() 

Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Paolo Bonzini

On 05/23/2011 01:38 PM, Ingo Molnar wrote:

Later on even this could be removed: using section
tricks we can put init functions into a section


This is not kernel space, the C library provides a way to do that with 
__attribute__((constructor))...


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME

2011-05-24 Thread Tian, Kevin
 From: Nadav Har'El
 Sent: Tuesday, May 17, 2011 3:53 AM
 
 Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
 hypervisor to run its own guests.
 
 This patch does not include some of the necessary validity checks on
 vmcs12 fields before the entry. These will appear in a separate patch
 below.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
  arch/x86/kvm/vmx.c |   84
 +--
  1 file changed, 82 insertions(+), 2 deletions(-)
 
 --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:49.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:49.0 +0300
 @@ -347,6 +347,9 @@ struct nested_vmx {
   /* vmcs02_list cache of VMCSs recently used to run L2 guests */
   struct list_head vmcs02_pool;
   int vmcs02_num;
 +
 + /* Saving the VMCS that we used for running L1 */
 + struct saved_vmcs saved_vmcs01;
   u64 vmcs01_tsc_offset;
   /*
* Guest pages referred to in vmcs02 with host-physical pointers, so
 @@ -4668,6 +4671,8 @@ static void nested_free_all_saved_vmcss(
   kfree(item);
   }
   vmx-nested.vmcs02_num = 0;
 + if (is_guest_mode(vmx-vcpu))
 + nested_free_saved_vmcs(vmx, vmx-nested.saved_vmcs01);
  }
 
  /* Get a vmcs02 for the current vmcs12. */
 @@ -4959,6 +4964,21 @@ static int handle_vmclear(struct kvm_vcp
   return 1;
  }
 
 +static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch);
 +
 +/* Emulate the VMLAUNCH instruction */
 +static int handle_vmlaunch(struct kvm_vcpu *vcpu)
 +{
 + return nested_vmx_run(vcpu, true);
 +}
 +
 +/* Emulate the VMRESUME instruction */
 +static int handle_vmresume(struct kvm_vcpu *vcpu)
 +{
 +
 + return nested_vmx_run(vcpu, false);
 +}
 +
  enum vmcs_field_type {
   VMCS_FIELD_TYPE_U16 = 0,
   VMCS_FIELD_TYPE_U64 = 1,
 @@ -5239,11 +5259,11 @@ static int (*kvm_vmx_exit_handlers[])(st
   [EXIT_REASON_INVLPG]  = handle_invlpg,
   [EXIT_REASON_VMCALL]  = handle_vmcall,
   [EXIT_REASON_VMCLEAR] = handle_vmclear,
 - [EXIT_REASON_VMLAUNCH]= handle_vmx_insn,
 + [EXIT_REASON_VMLAUNCH]= handle_vmlaunch,
   [EXIT_REASON_VMPTRLD] = handle_vmptrld,
   [EXIT_REASON_VMPTRST] = handle_vmptrst,
   [EXIT_REASON_VMREAD]  = handle_vmread,
 - [EXIT_REASON_VMRESUME]= handle_vmx_insn,
 + [EXIT_REASON_VMRESUME]= handle_vmresume,
   [EXIT_REASON_VMWRITE] = handle_vmwrite,
   [EXIT_REASON_VMOFF]   = handle_vmoff,
   [EXIT_REASON_VMON]= handle_vmon,
 @@ -6129,6 +6149,66 @@ static void nested_maintain_per_cpu_list
   }
  }
 
 +/*
 + * nested_vmx_run() handles a nested entry, i.e., a VMLAUNCH or
 VMRESUME on L1
 + * for running an L2 nested guest.
 + */
 +static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
 +{
 + struct vmcs12 *vmcs12;
 + struct vcpu_vmx *vmx = to_vmx(vcpu);
 + int cpu;
 + struct saved_vmcs *saved_vmcs02;
 +
 + if (!nested_vmx_check_permission(vcpu))
 + return 1;
 + skip_emulated_instruction(vcpu);
 +
 + vmcs12 = get_vmcs12(vcpu);
 +
 + enter_guest_mode(vcpu);
 +
 + vmx-nested.vmcs01_tsc_offset = vmcs_read64(TSC_OFFSET);
 +
 + /*
 +  * Switch from L1's VMCS (vmcs01), to L2's VMCS (vmcs02). Remember
 +  * vmcs01, on which CPU it was last loaded, and whether it was launched
 +  * (we need all these values next time we will use L1). Then recall
 +  * these values from the last time vmcs02 was used.
 +  */
 + saved_vmcs02 = nested_get_current_vmcs02(vmx);
 + if (!saved_vmcs02)
 + return -ENOMEM;
 +
 + cpu = get_cpu();
 + vmx-nested.saved_vmcs01.vmcs = vmx-vmcs;
 + vmx-nested.saved_vmcs01.cpu = vcpu-cpu;
 + vmx-nested.saved_vmcs01.launched = vmx-launched;
 + vmx-vmcs = saved_vmcs02-vmcs;
 + vcpu-cpu = saved_vmcs02-cpu;

this may be another valid reason for your check on cpu_online in your
latest [08/31] local_vcpus_link fix, since cpu may be offlined after
this assignment. :-)

 + vmx-launched = saved_vmcs02-launched;
 +
 + nested_maintain_per_cpu_lists(vmx,
 + saved_vmcs02, vmx-nested.saved_vmcs01);
 +
 + vmx_vcpu_put(vcpu);
 + vmx_vcpu_load(vcpu, cpu);
 + vcpu-cpu = cpu;
 + put_cpu();
 +
 + vmcs12-launch_state = 1;
 +
 + prepare_vmcs02(vcpu, vmcs12);

Since prepare_vmcs may fail, add a check here and move launch_state
assignment after its success?

Thanks
Kevin

 +
 + /*
 +  * Note no nested_vmx_succeed or nested_vmx_fail here. At this point
 +  * we are no longer running L1, and VMLAUNCH/VMRESUME has not yet
 +  * returned as far as L1 is concerned. It will only return (and set
 +  * the success flag) when L2 exits (see nested_vmx_vmexit()).
 +  */
 + return 

Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Ingo Molnar

* Paolo Bonzini pbonz...@redhat.com wrote:

 On 05/23/2011 01:38 PM, Ingo Molnar wrote:

  Later on even this could be removed: using section tricks we can put init 
  functions into a section
 
 This is not kernel space, the C library provides a way to do that
 with __attribute__((constructor))...

Yeah, that would certainly work for simple things but there's several reasons 
why explicit control over initcalls is preferred in a tool like tools/kvm/ over 
__attribute__((constructor)):

 - C constructors run before main() so any generic environment that tools/kvm/
   might want to set up is not available yet - including but not limited to the
   command line arguments.

 - C constructors have random limitations like apparently not being executed if
   the constructor is linked into a .a static library.

 - It has advantages to have explicit control over initcalls - that way
   debugging and other instrumentation can be added freely. For example the
   kernel has a debug_initcall boot parameter to print the initcalls as they
   are executed. They can also be traced.

 - The kernel has several classes of initcalls with different call priority,
   i'm not aware of an equivalent __attribute__((constructor)) capability.
   We could also do more sophisticated layers of initcalls or an explicit
   initcall dependency tree, should the need arise.

Using .section tricks directly isnt particularly complex and the result is 
rather flexible, well-defined and visible.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Cyrill Gorcunov
On 05/24/2011 12:37 PM, Paolo Bonzini wrote:
 On 05/23/2011 01:38 PM, Ingo Molnar wrote:
 Later on even this could be removed: using section
 tricks we can put init functions into a section
 
 This is not kernel space, the C library provides a way to do that with 
 __attribute__((constructor))...
 
 Paolo

Ingo meant to use the same trick as we do in kernel space for late initcalls.

-- 
Cyrill
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Paolo Bonzini

On 05/24/2011 10:50 AM, Ingo Molnar wrote:

Yeah, that would certainly work for simple things but there's several reasons
why explicit control over initcalls is preferred in a tool like tools/kvm/ over
__attribute__((constructor)):


Some advantages you mention are real indeed, but they are general; 
there's no reason why they apply only to tools/kvm.  You can achieve the 
same by doing only minimal work (registering your 
subsystems/devices/whatever in a linked list) in the constructors.  Then 
you iterate on the list and call function pointers.


I know portability is not relevant to tools/kvm/, but using unportable 
tricks for the sake of using them is a direct way to NIH.  But oh well 
all of tools/kvm/ is NIH after all. :)



  - The kernel has several classes of initcalls with different call priority,
i'm not aware of an equivalent __attribute__((constructor)) capability.
We could also do more sophisticated layers of initcalls or an explicit
initcall dependency tree, should the need arise.


Constructors and destructors can have a priority (low runs first for 
constructors, low runs last for destructors).


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 10/14] virtio_net: limit xmit polling

2011-05-24 Thread Michael S. Tsirkin
On Tue, May 24, 2011 at 01:24:15PM +0530, Krishna Kumar2 wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 05/23/2011 04:49:00 PM:
 
   To do this properly, we should really be using the actual number of sg
   elements needed, but we'd have to do most of xmit_skb beforehand so we
   know how many.
  
   Cheers,
   Rusty.
 
  Maybe I'm confused here.  The problem isn't the failing
  add_buf for the given skb IIUC.  What we are trying to do here is stop
  the queue *before xmit_skb fails*. We can't look at the
  number of fragments in the current skb - the next one can be
  much larger.  That's why we check capacity after xmit_skb,
  not before it, right?
 
 Maybe Rusty means it is a simpler model to free the amount
 of space that this xmit needs. We will still fail anyway
 at some time but it is unlikely, since earlier iteration
 freed up atleast the space that it was going to use.

Not sure I nderstand.  We can't know space is freed in the previous
iteration as buffers might not have been used by then.

 The
 code could become much simpler:
 
 start_xmit()
 {
 {
 num_sgs = get num_sgs for this skb;
 
 /* Free enough pending old buffers to enable queueing this one */
 free_old_xmit_skbs(vi, num_sgs * 2); /* ?? */
 
 if (virtqueue_get_capacity()  num_sgs) {
 netif_stop_queue(dev);
 if (virtqueue_enable_cb_delayed(vi-svq) ||
 free_old_xmit_skbs(vi, num_sgs)) {
 /* Nothing freed up, or not enough freed up */
 kfree_skb(skb);
 return NETDEV_TX_OK;

This packet drop is what we wanted to avoid.


 }
 netif_start_queue(dev);
 virtqueue_disable_cb(vi-svq);
 }
 
 /* xmit_skb cannot fail now, also pass 'num_sgs' */
 xmit_skb(vi, skb, num_sgs);
 virtqueue_kick(vi-svq);
 
 skb_orphan(skb);
 nf_reset(skb);
 
 return NETDEV_TX_OK;
 }
 
 We could even return TX_BUSY since that makes the dequeue
 code more efficient. See dev_dequeue_skb() - you can skip a
 lot of code (and avoid taking locks) to check if the queue
 is already stopped but that code runs only if you return
 TX_BUSY in the earlier iteration.
 
 BTW, shouldn't the check in start_xmit be:
   if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
   ...
   }
 
 Thanks,
 
 - KK

I thought we used to do basically this but other devices moved to a
model where they stop *before* queueing fails, so we did too.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Restoring saved guest causes guest to reboot

2011-05-24 Thread Joerg Roedel
On Tue, May 24, 2011 at 09:11:44AM +0200, Markus Schade wrote:
 after upgrading the kernel from 2.6.37.6 to 2.6.38.X the guest domain
 reboots immediately after restore from a saved state. The OS of the guest is 
 not
 a factor as it is the same for Linux (Debian, Ubuntu, CentOS, etc.), FreeBSD
 and Windows.

 While save and restore seems to work on a suspended guest. But once the guest
 is resumed, the domain reboots. Last working kernel version is 2.6.37.6.

 I can confirm this behaviour with qemu-kvm 0.13.0 to latest 0.14.1
 and vanilla kernels up to 2.6.39.

 This also won't work on unmodified stock installs of Ubuntu 11.04 or
 Fedora 15.

 Please let me know, if you need further information or testing.

What hardware does your host run on?

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Paolo Bonzini

On 05/24/2011 10:50 AM, Ingo Molnar wrote:

  - C constructors have random limitations like apparently not being executed if
the constructor is linked into a .a static library.


Ah, forgot about this.  Given _why_ this happens (for static libraries, 
the linker omits object modules that are not required to fulfill 
undefined references in previous objects), I'd be surprised if explicit 
section tricks do not have the same limitation.  After all the 
constructor attribute is only syntactic sugar for section tricks.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 17/31] nVMX: Prepare vmcs02 from vmcs01 and vmcs12

2011-05-24 Thread Nadav Har'El
On Tue, May 24, 2011, Tian, Kevin wrote about RE: [PATCH 17/31] nVMX: Prepare 
vmcs02 from vmcs01 and vmcs12:
  +static inline unsigned long guest_readable_cr4(struct vmcs12 *fields)
  +{
  + return (fields-guest_cr4  ~fields-cr4_guest_host_mask) |
  + (fields-cr4_read_shadow  fields-cr4_guest_host_mask);
  +}
  +
 
 will guest_ prefix look confusing here? The 'guest' has a broad range which 
 makes
 above two functions look like they can be used in non-nested case. Should we 
 stick
 to nested_prefix for nested specific facilities?

I don't know, I thought it made calls like

vmcs_writel(CR0_READ_SHADOW, guest_readable_cr0(vmcs12));

readable, and the comments (and the parameters) make it obvious it's for
nested only.

I now renamed these functions nested_read_cr0(), nested_read_cr4() - I hope
you like these names better.

  + if (is_guest_mode(vmx-vcpu))
  + vmx-vcpu.arch.cr4_guest_owned_bits =
  + ~get_vmcs12(vmx-vcpu)-cr4_guest_host_mask;
 
 why not is_nested_mode()? :-P

I assume you're wondering why the function is called is_guest_mode(), and
not is_nested_mode()?

This name was chosen by Avi Kivity in November last year, for the function
previously introduced by Joerg Roedel. My original code (before Joerg added
this function to x86.c) indeed used the term nested_mode, not guest_mode.

In January, I pointed to the possibility of confusion between the new
is_guest_mode() and other things called guest mode, and Avi Kivity said
he will rename it to is_nested_guest() - see
http://lkml.indiana.edu/hypermail/linux/kernel/1101.1/01418.html
But as you can see, he never did this renaming.

That being said, after half a year, I got used to the name is_guest_mode(),
and am no longer convinced it should be changed. It checks whether the vcpu
(not the underlying CPU) is in Intel-SDM-terminology guest mode. Just like
is_long_mode() checks if the vcpu is in long mode. So I'm fine with leaving
its current name.

  +static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
  +{
...
  + if (!vmx-rdtscp_enabled)
  + exec_control = ~SECONDARY_EXEC_RDTSCP;
  + /* Take the following fields only from vmcs12 */
  + exec_control = ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
  + if (nested_cpu_has(vmcs12,
  + CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
  + exec_control |= vmcs12-secondary_vm_exec_control;
 
 should this 2nd exec_control be merged in clear case-by-case flavor?
 
 what about L0 sets virtualize x2APIC bit while L1 doesn't?
 
 Or what about L0 disables EPT while L1 sets it?
 
 I think it's better to scrutinize every 2nd exec_control feature with a
 clear policy:
 - whether we want to use the stricter policy which is only set when both L0 
 and
 L1 set it
 - whether we want to use L1 setting absolutely regardless of L0 setting like
 what you did for virtualize APIC access

Please note that most of the examples you give cannot happen in practice,
because we tell L1 (via MSR) which features it is allowed to use, and we
fail entry if it tries to use disallowed features (before ever reaching
the merge code you're commenting on). So we don't allow L1, for example,
to use the EPT feature (and when nested-EPT support is added, we won't
allow L1 to use EPT if L0 didn't). The general thinking was that for most
fields that we do explicitly allow, OR is the right choice.

I'll add this to my bugzilla, and think about it again later.

Thanks,
Nadav

-- 
Nadav Har'El|  Tuesday, May 24 2011, 20 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |If you choke a Smurf, what color does it
http://nadav.harel.org.il   |turn?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Roedel, Joerg
On Mon, May 23, 2011 at 12:51:55PM -0400, Avi Kivity wrote:
 On 05/23/2011 07:43 PM, Roedel, Joerg wrote:
  On Mon, May 23, 2011 at 11:49:17AM -0400, Avi Kivity wrote:
 
Joerg, is
  
 if (unlikely(cpu != vcpu-cpu)) {
 svm-asid_generation = 0;
 mark_all_dirty(svm-vmcb);
 }
  
susceptible to cpu offline/online?
 
  I don't think so. This should be safe for cpu offline/online as long as
  the cpu-number value is not reused for another physical cpu. But that
  should be the case afaik.
 
 
 Why not? offline/online does reuse cpu numbers AFAIK (and it must, if 
 you have a fully populated machine and offline/online just one cpu).

Yes, you are right. There is a slight possibility that the asid is not
updated when a vcpu has asid_generation == 1 and hasn't been running on
another cpu while this given cpu was offlined/onlined. Very unlikely,
but we can not rule it out.

Probably we should make the local_vcpu_list from vmx generic, use it
from svm  and fix it this way.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 10/14] virtio_net: limit xmit polling

2011-05-24 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 05/24/2011 02:42:55 PM:

To do this properly, we should really be using the actual number of
sg
elements needed, but we'd have to do most of xmit_skb beforehand so
we
know how many.
   
Cheers,
Rusty.
  
   Maybe I'm confused here.  The problem isn't the failing
   add_buf for the given skb IIUC.  What we are trying to do here is
stop
   the queue *before xmit_skb fails*. We can't look at the
   number of fragments in the current skb - the next one can be
   much larger.  That's why we check capacity after xmit_skb,
   not before it, right?
 
  Maybe Rusty means it is a simpler model to free the amount
  of space that this xmit needs. We will still fail anyway
  at some time but it is unlikely, since earlier iteration
  freed up atleast the space that it was going to use.

 Not sure I nderstand.  We can't know space is freed in the previous
 iteration as buffers might not have been used by then.

Yes, the first few iterations may not have freed up space, but
later ones should. The amount of free space should increase
from then on, especially since we try to free double of what
we consume.

  The
  code could become much simpler:
 
  start_xmit()
  {
  {
  num_sgs = get num_sgs for this skb;
 
  /* Free enough pending old buffers to enable queueing this one
*/
  free_old_xmit_skbs(vi, num_sgs * 2); /* ?? */
 
  if (virtqueue_get_capacity()  num_sgs) {
  netif_stop_queue(dev);
  if (virtqueue_enable_cb_delayed(vi-svq) ||
  free_old_xmit_skbs(vi, num_sgs)) {
  /* Nothing freed up, or not enough freed up */
  kfree_skb(skb);
  return NETDEV_TX_OK;

 This packet drop is what we wanted to avoid.

Please see below on returning NETDEV_TX_BUSY.


  }
  netif_start_queue(dev);
  virtqueue_disable_cb(vi-svq);
  }
 
  /* xmit_skb cannot fail now, also pass 'num_sgs' */
  xmit_skb(vi, skb, num_sgs);
  virtqueue_kick(vi-svq);
 
  skb_orphan(skb);
  nf_reset(skb);
 
  return NETDEV_TX_OK;
  }
 
  We could even return TX_BUSY since that makes the dequeue
  code more efficient. See dev_dequeue_skb() - you can skip a
  lot of code (and avoid taking locks) to check if the queue
  is already stopped but that code runs only if you return
  TX_BUSY in the earlier iteration.
 
  BTW, shouldn't the check in start_xmit be:
 if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
...
 }
 
  Thanks,
 
  - KK

 I thought we used to do basically this but other devices moved to a
 model where they stop *before* queueing fails, so we did too.

I am not sure of why it was changed, since returning TX_BUSY
seems more efficient IMHO. qdisc_restart() handles requeue'd
packets much better than a stopped queue, as a significant
part of this code is skipped if gso_skb is present (qdisc
will eventually start dropping packets when tx_queue_len is
exceeded anyway).

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Nadav Har'El
On Tue, May 24, 2011, Roedel, Joerg wrote about Re: [PATCH 08/31] nVMX: Fix 
local_vcpus_link handling:
 Probably we should make the local_vcpu_list from vmx generic, use it
 from svm  and fix it this way.

The point is, local_vcpu_list is now gone, replaced by a loaded_vmcss_on_cpu,
and vcpu-cpu is not set to -1 for any vcpu when a CPU is offlined - also in
VMX...

-- 
Nadav Har'El|  Tuesday, May 24 2011, 20 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |In case of emergency, this box may be
http://nadav.harel.org.il   |used as a quotation device.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Restoring saved guest causes guest to reboot

2011-05-24 Thread Markus Schade

On Tue, 24 May 2011, Joerg Roedel wrote:


On Tue, May 24, 2011 at 09:11:44AM +0200, Markus Schade wrote:

after upgrading the kernel from 2.6.37.6 to 2.6.38.X the guest domain
reboots immediately after restore from a saved state. The OS of the guest is not
a factor as it is the same for Linux (Debian, Ubuntu, CentOS, etc.), FreeBSD
and Windows.

While save and restore seems to work on a suspended guest. But once the guest
is resumed, the domain reboots. Last working kernel version is 2.6.37.6.

I can confirm this behaviour with qemu-kvm 0.13.0 to latest 0.14.1
and vanilla kernels up to 2.6.39.

This also won't work on unmodified stock installs of Ubuntu 11.04 or
Fedora 15.

Please let me know, if you need further information or testing.


What hardware does your host run on?


The main target systems are Intel Core i7 (920 to 950) on a MSI X58 Pro
mainboard with 24GB RAM and an Adaptec 5405 RAID controller with 3 disks
in a RAID-5 configuration. The controller is not a factor (happens also
with MD RAID).

To rule out an issue with this specific mainboard and CPU, i have also
check this (only with F15) on a Core i5 and and Asus P55 mainboard.

I could try to also confirm this on AMD X2, but it will take me a little
to setup a machine.

Gruß,
Markus

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Restoring saved guest causes guest to reboot

2011-05-24 Thread Joerg Roedel
On Tue, May 24, 2011 at 11:35:27AM +0200, Markus Schade wrote:
 On Tue, 24 May 2011, Joerg Roedel wrote:

 On Tue, May 24, 2011 at 09:11:44AM +0200, Markus Schade wrote:
 after upgrading the kernel from 2.6.37.6 to 2.6.38.X the guest domain
 reboots immediately after restore from a saved state. The OS of the guest 
 is not
 a factor as it is the same for Linux (Debian, Ubuntu, CentOS, etc.), FreeBSD
 and Windows.

 While save and restore seems to work on a suspended guest. But once the 
 guest
 is resumed, the domain reboots. Last working kernel version is 2.6.37.6.

 I can confirm this behaviour with qemu-kvm 0.13.0 to latest 0.14.1
 and vanilla kernels up to 2.6.39.

 This also won't work on unmodified stock installs of Ubuntu 11.04 or
 Fedora 15.

 Please let me know, if you need further information or testing.

 What hardware does your host run on?

 The main target systems are Intel Core i7 (920 to 950) on a MSI X58 Pro
 mainboard with 24GB RAM and an Adaptec 5405 RAID controller with 3 disks
 in a RAID-5 configuration. The controller is not a factor (happens also
 with MD RAID).

 To rule out an issue with this specific mainboard and CPU, i have also
 check this (only with F15) on a Core i5 and and Asus P55 mainboard.

 I could try to also confirm this on AMD X2, but it will take me a little
 to setup a machine.

Your description points to some problem in the kernel. Would be good if
you could try the same on your AMD X2 to find out if the problem is in
generic kvm code or in the vmx part.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME

2011-05-24 Thread Nadav Har'El
On Tue, May 24, 2011, Tian, Kevin wrote about RE: [PATCH 18/31] nVMX: 
Implement VMLAUNCH and VMRESUME:
  +   /*
  +* Switch from L1's VMCS (vmcs01), to L2's VMCS (vmcs02). Remember
  +* vmcs01, on which CPU it was last loaded, and whether it was launched
  +* (we need all these values next time we will use L1). Then recall
  +* these values from the last time vmcs02 was used.
  +*/
  +   saved_vmcs02 = nested_get_current_vmcs02(vmx);
  +   if (!saved_vmcs02)
  +   return -ENOMEM;
  +
  +   cpu = get_cpu();
  +   vmx-nested.saved_vmcs01.vmcs = vmx-vmcs;
  +   vmx-nested.saved_vmcs01.cpu = vcpu-cpu;
  +   vmx-nested.saved_vmcs01.launched = vmx-launched;
  +   vmx-vmcs = saved_vmcs02-vmcs;
  +   vcpu-cpu = saved_vmcs02-cpu;
 
 this may be another valid reason for your check on cpu_online in your
 latest [08/31] local_vcpus_link fix, since cpu may be offlined after
 this assignment. :-)

I believe that wrapping this part of the code with get_cpu()/put_cpu()
protected me from these kinds of race conditions.

By the way, please note that this part of the code was changed after my
latest loaded_vmcs overhaul. It now looks like this:

vmcs02 = nested_get_current_vmcs02(vmx);
if (!vmcs02)
return -ENOMEM;

cpu = get_cpu();
vmx-loaded_vmcs = vmcs02;
vmx_vcpu_put(vcpu);
vmx_vcpu_load(vcpu, cpu);
vcpu-cpu = cpu;
put_cpu();

(if Avi gives me the green light, I'll send the entire, up-to-date, patch set
again).

  +   vmcs12-launch_state = 1;
  +
  +   prepare_vmcs02(vcpu, vmcs12);
 
 Since prepare_vmcs may fail, add a check here and move launch_state
 assignment after its success?

prepare_vmcs02() cannot fail. All the checks that need to be done on vmcs12
are done before calling it, in nested_vmx_run().

Currently, there's a single case where prepare_vmcs02 fails when it fails
to access apic_access_addr memory. This is wrong - the check should have been
done earlier. I'll fix that, and make prepare_vmcs02() void.


-- 
Nadav Har'El|  Tuesday, May 24 2011, 20 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |It's no use crying over spilt milk -- it
http://nadav.harel.org.il   |only makes it salty for the cat.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Pekka Enberg
On Tue, May 24, 2011 at 12:10 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 I know portability is not relevant to tools/kvm/, but using unportable
 tricks for the sake of using them is a direct way to NIH.  But oh well all
 of tools/kvm/ is NIH after all. :)

Hmm?

The point is to follow Linux kernel conventions and idioms (and share
code) as much as possible so it's familiar to devs who are already
working on the kernel. That's why section tricks seem more appropriate
than using constructor to me. Or is there some technical advantage to
using constructors?

   Pekka
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Roedel, Joerg
On Tue, May 24, 2011 at 05:28:38AM -0400, Nadav Har'El wrote:
 On Tue, May 24, 2011, Roedel, Joerg wrote about Re: [PATCH 08/31] nVMX: Fix 
 local_vcpus_link handling:
  Probably we should make the local_vcpu_list from vmx generic, use it
  from svm  and fix it this way.
 
 The point is, local_vcpu_list is now gone, replaced by a loaded_vmcss_on_cpu,
 and vcpu-cpu is not set to -1 for any vcpu when a CPU is offlined - also in
 VMX...

loaded_vmcss_on_cpu sound similar, probably this can be generalized. Is
this code already upstream or is this changed with your nVMX patch-set?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Avi Kivity

On 05/24/2011 12:57 PM, Roedel, Joerg wrote:

On Tue, May 24, 2011 at 05:28:38AM -0400, Nadav Har'El wrote:
  On Tue, May 24, 2011, Roedel, Joerg wrote about Re: [PATCH 08/31] nVMX: Fix 
local_vcpus_link handling:
Probably we should make the local_vcpu_list from vmx generic, use it
from svm  and fix it this way.

  The point is, local_vcpu_list is now gone, replaced by a loaded_vmcss_on_cpu,
  and vcpu-cpu is not set to -1 for any vcpu when a CPU is offlined - also in
  VMX...

loaded_vmcss_on_cpu sound similar, probably this can be generalized.


It's not the same: there is a main:1 relationship between vmcss and 
vcpus (like vmcbs and vcpus).


However, it may be that the general case for svm also needs to treat 
individual vmcbs differently.




Is
this code already upstream or is this changed with your nVMX patch-set?



Not upstream yet (however generalization, if needed, will be done after 
it's upstream).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Nadav Har'El
On Tue, May 24, 2011, Roedel, Joerg wrote about Re: [PATCH 08/31] nVMX: Fix 
local_vcpus_link handling:
 loaded_vmcss_on_cpu sound similar, probably this can be generalized.

I don't think so - now that a VCPU may have several VMCSs (L1, L2), each
of those may be loaded on a different cpu so we have a list of VMCSs
(the new loaded_vmcs structure), not vcpus. When we offline a CPU, we recall
all VMCSs loaded on it from this list, and clear them; We mark cpu=-1 for
each of those vmcs, but vcpu-cpu remains untouched (and not set to -1)
for all the vcpus.

 Is this code already upstream or is this changed with your nVMX patch-set?

Avi asked me to send the patch that does this *before* nvmx. But he did not
yet merge it.


-- 
Nadav Har'El|  Tuesday, May 24 2011, 20 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |If you notice this notice, you'll notice
http://nadav.harel.org.il   |it's not worth noticing but is noticable.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 17/31] nVMX: Prepare vmcs02 from vmcs01 and vmcs12

2011-05-24 Thread Tian, Kevin
 From: Nadav Har'El
 Sent: Tuesday, May 24, 2011 5:19 PM
 
 On Tue, May 24, 2011, Tian, Kevin wrote about RE: [PATCH 17/31] nVMX:
 Prepare vmcs02 from vmcs01 and vmcs12:
   +static inline unsigned long guest_readable_cr4(struct vmcs12 *fields)
   +{
   + return (fields-guest_cr4  ~fields-cr4_guest_host_mask) |
   + (fields-cr4_read_shadow 
 fields-cr4_guest_host_mask);
   +}
   +
 
  will guest_ prefix look confusing here? The 'guest' has a broad range which
 makes
  above two functions look like they can be used in non-nested case. Should we
 stick
  to nested_prefix for nested specific facilities?
 
 I don't know, I thought it made calls like
 
   vmcs_writel(CR0_READ_SHADOW, guest_readable_cr0(vmcs12));
 
 readable, and the comments (and the parameters) make it obvious it's for
 nested only.
 
 I now renamed these functions nested_read_cr0(), nested_read_cr4() - I hope
 you like these names better.

yes.

 
   + if (is_guest_mode(vmx-vcpu))
   + vmx-vcpu.arch.cr4_guest_owned_bits =
   +
 ~get_vmcs12(vmx-vcpu)-cr4_guest_host_mask;
 
  why not is_nested_mode()? :-P
 
 I assume you're wondering why the function is called is_guest_mode(), and
 not is_nested_mode()?

yes

 
 This name was chosen by Avi Kivity in November last year, for the function
 previously introduced by Joerg Roedel. My original code (before Joerg added
 this function to x86.c) indeed used the term nested_mode, not
 guest_mode.
 
 In January, I pointed to the possibility of confusion between the new
 is_guest_mode() and other things called guest mode, and Avi Kivity said
 he will rename it to is_nested_guest() - see
 http://lkml.indiana.edu/hypermail/linux/kernel/1101.1/01418.html
 But as you can see, he never did this renaming.
 
 That being said, after half a year, I got used to the name is_guest_mode(),
 and am no longer convinced it should be changed. It checks whether the vcpu
 (not the underlying CPU) is in Intel-SDM-terminology guest mode. Just like
 is_long_mode() checks if the vcpu is in long mode. So I'm fine with leaving
 its current name.

well, it's a small issue, and I'm fine with leaving it though I don't like 
'guest' here. :-)

 
   +static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12
 *vmcs12)
   +{
 ...
   + if (!vmx-rdtscp_enabled)
   + exec_control = ~SECONDARY_EXEC_RDTSCP;
   + /* Take the following fields only from vmcs12 */
   + exec_control =
 ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
   + if (nested_cpu_has(vmcs12,
   +
 CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
   + exec_control |=
 vmcs12-secondary_vm_exec_control;
 
  should this 2nd exec_control be merged in clear case-by-case flavor?
 
  what about L0 sets virtualize x2APIC bit while L1 doesn't?
 
  Or what about L0 disables EPT while L1 sets it?
 
  I think it's better to scrutinize every 2nd exec_control feature with a
  clear policy:
  - whether we want to use the stricter policy which is only set when both L0
 and
  L1 set it
  - whether we want to use L1 setting absolutely regardless of L0 setting like
  what you did for virtualize APIC access
 
 Please note that most of the examples you give cannot happen in practice,
 because we tell L1 (via MSR) which features it is allowed to use, and we
 fail entry if it tries to use disallowed features (before ever reaching
 the merge code you're commenting on). So we don't allow L1, for example,
 to use the EPT feature (and when nested-EPT support is added, we won't
 allow L1 to use EPT if L0 didn't). The general thinking was that for most
 fields that we do explicitly allow, OR is the right choice.

This really bases on the value of the control bit. To achieve the strictest
setting between L0/L1, sometimes you want to use AND and sometimes you
want to use OR.

From a design p.o.v, it's better not to have such implicit assumption on other
places. Just make it clean and correct. Also in your example it doesn't cover
the case where L0 sets some bits which are not exposed to L1 via MSR. For
example as I said earlier, what about L0 sets virtualize X2APIC mode while
it's not enabled by or not exposed to L1. With OR, you then also enable this 
mode for L2 absolutely, while L1 has no logic to handle it.

I'd like to see a clean policy for the known control bits here, even with a 
strict policy to incur most VM-exits which can be optimized in the future.

 
 I'll add this to my bugzilla, and think about it again later.


Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME

2011-05-24 Thread Tian, Kevin
 From: Nadav Har'El [mailto:n...@math.technion.ac.il]
 Sent: Tuesday, May 24, 2011 5:45 PM
 
 On Tue, May 24, 2011, Tian, Kevin wrote about RE: [PATCH 18/31] nVMX:
 Implement VMLAUNCH and VMRESUME:
   + /*
   +  * Switch from L1's VMCS (vmcs01), to L2's VMCS (vmcs02).
 Remember
   +  * vmcs01, on which CPU it was last loaded, and whether it was
 launched
   +  * (we need all these values next time we will use L1). Then recall
   +  * these values from the last time vmcs02 was used.
   +  */
   + saved_vmcs02 = nested_get_current_vmcs02(vmx);
   + if (!saved_vmcs02)
   + return -ENOMEM;
   +
   + cpu = get_cpu();
   + vmx-nested.saved_vmcs01.vmcs = vmx-vmcs;
   + vmx-nested.saved_vmcs01.cpu = vcpu-cpu;
   + vmx-nested.saved_vmcs01.launched = vmx-launched;
   + vmx-vmcs = saved_vmcs02-vmcs;
   + vcpu-cpu = saved_vmcs02-cpu;
 
  this may be another valid reason for your check on cpu_online in your
  latest [08/31] local_vcpus_link fix, since cpu may be offlined after
  this assignment. :-)
 
 I believe that wrapping this part of the code with get_cpu()/put_cpu()
 protected me from these kinds of race conditions.

you're right.

 
 By the way, please note that this part of the code was changed after my
 latest loaded_vmcs overhaul. It now looks like this:
 
   vmcs02 = nested_get_current_vmcs02(vmx);
   if (!vmcs02)
   return -ENOMEM;
 
   cpu = get_cpu();
   vmx-loaded_vmcs = vmcs02;
   vmx_vcpu_put(vcpu);
   vmx_vcpu_load(vcpu, cpu);
   vcpu-cpu = cpu;
   put_cpu();
 
 (if Avi gives me the green light, I'll send the entire, up-to-date, patch set
 again).

Generally your new patch looks good.

 
   + vmcs12-launch_state = 1;
   +
   + prepare_vmcs02(vcpu, vmcs12);
 
  Since prepare_vmcs may fail, add a check here and move launch_state
  assignment after its success?
 
 prepare_vmcs02() cannot fail. All the checks that need to be done on vmcs12
 are done before calling it, in nested_vmx_run().
 
 Currently, there's a single case where prepare_vmcs02 fails when it fails
 to access apic_access_addr memory. This is wrong - the check should have
 been
 done earlier. I'll fix that, and make prepare_vmcs02() void.
 

then no problem, as long as you keep this choice clear.

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Avi Kivity

On 05/24/2011 11:20 AM, Tian, Kevin wrote:


  The (vmx-cpu.cpu != cpu) case in __loaded_vmcs_clear should ideally never
  happen: In the cpu offline path, we only call it for the loaded_vmcss which
  we know for sure are loaded on the current cpu. In the cpu migration path,
  loaded_vmcs_clear runs __loaded_vmcs_clear on the right CPU, which ensures
  that
  equality.

  But, there can be a race condition (this was actually explained to me a while
  back by Avi - I never seen this happening in practice): Imagine that cpu
  migration calls loaded_vmcs_clear, which tells the old cpu (via IPI) to
  VMCLEAR this vmcs. But before that old CPU gets a chance to act on that IPI,
  a decision is made to take it offline, and all loaded_vmcs loaded on it
  (including the one in question) are cleared. When that CPU acts on this IPI,
  it notices that vmx-cpu.cpu==-1, i.e., != cpu, so it doesn't need to do
  anything (in the new version of the code, I made this more explicit, by
  returning immediately in this case).

the reverse also holds true. Right between the point where cpu_offline hits
a loaded_vmcs and the point where it calls __loaded_vmcs_clear, it's possible
that the vcpu is migrated to another cpu, and it's likely that migration path
(vmx_vcpu_load) has invoked loaded_vmcs_clear but hasn't delete this vmcs
from old cpu's linked list. This way later when __loaded_vmcs_clear is
invoked on the offlined cpu, there's still chance to observe cpu as -1.


I don't think it's possible.  Both calls are done with interrupts disabled.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-24 Thread Avi Kivity

On 05/23/2011 09:06 PM, Alexander Graf wrote:

On 23.05.2011, at 17:23, Avi Kivity wrote:

  On 05/23/2011 05:44 PM, Nadav Har'El wrote:
  On Mon, May 23, 2011, Avi Kivity wrote about Re: [PATCH 0/30] nVMX: Nested VMX, 
v9:
 vmcs01 and vmcs02 will both be generated from vmcs12.

  If you don't do a clean nested exit (from L2 to L1), vmcs02 can't be 
generated
  from vmcs12... while L2 runs, it is possible that it modifies vmcs02 (e.g.,
  non-trapped bits of guest_cr0), and these modifications are not copied back
  to vmcs12 until the nested exit (when prepare_vmcs12() is called to perform
  this task).

  If you do a nested exit (a fake one), vmcs12 is made up to date, and then
  indeed vmcs02 can be thrown away and regenerated.

  You would flush this state back to the vmcs.  But that just confirms Joerg's 
statement that a fake vmexit/vmrun is more or less equivalent.

  The question is whether %rip points to the VMRUN/VMLAUNCH instruction, 
HOST_RIP (or the next instruction for svm), or to guest code.  But the actual 
things we need to do are all very similar subsets of a vmexit.

%rip should certainly point to VMRUN. That way there is no need to save any 
information whatsoever, as the VMCB is already in sane state and nothing needs 
to be special cased, as the next VCPU_RUN would simply go back into guest mode 
- which is exactly what we want.

The only tricky part is how we distinguish between I need to live migrate and 
info registers. In the former case, %rip should be on VMRUN. In the latter, on the 
guest rip.


We can split vmrun emulation into save host state, load guest state 
and prepare nested vmcb.  Then, when we load registers, if we see that 
we're in guest mode, we do just the prepare nested vmcb bit.


This way register state is always nested guest state.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Tian, Kevin
 From: Avi Kivity [mailto:a...@redhat.com]
 Sent: Tuesday, May 24, 2011 7:06 PM
 
 On 05/24/2011 11:20 AM, Tian, Kevin wrote:
  
The (vmx-cpu.cpu != cpu) case in __loaded_vmcs_clear should ideally
 never
happen: In the cpu offline path, we only call it for the loaded_vmcss 
   which
we know for sure are loaded on the current cpu. In the cpu migration
 path,
loaded_vmcs_clear runs __loaded_vmcs_clear on the right CPU, which
 ensures
that
equality.
  
But, there can be a race condition (this was actually explained to me a
 while
back by Avi - I never seen this happening in practice): Imagine that cpu
migration calls loaded_vmcs_clear, which tells the old cpu (via IPI) to
VMCLEAR this vmcs. But before that old CPU gets a chance to act on that
 IPI,
a decision is made to take it offline, and all loaded_vmcs loaded on it
(including the one in question) are cleared. When that CPU acts on this
 IPI,
it notices that vmx-cpu.cpu==-1, i.e., != cpu, so it doesn't need to do
anything (in the new version of the code, I made this more explicit, by
returning immediately in this case).
 
  the reverse also holds true. Right between the point where cpu_offline hits
  a loaded_vmcs and the point where it calls __loaded_vmcs_clear, it's 
  possible
  that the vcpu is migrated to another cpu, and it's likely that migration 
  path
  (vmx_vcpu_load) has invoked loaded_vmcs_clear but hasn't delete this vmcs
  from old cpu's linked list. This way later when __loaded_vmcs_clear is
  invoked on the offlined cpu, there's still chance to observe cpu as -1.
 
 I don't think it's possible.  Both calls are done with interrupts disabled.

If that's the case then there's another potential issue. Deadlock may happen
when calling smp_call_function_single with interrupt disabled. 

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Avi Kivity

On 05/24/2011 12:55 PM, Pekka Enberg wrote:

On Tue, May 24, 2011 at 12:10 PM, Paolo Bonzinipbonz...@redhat.com  wrote:
  I know portability is not relevant to tools/kvm/, but using unportable
  tricks for the sake of using them is a direct way to NIH.  But oh well all
  of tools/kvm/ is NIH after all. :)

Hmm?

The point is to follow Linux kernel conventions and idioms (and share
code) as much as possible so it's familiar to devs who are already
working on the kernel. That's why section tricks seem more appropriate
than using constructor to me. Or is there some technical advantage to
using constructors?


You get to reuse infrastructure that's already there.

Things like using sections and s/uint64_t/u64/ look anti-reuse to me.  
Userspace isn't the kernel, for better or for worse.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Pekka Enberg
Hi Avi,

On Tue, May 24, 2011 at 2:22 PM, Avi Kivity a...@redhat.com wrote:
 The point is to follow Linux kernel conventions and idioms (and share
 code) as much as possible so it's familiar to devs who are already
 working on the kernel. That's why section tricks seem more appropriate
 than using constructor to me. Or is there some technical advantage to
 using constructors?

 You get to reuse infrastructure that's already there.

 Things like using sections and s/uint64_t/u64/ look anti-reuse to me.
  Userspace isn't the kernel, for better or for worse.

Not really. The type thing is pretty much required once you start
using kernel code (as we learned the hard way).

Btw, constructor attribute doesn't really seem like a good fit for
late_initcall type of thing:

  The constructor attribute causes the function to be called
automatically before execution enters main ()

What am I missing here?

Pekka
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Avi Kivity

On 05/24/2011 02:20 PM, Tian, Kevin wrote:

  I don't think it's possible.  Both calls are done with interrupts disabled.

If that's the case then there's another potential issue. Deadlock may happen
when calling smp_call_function_single with interrupt disabled.


We don't do that.  vcpu migration calls vcpu_clear() with interrupts 
enabled, which then calls smp_call_function_single(), which calls 
__vcpu_clear() with interrupts disabled.  vmclear_local_vcpus() is 
called from interrupts disabled (and calls __vcpu_clear() directly).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 10/14] virtio_net: limit xmit polling

2011-05-24 Thread Michael S. Tsirkin
On Tue, May 24, 2011 at 02:57:43PM +0530, Krishna Kumar2 wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 05/24/2011 02:42:55 PM:
 
 To do this properly, we should really be using the actual number of
 sg
 elements needed, but we'd have to do most of xmit_skb beforehand so
 we
 know how many.

 Cheers,
 Rusty.
   
Maybe I'm confused here.  The problem isn't the failing
add_buf for the given skb IIUC.  What we are trying to do here is
 stop
the queue *before xmit_skb fails*. We can't look at the
number of fragments in the current skb - the next one can be
much larger.  That's why we check capacity after xmit_skb,
not before it, right?
  
   Maybe Rusty means it is a simpler model to free the amount
   of space that this xmit needs. We will still fail anyway
   at some time but it is unlikely, since earlier iteration
   freed up atleast the space that it was going to use.
 
  Not sure I nderstand.  We can't know space is freed in the previous
  iteration as buffers might not have been used by then.
 
 Yes, the first few iterations may not have freed up space, but
 later ones should. The amount of free space should increase
 from then on, especially since we try to free double of what
 we consume.

Hmm. This is only an upper limit on the # of entries in the queue.
Assume that vq size is 4 and we transmit 4 enties without
getting anything in the used ring. The next transmit will fail.

So I don't really see why it's unlikely that we reach the packet
drop code with your patch.

   The
   code could become much simpler:
  
   start_xmit()
   {
   {
   num_sgs = get num_sgs for this skb;
  
   /* Free enough pending old buffers to enable queueing this one
 */
   free_old_xmit_skbs(vi, num_sgs * 2); /* ?? */
  
   if (virtqueue_get_capacity()  num_sgs) {
   netif_stop_queue(dev);
   if (virtqueue_enable_cb_delayed(vi-svq) ||
   free_old_xmit_skbs(vi, num_sgs)) {
   /* Nothing freed up, or not enough freed up */
   kfree_skb(skb);
   return NETDEV_TX_OK;
 
  This packet drop is what we wanted to avoid.
 
 Please see below on returning NETDEV_TX_BUSY.
 
 
   }
   netif_start_queue(dev);
   virtqueue_disable_cb(vi-svq);
   }
  
   /* xmit_skb cannot fail now, also pass 'num_sgs' */
   xmit_skb(vi, skb, num_sgs);
   virtqueue_kick(vi-svq);
  
   skb_orphan(skb);
   nf_reset(skb);
  
   return NETDEV_TX_OK;
   }
  
   We could even return TX_BUSY since that makes the dequeue
   code more efficient. See dev_dequeue_skb() - you can skip a
   lot of code (and avoid taking locks) to check if the queue
   is already stopped but that code runs only if you return
   TX_BUSY in the earlier iteration.
  
   BTW, shouldn't the check in start_xmit be:
  if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
 ...
  }
  
   Thanks,
  
   - KK
 
  I thought we used to do basically this but other devices moved to a
  model where they stop *before* queueing fails, so we did too.
 
 I am not sure of why it was changed, since returning TX_BUSY
 seems more efficient IMHO.
 qdisc_restart() handles requeue'd
 packets much better than a stopped queue, as a significant
 part of this code is skipped if gso_skb is present

I think this is the argument:
http://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg06364.html


 (qdisc
 will eventually start dropping packets when tx_queue_len is
 exceeded anyway).
 
 Thanks,
 
 - KK

tx_queue_len is a pretty large buffer so maybe no.
I think the packet drops from the scheduler queue can also be
done intelligently (e.g. with CHOKe) which should
work better than dropping a random packet?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Sasha Levin
On Tue, 2011-05-24 at 14:26 +0300, Pekka Enberg wrote:
 Hi Avi,
 
 On Tue, May 24, 2011 at 2:22 PM, Avi Kivity a...@redhat.com wrote:
  The point is to follow Linux kernel conventions and idioms (and share
  code) as much as possible so it's familiar to devs who are already
  working on the kernel. That's why section tricks seem more appropriate
  than using constructor to me. Or is there some technical advantage to
  using constructors?
 
  You get to reuse infrastructure that's already there.
 
  Things like using sections and s/uint64_t/u64/ look anti-reuse to me.
   Userspace isn't the kernel, for better or for worse.
 
 Not really. The type thing is pretty much required once you start
 using kernel code (as we learned the hard way).
 
 Btw, constructor attribute doesn't really seem like a good fit for
 late_initcall type of thing:
 
   The constructor attribute causes the function to be called
 automatically before execution enters main ()

You could add a small constructor function that'll add a pointer to the
real initialization function to a list which will get called after we
get everything initialized and ready.

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Avi Kivity

On 05/24/2011 02:26 PM, Pekka Enberg wrote:

Hi Avi,

On Tue, May 24, 2011 at 2:22 PM, Avi Kivitya...@redhat.com  wrote:
  The point is to follow Linux kernel conventions and idioms (and share
  code) as much as possible so it's familiar to devs who are already
  working on the kernel. That's why section tricks seem more appropriate
  than using constructor to me. Or is there some technical advantage to
  using constructors?

  You get to reuse infrastructure that's already there.

  Things like using sections and s/uint64_t/u64/ look anti-reuse to me.
Userspace isn't the kernel, for better or for worse.

Not really. The type thing is pretty much required once you start
using kernel code (as we learned the hard way).


What happens when you start using userspace libraries?

Eventually you'll have a lot more of that than kernel code.


Btw, constructor attribute doesn't really seem like a good fit for
late_initcall type of thing:

   The constructor attribute causes the function to be called
automatically before execution enters main ()

What am I missing here?


Like Paolo said, you can have the constructor register a function or 
structure to be called any time you like.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Tian, Kevin
 From: Avi Kivity [mailto:a...@redhat.com]
 Sent: Tuesday, May 24, 2011 7:27 PM
 
 On 05/24/2011 02:20 PM, Tian, Kevin wrote:
I don't think it's possible.  Both calls are done with interrupts 
   disabled.
 
  If that's the case then there's another potential issue. Deadlock may happen
  when calling smp_call_function_single with interrupt disabled.
 
 We don't do that.  vcpu migration calls vcpu_clear() with interrupts
 enabled, which then calls smp_call_function_single(), which calls
 __vcpu_clear() with interrupts disabled.  vmclear_local_vcpus() is
 called from interrupts disabled (and calls __vcpu_clear() directly).
 

OK, that's clear to me now. 

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Avi Kivity

On 05/24/2011 02:30 PM, Tian, Kevin wrote:


  We don't do that.  vcpu migration calls vcpu_clear() with interrupts
  enabled, which then calls smp_call_function_single(), which calls
  __vcpu_clear() with interrupts disabled.  vmclear_local_vcpus() is
  called from interrupts disabled (and calls __vcpu_clear() directly).


OK, that's clear to me now.


Are there still open issues about the patch?

(Nadav, please post patches in the future in new threads so they're 
easier to find)


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Pekka Enberg
Hi Avi,

On Tue, May 24, 2011 at 2:30 PM, Avi Kivity a...@redhat.com wrote:
 What happens when you start using userspace libraries?

 Eventually you'll have a lot more of that than kernel code.

I don't quite understand what problems you think we might have. We're
already using userspace libraries (libvnc) and most of us code is
non-kernel code.

We switched to u64 and friends for two reasons: (1) using uint*_t
turned out to be painful when using kernel headers (e.g., mptables,
e820, etc.) and (2) we want to be as close as possible to the coding
style of tools/perf to be able to reuse their code in the future.

On Tue, May 24, 2011 at 2:30 PM, Avi Kivity a...@redhat.com wrote:
 Like Paolo said, you can have the constructor register a function or
 structure to be called any time you like.

Oh, okay. We should probably start out with that and switch to section
tricks if we have to.

 Pekka
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Tian, Kevin
 From: Avi Kivity [mailto:a...@redhat.com]
 Sent: Tuesday, May 24, 2011 7:37 PM
 
 On 05/24/2011 02:30 PM, Tian, Kevin wrote:
  
We don't do that.  vcpu migration calls vcpu_clear() with interrupts
enabled, which then calls smp_call_function_single(), which calls
__vcpu_clear() with interrupts disabled.  vmclear_local_vcpus() is
called from interrupts disabled (and calls __vcpu_clear() directly).
  
 
  OK, that's clear to me now.
 
 Are there still open issues about the patch?
 
 (Nadav, please post patches in the future in new threads so they're
 easier to find)
 

I'm fine with this patch except that Nadav needs to clarify the comment
in __loaded_vmcs_clear (regarding to 'cpu migration' and 'cpu offline' part
which I replied in another mail)

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Avi Kivity

On 05/24/2011 02:38 PM, Pekka Enberg wrote:

Hi Avi,

On Tue, May 24, 2011 at 2:30 PM, Avi Kivitya...@redhat.com  wrote:
  What happens when you start using userspace libraries?

  Eventually you'll have a lot more of that than kernel code.

I don't quite understand what problems you think we might have. We're
already using userspace libraries (libvnc) and most of us code is
non-kernel code.


If uint64_t is defined differently than u64, you won't be able to pass 
it by reference if an external library expects it.



We switched to u64 and friends for two reasons: (1) using uint*_t
turned out to be painful when using kernel headers (e.g., mptables,
e820, etc.) and (2) we want to be as close as possible to the coding
style of tools/perf to be able to reuse their code in the future.



Regarding this reuse, I see it's done by copy/paste.  Won't it be better 
to have tools/lib and have tools/perf and tools/kvm use that?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Restoring saved guest causes guest to reboot

2011-05-24 Thread Markus Schade

Hi,

On Tue, 24 May 2011, Joerg Roedel wrote:


I could try to also confirm this on AMD X2, but it will take me a little
to setup a machine.


Your description points to some problem in the kernel. Would be good if
you could try the same on your AMD X2 to find out if the problem is in
generic kvm code or in the vmx part.


Okay, it seems to be an Intel-specific bug. I have tested multiple qemu-kvm
(0.13, 0.14 and 0.14.1) and kernel versions (2.6.37.6, 2.6.38.6 and
2.6.39) on an AMD X2 6000+. Neither of them causes the guest to reboot.
Everything works as expected.

Best regards,
Markus

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Pekka Enberg
Hi Avi,

On Tue, May 24, 2011 at 2:41 PM, Avi Kivity a...@redhat.com wrote:
 I don't quite understand what problems you think we might have. We're
 already using userspace libraries (libvnc) and most of us code is
 non-kernel code.

 If uint64_t is defined differently than u64, you won't be able to pass it by
 reference if an external library expects it.

Oh, the sizes must be the same for obvious reasons so that shouldn't
be a problem and it's better to do the casting in the external API
calls than in core code.

On Tue, May 24, 2011 at 2:41 PM, Avi Kivity a...@redhat.com wrote:
 Regarding this reuse, I see it's done by copy/paste.  Won't it be better to
 have tools/lib and have tools/perf and tools/kvm use that?

Yes, absolutely. We're keeping things well-contained under tools/kvm
with copy-paste until we hit Linus' tree in the future, though, for
merge reasons.

   Pekka
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Nadav Har'El
On Tue, May 24, 2011, Tian, Kevin wrote about RE: [PATCH 08/31] nVMX: Fix 
local_vcpus_link handling:
 I'm fine with this patch except that Nadav needs to clarify the comment
 in __loaded_vmcs_clear (regarding to 'cpu migration' and 'cpu offline' part
 which I replied in another mail)

I added a single letter, v, to my comment ;-)

Please note that the same code existed previously, and didn't have any comment.
If you find this short comment more confusing (or wrong) than helpful, then I
can just remove it.

Avi, I'll send a new version of patch 1 in a few minutes, in a new thread
this time ;-) Please let me know when (or if) you are prepared to apply the
rest of the patches, so I can send a new version, rebased to the current
trunk and with all the fixes people asked for in the last few days.


-- 
Nadav Har'El|  Tuesday, May 24 2011, 20 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Ms Piggy's last words: I'm pink,
http://nadav.harel.org.il   |therefore I'm ham.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] new version of loaded_vmcs patch

2011-05-24 Thread Nadav Har'El
Hi Avi, here is a updated version of the loaded_vmcs patch which you asked me
to send before the rest of the nvmx patches.

Please let me know when you'd like me to send you an updated version of
the rest of the patches.


Subject: [PATCH 01/31] nVMX: Keep list of loaded VMCSs, instead of vcpus.

In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
because (at least in theory) the processor might not have written all of its
content back to memory. Since a patch from June 26, 2008, this is done using
a per-cpu vcpus_on_cpu linked list of vcpus loaded on each CPU.

The problem is that with nested VMX, we no longer have the concept of a
vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, a pool for
L2s), and each of those may be have been last loaded on a different cpu.

So instead of linking the vcpus, we link the VMCSs, using a new structure
loaded_vmcs. This structure contains the VMCS, and the information pertaining
to its loading on a specific cpu (namely, the cpu number, and whether it
was already launched on this cpu once). In nested we will also use the same
structure to hold L2 VMCSs, and vmx-loaded_vmcs is a pointer to the
currently active VMCS.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |  150 ---
 1 file changed, 86 insertions(+), 64 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2011-05-24 15:12:22.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-24 15:12:22.0 +0300
@@ -116,6 +116,18 @@ struct vmcs {
char data[0];
 };
 
+/*
+ * Track a VMCS that may be loaded on a certain CPU. If it is (cpu!=-1), also
+ * remember whether it was VMLAUNCHed, and maintain a linked list of all VMCSs
+ * loaded on this CPU (so we can clear them if the CPU goes down).
+ */
+struct loaded_vmcs {
+   struct vmcs *vmcs;
+   int cpu;
+   int launched;
+   struct list_head loaded_vmcss_on_cpu_link;
+};
+
 struct shared_msr_entry {
unsigned index;
u64 data;
@@ -124,9 +136,7 @@ struct shared_msr_entry {
 
 struct vcpu_vmx {
struct kvm_vcpu   vcpu;
-   struct list_head  local_vcpus_link;
unsigned long host_rsp;
-   int   launched;
u8fail;
u8cpl;
bool  nmi_known_unmasked;
@@ -140,7 +150,14 @@ struct vcpu_vmx {
u64   msr_host_kernel_gs_base;
u64   msr_guest_kernel_gs_base;
 #endif
-   struct vmcs  *vmcs;
+   /*
+* loaded_vmcs points to the VMCS currently used in this vcpu. For a
+* non-nested (L1) guest, it always points to vmcs01. For a nested
+* guest (L2), it points to a different VMCS.
+*/
+   struct loaded_vmcsvmcs01;
+   struct loaded_vmcs   *loaded_vmcs;
+   bool  __launched; /* temporary, used in vmx_vcpu_run */
struct msr_autoload {
unsigned nr;
struct vmx_msr_entry guest[NR_AUTOLOAD_MSRS];
@@ -200,7 +217,11 @@ static int vmx_set_tss_addr(struct kvm *
 
 static DEFINE_PER_CPU(struct vmcs *, vmxarea);
 static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
-static DEFINE_PER_CPU(struct list_head, vcpus_on_cpu);
+/*
+ * We maintain a per-CPU linked-list of VMCS loaded on that CPU. This is needed
+ * when a CPU is brought down, and we need to VMCLEAR all VMCSs loaded on it.
+ */
+static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
 static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
 
 static unsigned long *vmx_io_bitmap_a;
@@ -501,6 +522,13 @@ static void vmcs_clear(struct vmcs *vmcs
   vmcs, phys_addr);
 }
 
+static inline void loaded_vmcs_init(struct loaded_vmcs *loaded_vmcs)
+{
+   vmcs_clear(loaded_vmcs-vmcs);
+   loaded_vmcs-cpu = -1;
+   loaded_vmcs-launched = 0;
+}
+
 static void vmcs_load(struct vmcs *vmcs)
 {
u64 phys_addr = __pa(vmcs);
@@ -514,25 +542,24 @@ static void vmcs_load(struct vmcs *vmcs)
   vmcs, phys_addr);
 }
 
-static void __vcpu_clear(void *arg)
+static void __loaded_vmcs_clear(void *arg)
 {
-   struct vcpu_vmx *vmx = arg;
+   struct loaded_vmcs *loaded_vmcs = arg;
int cpu = raw_smp_processor_id();
 
-   if (vmx-vcpu.cpu == cpu)
-   vmcs_clear(vmx-vmcs);
-   if (per_cpu(current_vmcs, cpu) == vmx-vmcs)
+   if (loaded_vmcs-cpu != cpu)
+   return; /* vcpu migration can race with cpu offline */
+   if (per_cpu(current_vmcs, cpu) == loaded_vmcs-vmcs)
per_cpu(current_vmcs, cpu) = NULL;
-   list_del(vmx-local_vcpus_link);
-   vmx-vcpu.cpu = -1;
-   vmx-launched = 0;
+   list_del(loaded_vmcs-loaded_vmcss_on_cpu_link);
+   loaded_vmcs_init(loaded_vmcs);
 }
 
-static void vcpu_clear(struct vcpu_vmx *vmx)
+static void 

Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Paolo Bonzini

On 05/24/2011 01:56 PM, Pekka Enberg wrote:

  If uint64_t is defined differently than u64, you won't be able to pass it by
  reference if an external library expects it.


Oh, the sizes must be the same for obvious reasons so that shouldn't
be a problem and it's better to do the casting in the external API
calls than in core code.


Be careful about unsigned long vs. unsigned long long.  If you want to 
use the u64 spelling that's fine, but I would use a custom header that 
defines it from uint64_t.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


FreeBSD startup timer hangs on boot

2011-05-24 Thread Nikita A Menkovich
Hello,

I receive strange problem with running FreeBSD guest (7 and 8, x86 and
amd64) on KVM.
When begins boot menu, there is time countdown 10..9..8 etc, and time
between 10 and 9 can be different than 1 sec 0.9, 1.1 etc.
And sometimes countdown hangs on some number.

QEMU emulator version 0.14.0 (qemu-kvm-0.14.0 Debian
0.14.0+dfsg-1~tls), Copyright (c) 2003-2008 Fabrice Bellard
libvirt 0.9.1
Linux  2.6.38  x86_64

libvirt.xml http://pastebin.com/03vJKhVc

strace -c -p VMPID


% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 99.812.432429 334  7277   select
  0.110.002591   0 18310   timer_gettime
  0.060.001355   0 14404   ioctl
  0.010.000218   0 10542  3514 read
  0.010.000217   0  3514   timer_settime
  0.000.70   0  3514   rt_sigaction
  0.000.53   0  3869   write
  0.000.33   0   303   recvmsg
-- --- --- - - 
100.002.436966 61733  3514 total


strace -p VMPID


write(7, \1\0\0\0\0\0\0\0, 8) = 8
read(13, 0x7fff119dae90, 128)   = -1 EAGAIN (Resource
temporarily unavailable)
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 0}}) = 0
timer_settime(0x2, 0, {it_interval={0, 0}, it_value={0, 25}}, NULL) = 0
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 138172}}) = 0
ioctl(5, KVM_IRQ_LINE_STATUS, 0x7fff119daeb0) = 0
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 1}}) = 0
ioctl(5, KVM_IRQ_LINE_STATUS, 0x7fff119daeb0) = 0
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 1}}) = 0
ioctl(5, KVM_IRQ_LINE_STATUS, 0x7fff119daeb0) = 0
select(16, [6 9 12 13 14 15], [], [], {1, 0}) = 1 (in [6], left {0, 98})
read(6, \1\0\0\0\0\0\0\0, 512)= 8
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 1}}) = 0
ioctl(5, KVM_IRQ_LINE_STATUS, 0x7fff119daeb0) = 0
select(16, [6 9 12 13 14 15], [], [], {1, 0}) = 1 (in [13], left {0, 998108})
read(13, 
\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0...,
128) = 128
rt_sigaction(SIGALRM, NULL, {0x4d0490, ~[KILL STOP RTMIN RT_1],
SA_RESTORER, 0x7fb09b6c9f60}, 8) = 0
write(7, \1\0\0\0\0\0\0\0, 8) = 8
read(13, 0x7fff119dae90, 128)   = -1 EAGAIN (Resource
temporarily unavailable)
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 0}}) = 0
timer_settime(0x2, 0, {it_interval={0, 0}, it_value={0, 25}}, NULL) = 0
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 144317}}) = 0
ioctl(5, KVM_IRQ_LINE_STATUS, 0x7fff119daeb0) = 0
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 1}}) = 0
ioctl(5, KVM_IRQ_LINE_STATUS, 0x7fff119daeb0) = 0
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 1}}) = 0
ioctl(5, KVM_IRQ_LINE_STATUS, 0x7fff119daeb0) = 0
select(16, [6 9 12 13 14 15], [], [], {1, 0}) = 1 (in [6], left {0, 98})
read(6, \1\0\0\0\0\0\0\0, 512)= 8
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 1}}) = 0
ioctl(5, KVM_IRQ_LINE_STATUS, 0x7fff119daeb0) = 0
select(16, [6 9 12 13 14 15], [], [], {1, 0}) = 1 (in [13], left {0, 997811})
read(13, 
\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0...,
128) = 128
rt_sigaction(SIGALRM, NULL, {0x4d0490, ~[KILL STOP RTMIN RT_1],
SA_RESTORER, 0x7fb09b6c9f60}, 8) = 0
write(7, \1\0\0\0\0\0\0\0, 8) = 8
read(13, 0x7fff119dae90, 128)   = -1 EAGAIN (Resource
temporarily unavailable)
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 0}}) = 0
timer_settime(0x2, 0, {it_interval={0, 0}, it_value={0, 25}}, NULL) = 0
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 133651}}) = 0
ioctl(5, KVM_IRQ_LINE_STATUS, 0x7fff119daeb0) = 0
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 1}}) = 0
ioctl(5, KVM_IRQ_LINE_STATUS, 0x7fff119daeb0) = 0
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 1}}) = 0
ioctl(5, KVM_IRQ_LINE_STATUS, 0x7fff119daeb0) = 0
select(16, [6 9 12 13 14 15], [], [], {1, 0}) = 1 (in [6], left {0, 98})
read(6, \1\0\0\0\0\0\0\0, 512)= 8
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 1}}) = 0
ioctl(5, KVM_IRQ_LINE_STATUS, 0x7fff119daeb0) = 0
select(16, [6 9 12 13 14 15], [], [], {1, 0}) = 1 (in [13], left {0, 998027})
read(13, 
\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0...,
128) = 128
rt_sigaction(SIGALRM, NULL, {0x4d0490, ~[KILL STOP RTMIN RT_1],
SA_RESTORER, 0x7fb09b6c9f60}, 8) = 0
write(7, \1\0\0\0\0\0\0\0, 8) = 8
read(13, 0x7fff119dae90, 128)   = -1 EAGAIN (Resource
temporarily unavailable)
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 0}}) = 0
timer_settime(0x2, 0, {it_interval={0, 0}, it_value={0, 25}}, NULL) = 0
timer_gettime(0x2, {it_interval={0, 0}, it_value={0, 136813}}) = 0
ioctl(5, KVM_IRQ_LINE_STATUS, 0x7fff119daeb0) = 0
timer_gettime(0x2, 

Re: [PATCHv2 10/14] virtio_net: limit xmit polling

2011-05-24 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 05/24/2011 04:59:39 PM:

Maybe Rusty means it is a simpler model to free the amount
of space that this xmit needs. We will still fail anyway
at some time but it is unlikely, since earlier iteration
freed up atleast the space that it was going to use.
  
   Not sure I nderstand.  We can't know space is freed in the previous
   iteration as buffers might not have been used by then.
 
  Yes, the first few iterations may not have freed up space, but
  later ones should. The amount of free space should increase
  from then on, especially since we try to free double of what
  we consume.

 Hmm. This is only an upper limit on the # of entries in the queue.
 Assume that vq size is 4 and we transmit 4 enties without
 getting anything in the used ring. The next transmit will fail.

 So I don't really see why it's unlikely that we reach the packet
 drop code with your patch.

I was assuming 256 entries :) I will try to get some
numbers to see how often it is true tomorrow.

  I am not sure of why it was changed, since returning TX_BUSY
  seems more efficient IMHO.
  qdisc_restart() handles requeue'd
  packets much better than a stopped queue, as a significant
  part of this code is skipped if gso_skb is present

 I think this is the argument:
 http://www.mail-archive.com/virtualization@lists.linux-
 foundation.org/msg06364.html

Thanks for digging up that thread! Yes, that one skb would get
sent first ahead of possibly higher priority skbs. However,
from a performance point, TX_BUSY code skips a lot of checks
and code for all subsequent packets till the device is
restarted. I can test performance with both cases and report
what I find (the requeue code has become very simple and clean
from horribly complex, thanks to Herbert and Dave).

  (qdisc
  will eventually start dropping packets when tx_queue_len is

 tx_queue_len is a pretty large buffer so maybe no.

I remember seeing tons of drops (pfifo_fast_enqueue) when
xmit returns TX_BUSY.

 I think the packet drops from the scheduler queue can also be
 done intelligently (e.g. with CHOKe) which should
 work better than dropping a random packet?

I am not sure of that - choke_enqueue checks against a random
skb to drop current skb, and also during congestion. But for
my sample driver xmit, returning TX_BUSY could still allow
to be used with CHOKe.

thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 20/31] nVMX: Exiting from L2 to L1

2011-05-24 Thread Tian, Kevin
 From: Nadav Har'El
 Sent: Tuesday, May 17, 2011 3:54 AM
 
 This patch implements nested_vmx_vmexit(), called when the nested L2 guest
 exits and we want to run its L1 parent and let it handle this exit.
 
 Note that this will not necessarily be called on every L2 exit. L0 may decide
 to handle a particular exit on its own, without L1's involvement; In that
 case, L0 will handle the exit, and resume running L2, without running L1 and
 without calling nested_vmx_vmexit(). The logic for deciding whether to handle
 a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
 will appear in a separate patch below.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com
 ---
  arch/x86/kvm/vmx.c |  257
 +++
  1 file changed, 257 insertions(+)
 
 --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:49.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:49.0 +0300
 @@ -6203,6 +6203,263 @@ static int nested_vmx_run(struct kvm_vcp
   return 1;
  }
 
 +/*
 + * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
 + * because L2 may have changed some cr0 bits directly (see
 CRO_GUEST_HOST_MASK)
 + * without L0 trapping the change and updating vmcs12.
 + * This function returns the value we should put in vmcs12.guest_cr0. It's 
 not
 + * enough to just return the current (vmcs02) GUEST_CR0 - that may not be
 the
 + * guest cr0 that L1 thought it was giving its L2 guest; It is possible that
 + * L1 wished to allow its guest to set some cr0 bit directly, but we (L0) 
 asked
 + * to trap this change and instead set just the read shadow bit. If this is 
 the
 + * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0,
 where
 + * L1 believes they already are.
 + */
 +static inline unsigned long
 +vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 +{
 + /*
 +  * As explained above, we take a bit from GUEST_CR0 if we allowed the
 +  * guest to modify it untrapped (vcpu-arch.cr0_guest_owned_bits), or
 +  * if we did trap it - if we did so because L1 asked to trap this bit
 +  * (vmcs12-cr0_guest_host_mask). Otherwise (bits we trapped but L1
 +  * didn't expect us to trap) we read from CR0_READ_SHADOW.
 +  */
 + unsigned long guest_cr0_bits =
 + vcpu-arch.cr0_guest_owned_bits | vmcs12-cr0_guest_host_mask;
 + return (vmcs_readl(GUEST_CR0)  guest_cr0_bits) |
 +(vmcs_readl(CR0_READ_SHADOW)  ~guest_cr0_bits);
 +}

Hi, Nadav,

Not sure whether I get above operation wrong. But it looks not exactly correct 
to me
in a glimpse. Say a bit set both in L0/L1's cr0_guest_host_mask. In such case 
that
bit from vmcs12_GUEST_CR0 resides in vmcs02_CR0_READ_SHADOW, however above
operation will make vmcs02_GUEST_CR0 bit returned instead.

Instead of constructing vmcs12_GUEST_CR0 completely from vmcs02_GUEST_CR0,
why not just updating bits which can be altered while keeping the rest bits from
vmcs12_GUEST_CR0? Say something like:

vmcs12-guest_cr0 = vmcs12-cr0_guest_host_mask; /* keep unchanged bits */
vmcs12-guest_cr0 |= (vmcs_readl(GUEST_CR0)  vcpu-arch.cr0_guest_owned_bits) |
(vmcs_readl(CR0_READ_SHADOW)  ~( vcpu-arch.cr0_guest_owned_bits | 
vmcs12-cr0_guest_host_mask))

Thanks
Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Restoring saved guest causes guest to reboot

2011-05-24 Thread Joerg Roedel
On Tue, May 24, 2011 at 01:42:23PM +0200, Markus Schade wrote:
 Hi,

 On Tue, 24 May 2011, Joerg Roedel wrote:

 I could try to also confirm this on AMD X2, but it will take me a little
 to setup a machine.

 Your description points to some problem in the kernel. Would be good if
 you could try the same on your AMD X2 to find out if the problem is in
 generic kvm code or in the vmx part.

 Okay, it seems to be an Intel-specific bug. I have tested multiple qemu-kvm
 (0.13, 0.14 and 0.14.1) and kernel versions (2.6.37.6, 2.6.38.6 and
 2.6.39) on an AMD X2 6000+. Neither of them causes the guest to reboot.
 Everything works as expected.

Hmm, only 19 patches between .37 and .38 touch vmx.c. Avi, any idea which
of those patches could cause this?

Regards,

Joerg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] new version of loaded_vmcs patch

2011-05-24 Thread Tian, Kevin
 From: Nadav Har'El
 Sent: Tuesday, May 24, 2011 8:26 PM
 
 Hi Avi, here is a updated version of the loaded_vmcs patch which you asked me
 to send before the rest of the nvmx patches.
 
 Please let me know when you'd like me to send you an updated version of
 the rest of the patches.
 
 
 Subject: [PATCH 01/31] nVMX: Keep list of loaded VMCSs, instead of vcpus.
 
 In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
 because (at least in theory) the processor might not have written all of its
 content back to memory. Since a patch from June 26, 2008, this is done using
 a per-cpu vcpus_on_cpu linked list of vcpus loaded on each CPU.
 
 The problem is that with nested VMX, we no longer have the concept of a
 vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, a pool for
 L2s), and each of those may be have been last loaded on a different cpu.
 
 So instead of linking the vcpus, we link the VMCSs, using a new structure
 loaded_vmcs. This structure contains the VMCS, and the information
 pertaining
 to its loading on a specific cpu (namely, the cpu number, and whether it
 was already launched on this cpu once). In nested we will also use the same
 structure to hold L2 VMCSs, and vmx-loaded_vmcs is a pointer to the
 currently active VMCS.
 
 Signed-off-by: Nadav Har'El n...@il.ibm.com

Acked-by: Kevin Tian kevin.t...@intel.com

Thanks
Kevin

 ---
  arch/x86/kvm/vmx.c |  150 ---
  1 file changed, 86 insertions(+), 64 deletions(-)
 
 --- .before/arch/x86/kvm/vmx.c2011-05-24 15:12:22.0 +0300
 +++ .after/arch/x86/kvm/vmx.c 2011-05-24 15:12:22.0 +0300
 @@ -116,6 +116,18 @@ struct vmcs {
   char data[0];
  };
 
 +/*
 + * Track a VMCS that may be loaded on a certain CPU. If it is (cpu!=-1), also
 + * remember whether it was VMLAUNCHed, and maintain a linked list of all
 VMCSs
 + * loaded on this CPU (so we can clear them if the CPU goes down).
 + */
 +struct loaded_vmcs {
 + struct vmcs *vmcs;
 + int cpu;
 + int launched;
 + struct list_head loaded_vmcss_on_cpu_link;
 +};
 +
  struct shared_msr_entry {
   unsigned index;
   u64 data;
 @@ -124,9 +136,7 @@ struct shared_msr_entry {
 
  struct vcpu_vmx {
   struct kvm_vcpu   vcpu;
 - struct list_head  local_vcpus_link;
   unsigned long host_rsp;
 - int   launched;
   u8fail;
   u8cpl;
   bool  nmi_known_unmasked;
 @@ -140,7 +150,14 @@ struct vcpu_vmx {
   u64   msr_host_kernel_gs_base;
   u64   msr_guest_kernel_gs_base;
  #endif
 - struct vmcs  *vmcs;
 + /*
 +  * loaded_vmcs points to the VMCS currently used in this vcpu. For a
 +  * non-nested (L1) guest, it always points to vmcs01. For a nested
 +  * guest (L2), it points to a different VMCS.
 +  */
 + struct loaded_vmcsvmcs01;
 + struct loaded_vmcs   *loaded_vmcs;
 + bool  __launched; /* temporary, used in
 vmx_vcpu_run */
   struct msr_autoload {
   unsigned nr;
   struct vmx_msr_entry guest[NR_AUTOLOAD_MSRS];
 @@ -200,7 +217,11 @@ static int vmx_set_tss_addr(struct kvm *
 
  static DEFINE_PER_CPU(struct vmcs *, vmxarea);
  static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
 -static DEFINE_PER_CPU(struct list_head, vcpus_on_cpu);
 +/*
 + * We maintain a per-CPU linked-list of VMCS loaded on that CPU. This is
 needed
 + * when a CPU is brought down, and we need to VMCLEAR all VMCSs loaded
 on it.
 + */
 +static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
  static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
 
  static unsigned long *vmx_io_bitmap_a;
 @@ -501,6 +522,13 @@ static void vmcs_clear(struct vmcs *vmcs
  vmcs, phys_addr);
  }
 
 +static inline void loaded_vmcs_init(struct loaded_vmcs *loaded_vmcs)
 +{
 + vmcs_clear(loaded_vmcs-vmcs);
 + loaded_vmcs-cpu = -1;
 + loaded_vmcs-launched = 0;
 +}
 +
  static void vmcs_load(struct vmcs *vmcs)
  {
   u64 phys_addr = __pa(vmcs);
 @@ -514,25 +542,24 @@ static void vmcs_load(struct vmcs *vmcs)
  vmcs, phys_addr);
  }
 
 -static void __vcpu_clear(void *arg)
 +static void __loaded_vmcs_clear(void *arg)
  {
 - struct vcpu_vmx *vmx = arg;
 + struct loaded_vmcs *loaded_vmcs = arg;
   int cpu = raw_smp_processor_id();
 
 - if (vmx-vcpu.cpu == cpu)
 - vmcs_clear(vmx-vmcs);
 - if (per_cpu(current_vmcs, cpu) == vmx-vmcs)
 + if (loaded_vmcs-cpu != cpu)
 + return; /* vcpu migration can race with cpu offline */
 + if (per_cpu(current_vmcs, cpu) == loaded_vmcs-vmcs)
   per_cpu(current_vmcs, cpu) = NULL;
 - list_del(vmx-local_vcpus_link);
 - vmx-vcpu.cpu = -1;
 - vmx-launched = 0;
 + 

Re: KVM call agenda dfor May 24th

2011-05-24 Thread Juan Quintela
Juan Quintela quint...@redhat.com wrote:
 Please send in any agenda items you are interested in covering.

As there have been any items for the agenda, call gets cancelled.

Have a nice week, Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/30] nVMX: Nested VMX, v9

2011-05-24 Thread Joerg Roedel
On Tue, May 24, 2011 at 02:09:00PM +0300, Avi Kivity wrote:
 On 05/23/2011 09:06 PM, Alexander Graf wrote:
 On 23.05.2011, at 17:23, Avi Kivity wrote:

   On 05/23/2011 05:44 PM, Nadav Har'El wrote:
   On Mon, May 23, 2011, Avi Kivity wrote about Re: [PATCH 0/30] nVMX: 
  Nested VMX, v9:
  vmcs01 and vmcs02 will both be generated from vmcs12.
 
   If you don't do a clean nested exit (from L2 to L1), vmcs02 can't be 
  generated
   from vmcs12... while L2 runs, it is possible that it modifies vmcs02 
  (e.g.,
   non-trapped bits of guest_cr0), and these modifications are not copied 
  back
   to vmcs12 until the nested exit (when prepare_vmcs12() is called to 
  perform
   this task).
 
   If you do a nested exit (a fake one), vmcs12 is made up to date, and 
  then
   indeed vmcs02 can be thrown away and regenerated.
 
   You would flush this state back to the vmcs.  But that just confirms 
  Joerg's statement that a fake vmexit/vmrun is more or less equivalent.
 
   The question is whether %rip points to the VMRUN/VMLAUNCH instruction, 
  HOST_RIP (or the next instruction for svm), or to guest code.  But the 
  actual things we need to do are all very similar subsets of a vmexit.

 %rip should certainly point to VMRUN. That way there is no need to save any 
 information whatsoever, as the VMCB is already in sane state and nothing 
 needs to be special cased, as the next VCPU_RUN would simply go back into 
 guest mode - which is exactly what we want.

 The only tricky part is how we distinguish between I need to live migrate 
 and info registers. In the former case, %rip should be on VMRUN. In the 
 latter, on the guest rip.

 We can split vmrun emulation into save host state, load guest state  
 and prepare nested vmcb.  Then, when we load registers, if we see that  
 we're in guest mode, we do just the prepare nested vmcb bit.

Or we just emulate a VMEXIT in the VCPU_FREEZE ioctl and set the
%rip back to the VMRUN that entered the L2 guest. For 'info registers'
the VCPU_FREEZE ioctl will not be issued and the guest registers be
displayed.
That way we don't need to migrate any additional state for SVM.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] new version of loaded_vmcs patch

2011-05-24 Thread Avi Kivity

On 05/24/2011 03:26 PM, Nadav Har'El wrote:

Hi Avi, here is a updated version of the loaded_vmcs patch which you asked me
to send before the rest of the nvmx patches.

Please let me know when you'd like me to send you an updated version of
the rest of the patches.


Subject: [PATCH 01/31] nVMX: Keep list of loaded VMCSs, instead of vcpus.

In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
because (at least in theory) the processor might not have written all of its
content back to memory. Since a patch from June 26, 2008, this is done using
a per-cpu vcpus_on_cpu linked list of vcpus loaded on each CPU.

The problem is that with nested VMX, we no longer have the concept of a
vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, a pool for
L2s), and each of those may be have been last loaded on a different cpu.

So instead of linking the vcpus, we link the VMCSs, using a new structure
loaded_vmcs. This structure contains the VMCS, and the information pertaining
to its loading on a specific cpu (namely, the cpu number, and whether it
was already launched on this cpu once). In nested we will also use the same
structure to hold L2 VMCSs, and vmx-loaded_vmcs is a pointer to the
currently active VMCS.



Applied, thanks.

Do you have a list of unresolved issues?

If not, please repost the series.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Restoring saved guest causes guest to reboot

2011-05-24 Thread Avi Kivity

On 05/24/2011 04:02 PM, Joerg Roedel wrote:

On Tue, May 24, 2011 at 01:42:23PM +0200, Markus Schade wrote:
  Hi,

  On Tue, 24 May 2011, Joerg Roedel wrote:

  I could try to also confirm this on AMD X2, but it will take me a little
  to setup a machine.

  Your description points to some problem in the kernel. Would be good if
  you could try the same on your AMD X2 to find out if the problem is in
  generic kvm code or in the vmx part.

  Okay, it seems to be an Intel-specific bug. I have tested multiple qemu-kvm
  (0.13, 0.14 and 0.14.1) and kernel versions (2.6.37.6, 2.6.38.6 and
  2.6.39) on an AMD X2 6000+. Neither of them causes the guest to reboot.
  Everything works as expected.

Hmm, only 19 patches between .37 and .38 touch vmx.c. Avi, any idea which
of those patches could cause this?


Might be several.

Markus, can you try a bisect?

The command

 $ git bisect start v2.6.38 v2.6.37 arch/x86/kvm

will generate test kernels for you to compile and run.  There will be 
7-8 tests needed, and most of the compiles should be short (esp. if you 
install ccache).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Restoring saved guest causes guest to reboot

2011-05-24 Thread Markus Schade

On Tue, 24 May 2011, Avi Kivity wrote:


Hmm, only 19 patches between .37 and .38 touch vmx.c. Avi, any idea which
of those patches could cause this?


Might be several.

Markus, can you try a bisect?

The command

$ git bisect start v2.6.38 v2.6.37 arch/x86/kvm

will generate test kernels for you to compile and run.  There will be 7-8 
tests needed, and most of the compiles should be short (esp. if you install 
ccache).


Sure. I will take some time, though. Shall I use Linus git tree or the
kvm one?

Best regards,
Markus
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Restoring saved guest causes guest to reboot

2011-05-24 Thread Avi Kivity

On 05/24/2011 04:37 PM, Markus Schade wrote:

On Tue, 24 May 2011, Avi Kivity wrote:

Hmm, only 19 patches between .37 and .38 touch vmx.c. Avi, any idea 
which

of those patches could cause this?


Might be several.

Markus, can you try a bisect?

The command

$ git bisect start v2.6.38 v2.6.37 arch/x86/kvm

will generate test kernels for you to compile and run.  There will be 
7-8 tests needed, and most of the compiles should be short (esp. if 
you install ccache).


Sure. I will take some time, though. Shall I use Linus git tree or the
kvm one?


Either will work - the kvm tree includes the Linus tree.  Since you 
tested .37 and .38, best to give those to git as starting points.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 20/31] nVMX: Exiting from L2 to L1

2011-05-24 Thread Nadav Har'El
On Tue, May 24, 2011, Tian, Kevin wrote about RE: [PATCH 20/31] nVMX: Exiting 
from L2 to L1:
  +vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
  +{
  +   /*
  +* As explained above, we take a bit from GUEST_CR0 if we allowed the
  +* guest to modify it untrapped (vcpu-arch.cr0_guest_owned_bits), or
  +* if we did trap it - if we did so because L1 asked to trap this bit
  +* (vmcs12-cr0_guest_host_mask). Otherwise (bits we trapped but L1
  +* didn't expect us to trap) we read from CR0_READ_SHADOW.
  +*/
  +   unsigned long guest_cr0_bits =
  +   vcpu-arch.cr0_guest_owned_bits | vmcs12-cr0_guest_host_mask;
  +   return (vmcs_readl(GUEST_CR0)  guest_cr0_bits) |
  +  (vmcs_readl(CR0_READ_SHADOW)  ~guest_cr0_bits);
  +}
 
 Hi, Nadav,
 
 Not sure whether I get above operation wrong.

This is one of the trickiest functions in nested VMX, which is why I added
15 lines of comments (!) on just two statements of code.

 But it looks not exactly correct to me
 in a glimpse. Say a bit set both in L0/L1's cr0_guest_host_mask. In such case 
 that
 bit from vmcs12_GUEST_CR0 resides in vmcs02_CR0_READ_SHADOW, however above
 operation will make vmcs02_GUEST_CR0 bit returned instead.

This behavior is correct: If a bit is set in L1's cr0_guest_host_mask (and
in particular, if it is set in both L0's and L1's), we always exit to L1 when
L2 changes this bit, and this bit cannot change while L2 is running, so
naturally after the run vmcs02.guest_cr0 and vmcs12.guest_cr0 are still
identical in that be.
Copying that bit from vmcs02_CR0_READ_SHADOW, like you suggested, would be
completely wrong in this case: When L1 set a bit in cr0_guest_host_mask,
the vmcs02-cr0_read_shadow is vmcs12-cr0_read_shadow (see nested_read_cr0),
and is just a pretense that L1 set up for L2 - it is NOT the real bit of
guest_cr0, so copying it into guest_cr0 would be wrong.

Note that this function is completely different from nested_read_cr0 (the
new name), which behaves similar to what you suggested but serves a completely
different (and in some respect, opposite) function.

I think my comments in the code are clearer than what I just wrote here, so
please take a look at them again, and let me know if you find any errors.

 Instead of constructing vmcs12_GUEST_CR0 completely from vmcs02_GUEST_CR0,
 why not just updating bits which can be altered while keeping the rest bits 
 from
 vmcs12_GUEST_CR0? Say something like:
 
 vmcs12-guest_cr0 = vmcs12-cr0_guest_host_mask; /* keep unchanged bits */
 vmcs12-guest_cr0 |= (vmcs_readl(GUEST_CR0)  
 vcpu-arch.cr0_guest_owned_bits) |
   (vmcs_readl(CR0_READ_SHADOW)  ~( vcpu-arch.cr0_guest_owned_bits | 
 vmcs12-cr0_guest_host_mask))

I guess I could do something like this, but do you think it's clearer?
I don't. Behind all the details, my formula emphasises that MOST cr0 bits
can be just copied from vmcs02 to vmcs12 as is - and we only have to do
something strange for special bits - where L0 wanted to trap but L1 didn't.
In your formula, it looks like there are 3 different cases instead of 2.

In any case, your formula is definitely not more correct, because the formulas
are in fact equivalent - let me prove:

If, instead of taking the unchanged bits (as you call them) from
vmcs12-guest_cr0, you take them from vmcs02-guest_cr0 (you can,
because they couldn't have changed), you end up with *exactly* the same
formula I used. Here is the proof:

 yourformula =
(vmcs12-guest_cr0  vmcs12-cr0_guest_host_mask) |
(vmcs_readl(GUEST_CR0)  vcpu-arch.cr0_guest_owned_bits) |
(vmcs_readl(CR0_READ_SHADOW)  ~( vcpu-arch.cr0_guest_owned_bits | 
vmcs12-cr0_guest_host_mask))

Now because of the unchanged bits,
(vmcs12-guest_cr0  vmcs12-cr0_guest_host_mask) ==
(vmcs02-guest_cr0  vmcs12-cr0_guest_host_mask) ==

  (and note that vmcs02-guest_cr0 is vmcs_readl(GUEST_CR0))

so this in yourformula, it becomes

 yourformula =
(vmcs_readl(GUEST_CR0)  vmcs12-cr0_guest_host_mask) |
(vmcs_readl(GUEST_CR0)  vcpu-arch.cr0_guest_owned_bits) |
(vmcs_readl(CR0_READ_SHADOW)  ~( vcpu-arch.cr0_guest_owned_bits | 
vmcs12-cr0_guest_host_mask))

or, simplifying

 yourformula =
(vmcs_readl(GUEST_CR0)  (vmcs12-cr0_guest_host_mask | 
vcpu-arch.cr0_guest_owned_bits) |
(vmcs_readl(CR0_READ_SHADOW)  ~( vcpu-arch.cr0_guest_owned_bits | 
vmcs12-cr0_guest_host_mask))

now, using the name I used:
unsigned long guest_cr0_bits =
vcpu-arch.cr0_guest_owned_bits | vmcs12-cr0_guest_host_mask;

you end up with

 yourforumula =
(vmcs_readl(GUEST_CR0)  guest_cr0_bits) |
(vmcs_readl(CR0_READ_SHADOW)  ~guest_cr0_bits )

Which is, believe it or not, exactly my formula :-)


-- 
Nadav Har'El|  Tuesday, May 24 2011, 20 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |From the Linux 

Using KVM in user mode with public bridge

2011-05-24 Thread c . monty
Hello!
I have installed KVM/QEMU on my LMDE (Linux Mint Debian Edition) system.

The guest should use a public bridge that I have set up with 
/etc/network/interfaces. As you can see the bridge is configured to use a 
static IP:
 s@pc1-lmde ~ $ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo
iface lo inet loopback

auto eth1
iface eth1 inet manual

auto br1
iface br1 inet static
    address 192.168.178.11      
    netmask 255.255.255.0
    gateway 192.168.178.1
    bridge_ports eth1
    bridge_stp off
    bridge_fd 0
    bridge_maxwait 0

Starting a VM with kvm command returns this error:
 s@pc1-lmde ~ $ kvm -hda 
~/.virtualmachines/aqemu/virtualdisk_converted/vm2-windowsxp-HD01-15G.qcow 
-boot c -cdrom /dev/cdrom -m 1024 -smp 1 -net nic,macaddr=00:88:4e:ad:2d:61 
-net tap
kvm: -net tap: could not configure /dev/net/tun (tap%d): Operation not permitted
kvm: -net tap: Device 'tap' could not be initialized

Checking the permissions of file /dev/net/tun:
 s@pc1-lmde ~ $ ls -l /dev/net/tun
crw-rw-rw- 1 root root 10, 200 21. Mai 09:52 /dev/net/tun
 Starting kvm as root by means of using sudo works. Then all processes runs 
as root.
Can you advice how to troubleshoot this problem? Why do I have to be root for 
starting KVM?
I thought the group assignment is correct:
 s@pc1-lmde ~ $ groups
sudo kvm libvirt

I would prefer to run kvm in user mode, means not as root.
I assume I have to initiate the TUN/TAP interface before the VM is started.
How can I do this using a script? Is there a Howto available providing 
instructions?

THX
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 10/14] virtio_net: limit xmit polling

2011-05-24 Thread Michael S. Tsirkin
On Tue, May 24, 2011 at 06:20:35PM +0530, Krishna Kumar2 wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 05/24/2011 04:59:39 PM:
 
 Maybe Rusty means it is a simpler model to free the amount
 of space that this xmit needs. We will still fail anyway
 at some time but it is unlikely, since earlier iteration
 freed up atleast the space that it was going to use.
   
Not sure I nderstand.  We can't know space is freed in the previous
iteration as buffers might not have been used by then.
  
   Yes, the first few iterations may not have freed up space, but
   later ones should. The amount of free space should increase
   from then on, especially since we try to free double of what
   we consume.
 
  Hmm. This is only an upper limit on the # of entries in the queue.
  Assume that vq size is 4 and we transmit 4 enties without
  getting anything in the used ring. The next transmit will fail.
 
  So I don't really see why it's unlikely that we reach the packet
  drop code with your patch.
 
 I was assuming 256 entries :) I will try to get some
 numbers to see how often it is true tomorrow.

That would depend on how fast the hypervisor is.
Try doing something to make hypervisor slower than the guest.  I don't
think we need measurements to realize that with the host being slower
than guest that would happen a lot, though.

   I am not sure of why it was changed, since returning TX_BUSY
   seems more efficient IMHO.
   qdisc_restart() handles requeue'd
   packets much better than a stopped queue, as a significant
   part of this code is skipped if gso_skb is present
 
  I think this is the argument:
  http://www.mail-archive.com/virtualization@lists.linux-
  foundation.org/msg06364.html
 
 Thanks for digging up that thread! Yes, that one skb would get
 sent first ahead of possibly higher priority skbs. However,
 from a performance point, TX_BUSY code skips a lot of checks
 and code for all subsequent packets till the device is
 restarted. I can test performance with both cases and report
 what I find (the requeue code has become very simple and clean
 from horribly complex, thanks to Herbert and Dave).

Cc Herbert, and try to convince him :)

   (qdisc
   will eventually start dropping packets when tx_queue_len is
 
  tx_queue_len is a pretty large buffer so maybe no.
 
 I remember seeing tons of drops (pfifo_fast_enqueue) when
 xmit returns TX_BUSY.
 
  I think the packet drops from the scheduler queue can also be
  done intelligently (e.g. with CHOKe) which should
  work better than dropping a random packet?
 
 I am not sure of that - choke_enqueue checks against a random
 skb to drop current skb, and also during congestion. But for
 my sample driver xmit, returning TX_BUSY could still allow
 to be used with CHOKe.
 
 thanks,
 
 - KK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] new version of loaded_vmcs patch

2011-05-24 Thread Avi Kivity

On 05/24/2011 04:14 PM, Avi Kivity wrote:

On 05/24/2011 03:26 PM, Nadav Har'El wrote:
Hi Avi, here is a updated version of the loaded_vmcs patch which you 
asked me

to send before the rest of the nvmx patches.

Please let me know when you'd like me to send you an updated version of
the rest of the patches.


Do you have a list of unresolved issues?

If not, please repost the series.



btw, since Kevin is now plowing through the patchset, please wait for 
the rest of his review.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] new version of loaded_vmcs patch

2011-05-24 Thread Nadav Har'El
On Tue, May 24, 2011, Avi Kivity wrote about Re: [PATCH] new version of 
loaded_vmcs patch:
 btw, since Kevin is now plowing through the patchset, please wait for 
 the rest of his review.

Ok, I will, but note that I'll also be here after the merge, and will continue
to explain the code to Kevin and others - and to modify the code - afterwards.

With nvmx in the tree, other people will also be able to directly modify it
themselves - if they feel strongly about some issue that I don't.

-- 
Nadav Har'El|  Tuesday, May 24 2011, 20 Iyyar 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |:(){ :|:};: # DANGER: DO NOT run this,
http://nadav.harel.org.il   |unless you REALLY know what you're doing!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Make possible to run client tests as subtests

2011-05-24 Thread Lucas Meneghel Rodrigues
On Tue, 2011-05-24 at 04:08 -0300, Lucas Meneghel Rodrigues wrote:
 In order to avoid duplication of code, make it possible to run the
 existing autotest client tests as subtests. This patchset is result
 of work on Jiri Zupka original single patch, the differences:
 
 * Removed example subtest KVM autotest test
 * Renamed some API introduced to net_utils for consistency
 * Rewrote netperf in terms of the new 'subtest' infrastructure

For the record of documentation, there are still some things bothering
me with this patchset:

* I still didn't check whether the changes to autotest core (test class)
do break unittests. We might want to write a unittest for the new
methods of the test class.
* Maybe rather than simply calling job.runtest() we would be better of
executing the subtest on a new thread?
* The netperf reimplementation clearly lacks the functionality of the
current implementation. We need more work on it.

I asked Jiri to pick the patchset and look at these points of
improvement.

 Lucas Meneghel Rodrigues (4):
   client.bin.net.net_utils: Introduce get_local_ip()
   client: Make it possible to run subtests in autotest
   tools: Make html_report to deal with subtest results
   KVM test: Rewrite netperf in terms of subtest
 
  client/bin/client_logging_config.py |5 +-
  client/bin/net/net_utils.py |   17 +
  client/common_lib/base_job.py   |2 +
  client/common_lib/logging_config.py |3 +-
  client/common_lib/test.py   |   21 ++-
  client/tools/html_report.py |  124 +++---
  client/virt/tests/netperf.py|  117 +
  7 files changed, 143 insertions(+), 146 deletions(-)
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] new version of loaded_vmcs patch

2011-05-24 Thread Avi Kivity

On 05/24/2011 05:10 PM, Nadav Har'El wrote:

On Tue, May 24, 2011, Avi Kivity wrote about Re: [PATCH] new version of loaded_vmcs 
patch:
  btw, since Kevin is now plowing through the patchset, please wait for
  the rest of his review.

Ok, I will, but note that I'll also be here after the merge, and will continue
to explain the code to Kevin and others - and to modify the code - afterwards.


Certainly.


With nvmx in the tree, other people will also be able to directly modify it
themselves - if they feel strongly about some issue that I don't.


Sure.  I'll still ask you to review those changes, of course.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Avi Kivity

On 05/24/2011 02:41 PM, Avi Kivity wrote:



We switched to u64 and friends for two reasons: (1) using uint*_t
turned out to be painful when using kernel headers (e.g., mptables,
e820, etc.) and (2) we want to be as close as possible to the coding
style of tools/perf to be able to reuse their code in the future.



Regarding this reuse, I see it's done by copy/paste.  Won't it be 
better to have tools/lib and have tools/perf and tools/kvm use that?




Another thing, I believe reuse of actual kernel code has limited 
utility.  The kernel has vast infrastructure and porting all of it to 
userspace would be a huge undertaking.  It's pretty common for code to 
use printk(), kmalloc(), slab caches, rcu, various locks, per-cpu 
variables.  Are you going to port all of it?  and maintain it when it 
changes?


It's a lot easier to use the native userspace stacks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Avi Kivity

On 05/24/2011 03:27 PM, Paolo Bonzini wrote:

On 05/24/2011 01:56 PM, Pekka Enberg wrote:
  If uint64_t is defined differently than u64, you won't be able to 
pass it by

  reference if an external library expects it.


Oh, the sizes must be the same for obvious reasons so that shouldn't
be a problem and it's better to do the casting in the external API
calls than in core code.


Be careful about unsigned long vs. unsigned long long. 


Yes, that's the issue I had in mind.

If you want to use the u64 spelling that's fine, but I would use a 
custom header that defines it from uint64_t.


And now you'll have printf/printk format mismatches if they happen to be 
different.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] new version of loaded_vmcs patch

2011-05-24 Thread Gleb Natapov
On Tue, May 24, 2011 at 05:10:56PM +0300, Nadav Har'El wrote:
 On Tue, May 24, 2011, Avi Kivity wrote about Re: [PATCH] new version of 
 loaded_vmcs patch:
  btw, since Kevin is now plowing through the patchset, please wait for 
  the rest of his review.
 
 Ok, I will, but note that I'll also be here after the merge, and will continue
 to explain the code to Kevin and others - and to modify the code - afterwards.
 
 With nvmx in the tree, other people will also be able to directly modify it
 themselves - if they feel strongly about some issue that I don't.
 
Like event handling? 

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: add missing void __user * cast to access_ok() call

2011-05-24 Thread Avi Kivity

On 05/24/2011 08:51 AM, Heiko Carstens wrote:

From: Heiko Carstensheiko.carst...@de.ibm.com

fa3d315a KVM: Validate userspace_addr of memslot when registered introduced
this new warning onn s390:

kvm_main.c: In function '__kvm_set_memory_region':
kvm_main.c:654:7: warning: passing argument 1 of '__access_ok' makes pointer 
from integer without a cast
arch/s390/include/asm/uaccess.h:53:19: note: expected 'const void *' but 
argument is of type '__u64'

Add the missing cast to get rid of it again...



Thanks, applied, and queued for 2.6.40/2.8.0/3.0/2011.1, as the case may be.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Pekka Enberg
Hi Avi,

On Tue, May 24, 2011 at 5:37 PM, Avi Kivity a...@redhat.com wrote:
 Another thing, I believe reuse of actual kernel code has limited utility.

OK, I don't agree with that.

On Tue, May 24, 2011 at 5:37 PM, Avi Kivity a...@redhat.com wrote:
  The kernel has vast infrastructure and porting all of it to userspace would
 be a huge undertaking.  It's pretty common for code to use printk(),
 kmalloc(), slab caches, rcu, various locks, per-cpu variables.  Are you
 going to port all of it?  and maintain it when it changes?

We will port whatever is useful and makes sense. Nothing new here -
we're following perf's lead, that's all.

As for the specific issues you mention, only RCU and locking
mechanisms seem like something that are not trivial to port (although
userspace RCU already exists).

[ Porting lockdep to userspace would be one cool feature, though. ]

On Tue, May 24, 2011 at 5:37 PM, Avi Kivity a...@redhat.com wrote:
 It's a lot easier to use the native userspace stacks.

We will be using userspace libs where appropriate (like with libvnc).

Pekka
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Using KVM in user mode with public bridge

2011-05-24 Thread David Mair

On 05/24/2011 07:44 AM, c.mo...@web.de wrote:

Hello!
I have installed KVM/QEMU on my LMDE (Linux Mint Debian Edition) system.

The guest should use a public bridge that I have set up with 
/etc/network/interfaces. As you can see the bridge is configured to use a 
static IP:
  s@pc1-lmde ~ $ cat /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).

# The loopback network interface
auto lo
iface lo inet loopback

auto eth1
iface eth1 inet manual

auto br1
iface br1 inet static
 address 192.168.178.11
 netmask 255.255.255.0
 gateway 192.168.178.1
 bridge_ports eth1
 bridge_stp off
 bridge_fd 0
 bridge_maxwait 0

Starting a VM with kvm command returns this error:
  s@pc1-lmde ~ $ kvm -hda 
~/.virtualmachines/aqemu/virtualdisk_converted/vm2-windowsxp-HD01-15G.qcow 
-boot c -cdrom /dev/cdrom -m 1024 -smp 1 -net nic,macaddr=00:88:4e:ad:2d:61 
-net tap
kvm: -net tap: could not configure /dev/net/tun (tap%d): Operation not permitted
kvm: -net tap: Device 'tap' could not be initialized

Checking the permissions of file /dev/net/tun:
  s@pc1-lmde ~ $ ls -l /dev/net/tun
crw-rw-rw- 1 root root 10, 200 21. Mai 09:52 /dev/net/tun
  Starting kvm as root by means of using sudo works. Then all processes runs 
as root.
Can you advice how to troubleshoot this problem? Why do I have to be root for 
starting KVM?
I thought the group assignment is correct:
  s@pc1-lmde ~ $ groups
sudo kvm libvirt

I would prefer to run kvm in user mode, means not as root.
I assume I have to initiate the TUN/TAP interface before the VM is started.
How can I do this using a script? Is there a Howto available providing 
instructions?


That's pretty much how I do it and have the following sudo configuration 
(paths from my environment):


myuserid ALL = NOPASSWD: \
/bin/tunctl -u myuserid -t kvmtap*, \
/bin/tunctl -d kvmtap*, \
/sbin/ifconfig kvmtap* up 0.0.0.0 promisc, \
/sbin/ifconfig kvmtap* down, \
/sbin/brctl addif * kvmtap*, \
/sbin/brctl delif * kvmtap*, \

Note that there is no qemu option in my sudoers. Then, my order of use 
is as follows but I use a script to avoid having to come up with the 
referenced random digits on demand:


sudo /bin/tunctl -u myuserid -t kvmtapfour random digits
That creates a tap interface
sudo /sbin/ifconfig kvmtapsame digits as in tunctl up 0.0.0.0 promisc
That brings the tap interface up with no configuration
sudo /sbin/brctl addif br1 kvmtapsame digits as in tunctl
That adds the tap to the bridge
qemu-system-x86_64 qemu args -net nic,vlan=0,model=your 
choice,macaddr=random mac generated when config was created -net 
tap,ifname=kvmtapsame digits as in tunctl,script=no,vlan=0

Delay while you use the VM until you shut it down and the
qemu proc exits, then:
sudo /sbin/brctl delif br1 kvmtapsame digits as in tunctl
That removes the tap from the bridge
sudo /sbin/ifconfig kvmtapsame digits as in tunctl down
That brings the tap interface down
sudo /bin/tunctl -d kvmtapsame digits as in tunctl
That deletes the tap interface

I load the kvm drivers as root but as far as I can tell qemu doesn't run 
as root. So, a real world example would look like this:


sudo /bin/tunctl -u myuserid -t kvmtap1981
sudo /sbin/ifconfig kvmtap1981 up 0.0.0.0 promisc
sudo /sbin/brctl addif br1 kvmtap1981
qemu-system-x86_64 -smp 1 -m 512 -usb -localtime etc -net 
nic,vlan=0,model=e1000,macaddr=ac:de:48:6c:5c:a3 -net 
tap,ifname=kvmtap1981,script=no,vlan=0 more etc

sudo /sbin/brctl delif br1 kvmtap1981
sudo /sbin/ifconfig kvmtap1981 down
sudo /bin/tunctl -d kvmtap1981

A repeat of the same script might do this the very next time:

sudo /bin/tunctl -u myuserid -t kvmtap8358
sudo /sbin/ifconfig kvmtap8358 up 0.0.0.0 promisc
sudo /sbin/brctl addif br8 kvmtap8358
qemu-system-x86_64 -smp 1 -m 512 -usb -localtime etc -net 
nic,vlan=0,model=e1000,macaddr=ac:de:48:6c:5c:a3 -net 
tap,ifname=kvmtap8358,script=no,vlan=0 more etc

sudo /sbin/brctl delif br8 kvmtap8358
sudo /sbin/ifconfig kvmtap8358 down
sudo /bin/tunctl -d kvmtap8358

Note that the MAC address doesn't change and if you were to check, it 
should be a legitimate IEEE 802.3 private MAC address, it was generated 
when the VM was created by the same script and is stored in a 
configuration file the script uses to launch VMs. My own br1 has no host 
physical NICs connected and has no routing but I also have a br8 with 
NAT routing via netfilter/iptables and a br0 that has my physical NIC 
connected so that I can do VMs that appear on the real network. I run 
the ISC dhcp server with configuration for br1 and br8. IOW, pretty much 
a copy of the default VMWare network installation.


Hopefully that saves you repeating some of the effort I had picking out 
the tunctl/brctl operation. Best of luck.


--
David.
--
To unsubscribe from this list: 

KVM: x86: use proper port value when checking io instruction permission

2011-05-24 Thread Marcelo Tosatti

Commit fa4491a6b667304 moved the permission check for io instructions
to the -check_perm callback. It failed to copy the port value from RDX
register for string and in,out ax,dx instructions. Fix it.

Fixes FC8.32 installation.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 3bc6b7a..df354a4 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2944,6 +2944,15 @@ static int check_perm_in(struct x86_emulate_ctxt *ctxt)
 {
struct decode_cache *c = ctxt-decode;
 
+   switch (c-b) {
+   case 0x6c: /* insb */
+   case 0x6d: /* insw/insd */
+   case 0xec: /* in al,dx */
+   case 0xed: /* in (e/r)ax,dx */
+   c-src.val = c-regs[VCPU_REGS_RDX];
+   break;
+   }
+
c-dst.bytes = min(c-dst.bytes, 4u);
if (!emulator_io_permited(ctxt, c-src.val, c-dst.bytes))
return emulate_gp(ctxt, 0);
@@ -2955,6 +2964,15 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
 {
struct decode_cache *c = ctxt-decode;
 
+   switch (c-b) {
+   case 0x6e: /* outsb */
+   case 0x6f: /* outsw/outsd */
+   case 0xee: /* out dx,al */
+   case 0xef: /* out dx,(e/r)ax */
+   c-dst.val = c-regs[VCPU_REGS_RDX];
+   break;
+   }
+
c-src.bytes = min(c-src.bytes, 4u);
if (!emulator_io_permited(ctxt, c-dst.val, c-src.bytes))
return emulate_gp(ctxt, 0);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: use proper port value when checking io instruction permission

2011-05-24 Thread Gleb Natapov
On Tue, May 24, 2011 at 02:11:20PM -0300, Marcelo Tosatti wrote:
 
 Commit fa4491a6b667304 moved the permission check for io instructions
 to the -check_perm callback. It failed to copy the port value from RDX
 register for string and in,out ax,dx instructions. Fix it.
 
 Fixes FC8.32 installation.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
 
 diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
 index 3bc6b7a..df354a4 100644
 --- a/arch/x86/kvm/emulate.c
 +++ b/arch/x86/kvm/emulate.c
 @@ -2944,6 +2944,15 @@ static int check_perm_in(struct x86_emulate_ctxt *ctxt)
  {
   struct decode_cache *c = ctxt-decode;
  
 + switch (c-b) {
 + case 0x6c: /* insb */
 + case 0x6d: /* insw/insd */
 + case 0xec: /* in al,dx */
 + case 0xed: /* in (e/r)ax,dx */
 + c-src.val = c-regs[VCPU_REGS_RDX];
 + break;
 + }
 +
   c-dst.bytes = min(c-dst.bytes, 4u);
   if (!emulator_io_permited(ctxt, c-src.val, c-dst.bytes))
   return emulate_gp(ctxt, 0);
 @@ -2955,6 +2964,15 @@ static int check_perm_out(struct x86_emulate_ctxt 
 *ctxt)
  {
   struct decode_cache *c = ctxt-decode;
  
 + switch (c-b) {
 + case 0x6e: /* outsb */
 + case 0x6f: /* outsw/outsd */
 + case 0xee: /* out dx,al */
 + case 0xef: /* out dx,(e/r)ax */
 + c-dst.val = c-regs[VCPU_REGS_RDX];
 + break;
 + }
 +
   c-src.bytes = min(c-src.bytes, 4u);
   if (!emulator_io_permited(ctxt, c-dst.val, c-src.bytes))
   return emulate_gp(ctxt, 0);
I'd rather do it at decoding stage by adding SrcDX/DstDX.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 0/4] rbd improvements

2011-05-24 Thread Josh Durgin
This patchset moves the complexity of the rbd format into librbd and
adds truncation support.

Changes since v3:
 * trivially rebased
 * updated copyright header

Changes since v2:
 * return values are checked in rbd_aio_rw_vector
 * bdrv_truncate added

Josh Durgin (4):
  rbd: use the higher level librbd instead of just librados
  rbd: allow configuration of rados from the rbd filename
  rbd: check return values when scheduling aio
  rbd: Add bdrv_truncate implementation

 block/rbd.c   |  892 +++--
 block/rbd_types.h |   71 -
 configure |   33 +--
 3 files changed, 333 insertions(+), 663 deletions(-)
 delete mode 100644 block/rbd_types.h

-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/4] rbd: use the higher level librbd instead of just librados

2011-05-24 Thread Josh Durgin
librbd stacks on top of librados to provide access
to rbd images.

Using librbd simplifies the qemu code, and allows
qemu to use new versions of the rbd format
with few (if any) changes.

Signed-off-by: Josh Durgin josh.dur...@dreamhost.com
Signed-off-by: Yehuda Sadeh yeh...@hq.newdream.net
---
 block/rbd.c   |  790 +++--
 block/rbd_types.h |   71 -
 configure |   33 +--
 3 files changed, 224 insertions(+), 670 deletions(-)
 delete mode 100644 block/rbd_types.h

diff --git a/block/rbd.c b/block/rbd.c
index 249a590..1c8e7c7 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1,20 +1,22 @@
 /*
  * QEMU Block driver for RADOS (Ceph)
  *
- * Copyright (C) 2010 Christian Brunner c...@muc.de
+ * Copyright (C) 2010-2011 Christian Brunner c...@muc.de,
+ * Josh Durgin josh.dur...@dreamhost.com
  *
  * This work is licensed under the terms of the GNU GPL, version 2.  See
  * the COPYING file in the top-level directory.
  *
  */
 
+#include inttypes.h
+
 #include qemu-common.h
 #include qemu-error.h
 
-#include rbd_types.h
 #include block_int.h
 
-#include rados/librados.h
+#include rbd/librbd.h
 
 
 
@@ -40,6 +42,12 @@
 
 #define OBJ_MAX_SIZE (1UL  OBJ_DEFAULT_OBJ_ORDER)
 
+#define RBD_MAX_CONF_NAME_SIZE 128
+#define RBD_MAX_CONF_VAL_SIZE 512
+#define RBD_MAX_CONF_SIZE 1024
+#define RBD_MAX_POOL_NAME_SIZE 128
+#define RBD_MAX_SNAP_NAME_SIZE 128
+
 typedef struct RBDAIOCB {
 BlockDriverAIOCB common;
 QEMUBH *bh;
@@ -48,7 +56,6 @@ typedef struct RBDAIOCB {
 char *bounce;
 int write;
 int64_t sector_num;
-int aiocnt;
 int error;
 struct BDRVRBDState *s;
 int cancelled;
@@ -59,7 +66,7 @@ typedef struct RADOSCB {
 RBDAIOCB *acb;
 struct BDRVRBDState *s;
 int done;
-int64_t segsize;
+int64_t size;
 char *buf;
 int ret;
 } RADOSCB;
@@ -69,25 +76,22 @@ typedef struct RADOSCB {
 
 typedef struct BDRVRBDState {
 int fds[2];
-rados_pool_t pool;
-rados_pool_t header_pool;
-char name[RBD_MAX_OBJ_NAME_SIZE];
-char block_name[RBD_MAX_BLOCK_NAME_SIZE];
-uint64_t size;
-uint64_t objsize;
+rados_t cluster;
+rados_ioctx_t io_ctx;
+rbd_image_t image;
+char name[RBD_MAX_IMAGE_NAME_SIZE];
 int qemu_aio_count;
+char *snap;
 int event_reader_pos;
 RADOSCB *event_rcb;
 } BDRVRBDState;
 
-typedef struct rbd_obj_header_ondisk RbdHeader1;
-
 static void rbd_aio_bh_cb(void *opaque);
 
-static int rbd_next_tok(char *dst, int dst_len,
-char *src, char delim,
-const char *name,
-char **p)
+static int qemu_rbd_next_tok(char *dst, int dst_len,
+ char *src, char delim,
+ const char *name,
+ char **p)
 {
 int l;
 char *end;
@@ -115,10 +119,10 @@ static int rbd_next_tok(char *dst, int dst_len,
 return 0;
 }
 
-static int rbd_parsename(const char *filename,
- char *pool, int pool_len,
- char *snap, int snap_len,
- char *name, int name_len)
+static int qemu_rbd_parsename(const char *filename,
+  char *pool, int pool_len,
+  char *snap, int snap_len,
+  char *name, int name_len)
 {
 const char *start;
 char *p, *buf;
@@ -131,12 +135,12 @@ static int rbd_parsename(const char *filename,
 buf = qemu_strdup(start);
 p = buf;
 
-ret = rbd_next_tok(pool, pool_len, p, '/', pool name, p);
+ret = qemu_rbd_next_tok(pool, pool_len, p, '/', pool name, p);
 if (ret  0 || !p) {
 ret = -EINVAL;
 goto done;
 }
-ret = rbd_next_tok(name, name_len, p, '@', object name, p);
+ret = qemu_rbd_next_tok(name, name_len, p, '@', object name, p);
 if (ret  0) {
 goto done;
 }
@@ -145,123 +149,35 @@ static int rbd_parsename(const char *filename,
 goto done;
 }
 
-ret = rbd_next_tok(snap, snap_len, p, '\0', snap name, p);
+ret = qemu_rbd_next_tok(snap, snap_len, p, '\0', snap name, p);
 
 done:
 qemu_free(buf);
 return ret;
 }
 
-static int create_tmap_op(uint8_t op, const char *name, char **tmap_desc)
-{
-uint32_t len = strlen(name);
-uint32_t len_le = cpu_to_le32(len);
-/* total_len = encoding op + name + empty buffer */
-uint32_t total_len = 1 + (sizeof(uint32_t) + len) + sizeof(uint32_t);
-uint8_t *desc = NULL;
-
-desc = qemu_malloc(total_len);
-
-*tmap_desc = (char *)desc;
-
-*desc = op;
-desc++;
-memcpy(desc, len_le, sizeof(len_le));
-desc += sizeof(len_le);
-memcpy(desc, name, len);
-desc += len;
-len = 0; /* no need for endian conversion for 0 */
-memcpy(desc, len, sizeof(len));
-desc += sizeof(len);
-
-return (char *)desc - *tmap_desc;
-}
-
-static void free_tmap_op(char *tmap_desc)
-{
-   

[PATCH v4 2/4] rbd: allow configuration of rados from the rbd filename

2011-05-24 Thread Josh Durgin
The new format is rbd:pool/image[@snapshot][:option1=value1[:option2=value2...]]
Each option is used to configure rados, and may be any Ceph option, or conf.
The conf option specifies a Ceph configuration file to read.

This allows rbd volumes from more than one Ceph cluster to be used by
specifying different monitor addresses, as well as having different
logging levels or locations for different volumes.

Signed-off-by: Josh Durgin josh.dur...@dreamhost.com
---
 block/rbd.c |  119 ++
 1 files changed, 102 insertions(+), 17 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 1c8e7c7..ff74e8b 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -23,13 +23,17 @@
 /*
  * When specifying the image filename use:
  *
- * rbd:poolname/devicename
+ * rbd:poolname/devicename[@snapshotname][:option1=value1[:option2=value2...]]
  *
  * poolname must be the name of an existing rados pool
  *
  * devicename is the basename for all objects used to
  * emulate the raw device.
  *
+ * Each option given is used to configure rados, and may be
+ * any Ceph option, or conf. The conf option specifies
+ * a Ceph configuration file to read.
+ *
  * Metadata information (image size, ...) is stored in an
  * object with the name devicename.rbd.
  *
@@ -122,7 +126,8 @@ static int qemu_rbd_next_tok(char *dst, int dst_len,
 static int qemu_rbd_parsename(const char *filename,
   char *pool, int pool_len,
   char *snap, int snap_len,
-  char *name, int name_len)
+  char *name, int name_len,
+  char *conf, int conf_len)
 {
 const char *start;
 char *p, *buf;
@@ -134,28 +139,84 @@ static int qemu_rbd_parsename(const char *filename,
 
 buf = qemu_strdup(start);
 p = buf;
+*snap = '\0';
+*conf = '\0';
 
 ret = qemu_rbd_next_tok(pool, pool_len, p, '/', pool name, p);
 if (ret  0 || !p) {
 ret = -EINVAL;
 goto done;
 }
-ret = qemu_rbd_next_tok(name, name_len, p, '@', object name, p);
-if (ret  0) {
-goto done;
+
+if (strchr(p, '@')) {
+ret = qemu_rbd_next_tok(name, name_len, p, '@', object name, p);
+if (ret  0) {
+goto done;
+}
+ret = qemu_rbd_next_tok(snap, snap_len, p, ':', snap name, p);
+} else {
+ret = qemu_rbd_next_tok(name, name_len, p, ':', object name, p);
 }
-if (!p) {
-*snap = '\0';
+if (ret  0 || !p) {
 goto done;
 }
 
-ret = qemu_rbd_next_tok(snap, snap_len, p, '\0', snap name, p);
+ret = qemu_rbd_next_tok(conf, conf_len, p, '\0', configuration, p);
 
 done:
 qemu_free(buf);
 return ret;
 }
 
+static int qemu_rbd_set_conf(rados_t cluster, const char *conf)
+{
+char *p, *buf;
+char name[RBD_MAX_CONF_NAME_SIZE];
+char value[RBD_MAX_CONF_VAL_SIZE];
+int ret = 0;
+
+buf = qemu_strdup(conf);
+p = buf;
+
+while (p) {
+ret = qemu_rbd_next_tok(name, sizeof(name), p,
+'=', conf option name, p);
+if (ret  0) {
+break;
+}
+
+if (!p) {
+error_report(conf option %s has no value, name);
+ret = -EINVAL;
+break;
+}
+
+ret = qemu_rbd_next_tok(value, sizeof(value), p,
+':', conf option value, p);
+if (ret  0) {
+break;
+}
+
+if (strncmp(name, conf, strlen(conf))) {
+ret = rados_conf_set(cluster, name, value);
+if (ret  0) {
+error_report(invalid conf option %s, name);
+ret = -EINVAL;
+break;
+}
+} else {
+ret = rados_conf_read_file(cluster, value);
+if (ret  0) {
+error_report(error reading conf file %s, value);
+break;
+}
+}
+}
+
+qemu_free(buf);
+return ret;
+}
+
 static int qemu_rbd_create(const char *filename, QEMUOptionParameter *options)
 {
 int64_t bytes = 0;
@@ -164,6 +225,7 @@ static int qemu_rbd_create(const char *filename, 
QEMUOptionParameter *options)
 char pool[RBD_MAX_POOL_NAME_SIZE];
 char name[RBD_MAX_IMAGE_NAME_SIZE];
 char snap_buf[RBD_MAX_SNAP_NAME_SIZE];
+char conf[RBD_MAX_CONF_SIZE];
 char *snap = NULL;
 rados_t cluster;
 rados_ioctx_t io_ctx;
@@ -171,7 +233,8 @@ static int qemu_rbd_create(const char *filename, 
QEMUOptionParameter *options)
 
 if (qemu_rbd_parsename(filename, pool, sizeof(pool),
snap_buf, sizeof(snap_buf),
-   name, sizeof(name))  0) {
+   name, sizeof(name),
+   conf, sizeof(conf))  0) {
 return -EINVAL;
 }
 if (snap_buf[0] != '\0') {
@@ -204,8 +267,17 @@ static int qemu_rbd_create(const char 

[PATCH v4 4/4] rbd: Add bdrv_truncate implementation

2011-05-24 Thread Josh Durgin
Signed-off-by: Josh Durgin josh.dur...@dreamhost.com
---
 block/rbd.c |   15 +++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index c9f32e4..015ae8e 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -687,6 +687,20 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs)
 return info.size;
 }
 
+static int qemu_rbd_truncate(BlockDriverState *bs, int64_t offset)
+{
+BDRVRBDState *s = bs-opaque;
+int r;
+
+r = rbd_resize(s-image, offset);
+if (r  0) {
+error_report(failed to resize rbd image);
+return -EIO;
+}
+
+return 0;
+}
+
 static int qemu_rbd_snap_create(BlockDriverState *bs,
 QEMUSnapshotInfo *sn_info)
 {
@@ -785,6 +799,7 @@ static BlockDriver bdrv_rbd = {
 .bdrv_get_info  = qemu_rbd_getinfo,
 .create_options = qemu_rbd_create_options,
 .bdrv_getlength = qemu_rbd_getlength,
+.bdrv_truncate  = qemu_rbd_truncate,
 .protocol_name  = rbd,
 
 .bdrv_aio_readv = qemu_rbd_aio_readv,
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 3/4] rbd: check return values when scheduling aio

2011-05-24 Thread Josh Durgin
If scheduling fails, the number of outstanding I/Os must be correct,
or there will be a hang when waiting for everything to be flushed.

Reported-by: Stefan Hajnoczi stefa...@gmail.com
Signed-off-by: Josh Durgin josh.dur...@dreamhost.com
---
 block/rbd.c |   24 
 1 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index ff74e8b..c9f32e4 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -580,10 +580,14 @@ static BlockDriverAIOCB 
*rbd_aio_rw_vector(BlockDriverState *bs,
 rbd_completion_t c;
 int64_t off, size;
 char *buf;
+int r;
 
 BDRVRBDState *s = bs-opaque;
 
 acb = qemu_aio_get(rbd_aio_pool, bs, cb, opaque);
+if (!acb) {
+return NULL;
+}
 acb-write = write;
 acb-qiov = qiov;
 acb-bounce = qemu_blockalign(bs, qiov-size);
@@ -610,16 +614,28 @@ static BlockDriverAIOCB 
*rbd_aio_rw_vector(BlockDriverState *bs,
 rcb-buf = buf;
 rcb-s = acb-s;
 rcb-size = size;
+r = rbd_aio_create_completion(rcb, (rbd_callback_t) rbd_finish_aiocb, c);
+if (r  0) {
+goto failed;
+}
 
 if (write) {
-rbd_aio_create_completion(rcb, (rbd_callback_t) rbd_finish_aiocb, c);
-rbd_aio_write(s-image, off, size, buf, c);
+r = rbd_aio_write(s-image, off, size, buf, c);
 } else {
-rbd_aio_create_completion(rcb, (rbd_callback_t) rbd_finish_aiocb, c);
-rbd_aio_read(s-image, off, size, buf, c);
+r = rbd_aio_read(s-image, off, size, buf, c);
+}
+
+if (r  0) {
+goto failed;
 }
 
 return acb-common;
+
+failed:
+qemu_free(rcb);
+s-qemu_aio_count--;
+qemu_aio_release(acb);
+return NULL;
 }
 
 static BlockDriverAIOCB *qemu_rbd_aio_readv(BlockDriverState *bs,
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Ingo Molnar

* Pekka Enberg penb...@kernel.org wrote:

 We switched to u64 and friends for two reasons: (1) using uint*_t turned out 
 to be painful when using kernel headers (e.g., mptables, e820, etc.) and (2) 
 we want to be as close as possible to the coding style of tools/perf to be 
 able to reuse their code in the future.

3)

uint*_t caused frequent type mismatches and breakages with format strings if 
accidentally an uint*_t was stuck into a regular printf format - it would 
trigger a warning only on 64-bit or only on 32-bit systems, causing frequent 
flux. This is not a very good data type model.

4)

uint*_t also causes absolute brain-damaged printf format hacks like:

fprintf(stderr, %s: % PRIu64  % PRIu64  % PRIu64 \n,
event_name(counter), count[0], count[1], count[2]);

instead of the much cleaner:

fprintf(stderr, %s: %Lu %Lu %Lu\n,
event_name(counter), count[0], count[1], count[2]);

So we can use uint*_t where absolutely necessary due to external ABI 
constraints, but otherwise it's discouraged for tools/kvm/ internal code.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Ingo Molnar

* Pekka Enberg penb...@kernel.org wrote:

 As for the specific issues you mention, only RCU and locking mechanisms seem 
 like something that are not trivial to port (although userspace RCU already 
 exists).

Indeed.

 [ Porting lockdep to userspace would be one cool feature, though. ]

Agreed - we already ran into a few locking bugs that were IMO needlessly 
difficult to debug - our expectations got spoiled by lockdep i guess! :-)

Porting RCU would also have its distinct advantages in terms of scalability. 
While tools/perf/ probably wont need it (all its scalability-relevant code runs 
in the kernel), tools/kvm/ definitely looks like a candidate for good 
user-space RCU primitives.

 On Tue, May 24, 2011 at 5:37 PM, Avi Kivity a...@redhat.com wrote:
  It's a lot easier to use the native userspace stacks.
 
 We will be using userspace libs where appropriate (like with libvnc).

We are using a fair amount of userspace libraries in perf too: -ldw, -lelf, 
-lnewt, -lslang, we used -laudit too for some time.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: use proper port value when checking io instruction permission

2011-05-24 Thread Avi Kivity

On 05/24/2011 08:27 PM, Gleb Natapov wrote:

On Tue, May 24, 2011 at 02:11:20PM -0300, Marcelo Tosatti wrote:

  Commit fa4491a6b667304 moved the permission check for io instructions
  to the -check_perm callback. It failed to copy the port value from RDX
  register for string and in,out ax,dx instructions. Fix it.

  Fixes FC8.32 installation.


Ouch.



  @@ -2955,6 +2964,15 @@ static int check_perm_out(struct x86_emulate_ctxt 
*ctxt)
   {
struct decode_cache *c =ctxt-decode;

  + switch (c-b) {
  + case 0x6e: /* outsb */
  + case 0x6f: /* outsw/outsd */
  + case 0xee: /* out dx,al */
  + case 0xef: /* out dx,(e/r)ax */
  + c-dst.val = c-regs[VCPU_REGS_RDX];
  + break;
  + }
  +
c-src.bytes = min(c-src.bytes, 4u);
if (!emulator_io_permited(ctxt, c-dst.val, c-src.bytes))
return emulate_gp(ctxt, 0);
I'd rather do it at decoding stage by adding SrcDX/DstDX.



Note we haven't decoded operands yet.  And this doesn't fix in $imm8, %al.

Maybe we need an additional check site after operands are fetched.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Ingo Molnar

* Paolo Bonzini pbonz...@redhat.com wrote:

 On 05/24/2011 10:50 AM, Ingo Molnar wrote:
 Yeah, that would certainly work for simple things but there's several reasons
 why explicit control over initcalls is preferred in a tool like tools/kvm/ 
 over
 __attribute__((constructor)):
 
 Some advantages you mention are real indeed, but they are general;
 there's no reason why they apply only to tools/kvm.  You can achieve
 the same by doing only minimal work (registering your
 subsystems/devices/whatever in a linked list) in the constructors.
 Then you iterate on the list and call function pointers.

Well, the plain fact that __attribute__((constructor)) gets called on binary 
and shared library startup before main() is called is a show-stopper for us as 
i described it in my previous mail, so why are we even arguing about it?

((constructor)) has showstopper properties:

 - We don't have access to the program arguments

 - stdio is probably not set up yet (this is undefined AFAICS)

 - Also, over the years i have grown to be suspicious of GCC defined 
   extensions. More often than not the GCC project is fixing regressions not by 
   fixing the compiler but by changing the documentation ;-) We got bitten by 
   regressions in asm() behavior in the kernel rather often.

   In that sense ((section)) is way more robust: there's not really that many 
   ways to screw that up. Fiddling with the ((constructor)) environment on the 
   other hand ...

Note that in terms of explicit iterations we do that with 
__attribute__((section)) as well: except that *we* define where and how the 
functions get called.

 I know portability is not relevant to tools/kvm/, but using unportable tricks 
 for the sake of using them is a direct way to NIH. But oh well all of 
 tools/kvm/ is NIH after all. :)

__attribute__((constructor)) is not particularly portable to begin with: does 
the MSVC compiler support it for example?

So __attribute__ ((section)), which is used by the initcall() machinery is 
similarly portable: GCC and LLVM/Clang support it.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: use proper port value when checking io instruction permission

2011-05-24 Thread Gleb Natapov
On Tue, May 24, 2011 at 10:07:48PM +0300, Avi Kivity wrote:
   @@ -2955,6 +2964,15 @@ static int check_perm_out(struct x86_emulate_ctxt 
  *ctxt)
{
 struct decode_cache *c =ctxt-decode;
 
   + switch (c-b) {
   + case 0x6e: /* outsb */
   + case 0x6f: /* outsw/outsd */
   + case 0xee: /* out dx,al */
   + case 0xef: /* out dx,(e/r)ax */
   + c-dst.val = c-regs[VCPU_REGS_RDX];
   + break;
   + }
   +
 c-src.bytes = min(c-src.bytes, 4u);
 if (!emulator_io_permited(ctxt, c-dst.val, c-src.bytes))
 return emulate_gp(ctxt, 0);
 I'd rather do it at decoding stage by adding SrcDX/DstDX.
 
 
 Note we haven't decoded operands yet.  And this doesn't fix in $imm8, %al.
 
We haven't? check_perm is called from x86_emulate_insn() and operands are
decode in x86_decode_insn(). So $imm8, %al should work now. Or am I
missing something?

 Maybe we need an additional check site after operands are fetched.
 
 -- 
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: use proper port value when checking io instruction permission

2011-05-24 Thread Avi Kivity

On 05/24/2011 10:18 PM, Gleb Natapov wrote:

  Note we haven't decoded operands yet.  And this doesn't fix in $imm8, %al.

We haven't? check_perm is called from x86_emulate_insn() and operands are
decode in x86_decode_insn(). So $imm8, %al should work now. Or am I
missing something?


No, you're right.  SrcDX seems the simplest fix then.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5 V2] kvm tools: Initialize and use VESA and VNC

2011-05-24 Thread Ingo Molnar

* Paolo Bonzini pbonz...@redhat.com wrote:

 On 05/24/2011 10:50 AM, Ingo Molnar wrote:
   - C constructors have random limitations like apparently not being 
  executed if
 the constructor is linked into a .a static library.
 
 Ah, forgot about this.  Given _why_ this happens (for static libraries, the 
 linker omits object modules that are not required to fulfill undefined 
 references in previous objects), I'd be surprised if explicit section tricks 
 do not have the same limitation.  [...]

Since we create an actual array (data) and iterate it the worst i can imagine 
is the linker dropping those sections - in which case the build will fail and 
we are alerted to the problem.

With ((constructor)) the program will misbehave silently - not good.

 [...] After all the constructor attribute is only syntactic sugar for section 
 tricks.

Yeah but not just syntactic sugar but also code built into glibc to execute 
them at some point before main().

And this last bit really matters as well.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 20/31] nVMX: Exiting from L2 to L1

2011-05-24 Thread Tian, Kevin
 From: Nadav Har'El [mailto:n...@math.technion.ac.il]
 Sent: Tuesday, May 24, 2011 9:43 PM
 
 On Tue, May 24, 2011, Tian, Kevin wrote about RE: [PATCH 20/31] nVMX:
 Exiting from L2 to L1:
   +vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
   +{
   + /*
   +  * As explained above, we take a bit from GUEST_CR0 if we allowed
 the
   +  * guest to modify it untrapped (vcpu-arch.cr0_guest_owned_bits),
 or
   +  * if we did trap it - if we did so because L1 asked to trap this bit
   +  * (vmcs12-cr0_guest_host_mask). Otherwise (bits we trapped but
 L1
   +  * didn't expect us to trap) we read from CR0_READ_SHADOW.
   +  */
   + unsigned long guest_cr0_bits =
   + vcpu-arch.cr0_guest_owned_bits |
 vmcs12-cr0_guest_host_mask;
   + return (vmcs_readl(GUEST_CR0)  guest_cr0_bits) |
   +(vmcs_readl(CR0_READ_SHADOW)  ~guest_cr0_bits);
   +}
 
  Hi, Nadav,
 
  Not sure whether I get above operation wrong.
 
 This is one of the trickiest functions in nested VMX, which is why I added
 15 lines of comments (!) on just two statements of code.

I read the comment carefully, and the scenario I described is not covered there.

 
  But it looks not exactly correct to me
  in a glimpse. Say a bit set both in L0/L1's cr0_guest_host_mask. In such 
  case
 that
  bit from vmcs12_GUEST_CR0 resides in vmcs02_CR0_READ_SHADOW,
 however above
  operation will make vmcs02_GUEST_CR0 bit returned instead.
 
 This behavior is correct: If a bit is set in L1's cr0_guest_host_mask (and
 in particular, if it is set in both L0's and L1's), we always exit to L1 when
 L2 changes this bit, and this bit cannot change while L2 is running, so
 naturally after the run vmcs02.guest_cr0 and vmcs12.guest_cr0 are still
 identical in that be.

Are you sure this is the case? vmcs12.guest_cr0 is identical to an operation
that L1 tries to update GUEST_CR0 when you prepare vmcs02 which is why
you use vmx_set_cr0(vcpu, vmcs12-guest_cr0) in prepare_vmcs02. If L0 
has one bit set in L0's cr0_guest_host_mask, the corresponding bit in 
vmcs12.guest_cr0 will be cached in vmcs02.cr0_read_shadow anyway. This
is not related to whether L2 changes that bit.

IOW, I disagree that if L0/L1 set same bit in cr0_guest_host_mask, then
the bit is identical in vmcs02.guest_cr0 and vmcs12.guest_cr0 because L1
has no permission to set its bit effectively in this case.

 Copying that bit from vmcs02_CR0_READ_SHADOW, like you suggested, would
 be
 completely wrong in this case: When L1 set a bit in cr0_guest_host_mask,
 the vmcs02-cr0_read_shadow is vmcs12-cr0_read_shadow (see
 nested_read_cr0),
 and is just a pretense that L1 set up for L2 - it is NOT the real bit of
 guest_cr0, so copying it into guest_cr0 would be wrong.

So I'm talking about reserving that bit from vmcs12.guest_cr0 when it's set
in vmcs12.cr0_guest_host_mask which is a natural output.

 
 Note that this function is completely different from nested_read_cr0 (the
 new name), which behaves similar to what you suggested but serves a
 completely
 different (and in some respect, opposite) function.
 
 I think my comments in the code are clearer than what I just wrote here, so
 please take a look at them again, and let me know if you find any errors.
 
  Instead of constructing vmcs12_GUEST_CR0 completely from
 vmcs02_GUEST_CR0,
  why not just updating bits which can be altered while keeping the rest bits
 from
  vmcs12_GUEST_CR0? Say something like:
 
  vmcs12-guest_cr0 = vmcs12-cr0_guest_host_mask; /* keep unchanged
 bits */
  vmcs12-guest_cr0 |= (vmcs_readl(GUEST_CR0) 
 vcpu-arch.cr0_guest_owned_bits) |
  (vmcs_readl(CR0_READ_SHADOW) 
 ~( vcpu-arch.cr0_guest_owned_bits | vmcs12-cr0_guest_host_mask))
 
 I guess I could do something like this, but do you think it's clearer?
 I don't. Behind all the details, my formula emphasises that MOST cr0 bits
 can be just copied from vmcs02 to vmcs12 as is - and we only have to do
 something strange for special bits - where L0 wanted to trap but L1 didn't.
 In your formula, it looks like there are 3 different cases instead of 2.

But my formula is more clear given that it sticks to the implication of the
cr0_guest_host_mask. You only need to update cr0 bits which can be modified
by the L2 w/o trap while just keeping the rest.

 
 In any case, your formula is definitely not more correct, because the formulas
 are in fact equivalent - let me prove:
 
 If, instead of taking the unchanged bits (as you call them) from
 vmcs12-guest_cr0, you take them from vmcs02-guest_cr0 (you can,
 because they couldn't have changed), you end up with *exactly* the same
 formula I used. Here is the proof:
 
  yourformula =
   (vmcs12-guest_cr0  vmcs12-cr0_guest_host_mask) |
   (vmcs_readl(GUEST_CR0)  vcpu-arch.cr0_guest_owned_bits) |
   (vmcs_readl(CR0_READ_SHADOW) 
 ~( vcpu-arch.cr0_guest_owned_bits | vmcs12-cr0_guest_host_mask))
 
 Now because of the unchanged bits,
   (vmcs12-guest_cr0  vmcs12-cr0_guest_host_mask) ==
   (vmcs02-guest_cr0  

Re: Virtual SCSI disks hangs on heavy IO

2011-05-24 Thread Scott
Guido Winkelmann guido-kvml at thisisnotatest.de writes:

 
 Hi,
 
 I am experiencing hangs in disk IO in systems hosted inside a virtual KVM 
 machine. When the virtual system disk is SCSI and when I am doing a lot of 
 I/O 
 on it, I will eventually get error messages on the console and in dmesg like 
 these:


I do not have an explanation, but I have seen the same error under similar
circumstances. The server here is running libvirt 0.8.3-4 with QEMU 0.13.0 on
Fedora 14, on a multiprocessing Intel Xeon system. Only one guest is running at
the moment, but the error occurred during a large file copy operation. The
messages in the log were virtually identical to the ones you have seen.

Kind regards,

Scott




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM: x86 emulator: Clean up init_emulate_ctxt()

2011-05-24 Thread Takuya Yoshikawa
From: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp

Use a local pointer to the emulate_ctxt for simplicity.  Then, arrange
the hard-to-read mode selection lines neatly.

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
---
 arch/x86/kvm/x86.c |   21 +++--
 1 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index da48622..e7d337e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4508,7 +4508,8 @@ static void inject_emulated_exception(struct kvm_vcpu 
*vcpu)
 
 static void init_emulate_ctxt(struct kvm_vcpu *vcpu)
 {
-   struct decode_cache *c = vcpu-arch.emulate_ctxt.decode;
+   struct x86_emulate_ctxt *ctxt = vcpu-arch.emulate_ctxt;
+   struct decode_cache *c = ctxt-decode;
int cs_db, cs_l;
 
/*
@@ -4521,15 +4522,15 @@ static void init_emulate_ctxt(struct kvm_vcpu *vcpu)
 
kvm_x86_ops-get_cs_db_l_bits(vcpu, cs_db, cs_l);
 
-   vcpu-arch.emulate_ctxt.eflags = kvm_get_rflags(vcpu);
-   vcpu-arch.emulate_ctxt.eip = kvm_rip_read(vcpu);
-   vcpu-arch.emulate_ctxt.mode =
-   (!is_protmode(vcpu)) ? X86EMUL_MODE_REAL :
-   (vcpu-arch.emulate_ctxt.eflags  X86_EFLAGS_VM)
-   ? X86EMUL_MODE_VM86 : cs_l
-   ? X86EMUL_MODE_PROT64 : cs_db
-   ? X86EMUL_MODE_PROT32 : X86EMUL_MODE_PROT16;
-   vcpu-arch.emulate_ctxt.guest_mode = is_guest_mode(vcpu);
+   ctxt-eflags = kvm_get_rflags(vcpu);
+   ctxt-eip = kvm_rip_read(vcpu);
+   ctxt-mode = (!is_protmode(vcpu))   ? X86EMUL_MODE_REAL :
+(ctxt-eflags  X86_EFLAGS_VM) ? X86EMUL_MODE_VM86 :
+cs_l   ? X86EMUL_MODE_PROT64 :
+cs_db  ? X86EMUL_MODE_PROT32 :
+ X86EMUL_MODE_PROT16;
+   ctxt-guest_mode = is_guest_mode(vcpu);
+
memset(c, 0, sizeof(struct decode_cache));
memcpy(c-regs, vcpu-arch.regs, sizeof c-regs);
vcpu-arch.emulate_regs_need_sync_from_vcpu = false;
-- 
1.7.4.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >