device passthrough
Hi, All: Is there any method to directly assign a device to Guest OS without VT-d? Thanks Mu-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/10] Redirct and make use of the guest serial console
Lucas Meneghel Rodrigues wrote: On Tue, 2010-05-11 at 17:03 +0800, Jason Wang wrote: The guest console is useful for failure troubleshooting especially for the one who has calltrace. And as we plan to push the network related test in the next few weeks, we found the serial session in more reliable during the network testing. So this patchset logs the guest serial throught the redirectied serial of guest and also enable the ability to log into guest through serial console. I only open the serial console for linux, I would do some investigation on windows guests. Change from v1: - Coding style improvement according to the suggestions from Michael Goldish - Improve the username sending handling in remote_login() - Change the matching re of login to [Ll]ogin:\s*$ - Check whether vm have already dead in dumpping thread - Return none rather than raise exception when met unknown shell_client - Keep tty0 for all linux guests - Enable the serial console in unattended installation - Add a helper to check whether the panic information was occured - Keep the porcess() at its original location in preprocess() Jason, after a long conversation I've had with Michael during the previous week, we reached some common points: 1 - We believe it is possible to be able to both log in *and* log serial console output. That will require changes to kvm_subprocess and might take a little bit more time. Yes, I've tried a ugly implementation of a server in serial_dump_thread ( finally dropped from my patch ), so I agree that it would be much more elegant if I can get the support from kvm_subprocess. 2 - We know you guys are depending on this patchset to be accepted in order to proceed with the network related cases. However, we ask for a little more patience, and we'd like to get your opinions on the patches that we are going to roll out. This way we can get to a better solution for all of us. So, please bear with us and I'll try to see with Michael and Dor if we can prioritize this work to not block work items for you guys. Cheers, Lucas Ok, no problem. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] KVM test: Do not use the hard-coded address during unattended installation
Lucas Meneghel Rodrigues wrote: On Wed, 2010-05-19 at 17:20 +0800, Jason Wang wrote: When we do the unattended installation in tap mode, we should use vm.get_address() instead of the 'localhost' in order the connect to the finish program running in the guest. Signed-off-by: Jason Wang jasow...@redhat.com --- client/tests/kvm/tests/unattended_install.py | 25 + 1 files changed, 13 insertions(+), 12 deletions(-) diff --git a/client/tests/kvm/tests/unattended_install.py b/client/tests/kvm/tests/unattended_install.py index e2cec8e..e71f993 100644 --- a/client/tests/kvm/tests/unattended_install.py +++ b/client/tests/kvm/tests/unattended_install.py @@ -17,7 +17,6 @@ def run_unattended_install(test, params, env): vm = kvm_test_utils.get_living_vm(env, params.get(main_vm)) port = vm.get_port(int(params.get(guest_port_unattended_install))) -addr = ('localhost', port) if params.get(post_install_delay): post_install_delay = int(params.get(post_install_delay)) else: @@ -31,17 +30,19 @@ def run_unattended_install(test, params, env): time_elapsed = 0 while time_elapsed install_timeout: client = socket.socket(socket.AF_INET, socket.SOCK_STREAM) -try: -client.connect(addr) -msg = client.recv(1024) -if msg == 'done': -if post_install_delay: -logging.debug(Post install delay specified, - waiting %ss..., post_install_delay) -time.sleep(post_install_delay) -break -except socket.error: -pass +addr = vm.get_address() +if addr: ^ Per coding style, we should check for is None if addr is not None: +try: +client.connect((addr, port)) +msg = client.recv(1024) +if msg == 'done': +if post_install_delay: +logging.debug(Post install delay specified, + waiting %ss..., post_install_delay) +time.sleep(post_install_delay) +break +except socket.error: +pass ^ If vm.get_address() returns None, we'll have to fail the test, if we don't we'll get a false PASS. An vm may not get its ip address during the startup and because we have timeout, I think it's safe here. time.sleep(1) client.close() end_time = time.time() -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] KVM test: Add the support of kernel and initrd option for qemu-kvm
-kernel option is useful for both unattended installation and the unittest in /kvm/user/test. Signed-off-by: Jason Wang jasow...@redhat.com --- client/tests/kvm/kvm_vm.py | 10 ++ 1 files changed, 10 insertions(+), 0 deletions(-) diff --git a/client/tests/kvm/kvm_vm.py b/client/tests/kvm/kvm_vm.py index bca9d15..c7eed56 100755 --- a/client/tests/kvm/kvm_vm.py +++ b/client/tests/kvm/kvm_vm.py @@ -360,6 +360,16 @@ class VM: tftp = kvm_utils.get_path(root_dir, tftp) qemu_cmd += add_tftp(help, tftp) +kernel = params.get(kernel) +if kernel: +kernel = kvm_utils.get_path(root_dir, kernel) +qemu_cmd += -kernel %s % kernel + +initrd = params.get(initrd) +if initrd: +initrd = kvm_utils.get_path(root_dir, initrd) +qemu_cmd += -initrd %s % initrd + for redir_name in kvm_utils.get_sub_dict_names(params, redirs): redir_params = kvm_utils.get_sub_dict(params, redir_name) guest_port = int(redir_params.get(guest_port)) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] KVM test: Do not use the hard-coded address during unattended installation
When we do the unattended installation in tap mode, we should use vm.get_address() instead of the 'localhost' in order the connect to the finish program running in the guest. Signed-off-by: Jason Wang jasow...@redhat.com --- client/tests/kvm/tests/unattended_install.py | 25 + 1 files changed, 13 insertions(+), 12 deletions(-) diff --git a/client/tests/kvm/tests/unattended_install.py b/client/tests/kvm/tests/unattended_install.py index e2cec8e..8928575 100644 --- a/client/tests/kvm/tests/unattended_install.py +++ b/client/tests/kvm/tests/unattended_install.py @@ -17,7 +17,6 @@ def run_unattended_install(test, params, env): vm = kvm_test_utils.get_living_vm(env, params.get(main_vm)) port = vm.get_port(int(params.get(guest_port_unattended_install))) -addr = ('localhost', port) if params.get(post_install_delay): post_install_delay = int(params.get(post_install_delay)) else: @@ -31,17 +30,19 @@ def run_unattended_install(test, params, env): time_elapsed = 0 while time_elapsed install_timeout: client = socket.socket(socket.AF_INET, socket.SOCK_STREAM) -try: -client.connect(addr) -msg = client.recv(1024) -if msg == 'done': -if post_install_delay: -logging.debug(Post install delay specified, - waiting %ss..., post_install_delay) -time.sleep(post_install_delay) -break -except socket.error: -pass +addr = vm.get_address() +if addr is not None: +try: +client.connect((addr, port)) +msg = client.recv(1024) +if msg == 'done': +if post_install_delay: +logging.debug(Post install delay specified, + waiting %ss..., post_install_delay) +time.sleep(post_install_delay) +break +except socket.error: +pass time.sleep(1) client.close() end_time = time.time() -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] KVM test: Add implementation of network based unattended installation
This patch could let the unattended installation to be done through the following method: - unattended.cdrom: the original method which does the installation from cdrom - unattended.url: installing the linux guest from http or ftp, tree url was specified through url - unattended.nfs: installing the linux guest from nfs. the server address was specified through nfs_server, and the director was specified through nfs_dir - unattended.remote_ks: installing the linux guest through a remote kickstart file For url and nfs installation, the extra_params need to be configurated to specify the location of unattended files: - If the unattended file in the tree is used, extra_parmas= append ks=floppy and unattended_file params need to be specified in the configuration file. - If the unattended file located at remote server is used, unattended_file option must be none and extram_params= append ks=http://xxx; need to be speficied in the configuration file and don't forget the add the finish nofitication part. The --kernel and --initrd were used directly for the network installation instead of the tftp/bootp param because user mode network is too slow to do this. Only the unattended files for RHEL and Fedora gues ts are modified, others are kept unmodified and could do the installation from cdrom. Signed-off-by: Jason Wang jasow...@redhat.com --- client/tests/kvm/scripts/unattended.py | 107 +++- client/tests/kvm/tests_base.cfg.sample | 172 +++--- client/tests/kvm/unattended/Fedora-10.ks |2 client/tests/kvm/unattended/Fedora-11.ks |2 client/tests/kvm/unattended/Fedora-12.ks |2 client/tests/kvm/unattended/Fedora-8.ks |2 client/tests/kvm/unattended/Fedora-9.ks |2 client/tests/kvm/unattended/RHEL-3-series.ks |2 client/tests/kvm/unattended/RHEL-4-series.ks |2 client/tests/kvm/unattended/RHEL-5-series.ks |2 10 files changed, 206 insertions(+), 89 deletions(-) diff --git a/client/tests/kvm/scripts/unattended.py b/client/tests/kvm/scripts/unattended.py index fdadd03..0377d83 100755 --- a/client/tests/kvm/scripts/unattended.py +++ b/client/tests/kvm/scripts/unattended.py @@ -50,8 +50,9 @@ class UnattendedInstall(object): self.cdrom_iso = os.path.join(kvm_test_dir, cdrom_iso) self.floppy_mount = tempfile.mkdtemp(prefix='floppy_', dir='/tmp') self.cdrom_mount = tempfile.mkdtemp(prefix='cdrom_', dir='/tmp') -flopy_name = os.environ['KVM_TEST_floppy'] -self.floppy_img = os.path.join(kvm_test_dir, flopy_name) +self.nfs_mount = tempfile.mkdtemp(prefix='nfs_', dir='/tmp') +floppy_name = os.environ['KVM_TEST_floppy'] +self.floppy_img = os.path.join(kvm_test_dir, floppy_name) floppy_dir = os.path.dirname(self.floppy_img) if not os.path.isdir(floppy_dir): os.makedirs(floppy_dir) @@ -60,6 +61,16 @@ class UnattendedInstall(object): self.pxe_image = os.environ.get('KVM_TEST_pxe_image', '') self.pxe_initrd = os.environ.get('KVM_TEST_pxe_initrd', '') +self.medium = os.environ.get('KVM_TEST_medium', '') +self.url = os.environ.get('KVM_TEST_url', '') +self.kernel = os.environ.get('KVM_TEST_kernel', '') +self.initrd = os.environ.get('KVM_TEST_initrd', '') +self.nfs_server = os.environ.get('KVM_TEST_nfs_server', '') +self.nfs_dir = os.environ.get('KVM_TEST_nfs_dir', '') +self.image_path = kvm_test_dir +self.kernel_path = os.path.join(self.image_path, self.kernel) +self.initrd_path = os.path.join(self.image_path, self.initrd) + def create_boot_floppy(self): @@ -106,7 +117,8 @@ class UnattendedInstall(object): dest = os.path.join(self.floppy_mount, dest_fname) # Replace KVM_TEST_CDKEY (in the unattended file) with the cdkey -# provided for this test +# provided for this test and replace the KVM_TEST_MEDIUM with +# the tree url or nfs address provided for this test. unattended_contents = open(self.unattended_file).read() dummy_cdkey_re = r'\bKVM_TEST_CDKEY\b' real_cdkey = os.environ.get('KVM_TEST_cdkey') @@ -117,7 +129,20 @@ class UnattendedInstall(object): else: print (WARNING: 'cdkey' required but not specified for this unattended installation) + +dummy_re = r'\bKVM_TEST_MEDIUM\b' +if self.medium == cdrom: +content = cdrom +elif self.medium == url: +content = url --url %s % self.url +elif self.medium == nfs: +content = nfs --server=%s --dir=%s % (self.nfs_server, self.nfs_dir) +else: +raise SetupError(Unexpected installation medium %s % self.url) + +unattended_contents =
Re: [PATCHv2-RFC 0/2] virtio: put last seen used index into ring itself
On 05/26/10 21:50, Michael S. Tsirkin wrote: Here's a rewrite of the original patch with a new layout. I haven't tested it yet so no idea how this performs, but I think this addresses the cache bounce issue raised by Avi. Posting for early flames/comments. Generally, the Host end of the virtio ring doesn't need to see where Guest is up to in consuming the ring. However, to completely understand what's going on from the outside, this information must be exposed. For example, host can reduce the number of interrupts by detecting that the guest is currently handling previous buffers. We add a feature bit so the guest can tell the host that it's writing out the current value there, if it wants to use that. This differs from original approach in that the used index is put after avail index (they are typically written out together). To avoid cache bounces on descriptor access, and make future extensions easier, we put the ring itself at start of page, and move the control after it. Hi Michael, It looks pretty good to me, however one thing I have been thinking of while reading through it: Rather than storing a pointer within the ring struct, pointing into a position within the same struct. How about storing a byte offset instead and using a cast to get to the pointer position? That would avoid the pointer dereference, which is less effective cache wise and harder for the CPU to predict. Not sure whether it really matters performance wise, just a thought. Cheers, Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 0/2] Fix scsi-generic breakage in upstream qemu-kvm.git
Am 27.05.2010 17:56, schrieb Nicholas A. Bellinger: On Thu, 2010-05-20 at 15:18 +0200, Kevin Wolf wrote: Am 17.05.2010 18:45, schrieb Nicholas A. Bellinger: From: Nicholas Bellinger n...@linux-iscsi.org Greetings, Attached are the updated patches following hch's comments to fix scsi-generic device breakage with find_image_format() and refresh_total_sectors(). These are being resent as the last attachments where in MBOX format from git-format-patch. Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org Thanks, applied all to the block branch, even though I forgot to reply here. Kevin Hi Kevin, Thanks for accepting the series. There is one more piece of breakage that Chris Krumme found in block.c:find_image_format() in the original patch. Please apply the patch to add the missing bdrv_delete() for the SG_IO case below. Thanks for pointing this out Chris! Right, thanks for the fix. I've applied it to the block branch. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Question about IOMMU
Hi, I'm trying to use KVM virtualization at home and I've run into what I think is a limitation of my hardware. I'm trying to pass a PCI card (WinTV NOVA-T 500) through to a guest OS but I get the error 'IOMMU not found'. Now I've read that this is because my motherboard does not have an IOMMU to perform hardware accelerated device virtualization. I have an AMD based system and apparently they call it AMD-Vi and Intel call it VT-d. Now, my 790GX based board does not have IOMMU, but the latest chipset, the 890FX, apparently does, it has IOMMU v1.2 Before I go and spend lots of money on what it a very expensive board, can somebody confirm that an AMD 890FX based board with a Phenom II X3 processor with Ubuntu 10.04 server 64bit using KVM/QEMU will allow me to passthrough a PCI card to the guest OS? On another note, now that I've subscribed to this mailing list I notice that there is an email named AMD IOMMU Emulation Is that what I think it is and can I compile up a patched version for my server? Many Thanks, James. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM test: Measure the timedrift after continuing a stopped vm
This test extends the timedifrt test and measures the timedirft across the vm stopping and continuing. Two helpers function are also added in kvm_test_utils to do the stop and continue. Signed-off-by: Jason Wang jasow...@redhat.com --- client/tests/kvm/tests/timedrift_with_stop.py | 97 + 1 files changed, 97 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/tests/timedrift_with_stop.py diff --git a/client/tests/kvm/kvm_test_utils.py b/client/tests/kvm/kvm_test_utils.py index 24e2bf5..7e158a3 100644 --- a/client/tests/kvm/kvm_test_utils.py +++ b/client/tests/kvm/kvm_test_utils.py @@ -181,6 +181,30 @@ def migrate(vm, env=None): raise +def stop(vm): + +Stop a running vm + +s, o = vm.send_monitor_cmd(stop) +if s != 0: +raise error.TestError(Could not send the stop command) +s, o = vm.send_monitor_cmd(info status) +if paused not in o: +raise error.TestFail(VM does not stop afer send stop command) + + +def cont(vm): + +Continue a stopped vm + +s, o = vm.send_monitor_cmd(cont) +if s != 0: +raise error.TestError(Could not send the cont command) +s, o = vm.send_monitor_cmd(info status) +if running not in o: +raise error.TestFail(VM still in paused status after sending cont) + + def get_time(session, time_command, time_filter_re, time_format): Return the host time and guest time. If the guest time cannot be fetched diff --git a/client/tests/kvm/tests/timedrift_with_stop.py b/client/tests/kvm/tests/timedrift_with_stop.py new file mode 100644 index 000..b99dd40 --- /dev/null +++ b/client/tests/kvm/tests/timedrift_with_stop.py @@ -0,0 +1,97 @@ +import logging, time, commands, re +from autotest_lib.client.common_lib import error +import kvm_subprocess, kvm_test_utils, kvm_utils + + +def run_timedrift_with_stop(test, params, env): + +Time drift test with stop/continue the guest: + +1) Log into a guest. +2) Take a time reading from the guest and host. +3) Stop the running of the guest +4) Sleep for a while +5) Continue the guest running +6) Take a second time reading. +7) If the drift (in seconds) is higher than a user specified value, fail. + +@param test: KVM test object. +@param params: Dictionary with test parameters. +@param env: Dictionary with the test environment. + +login_timeout = int(params.get(login_timeout, 360)) +vm = kvm_test_utils.get_living_vm(env, params.get(main_vm)) +session = kvm_test_utils.wait_for_login(vm, timeout=login_timeout) + +# Collect test parameters: +# Command to run to get the current time +time_command = params.get(time_command) +# Filter which should match a string to be passed to time.strptime() +time_filter_re = params.get(time_filter_re) +# Time format for time.strptime() +time_format = params.get(time_format) +drift_threshold = float(params.get(drift_threshold, 10)) +drift_threshold_single = float(params.get(drift_threshold_single, 3)) +stop_iterations = int(params.get(stop_iterations, 1)) +stop_time = int(params.get(stop_time, 60)) + +try: +# Get initial time +# (ht stands for host time, gt stands for guest time) +(ht0, gt0) = kvm_test_utils.get_time(session, time_command, + time_filter_re, time_format) + +# Stop the guest +for i in range(stop_iterations): +# Get time before current iteration +(ht0_, gt0_) = kvm_test_utils.get_time(session, time_command, + time_filter_re, time_format) +# Run current iteration +logging.info(Stop %s second: iteration %d of %d... % + (stop_time, i + 1, stop_iterations)) + +kvm_test_utils.stop(vm) +time.sleep(stop_time) +kvm_test_utils.cont(vm) + +# Get time after current iteration +(ht1_, gt1_) = kvm_test_utils.get_time(session, time_command, + time_filter_re, time_format) +# Report iteration results +host_delta = ht1_ - ht0_ +guest_delta = gt1_ - gt0_ +drift = abs(host_delta - guest_delta) +logging.info(Host duration (iteration %d): %.2f % + (i + 1, host_delta)) +logging.info(Guest duration (iteration %d): %.2f % + (i + 1, guest_delta)) +logging.info(Drift at iteration %d: %.2f seconds % + (i + 1, drift)) +# Fail if necessary +if drift drift_threshold_single: +raise error.TestFail(Time drift too large at iteration %d: + %.2f seconds % (i + 1, drift)) + +# Get final time +(ht1, gt1) =
KVM: MMU: always invalidate and flush on spte page size change
Always invalidate spte and flush TLBs when changing page size, to make sure different sized translations for the same address are never cached in a CPU's TLB. The first case where this occurs is when a non-leaf spte pointer is overwritten by a leaf, large spte entry. This can happen after dirty logging is disabled on a memslot, for example. The second case is a leaf, large spte entry is overwritten with a non-leaf spte pointer, in __direct_map. Note this cannot happen now because the only potential source of such overwrite is dirty logging being enabled, which zaps all MMU pages. But this might change in the future, so better be robust against it. Noticed by Andrea. KVM-Stable-Tag Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: kvm/arch/x86/kvm/mmu.c === --- kvm.orig/arch/x86/kvm/mmu.c +++ kvm/arch/x86/kvm/mmu.c @@ -1952,6 +1952,8 @@ static void mmu_set_spte(struct kvm_vcpu child = page_header(pte PT64_BASE_ADDR_MASK); mmu_page_remove_parent_pte(child, sptep); + __set_spte(sptep, shadow_trap_nonpresent_pte); + kvm_flush_remote_tlbs(vcpu-kvm); } else if (pfn != spte_to_pfn(*sptep)) { pgprintk(hfn old %lx new %lx\n, spte_to_pfn(*sptep), pfn); @@ -2015,6 +2017,16 @@ static int __direct_map(struct kvm_vcpu break; } + if (is_shadow_present_pte(*iterator.sptep) + !is_large_pte(*iterator.sptep)) + continue; + + if (is_large_pte(*iterator.sptep)) { + rmap_remove(vcpu-kvm, iterator.sptep); + __set_spte(iterator.sptep, shadow_trap_nonpresent_pte); + kvm_flush_remote_tlbs(vcpu-kvm); + } + if (*iterator.sptep == shadow_trap_nonpresent_pte) { u64 base_addr = iterator.addr; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
On Thu, May 27, 2010 at 11:20:22PM +0200, Tejun Heo wrote: Hello, Michael. On 05/27/2010 07:32 PM, Michael S. Tsirkin wrote: Well, this is why I proposed adding a new API for creating workqueue within workqueue.c, rather than exposing the task and attaching it to cgroups in our driver: so that workqueue maintainers can fix the implementation if it ever changes. And after all, it's an internal API, we can always change it later if we need. ... Well, yes but we are using APIs like flush_work etc. These are very handy. It seems much easier than rolling our own queue on top of kthread. The thing is that this kind of one-off usage becomes problemetic when you're trying to change the implementation detail. All current workqueue users don't care which thread they run on and they shouldn't as each work owns the context only for the duration the work is executing. If this sort of fundamental guidelines are followed, the implementation can be improved in pretty much transparent way but when you start depending on specific implementation details, things become messy pretty quickly. If this type of usage were more common, adding proper way to account work usage according to cgroups would make sense but that's not the case here and I removed the only exception case recently while trying to implement cmwq and if this is added. So, this would be the only one which makes such extra assumptions in the whole kernel. One way or the other, workqueue needs to be improved and I don't really think adding the single exception at this point is a good idea. The thing I realized after stop_machine conversion was that there was no reason to use workqueue there at all. There already are more than enough not-too-difficult synchronization constructs and if you're using a thread for dedicated purposes, code complexity isn't that different either way. Plus, it would also be clearer that dedicated threads are required there for what reason. So, I strongly suggest using a kthread. If there are issues which are noticeably difficult to solve with kthread, we can definitely talk about that and think about things again. Thank you. Well, we have create_singlethread_workqueue, right? This is not very different ... is it? Just copying structures and code from workqueue.c, adding vhost_ in front of it will definitely work: there is nothing magic about the workqueue library. But this just involves cut and paste which might be best avoided. One final idea before we go the cut and paste way: how about 'create_workqueue_from_task' that would get a thread and have workqueue run there? -- tejun -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup
Hello, On 05/28/2010 05:08 PM, Michael S. Tsirkin wrote: Well, we have create_singlethread_workqueue, right? This is not very different ... is it? Just copying structures and code from workqueue.c, adding vhost_ in front of it will definitely work: Sure it will, but you'll probably be able to get away with much less. there is nothing magic about the workqueue library. But this just involves cut and paste which might be best avoided. What I'm saying is that some magic needs to be added to workqueue and if you add this single(!) exception, it will have to be backed out pretty soon, so it would be better to do it properly now. One final idea before we go the cut and paste way: how about 'create_workqueue_from_task' that would get a thread and have workqueue run there? You can currently depend on that implementation detail but it's not the workqueue interface is meant to do. The single threadedness is there as execution ordering and concurrency specification and it doesn't (or rather won't) necessarily mean that a specific single thread is bound to certain workqueue. Can you please direct me to have a look at the code. I'll be happy to do the conversion for you. Thanks. -- tejun -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/1] ceph/rbd block driver for qemu-kvm (v2)
Am 27.05.2010 21:11, schrieb Christian Brunner: This is a block driver for the distributed file system Ceph (http://ceph.newdream.net/). This driver uses librados (which is part of the Ceph server) for direct access to the Ceph object store and is running entirely in userspace. Therefore it is called rbd - rados block device. To compile the driver a recent version of ceph (unstable/testin git head or 0.20.3 once it is released) is needed and you have to --enable-rbd when running configure. Additional information is available on the Ceph-Wiki: http://ceph.newdream.net/wiki/Kvm-rbd The patch is based on git://repo.or.cz/qemu/kevin.git block Signed-off-by line is missing. --- Makefile |3 + Makefile.objs |1 + block/rbd.c | 584 + block/rbd_types.h | 52 + configure | 27 +++ 5 files changed, 667 insertions(+), 0 deletions(-) create mode 100644 block/rbd.c create mode 100644 block/rbd_types.h diff --git a/Makefile b/Makefile index 7986bf6..8d09612 100644 --- a/Makefile +++ b/Makefile @@ -27,6 +27,9 @@ configure: ; $(call set-vpath, $(SRC_PATH):$(SRC_PATH)/hw) LIBS+=-lz $(LIBS_TOOLS) +ifdef CONFIG_RBD +LIBS+=-lrados +endif You already write the -lrados option to config-host.mak in configure, so this looks unnecessary. ifdef BUILD_DOCS DOCS=qemu-doc.html qemu-tech.html qemu.1 qemu-img.1 qemu-nbd.8 diff --git a/Makefile.objs b/Makefile.objs index 1a942e5..08dc11f 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -18,6 +18,7 @@ block-nested-y += parallels.o nbd.o blkdebug.o block-nested-$(CONFIG_WIN32) += raw-win32.o block-nested-$(CONFIG_POSIX) += raw-posix.o block-nested-$(CONFIG_CURL) += curl.o +block-nested-$(CONFIG_RBD) += rbd.o block-obj-y += $(addprefix block/, $(block-nested-y)) diff --git a/block/rbd.c b/block/rbd.c new file mode 100644 index 000..375ae9d --- /dev/null +++ b/block/rbd.c @@ -0,0 +1,584 @@ +/* + * QEMU Block driver for RADOS (Ceph) + * + * Copyright (C) 2010 Christian Brunner c...@muc.de + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ + +#include qemu-common.h +#include sys/types.h +#include stdbool.h + +#include qemu-common.h + +#include rbd_types.h +#include module.h +#include block_int.h + +#include stdio.h +#include stdlib.h +#include rados/librados.h + +#include signal.h + +/* + * When specifying the image filename use: + * + * rbd:poolname/devicename + * + * poolname must be the name of an existing rados pool + * + * devicename is the basename for all objects used to + * emulate the raw device. + * + * Metadata information (image size, ...) is stored in an + * object with the name devicename.rbd. + * + * The raw device is split into 4MB sized objects by default. + * The sequencenumber is encoded in a 12 byte long hex-string, + * and is attached to the devicename, separated by a dot. + * e.g. devicename.1234567890ab + * + */ + +#define OBJ_MAX_SIZE (1UL OBJ_DEFAULT_OBJ_ORDER) + +typedef struct RBDAIOCB { +BlockDriverAIOCB common; +QEMUBH *bh; +int ret; +QEMUIOVector *qiov; +char *bounce; +int write; +int64_t sector_num; +int aiocnt; +int error; +} RBDAIOCB; + +typedef struct RADOSCB { +int rcbid; +RBDAIOCB *acb; +int done; +int64_t segsize; +char *buf; +} RADOSCB; + +typedef struct RBDRVRBDState { +rados_pool_t pool; +char name[RBD_MAX_OBJ_NAME_SIZE]; +int name_len; name_len looks unused. +uint64_t size; +uint64_t objsize; +} RBDRVRBDState; Hm, you mean BDRVRBDState? Maybe ceph would have been a better driver name to avoid such type names. ;-) + +typedef struct rbd_obj_header_ondisk RbdHeader1; + +static int rbd_parsename(const char *filename, char *pool, char *name) +{ +const char *rbdname; +char *p, *n; +int l; + +if (!strstart(filename, rbd:, rbdname)) { +return -EINVAL; +} + +pstrcpy(pool, 2 * RBD_MAX_SEG_NAME_SIZE, rbdname); Why twice the size? The callers pass a char[RBD_MAX_SEG_NAME_SIZE], so doesn't this allow buffer overflows? +p = strchr(pool, '/'); +if (p == NULL) { +return -EINVAL; +} + +*p = '\0'; +n = ++p; Why introduce a new variable here? p isn't used any more afterwards. + +l = strlen(n); + +if (l RBD_MAX_OBJ_NAME_SIZE) { +fprintf(stderr, object name to long\n); Off by one, you need to consider the trailing '\0'. Also, please use error_report instead of fprintf(stderr, ...) for real error messages. Directly printing to stderr is okay for debug code. +return -EINVAL; +} else if (l = 0) { +fprintf(stderr, object name to short\n); +return -EINVAL; +} + +strcpy(name, n); + +return l;
Re: Perf trace event parse errors for KVM events
I get parse errors when using Steven Rostedt's trace-cmd tool, too. Any ideas what is going on here? I can provide more info (e.g. trace files) if necessary. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Reminder] KVM Forum 2010: Early Bird Registration
Just a reminder...The early bird registration period ends on May 30th. It's shaping up to be an excellent KVM Forum, look forward to seeing you there. Registration link is here: http://events.linuxfoundation.org/component/registrationpro/?func=detailsdid=34 thanks, -KVM Forum 2010 Program Commitee -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: device passthrough
* Mu Lin (m...@juniper.net) wrote: Is there any method to directly assign a device to Guest OS without VT-d? Assuming you mean a PCI device, no, there isn't. Without an IOMMU[1] you can't directly assign a PCI device to a guest (nor is it safe). There have been patches floating around to allow this, but they don't maintain secure isolation. thanks, -chris [1] VT-d is an Intel chipset feature, so you could certainly do it on an AMD platform that has an AMD IOMMU. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Re: Another SIGFPE in display code, now in cirrus
12.05.2010 22:11, Stefano Stabellini wrote: On Wed, 12 May 2010, Jamie Lokier wrote: Stefano Stabellini wrote: On Wed, 12 May 2010, Avi Kivity wrote: It's useful if you have a one-line horizontal pattern you want to propagate all over. It might be useful all right, but it is not entirely clear what the hardware should do in this situation from the documentation we have, and certainly the current state of the cirrus emulation code doesn't help. It's quite a reasonable thing for hardware to do, even if not documented. It would be surprising if the hardware didn't copy the one-line pattern. All right then, you convinced me :) This is my proposed solution, however it is untested with Windows NT. Signed-off-by: Stefano Stabellinistefano.stabell...@eu.citrix.com So.. what's the status of this, after all? :) As far as I can tell, it has been agreed that the patch is good, and verified that it fixes the problem. Should we just throw it away and start from scratch, or what? :) Thanks! diff --git a/hw/cirrus_vga.c b/hw/cirrus_vga.c index 9f61a01..a7f0d3c 100644 --- a/hw/cirrus_vga.c +++ b/hw/cirrus_vga.c @@ -676,15 +676,17 @@ static void cirrus_do_copy(CirrusVGAState *s, int dst, int src, int w, int h) int sx, sy; int dx, dy; int width, height; +uint32_t start_addr, line_offset, line_compare; int depth; int notify = 0; depth = s-vga.get_bpp(s-vga) / 8; s-vga.get_resolution(s-vga,width,height); +s-vga.get_offsets(s-vga,line_offset,start_addr,line_compare); /* extra x, y */ -sx = (src % ABS(s-cirrus_blt_srcpitch)) / depth; -sy = (src / ABS(s-cirrus_blt_srcpitch)); +sx = (src % line_offset) / depth; +sy = (src / line_offset); dx = (dst % ABS(s-cirrus_blt_dstpitch)) / depth; dy = (dst / ABS(s-cirrus_blt_dstpitch)); @@ -725,18 +727,23 @@ static void cirrus_do_copy(CirrusVGAState *s, int dst, int src, int w, int h) s-cirrus_blt_dstpitch, s-cirrus_blt_srcpitch, s-cirrus_blt_width, s-cirrus_blt_height); -if (notify) - qemu_console_copy(s-vga.ds, - sx, sy, dx, dy, - s-cirrus_blt_width / depth, - s-cirrus_blt_height); - -/* we don't have to notify the display that this portion has - changed since qemu_console_copy implies this */ - -cirrus_invalidate_region(s, s-cirrus_blt_dstaddr, - s-cirrus_blt_dstpitch, s-cirrus_blt_width, - s-cirrus_blt_height); + if (ABS(s-cirrus_blt_dstpitch) != line_offset || + ABS(s-cirrus_blt_srcpitch) != line_offset) { + /* this is not going to happen very often */ + vga_hw_invalidate(); + } else { + if (notify) + /* we don't have to notify the display that this portion has +changed since qemu_console_copy implies this */ + qemu_console_copy(s-vga.ds, + sx, sy, dx, dy, + s-cirrus_blt_width / depth, + s-cirrus_blt_height); + else + cirrus_invalidate_region(s, s-cirrus_blt_dstaddr, + s-cirrus_blt_dstpitch, s-cirrus_blt_width, + s-cirrus_blt_height); + } } static int cirrus_bitblt_videotovideo_copy(CirrusVGAState * s) diff --git a/hw/cirrus_vga_rop.h b/hw/cirrus_vga_rop.h index 39a7b72..80f135b 100644 --- a/hw/cirrus_vga_rop.h +++ b/hw/cirrus_vga_rop.h @@ -32,10 +32,10 @@ glue(cirrus_bitblt_rop_fwd_, ROP_NAME)(CirrusVGAState *s, dstpitch -= bltwidth; srcpitch -= bltwidth; -if (dstpitch 0 || srcpitch 0) { -/* is 0 valid? srcpitch == 0 could be useful */ +if (dstpitch 0) return; -} +if (srcpitch 0) +srcpitch = 0; for (y = 0; y bltheight; y++) { for (x = 0; x bltwidth; x++) { @@ -57,6 +57,12 @@ glue(cirrus_bitblt_rop_bkwd_, ROP_NAME)(CirrusVGAState *s, int x,y; dstpitch += bltwidth; srcpitch += bltwidth; + +if (dstpitch 0) +return; +if (srcpitch 0) +srcpitch = 0; + for (y = 0; y bltheight; y++) { for (x = 0; x bltwidth; x++) { ROP_OP(*dst, *src); @@ -78,6 +84,12 @@ glue(glue(cirrus_bitblt_rop_fwd_transp_, ROP_NAME),_8)(CirrusVGAState *s, uint8_t p; dstpitch -= bltwidth; srcpitch -= bltwidth; + +if (dstpitch 0) +return; +if (srcpitch 0) +srcpitch = 0; + for (y = 0; y bltheight; y++) { for (x = 0; x bltwidth; x++) { p = *dst; @@ -101,6 +113,12 @@ glue(glue(cirrus_bitblt_rop_bkwd_transp_, ROP_NAME),_8)(CirrusVGAState *s, uint8_t p; dstpitch += bltwidth; srcpitch += bltwidth; + +if (dstpitch 0) +return; +if (srcpitch 0) +srcpitch = 0; + for (y = 0;
Re: [Qemu-devel] [PATCH 1/1] ceph/rbd block driver for qemu-kvm (v2)
Hi Kevin, thanks for your review notes. Yehuda and I have already worked this into the git tree on the ceph site. I'll do some testing on Monday. After that I'll send an updated patch. Regards, Christian 2010/5/28 Kevin Wolf kw...@redhat.com: Am 27.05.2010 21:11, schrieb Christian Brunner: This is a block driver for the distributed file system Ceph (http://ceph.newdream.net/). This driver uses librados (which is part of the Ceph server) for direct access to the Ceph object store and is running entirely in userspace. Therefore it is called rbd - rados block device. To compile the driver a recent version of ceph (unstable/testin git head or 0.20.3 once it is released) is needed and you have to --enable-rbd when running configure. Additional information is available on the Ceph-Wiki: http://ceph.newdream.net/wiki/Kvm-rbd The patch is based on git://repo.or.cz/qemu/kevin.git block Signed-off-by line is missing. --- Makefile | 3 + Makefile.objs | 1 + block/rbd.c | 584 + block/rbd_types.h | 52 + configure | 27 +++ 5 files changed, 667 insertions(+), 0 deletions(-) create mode 100644 block/rbd.c create mode 100644 block/rbd_types.h diff --git a/Makefile b/Makefile index 7986bf6..8d09612 100644 --- a/Makefile +++ b/Makefile @@ -27,6 +27,9 @@ configure: ; $(call set-vpath, $(SRC_PATH):$(SRC_PATH)/hw) LIBS+=-lz $(LIBS_TOOLS) +ifdef CONFIG_RBD +LIBS+=-lrados +endif You already write the -lrados option to config-host.mak in configure, so this looks unnecessary. ifdef BUILD_DOCS DOCS=qemu-doc.html qemu-tech.html qemu.1 qemu-img.1 qemu-nbd.8 diff --git a/Makefile.objs b/Makefile.objs index 1a942e5..08dc11f 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -18,6 +18,7 @@ block-nested-y += parallels.o nbd.o blkdebug.o block-nested-$(CONFIG_WIN32) += raw-win32.o block-nested-$(CONFIG_POSIX) += raw-posix.o block-nested-$(CONFIG_CURL) += curl.o +block-nested-$(CONFIG_RBD) += rbd.o block-obj-y += $(addprefix block/, $(block-nested-y)) diff --git a/block/rbd.c b/block/rbd.c new file mode 100644 index 000..375ae9d --- /dev/null +++ b/block/rbd.c @@ -0,0 +1,584 @@ +/* + * QEMU Block driver for RADOS (Ceph) + * + * Copyright (C) 2010 Christian Brunner c...@muc.de + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ + +#include qemu-common.h +#include sys/types.h +#include stdbool.h + +#include qemu-common.h + +#include rbd_types.h +#include module.h +#include block_int.h + +#include stdio.h +#include stdlib.h +#include rados/librados.h + +#include signal.h + +/* + * When specifying the image filename use: + * + * rbd:poolname/devicename + * + * poolname must be the name of an existing rados pool + * + * devicename is the basename for all objects used to + * emulate the raw device. + * + * Metadata information (image size, ...) is stored in an + * object with the name devicename.rbd. + * + * The raw device is split into 4MB sized objects by default. + * The sequencenumber is encoded in a 12 byte long hex-string, + * and is attached to the devicename, separated by a dot. + * e.g. devicename.1234567890ab + * + */ + +#define OBJ_MAX_SIZE (1UL OBJ_DEFAULT_OBJ_ORDER) + +typedef struct RBDAIOCB { + BlockDriverAIOCB common; + QEMUBH *bh; + int ret; + QEMUIOVector *qiov; + char *bounce; + int write; + int64_t sector_num; + int aiocnt; + int error; +} RBDAIOCB; + +typedef struct RADOSCB { + int rcbid; + RBDAIOCB *acb; + int done; + int64_t segsize; + char *buf; +} RADOSCB; + +typedef struct RBDRVRBDState { + rados_pool_t pool; + char name[RBD_MAX_OBJ_NAME_SIZE]; + int name_len; name_len looks unused. + uint64_t size; + uint64_t objsize; +} RBDRVRBDState; Hm, you mean BDRVRBDState? Maybe ceph would have been a better driver name to avoid such type names. ;-) + +typedef struct rbd_obj_header_ondisk RbdHeader1; + +static int rbd_parsename(const char *filename, char *pool, char *name) +{ + const char *rbdname; + char *p, *n; + int l; + + if (!strstart(filename, rbd:, rbdname)) { + return -EINVAL; + } + + pstrcpy(pool, 2 * RBD_MAX_SEG_NAME_SIZE, rbdname); Why twice the size? The callers pass a char[RBD_MAX_SEG_NAME_SIZE], so doesn't this allow buffer overflows? + p = strchr(pool, '/'); + if (p == NULL) { + return -EINVAL; + } + + *p = '\0'; + n = ++p; Why introduce a new variable here? p isn't used any more afterwards. + + l = strlen(n); + + if (l RBD_MAX_OBJ_NAME_SIZE) { + fprintf(stderr, object name to long\n); Off by one, you need to consider the trailing '\0'. Also, please use error_report instead of fprintf(stderr, ...) for real error
Re: Perf trace event parse errors for KVM events
On Fri, May 28, 2010 at 05:42:51PM +0100, Stefan Hajnoczi wrote: I get parse errors when using Steven Rostedt's trace-cmd tool, too. Any ideas what is going on here? I can provide more info (e.g. trace files) if necessary. Non standard print_format for the problematic entries? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Perf trace event parse errors for KVM events
On Fri, 2010-05-28 at 17:42 +0100, Stefan Hajnoczi wrote: I get parse errors when using Steven Rostedt's trace-cmd tool, too. Any ideas what is going on here? I can provide more info (e.g. trace files) if necessary. Does trace-cmd fail on the same tracepoints? Have you checkout the latest code?. I do know it fails on some of the KVM tracerpoints since the formatting they use is obnoxious. Could you show the print-fmt of the failing events? Thanks, -- Steve -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: device passthrough
Thanks, Chris, Do you know where is the patch, I just need something quick and dirty for now, my shining new board does have VT-d but the BIOS is not ready yet, I want to have something working now. Mu -Original Message- From: Chris Wright [mailto:chr...@sous-sol.org] Sent: Friday, May 28, 2010 12:33 PM To: Mu Lin Cc: kvm@vger.kernel.org Subject: Re: device passthrough * Mu Lin (m...@juniper.net) wrote: Is there any method to directly assign a device to Guest OS without VT-d? Assuming you mean a PCI device, no, there isn't. Without an IOMMU[1] you can't directly assign a PCI device to a guest (nor is it safe). There have been patches floating around to allow this, but they don't maintain secure isolation. thanks, -chris [1] VT-d is an Intel chipset feature, so you could certainly do it on an AMD platform that has an AMD IOMMU. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: device passthrough
* Mu Lin (m...@juniper.net) wrote: Do you know where is the patch, I just need something quick and dirty for now, my shining new board does have VT-d but the BIOS is not ready yet, I want to have something working now. Sorry, I don't have a handy pointer. You can search for either pv dma changes (paravirtualizing the guest's request for dma addrs so that it gets host physical addr to program card for dma) or reserved-ram for pci-passthrough (1:1 mapping of guest to host physical memory). I don't recall a recent attempt to bring them forward, so expect anything you find to be quite stale. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
Hi, On Fri, 28 May 2010 16:07:38 -0700 Tom Lyon wrote: Missing diffstat -p1 -w 70: Documentation/vfio.txt | 176 MAINTAINERS|7 drivers/Kconfig|2 drivers/Makefile |1 drivers/vfio/Kconfig |9 drivers/vfio/Makefile |5 drivers/vfio/vfio_dma.c| 372 ++ drivers/vfio/vfio_intrs.c | 189 + drivers/vfio/vfio_main.c | 627 +++ drivers/vfio/vfio_pci_config.c | 554 +++ drivers/vfio/vfio_rdwr.c | 147 +++ drivers/vfio/vfio_sysfs.c | 153 +++ include/linux/vfio.h | 193 + 13 files changed, 2435 insertions(+) which shows that the patch is missing an update to Documentation/ioctl/ioctl-number.txt for ioctl code ';'. Please add that. diff -uprN linux-2.6.34/drivers/vfio/Kconfig vfio-linux-2.6.34/drivers/vfio/Kconfig --- linux-2.6.34/drivers/vfio/Kconfig 1969-12-31 16:00:00.0 -0800 +++ vfio-linux-2.6.34/drivers/vfio/Kconfig2010-05-27 17:07:25.0 -0700 @@ -0,0 +1,9 @@ +menuconfig VFIO + tristate Non-Priv User Space PCI drivers Non-privileged + depends on PCI + help + Driver to allow advanced user space drivers for PCI, PCI-X, + and PCIe devices. Requires IOMMU to allow non-privilged non-privileged + processes to directly control the PCI devices. + + If you don't know what to do here, say N. --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
On Fri, 28 May 2010 16:07:38 -0700 Tom Lyon wrote: diff -uprN linux-2.6.34/Documentation/vfio.txt vfio-linux-2.6.34/Documentation/vfio.txt --- linux-2.6.34/Documentation/vfio.txt 1969-12-31 16:00:00.0 -0800 +++ vfio-linux-2.6.34/Documentation/vfio.txt 2010-05-28 14:03:05.0 -0700 @@ -0,0 +1,176 @@ +--- +The VFIO driver is used to allow privileged AND non-privileged processes to +implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe +devices. + +Why is this interesting? Some applications, especially in the high performance +computing field, need access to hardware functions with as little overhead as +possible. Examples are in network adapters (typically non tcp/ip based) and non-TCP/IP-based) +in compute accelerators - i.e., array processors, FPGA processors, etc. +Previous to the VFIO drivers these apps would need either a kernel-level +driver (with corrsponding overheads), or else root permissions to directly corresponding +access the hardware. The VFIO driver allows generic access to the hardware +from non-privileged apps IF the hardware is well-behaved enough for this +to be safe. + +While there have long been ways to implement user-level drivers using specific +corresponding drivers in the kernel, it was not until the introduction of the +UIO driver framework, and the uio_pci_generic driver that one could have a +generic kernel component supporting many types of user level drivers. However, +even with the uio_pci_generic driver, processes implementing the user level +drivers had to be trusted - they could do dangerous manipulation of DMA +addreses and were required to be root to write PCI configuration space +registers. + +Recent hardware technologies - I/O MMUs and PCI I/O Virtualization - provide +new hardware capabilities which the VFIO solution exploits to allow non-root +user level drivers. The main role of the IOMMU is to ensure that DMA accesses +from devices go only to the appropriate memory locations, this allows VFIO to locations; +ensure that user level drivers do not corrupt inappropriate memory. PCI I/O +virtualization (SR-IOV) was defined to allow pass-through of virtual devices +to guest virtual machines. VFIO in essence implements pass-through of devices +to user processes, not virtual machines. SR-IOV devices implement a +traditional PCI device (the physical function) and a dynamic number of special +PCI devices (virtual functions) whose feature set is somewhat restricted - in +order to allow the operating system or virtual machine monitor to ensure the +safe operation of the system. + +Any SR-IOV virtual function meets the VFIO definition of well-behaved, but +there are many other non-IOV PCI devices which also meet the defintion. +Elements of this definition are: +- The size of any memory BARs to be mmap'ed into the user process space must be + a multiple of the system page size. +- If MSI-X interrupts are used, the device driver must not attempt to mmap or + write the MSI-X vector area. +- If the device is a PCI device (not PCI-X or PCIe), it must conform to PCI + revision 2.3 to allow its interrupts to be masked in a generic way. +- The device must not use the PCI configuration space in any non-standard way, + i.e., the user level driver will be permitted only to read and write standard + fields of the PCI config space, and only if those fields cannot cause harm to + the system. In addition, some fields are virtualized, so that the user + driver can read/write them like a kernel driver, but they do not affect the + real device. +- For now, there is no support for user access to the PCIe and PCI-X extended + capabilities configuration space. + +Even with these restrictions, there are bound to be devices which are unsafe +for user level use - it is still up to the system admin to decide whether to +grant access to the device. When the vfio module is loaded, it will have +access to no devices until the desired PCI devices are bound to the driver. +First, make sure the devices are not bound to another kernel driver. You can +unload that driver if you wish to unbind all its devices, or else enter the +driver's sysfs directory, and unbind a specific device: + cd /sys/bus/pci/drivers/drivername + echo :06:02.00 unbind +(The :06:02.00 is a fully qualified PCI device name - different for each +device). Now, to bind to the vfio driver, go to /sys/bus/pci/drivers/vfio and +write the PCI device type of the target device to the new_id file: + echo 8086 10ca new_id +(8086 10ca are the vendor and device type for the Intel 82576 virtual function +devices). A /dev/vfioN entry will be created for each device bound.