Re: [PATCH] virtio_pci: properly clean up MSI-X state when initialization fails
Hi Michael, On Mon, Sep 15, 2014 at 1:32 AM, Michael S. Tsirkin m...@redhat.com wrote: On Sun, Sep 14, 2014 at 08:23:26PM -0700, Anthony Liguori wrote: From: Anthony Liguori aligu...@amazon.com If MSI-X initialization fails after setting msix_enabled = 1, then the device is left in an inconsistent state. This would normally only happen if there was a bug in the device emulation but it still should be handled correctly. This might happen if host runs out of resources when trying to map VQs to vectors, so doesn't have to be a bug. But I don't see what the problem is: msix_used_vectors reflects the number of used vectors and msix_enabled is set, thus vp_free_vectors will free all IRQs and then disable MSIX. Where is the inconsistency you speak about? I missed the fact that vp_free_vectors() conditionally sets msix_enabled=0. It seems a bit cludgy especially since it is called both before and after setting msix_enabled=1. I ran into a number of weird problems due to read/write reordering that was causing this code path to fail. The impact was non-deterministic. I'll go back and try to better understand what code path is causing it. Cc: Matt Wilson m...@amazon.com Cc: Rusty Russell ru...@rustcorp.com.au Cc: Michael Tsirkin m...@redhat.com Signed-off-by: Anthony Liguori aligu...@amazon.com --- drivers/virtio/virtio_pci.c |8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index 9cbac33..3d2c2a5 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -357,7 +357,7 @@ static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors, v = ioread16(vp_dev-ioaddr + VIRTIO_MSI_CONFIG_VECTOR); if (v == VIRTIO_MSI_NO_VECTOR) { err = -EBUSY; - goto error; + goto error_msix_used; } if (!per_vq_vectors) { @@ -369,11 +369,15 @@ static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors, vp_vring_interrupt, 0, vp_dev-msix_names[v], vp_dev); if (err) - goto error; + goto error_msix_used; ++vp_dev-msix_used_vectors; } return 0; +error_msix_used: + v = --vp_dev-msix_used_vectors; + free_irq(vp_dev-msix_entries[v].vector, vp_dev); error: + vp_dev-msix_enabled = 0; As far as I can see, if you do this, guest will not call pci_disable_msix thus leaving the device with MSIX enabled. I don't understand this comment. How is the work done in this path any different from what's done in vp_free_vectors()? Regards, Anthony Liguori I'm not sure this won't break drivers if they then try to use the device without MSIX, and it definitely seems less elegant than reverting the device to the original state. vp_free_vectors(vdev); return err; } -- 1.7.9.5 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio-pci: also bind to Amazon PCI vendor ID
Hi Michael, On Mon, Sep 15, 2014 at 1:20 AM, Michael S. Tsirkin m...@redhat.com wrote: On Sun, Sep 14, 2014 at 08:29:33PM -0700, Anthony Liguori wrote: From: Anthony Liguori aligu...@amazon.com See https://issues.oasis-open.org/browse/VIRTIO-16 although it was prematurely closed. The reason it was closed is described in the comments. and I quote: I think anyone can use a different subsystem vendor id and whql the driver. virtio-102 will make this even easier, by making the subsystem id flexible. Let's close this and re-open if someone tries to do this and runs into a problem. Look here for example: https://github.com/YanVugenfirer/kvm-guest-drivers-windows/blob/master/NetKVM/NDIS5/wxp/netkvm.inf Replace SUBSYS_00011AF4 with SUBSYS_00011D0F, and you will get a virtio-net driver that (I think) you should be able to WHQL. The string you are referencing is the device description which is part of what forms the device instance id. You can read more at http://msdn.microsoft.com/en-us/library/windows/hardware/ff541327%28v=vs.85%29.aspx. Including a SUBSYS entry is mandatory in modern Windows drivers. But this is independent of WHQL certification. My understanding is that Microsoft will only allow the owner of the PCI Vendor ID to WHQL drivers. As best as I know, this is not a publicly documented process. Do you have any examples of anyone else successfuling WHQL'ing drivers by just changing the subsystem ID? Microsoft has specific rules about the usage of the subsystem ID. See http://msdn.microsoft.com/en-us/library/windows/hardware/dn653567%28v=vs.85%29.aspx. I don't think it is intended to enable a totally different implementation of the driver based on subsystem ID. Certainly, this is not expected with typical PCI devices. On the host side, you will need a QEMU patch to allow libvirt control of the subsystem vendor ID. All this while all Linux guests will keep working without changes, which seems like a better approach. Looking on the web, I found: http://download.microsoft.com/download/1/6/1/161ba512-40e2-4cc9-843a-923143f3456c/pciidspec-11.doc Test will read and properly concatenate PCI IDs and verify uniqueness this is likely what you are running into: IDs must be unique, so if you want to put your driver in Microsoft's database, it must match a different set of IDs. But you should not need to change the vendor ID to make them unique, changing subsystem vendor ID will do. Did you try this? You cannot submit a modified kvm-guest-drivers.git for WHQL certification as the licensing is constructed in such a way as to prevent that. Red Hat has non-redistributable Windows drivers and Microsoft will not allow anyone else to WHQL certify drivers using that vendor ID. Don't see what Red Hat's windows drivers have to do with Linux really. Amazon.com can do whatever it wants with its vendor ID, and if there is a hypervisor with a different vendor ID that can use the virtio drivers, this patch is required. The following would be a reasonable commit log in that case: Amazon.com uses PV devices with vendor ID 0x1d0f that are otherwise compatible with Linux virtio drivers. Add 0x1d0f ID to the list to make Linux work with these devices. Feel free to use :) But I'd like to note that by doing this on the hypervisor side, you lose the ability to run older Linux guests, and create work for all distros who now need to update their kernels to work with your ID, apparently for no good reason. So if this isn't out in the field yet, I would suggest examining the alternative listed above. I am very happy to use any alternative mechanism that allows for virtio to be used with a wide variety of guests. Many other companies have also struggled with this and AFAIK no one has had much success. OTOH if it *is* decided this is going to be out there in the field, please add the new devices to the PCI IDs list. http://pci-ids.ucw.cz/ Otherwise there's no way to be sure someone won't try to use these IDs for something else. PCI-SIG assigns vendor IDs and 0x1d0f is assigned to Amazon. See https://www.pcisig.com/membership/vid_search/ Vendors self-manage device IDs and we have allocated 0x1000-0x103f to virtio devices. That makes it impossible to use virtio drivers with a Windows guest without changing the vendor ID. Hardly impossible: virtio drivers are available from a variety of sources. Examples? Regards, Anthony Liguori But this is IMO beside the point. Cc: Matt Wilson m...@amazon.com Cc: Rusty Russell ru...@rustcorp.com.au Cc: Michael Tsirkin m...@redhat.com Signed-off-by: Anthony Liguori aligu...@amazon.com I'd like to see the response/confirmation of the above, and/or the commit log replaced before this patch is applied. Thanks! --- drivers/virtio/virtio_pci.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index 101db3f..9cbac33 100644 --- a/drivers/virtio
[PATCH] virtio_pci: properly clean up MSI-X state when initialization fails
From: Anthony Liguori aligu...@amazon.com If MSI-X initialization fails after setting msix_enabled = 1, then the device is left in an inconsistent state. This would normally only happen if there was a bug in the device emulation but it still should be handled correctly. Cc: Matt Wilson m...@amazon.com Cc: Rusty Russell ru...@rustcorp.com.au Cc: Michael Tsirkin m...@redhat.com Signed-off-by: Anthony Liguori aligu...@amazon.com --- drivers/virtio/virtio_pci.c |8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index 9cbac33..3d2c2a5 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -357,7 +357,7 @@ static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors, v = ioread16(vp_dev-ioaddr + VIRTIO_MSI_CONFIG_VECTOR); if (v == VIRTIO_MSI_NO_VECTOR) { err = -EBUSY; - goto error; + goto error_msix_used; } if (!per_vq_vectors) { @@ -369,11 +369,15 @@ static int vp_request_msix_vectors(struct virtio_device *vdev, int nvectors, vp_vring_interrupt, 0, vp_dev-msix_names[v], vp_dev); if (err) - goto error; + goto error_msix_used; ++vp_dev-msix_used_vectors; } return 0; +error_msix_used: + v = --vp_dev-msix_used_vectors; + free_irq(vp_dev-msix_entries[v].vector, vp_dev); error: + vp_dev-msix_enabled = 0; vp_free_vectors(vdev); return err; } -- 1.7.9.5 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH] virtio-pci: also bind to Amazon PCI vendor ID
From: Anthony Liguori aligu...@amazon.com See https://issues.oasis-open.org/browse/VIRTIO-16 although it was prematurely closed. Red Hat has non-redistributable Windows drivers and Microsoft will not allow anyone else to WHQL certify drivers using that vendor ID. That makes it impossible to use virtio drivers with a Windows guest without changing the vendor ID. Cc: Matt Wilson m...@amazon.com Cc: Rusty Russell ru...@rustcorp.com.au Cc: Michael Tsirkin m...@redhat.com Signed-off-by: Anthony Liguori aligu...@amazon.com --- drivers/virtio/virtio_pci.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index 101db3f..9cbac33 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -93,6 +93,8 @@ struct virtio_pci_vq_info /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. */ static DEFINE_PCI_DEVICE_TABLE(virtio_pci_id_table) = { { PCI_DEVICE(0x1af4, PCI_ANY_ID) }, + /* Amazon.com vendor ID */ + { PCI_DEVICE(0x1d0f, PCI_ANY_ID) }, { 0 } }; -- 1.7.9.5 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
KVM Forum 2013 Call for Participation - Extended to August 4th
We have received numerous requests to extend the CFP deadline and so we are happy to announce that the CFP deadline has been moved by two weeks to August 4th. = KVM Forum 2013: Call For Participation October 21-23, 2013 - Edinburgh International Conference Centre - Edinburgh, UK (All submissions must be received before midnight July 21, 2013) = KVM is an industry leading open source hypervisor that provides an ideal platform for datacenter virtualization, virtual desktop infrastructure, and cloud computing. Once again, it's time to bring together the community of developers and users that define the KVM ecosystem for our annual technical conference. We will discuss the current state of affairs and plan for the future of KVM, its surrounding infrastructure, and management tools. The oVirt Workshop will run in parallel with the KVM Forum again, bringing in a community focused on enterprise datacenter virtualization management built on KVM. For topics which overlap we will have shared sessions. So mark your calendar and join us in advancing KVM. http://events.linuxfoundation.org/events/kvm-forum/ Once again we are colocated with The Linux Foundation's LinuxCon Europe. KVM Forum attendees will be able to attend oVirt Workshop sessions and are eligible to attend LinuxCon Europe for a discounted rate. http://events.linuxfoundation.org/events/kvm-forum/register We invite you to lead part of the discussion by submitting a speaking proposal for KVM Forum 2013. http://events.linuxfoundation.org/cfp Suggested topics: KVM/Kernel - Scaling and performance - Nested virtualization - I/O improvements - VFIO, device assignment, SR-IOV - Driver domains - Time keeping - Resource management (cpu, memory, i/o) - Memory management (page sharing, swapping, huge pages, etc) - Network virtualization - Security - Architecture ports QEMU - Device model improvements - New devices and chipsets - Scaling and performance - Desktop virtualization - Spice - Increasing robustness and hardening - Security model - Management interfaces - QMP protocol and implementation - Image formats - Firmware (SeaBIOS, OVMF, UEFI, etc) - Live migration - Live snapshots and merging - Fault tolerance, high availability, continuous backup - Real-time guest support Virtio - Speeding up existing devices - Alternatives - Virtio on non-Linux or non-virtualized Management infrastructure - oVirt (shared track w/ oVirt Workshop) - Libvirt - KVM autotest - OpenStack - Network virtualization management - Enterprise storage management Cloud computing - Scalable storage - Virtual networking - Security - Provisioning SUBMISSION REQUIREMENTS Abstracts due: July 21, 2013 Notification: August 1, 2013 Please submit a short abstract (~150 words) describing your presentation proposal. In your submission please note how long your talk will take. Slots vary in length up to 45 minutes. Also include in your proposal the proposal type -- one of: - technical talk - end-user talk - birds of a feather (BOF) session Submit your proposal here: http://events.linuxfoundation.org/cfp You will receive a notification whether or not your presentation proposal was accepted by Aug 1st. END-USER COLLABORATION One of the big challenges as developers is to know what, where and how people actually use our software. We will reserve a few slots for end users talking about their deployment challenges and achievements. If you are using KVM in production you are encouraged submit a speaking proposal. Simply mark it as an end-user collaboration proposal. As an end user, this is a unique opportunity to get your input to developers. BOF SESSION We will reserve some slots in the evening after the main conference tracks, for birds of a feather (BOF) sessions. These sessions will be less formal than presentation tracks and targetted for people who would like to discuss specific issues with other developers and/or users. If you are interested in getting developers and/or uses together to discuss a specific problem, please submit a BOF proposal. HOTEL / TRAVEL The KVM Forum 2013 will be held in Edinburgh, UK at the Edinburgh International Conference Centre. http://events.linuxfoundation.org/events/kvm-forum/hotel Thank you for your interest in KVM. We're looking forward to your submissions and seeing you at the KVM Forum 2013 in October! Thanks, -your KVM Forum 2013 Program Committee Please contact us with any questions or comments. kvm-forum-2013...@redhat.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: updated: kvm networking todo wiki
Rusty Russell ru...@rustcorp.com.au writes: Anthony Liguori anth...@codemonkey.ws writes: Rusty Russell ru...@rustcorp.com.au writes: On Fri, May 24, 2013 at 08:47:58AM -0500, Anthony Liguori wrote: FWIW, I think what's more interesting is using vhost-net as a networking backend with virtio-net in QEMU being what's guest facing. In theory, this gives you the best of both worlds: QEMU acts as a first line of defense against a malicious guest while still getting the performance advantages of vhost-net (zero-copy). It would be an interesting idea if we didn't already have the vhost model where we don't need the userspace bounce. The model is very interesting for QEMU because then we can use vhost as a backend for other types of network adapters (like vmxnet3 or even e1000). It also helps for things like fault tolerance where we need to be able to control packet flow within QEMU. (CC's reduced, context added, Dmitry Fleytman added for vmxnet3 thoughts). Then I'm really confused as to what this would look like. A zero copy sendmsg? We should be able to implement that today. The only trouble with sendmsg would be doing batch submission and asynchronous completion. A thread pool could certainly be used for this I guess. Regards, Anthony Liguori On the receive side, what can we do better than readv? If we need to return to userspace to tell the guest that we've got a new packet, we don't win on latency. We might reduce syscall overhead with a multi-dimensional readv to read multiple packets at once? Confused, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: updated: kvm networking todo wiki
Stefan Hajnoczi stefa...@gmail.com writes: On Thu, May 30, 2013 at 7:23 AM, Rusty Russell ru...@rustcorp.com.au wrote: Anthony Liguori anth...@codemonkey.ws writes: Rusty Russell ru...@rustcorp.com.au writes: On Fri, May 24, 2013 at 08:47:58AM -0500, Anthony Liguori wrote: FWIW, I think what's more interesting is using vhost-net as a networking backend with virtio-net in QEMU being what's guest facing. In theory, this gives you the best of both worlds: QEMU acts as a first line of defense against a malicious guest while still getting the performance advantages of vhost-net (zero-copy). It would be an interesting idea if we didn't already have the vhost model where we don't need the userspace bounce. The model is very interesting for QEMU because then we can use vhost as a backend for other types of network adapters (like vmxnet3 or even e1000). It also helps for things like fault tolerance where we need to be able to control packet flow within QEMU. (CC's reduced, context added, Dmitry Fleytman added for vmxnet3 thoughts). Then I'm really confused as to what this would look like. A zero copy sendmsg? We should be able to implement that today. On the receive side, what can we do better than readv? If we need to return to userspace to tell the guest that we've got a new packet, we don't win on latency. We might reduce syscall overhead with a multi-dimensional readv to read multiple packets at once? Sounds like recvmmsg(2). Could we map this to mergable rx buffers though? Regards, Anthony Liguori Stefan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC] virtio-pci: new config layout: using memory BAR
Rusty Russell ru...@rustcorp.com.au writes: Anthony Liguori aligu...@us.ibm.com writes: Forcing a guest driver change is a really big deal and I see no reason to do that unless there's a compelling reason to. So we're stuck with the 1.0 config layout for a very long time. We definitely must not force a guest change. The explicit aim of the standard is that legacy and 1.0 be backward compatible. One deliverable is a document detailing how this is done (effectively a summary of changes between what we have and 1.0). If 2.0 is fully backwards compatible, great. It seems like such a difference that that would be impossible but I need to investigate further. Regards, Anthony Liguori It's a delicate balancing act. My plan is to accompany any changes in the standard with a qemu implementation, so we can see how painful those changes are. And if there are performance implications, measure them. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: updated: kvm networking todo wiki
Michael S. Tsirkin m...@redhat.com writes: On Thu, May 30, 2013 at 08:40:47AM -0500, Anthony Liguori wrote: Stefan Hajnoczi stefa...@gmail.com writes: On Thu, May 30, 2013 at 7:23 AM, Rusty Russell ru...@rustcorp.com.au wrote: Anthony Liguori anth...@codemonkey.ws writes: Rusty Russell ru...@rustcorp.com.au writes: On Fri, May 24, 2013 at 08:47:58AM -0500, Anthony Liguori wrote: FWIW, I think what's more interesting is using vhost-net as a networking backend with virtio-net in QEMU being what's guest facing. In theory, this gives you the best of both worlds: QEMU acts as a first line of defense against a malicious guest while still getting the performance advantages of vhost-net (zero-copy). It would be an interesting idea if we didn't already have the vhost model where we don't need the userspace bounce. The model is very interesting for QEMU because then we can use vhost as a backend for other types of network adapters (like vmxnet3 or even e1000). It also helps for things like fault tolerance where we need to be able to control packet flow within QEMU. (CC's reduced, context added, Dmitry Fleytman added for vmxnet3 thoughts). Then I'm really confused as to what this would look like. A zero copy sendmsg? We should be able to implement that today. On the receive side, what can we do better than readv? If we need to return to userspace to tell the guest that we've got a new packet, we don't win on latency. We might reduce syscall overhead with a multi-dimensional readv to read multiple packets at once? Sounds like recvmmsg(2). Could we map this to mergable rx buffers though? Regards, Anthony Liguori Yes because we don't have to complete buffers in order. What I meant though was for GRO, we don't know how large the received packet is going to be. Mergable rx buffers lets us allocate a pool of data for all incoming packets instead of allocating max packet size * max packets. recvmmsg expects an array of msghdrs and I presume each needs to be given a fixed size. So this seems incompatible with mergable rx buffers. Regards, Anthony Liguori Stefan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC] virtio-pci: new config layout: using memory BAR
Rusty Russell ru...@rustcorp.com.au writes: Anthony Liguori aligu...@us.ibm.com writes: Michael S. Tsirkin m...@redhat.com writes: +case offsetof(struct virtio_pci_common_cfg, device_feature_select): +return proxy-device_feature_select; Oh dear no... Please use defines like the rest of QEMU. It is pretty ugly. I think beauty is in the eye of the beholder here... Pretty much every device we have has a switch statement like this. Consistency wins when it comes to qualitative arguments like this. Yet the structure definitions are descriptive, capturing layout, size and endianness in natural a format readable by any C programmer. From an API design point of view, here are the problems I see: 1) C makes no guarantees about structure layout beyond the first member. Yes, if it's naturally aligned or has a packed attribute, GCC does the right thing. But this isn't kernel land anymore, portability matters and there are more compilers than GCC. 2) If we every introduce anything like latching, this doesn't work out so well anymore because it's hard to express in a single C structure the register layout at that point. Perhaps a union could be used but padding may make it a bit challenging. 3) It suspect it's harder to review because a subtle change could more easily have broad impact. If someone changed the type of a field from u32 to u16, it changes the offset of every other field. That's not terribly obvious in the patch itself unless you understand how the structure is used elsewhere. This may not be a problem for virtio because we all understand that the structures are part of an ABI, but if we used this pattern more in QEMU, it would be a lot less obvious. So AFAICT the question is, do we put the required #define VIRTIO_PCI_CFG_FEATURE_SEL \ (offsetof(struct virtio_pci_common_cfg, device_feature_select)) etc. in the kernel headers or qemu? I'm pretty sure we would end up just having our own integer defines. We carry our own virtio headers today because we can't easily import the kernel headers. Haven't looked at the proposed new ring layout yet. No change, but there's an open question on whether we should nail it to little endian (or define the endian by the transport). Of course, I can't rule out that the 1.0 standard *may* decide to frob the ring layout somehow, Well, given that virtio is widely deployed today, I would think the 1.0 standard should strictly reflect what's deployed today, no? Any new config layout would be 2.0 material, right? Re: the new config layout, I don't think we would want to use it for anything but new devices. Forcing a guest driver change is a really big deal and I see no reason to do that unless there's a compelling reason to. So we're stuck with the 1.0 config layout for a very long time. Regards, Anthony Liguori but I'd think it would require a compelling reason. I suggest that's 2.0 material... Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: updated: kvm networking todo wiki
Rusty Russell ru...@rustcorp.com.au writes: Michael S. Tsirkin m...@redhat.com writes: On Fri, May 24, 2013 at 08:47:58AM -0500, Anthony Liguori wrote: Michael S. Tsirkin m...@redhat.com writes: On Fri, May 24, 2013 at 05:41:11PM +0800, Jason Wang wrote: On 05/23/2013 04:50 PM, Michael S. Tsirkin wrote: Hey guys, I've updated the kvm networking todo wiki with current projects. Will try to keep it up to date more often. Original announcement below. Thanks a lot. I've added the tasks I'm currently working on to the wiki. btw. I notice the virtio-net data plane were missed in the wiki. Is the project still being considered? It might have been interesting several years ago, but now that linux has vhost-net in kernel, the only point seems to be to speed up networking on non-linux hosts. Data plane just means having a dedicated thread for virtqueue processing that doesn't hold qemu_mutex. Of course we're going to do this in QEMU. It's a no brainer. But not as a separate device, just as an improvement to the existing userspace virtio-net. Since non-linux does not have kvm, I doubt virtio is a bottleneck. FWIW, I think what's more interesting is using vhost-net as a networking backend with virtio-net in QEMU being what's guest facing. In theory, this gives you the best of both worlds: QEMU acts as a first line of defense against a malicious guest while still getting the performance advantages of vhost-net (zero-copy). Great idea, that sounds very intresting. I'll add it to the wiki. In fact a bit of complexity in vhost was put there in the vague hope to support something like this: virtio rings are not translated through regular memory tables, instead, vhost gets a pointer to ring address. This allows qemu acting as a man in the middle, verifying the descriptors but not touching the Anyone interested in working on such a project? It would be an interesting idea if we didn't already have the vhost model where we don't need the userspace bounce. The model is very interesting for QEMU because then we can use vhost as a backend for other types of network adapters (like vmxnet3 or even e1000). It also helps for things like fault tolerance where we need to be able to control packet flow within QEMU. Regards, Anthony Liguori We already have two sets of host side ring code in the kernel (vhost and vringh, though they're being unified). All an accelerator can offer on the tx side is zero copy and direct update of the used ring. On rx userspace could register the buffers and the accelerator could fill them and update the used ring. It still needs to deal with merged buffers, for example. You avoid the address translation in the kernel, but I'm not convinced that's a key problem. Cheers, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] [PATCH v2 00/11] qemu: use virtio linux headers in portable code
Rusty Russell ru...@rustcorp.com.au writes: Anthony Liguori anth...@codemonkey.ws writes: The headers say they are BSD licensed... but they include a GPLv2+ header. Doesn't make a lot of sense, does it? It makes perfect sense: you're overthinking it. It just means that copying the BSD headers outside Linux is encouraged. Copying by hand and modifying. This patch series attempts to do it automatically within QEMU. And it's clearly nonsensical to claim the GPL on kernel headers means that userspace needs to be GPL. So please ignore this licensing red-herring. You're missing context, I suspect. This series is trying to automatically copy the headers from Linux. We currently have a manually copied version. The headers are not currently well suited for automatic copying because (1) they use kernel types (2) they pull in dependencies from throughout the kernel. This discussion comes down to whether we want to make it easier to automatically copy the headers or do we maintain the status quo and require manual munging to pull in the headers. Regards, Anthony Liguori And we'll bikeshed the headers in the standard when we have to :) They certainly don't need to be cutpaste into the kernel sources. Cheers, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC] virtio-pci: new config layout: using memory BAR
Michael S. Tsirkin m...@redhat.com writes: On Wed, May 29, 2013 at 07:52:37AM -0500, Anthony Liguori wrote: 1) C makes no guarantees about structure layout beyond the first member. Yes, if it's naturally aligned or has a packed attribute, GCC does the right thing. But this isn't kernel land anymore, portability matters and there are more compilers than GCC. You expect a compiler to pad this structure: struct foo { uint8_t a; uint8_t b; uint16_t c; uint32_t d; }; I'm guessing any compiler that decides to waste memory in this way will quickly get dropped by users and then we won't worry about building QEMU with it. There are other responses in the thread here and I don't really care to bikeshed on this issue. Well, given that virtio is widely deployed today, I would think the 1.0 standard should strictly reflect what's deployed today, no? Any new config layout would be 2.0 material, right? Not as it's currently planned. Devices can choose to support a legacy layout in addition to the new one, and if you look at the patch you will see that that is exactly what it does. Adding a new BAR most certainly requires bumping the revision ID or changing the device ID, no? Didn't we run into this problem with the virtio-win drivers with just the BAR size changing? Re: the new config layout, I don't think we would want to use it for anything but new devices. Forcing a guest driver change There's no forcing. If you look at the patches closely, you will see that we still support the old layout on BAR0. is a really big deal and I see no reason to do that unless there's a compelling reason to. There are many a compelling reasons, and they are well known limitations of virtio PCI: - PCI spec compliance (madates device operation with IO memory disabled). PCI express spec. We are fully compliant with the PCI spec. And what's the user visible advantage of pointing an emulated virtio device behind a PCI-e bus verses a legacy PCI bus? This is a very good example because if we have to disable BAR0, then it's an ABI breaker plan and simple. - support 64 bit addressing We currently support 44-bit addressing for the ring. While I agree we need to bump it, there's no immediate problem with 44-bit addressing. - add more than 32 feature bits. - individually disable queues. - sanely support cross-endian systems. - support very small (1 PAGE) for virtio rings. - support a separate page for each vq kick. - make each device place config at flexible offset. None of these things are holding us back today. I'm not saying we shouldn't introduce a new device. But adoption of that device will be slow and realistically will be limited to new devices only. We'll be supporting both devices for a very, very long time. Compatibility is the fundamental value that we provide. We need to go out of our way to make sure that existing guests work and work as well as possible. Sticking virtio devices behind a PCI-e bus just for the hell of it isn't a compelling reason to break existing guests. Regards, Anthony Liguori Addressing any one of these would cause us to add a substantially new way to operate virtio devices. And since it's a guest change anyway, it seemed like a good time to do the new layout and fix everything in one go. And they are needed like yesterday. So we're stuck with the 1.0 config layout for a very long time. Regards, Anthony Liguori Absolutely. This patch let us support both which will allow for a gradual transition over the next 10 years or so. reason. I suggest that's 2.0 material... Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] [PATCH v2 00/11] qemu: use virtio linux headers in portable code
Michael S. Tsirkin m...@redhat.com writes: On Wed, May 29, 2013 at 08:05:29AM -0500, Anthony Liguori wrote: Rusty Russell ru...@rustcorp.com.au writes: Anthony Liguori anth...@codemonkey.ws writes: The headers say they are BSD licensed... but they include a GPLv2+ header. Doesn't make a lot of sense, does it? It makes perfect sense: you're overthinking it. It just means that copying the BSD headers outside Linux is encouraged. Copying by hand and modifying. This patch series attempts to do it automatically within QEMU. And it's clearly nonsensical to claim the GPL on kernel headers means that userspace needs to be GPL. So please ignore this licensing red-herring. You're missing context, I suspect. This series is trying to automatically copy the headers from Linux. We currently have a manually copied version. The headers are not currently well suited for automatic copying because (1) they use kernel types (2) they pull in dependencies from throughout the kernel. This discussion comes down to whether we want to make it easier to automatically copy the headers or do we maintain the status quo and require manual munging to pull in the headers. Regards, Anthony Liguori Okay, what if I 1. add a stub if_ether.h with a couple of macros we want 2. replace the types during copying Would this address all concerns? If Rusty doesn't like the idea of making the headers usable directly, then I don't object to having stubs/scripts to sanitize them in QEMU. Regards, Anthony Liguori And we'll bikeshed the headers in the standard when we have to :) They certainly don't need to be cutpaste into the kernel sources. Cheers, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC] virtio-pci: new config layout: using memory BAR
Michael S. Tsirkin m...@redhat.com writes: This adds support for new config, and is designed to work with the new layout code in Rusty's new layout branch. At the moment all fields are in the same memory BAR (bar 2). This will be used to test performance and compare memory, io and hypercall latency. Compiles but does not work yet. Migration isn't handled yet. It's not clear what do queue_enable/queue_disable fields do, not yet implemented. Gateway for config access with config cycles not yet implemented. Sending out for early review/flames. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- hw/virtio/virtio-pci.c | 393 +++-- hw/virtio/virtio-pci.h | 55 +++ hw/virtio/virtio.c | 20 +++ include/hw/virtio/virtio.h | 4 + 4 files changed, 458 insertions(+), 14 deletions(-) diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c index 752991a..f4db224 100644 --- a/hw/virtio/virtio-pci.c +++ b/hw/virtio/virtio-pci.c @@ -259,6 +259,26 @@ static void virtio_pci_stop_ioeventfd(VirtIOPCIProxy *proxy) proxy-ioeventfd_started = false; } +static void virtio_pci_set_status(VirtIOPCIProxy *proxy, uint8_t val) +{ +VirtIODevice *vdev = proxy-vdev; + +if (!(val VIRTIO_CONFIG_S_DRIVER_OK)) { +virtio_pci_stop_ioeventfd(proxy); +} + +virtio_set_status(vdev, val 0xFF); + +if (val VIRTIO_CONFIG_S_DRIVER_OK) { +virtio_pci_start_ioeventfd(proxy); +} + +if (vdev-status == 0) { +virtio_reset(proxy-vdev); +msix_unuse_all_vectors(proxy-pci_dev); +} +} + static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) { VirtIOPCIProxy *proxy = opaque; @@ -293,20 +313,7 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val) } break; case VIRTIO_PCI_STATUS: -if (!(val VIRTIO_CONFIG_S_DRIVER_OK)) { -virtio_pci_stop_ioeventfd(proxy); -} - -virtio_set_status(vdev, val 0xFF); - -if (val VIRTIO_CONFIG_S_DRIVER_OK) { -virtio_pci_start_ioeventfd(proxy); -} - -if (vdev-status == 0) { -virtio_reset(proxy-vdev); -msix_unuse_all_vectors(proxy-pci_dev); -} +virtio_pci_set_status(proxy, val); /* Linux before 2.6.34 sets the device as OK without enabling the PCI device bus master bit. In this case we need to disable @@ -455,6 +462,226 @@ static void virtio_pci_config_write(void *opaque, hwaddr addr, } } +static uint64_t virtio_pci_config_common_read(void *opaque, hwaddr addr, + unsigned size) +{ +VirtIOPCIProxy *proxy = opaque; +VirtIODevice *vdev = proxy-vdev; + +uint64_t low = 0xull; + +switch (addr) { +case offsetof(struct virtio_pci_common_cfg, device_feature_select): +return proxy-device_feature_select; Oh dear no... Please use defines like the rest of QEMU. From a QEMU pov, take a look at: https://github.com/aliguori/qemu/commit/587c35c1a3fe90f6af0f97927047ef4d3182a659 And: https://github.com/aliguori/qemu/commit/01ba80a23cf2eb1e15056f82b44b94ec381565cb Which lets virtio-pci be subclassable and then remaps the config space to BAR2. Haven't looked at the proposed new ring layout yet. Regards, Anthony Liguori +case offsetof(struct virtio_pci_common_cfg, device_feature): +/* TODO: 64-bit features */ + return proxy-device_feature_select ? 0 : proxy-host_features; +case offsetof(struct virtio_pci_common_cfg, guest_feature_select): +return proxy-guest_feature_select; +case offsetof(struct virtio_pci_common_cfg, guest_feature): +/* TODO: 64-bit features */ + return proxy-guest_feature_select ? 0 : vdev-guest_features; +case offsetof(struct virtio_pci_common_cfg, msix_config): + return vdev-config_vector; +case offsetof(struct virtio_pci_common_cfg, num_queues): +/* TODO: more exact limit? */ + return VIRTIO_PCI_QUEUE_MAX; +case offsetof(struct virtio_pci_common_cfg, device_status): +return vdev-status; + + /* About a specific virtqueue. */ +case offsetof(struct virtio_pci_common_cfg, queue_select): +return vdev-queue_sel; +case offsetof(struct virtio_pci_common_cfg, queue_size): +return virtio_queue_get_num(vdev, vdev-queue_sel); +case offsetof(struct virtio_pci_common_cfg, queue_msix_vector): + return virtio_queue_vector(vdev, vdev-queue_sel); +case offsetof(struct virtio_pci_common_cfg, queue_enable): +/* TODO */ + return 0; +case offsetof(struct virtio_pci_common_cfg, queue_notify_off): +return vdev-queue_sel; +case offsetof(struct virtio_pci_common_cfg, queue_desc): +return virtio_queue_get_desc_addr(vdev, vdev-queue_sel
Re: [PATCH RFC] virtio-pci: new config layout: using memory BAR
Michael S. Tsirkin m...@redhat.com writes: On Tue, May 28, 2013 at 12:15:16PM -0500, Anthony Liguori wrote: @@ -455,6 +462,226 @@ static void virtio_pci_config_write(void *opaque, hwaddr addr, } } +static uint64_t virtio_pci_config_common_read(void *opaque, hwaddr addr, + unsigned size) +{ +VirtIOPCIProxy *proxy = opaque; +VirtIODevice *vdev = proxy-vdev; + +uint64_t low = 0xull; + +switch (addr) { +case offsetof(struct virtio_pci_common_cfg, device_feature_select): +return proxy-device_feature_select; Oh dear no... Please use defines like the rest of QEMU. Any good reason not to use offsetof? I see about 138 examples in qemu. There are exactly zero: $ find . -name *.c -exec grep -l case offset {} \; $ Anyway, that's the way Rusty wrote it in the kernel header - I started with defines. If you convince Rusty to switch I can switch too, We have 300+ devices in QEMU that use #defines. We're not using this kind of thing just because you want to copy code from the kernel. https://github.com/aliguori/qemu/commit/587c35c1a3fe90f6af0f97927047ef4d3182a659 And: https://github.com/aliguori/qemu/commit/01ba80a23cf2eb1e15056f82b44b94ec381565cb Which lets virtio-pci be subclassable and then remaps the config space to BAR2. Interesting. Have the spec anywhere? Not yet, but working on that. You are saying this is going to conflict because of BAR2 usage, yes? No, this whole thing is flexible. I had to use BAR2 because BAR0 has to be the vram mapping. It also had to be an MMIO bar. The new layout might make it easier to implement a device like this. I shared it mainly because I wanted to show the subclassing idea vs. just tacking an option onto the existing virtio-pci code in QEMU. Regards, Anthony Liguori So let's only do this virtio-fb only for new layout, so we don't need to maintain compatibility. In particular, we are working on making memory BAR access fast for virtio devices in a generic way. At the moment they are measureably slower than PIO on x86. Haven't looked at the proposed new ring layout yet. Regards, No new ring layout. It's new config layout. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] [PATCH v2 00/11] qemu: use virtio linux headers in portable code
Michael S. Tsirkin m...@redhat.com writes: On Sun, May 26, 2013 at 07:55:25PM -0500, Anthony Liguori wrote: Michael S. Tsirkin m...@redhat.com writes: On Sun, May 26, 2013 at 03:49:53PM -0500, Anthony Liguori wrote: Paolo Bonzini pbonz...@redhat.com writes: Il 26/05/2013 22:02, Michael S. Tsirkin ha scritto: My fault. I should have looked at linux/types.h (actually asm-generic/). Not really, __uX appear in the headers that were posted. Which is a problem because this is a reserved namespace in C99. What I'm saying is - a chance of a conflict is very remote, if it happens it's a build failure so easy to debug. I'm sure that others will complain, :) but you can go ahead. I think we should clean up the virtio headers in the kernel first. Seems like a good thing to do given the standardization effort too. There are lots of headers in uapi that use the C99 int types I found 4: $ grep -l uint include/uapi/linux/*h include/uapi/linux/dm-log-userspace.h include/uapi/linux/fuse.h include/uapi/linux/jffs2.h include/uapi/linux/pps.h include/uapi/linux/rds.h include/uapi/linux/sctp.h That's not really lots. so there doesn't seem to be a reason they couldn't be used in virtio. fuse.h even has a: #ifdef __KERNEL__ #include linux/types.h #else #include stdint.h #endif Which seems like a reasonable thing to do. In kernel, we really want to use things like endian-ness checks (and, we really should have them in userspace too!). So we want __le32 and other kernel goodies like that. stdint.h won't cut it. With the spec being standardized, I think having a stand alone set of headers is a good thing. Sure, that's possible. We'll have to find some way to preserve the endian-ness annotations, I think. And then import them into kernel/qemu with some script, converting to kernel/qemu style and annotations? See below. Licensing is problematic here too. If virtio headers depend on other Linux headers, then it really doesn't matter if they are BSD licensed if you need a GPL header (like linux/if_ether.h). Now, we can certainly debate the copyrightability of these defines and what have you but if the headers are meant to be 1) consumed outside the kernel 2) licensed under a different license than the general kernel then depending on kernel goodies is the wrong strategy. Well specifically if_ether.h says GPLv2+ so it's OK for QEMU. Do you mean for some other non GPL app? Ignore QEMU for the moment. The headers say they are BSD licensed... but they include a GPLv2+ header. Doesn't make a lot of sense, does it? Again, I think it's something pretty basic here. Either (1) these are kernel/userspace headers that are meant to be consumed through libc or whatever (2) these are kernel-private headers not to be consumed outside of the kernel (3) these are headers meant to be copied widely and used as a reference implementation of virtio. If (1) or (2), copying them into QEMU via a magic sanitizing script is wrong. We should just keep doing what we're doing. If (3), then we should clean up the virtio headers to be license clean and usable outside of the kernel. We shouldn't try to solve this problem in QEMU (via scripts) if we can just solve it in the kernel. Regards, Anthony Liguori The only other kernel dependency is linux/if_ether.h to define ETH_ALEN. But it seems completely reasonable to define a VIRTIO_NET_MAC_ALEN or something like that. Ugh. Not really reasonable IMO. We also use ETH_P_IP in code, would like to get rid of redefining that too. But we can have our own linux/if_ether.h for non-linux hosts, just with a couple of macros like that, it's not a big deal. See above. Regards, Anthony Liguori This would make the virtio headers completely stand alone and includable in userspace (without violating C99). Perhaps it's even worth moving the headers from uapi/linux to uapi/virtio. Rusty, what do you think? Regards, Anhtony Liguori Paolo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] [PATCH v2 00/11] qemu: use virtio linux headers in portable code
Rusty Russell ru...@rustcorp.com.au writes: Anthony Liguori anth...@codemonkey.ws writes: Paolo Bonzini pbonz...@redhat.com writes: Il 26/05/2013 22:02, Michael S. Tsirkin ha scritto: My fault. I should have looked at linux/types.h (actually asm-generic/). Not really, __uX appear in the headers that were posted. Which is a problem because this is a reserved namespace in C99. Personally, I find it hard to care. What matters is not what the standard has carved out, but whether we have clashes, reserved namespace or no. And that won't happen for these. If someone wants to convert all the kernel headers, I won't NAK it though. virtio headers are special. Linux headers are normally only consumed in the kernel or in a userspace application running on Linux. virtio headers may be used either in a userspace application running on !Linux (we need to support QEMU on Windows) or even in a foreign kernel. linux/types.h is unusable outside of Linux because it pulls in a bunch of other headers. If you look at Michael's patch, he adds his own version of types.h. It's unfortunate for every user of the headers to do this. Regards, Anthony Liguori Perhaps it's even worth moving the headers from uapi/linux to uapi/virtio. Rusty, what do you think? Hmm, #include virtio/virtio_net.h etc would be worthwhile if that also worked on FreeBSD. Bryan CC'd... Cheers, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] [PATCH v2 00/11] qemu: use virtio linux headers in portable code
Paolo Bonzini pbonz...@redhat.com writes: Il 26/05/2013 22:02, Michael S. Tsirkin ha scritto: My fault. I should have looked at linux/types.h (actually asm-generic/). Not really, __uX appear in the headers that were posted. Which is a problem because this is a reserved namespace in C99. What I'm saying is - a chance of a conflict is very remote, if it happens it's a build failure so easy to debug. I'm sure that others will complain, :) but you can go ahead. I think we should clean up the virtio headers in the kernel first. Seems like a good thing to do given the standardization effort too. There are lots of headers in uapi that use the C99 int types so there doesn't seem to be a reason they couldn't be used in virtio. fuse.h even has a: #ifdef __KERNEL__ #include linux/types.h #else #include stdint.h #endif Which seems like a reasonable thing to do. The only other kernel dependency is linux/if_ether.h to define ETH_ALEN. But it seems completely reasonable to define a VIRTIO_NET_MAC_ALEN or something like that. This would make the virtio headers completely stand alone and includable in userspace (without violating C99). Perhaps it's even worth moving the headers from uapi/linux to uapi/virtio. Rusty, what do you think? Regards, Anhtony Liguori Paolo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] [PATCH v2 00/11] qemu: use virtio linux headers in portable code
Michael S. Tsirkin m...@redhat.com writes: On Sun, May 26, 2013 at 03:49:53PM -0500, Anthony Liguori wrote: Paolo Bonzini pbonz...@redhat.com writes: Il 26/05/2013 22:02, Michael S. Tsirkin ha scritto: My fault. I should have looked at linux/types.h (actually asm-generic/). Not really, __uX appear in the headers that were posted. Which is a problem because this is a reserved namespace in C99. What I'm saying is - a chance of a conflict is very remote, if it happens it's a build failure so easy to debug. I'm sure that others will complain, :) but you can go ahead. I think we should clean up the virtio headers in the kernel first. Seems like a good thing to do given the standardization effort too. There are lots of headers in uapi that use the C99 int types I found 4: $ grep -l uint include/uapi/linux/*h include/uapi/linux/dm-log-userspace.h include/uapi/linux/fuse.h include/uapi/linux/jffs2.h include/uapi/linux/pps.h include/uapi/linux/rds.h include/uapi/linux/sctp.h That's not really lots. so there doesn't seem to be a reason they couldn't be used in virtio. fuse.h even has a: #ifdef __KERNEL__ #include linux/types.h #else #include stdint.h #endif Which seems like a reasonable thing to do. In kernel, we really want to use things like endian-ness checks (and, we really should have them in userspace too!). So we want __le32 and other kernel goodies like that. stdint.h won't cut it. With the spec being standardized, I think having a stand alone set of headers is a good thing. Licensing is problematic here too. If virtio headers depend on other Linux headers, then it really doesn't matter if they are BSD licensed if you need a GPL header (like linux/if_ether.h). Now, we can certainly debate the copyrightability of these defines and what have you but if the headers are meant to be 1) consumed outside the kernel 2) licensed under a different license than the general kernel then depending on kernel goodies is the wrong strategy. The only other kernel dependency is linux/if_ether.h to define ETH_ALEN. But it seems completely reasonable to define a VIRTIO_NET_MAC_ALEN or something like that. Ugh. Not really reasonable IMO. We also use ETH_P_IP in code, would like to get rid of redefining that too. But we can have our own linux/if_ether.h for non-linux hosts, just with a couple of macros like that, it's not a big deal. See above. Regards, Anthony Liguori This would make the virtio headers completely stand alone and includable in userspace (without violating C99). Perhaps it's even worth moving the headers from uapi/linux to uapi/virtio. Rusty, what do you think? Regards, Anhtony Liguori Paolo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: updated: kvm networking todo wiki
Michael S. Tsirkin m...@redhat.com writes: On Fri, May 24, 2013 at 05:41:11PM +0800, Jason Wang wrote: On 05/23/2013 04:50 PM, Michael S. Tsirkin wrote: Hey guys, I've updated the kvm networking todo wiki with current projects. Will try to keep it up to date more often. Original announcement below. Thanks a lot. I've added the tasks I'm currently working on to the wiki. btw. I notice the virtio-net data plane were missed in the wiki. Is the project still being considered? It might have been interesting several years ago, but now that linux has vhost-net in kernel, the only point seems to be to speed up networking on non-linux hosts. Data plane just means having a dedicated thread for virtqueue processing that doesn't hold qemu_mutex. Of course we're going to do this in QEMU. It's a no brainer. But not as a separate device, just as an improvement to the existing userspace virtio-net. Since non-linux does not have kvm, I doubt virtio is a bottleneck. FWIW, I think what's more interesting is using vhost-net as a networking backend with virtio-net in QEMU being what's guest facing. In theory, this gives you the best of both worlds: QEMU acts as a first line of defense against a malicious guest while still getting the performance advantages of vhost-net (zero-copy). IMO yet another networking backend is a distraction, and confusing to users. In any case, I'd like to see virtio-blk dataplane replace non dataplane first. We don't want two copies of virtio-net in qemu. 100% agreed. Regards, Anthony Liguori I've put up a wiki page with a kvm networking todo list, mainly to avoid effort duplication, but also in the hope to draw attention to what I think we should try addressing in KVM: http://www.linux-kvm.org/page/NetworkingTodo This page could cover all networking related activity in KVM, currently most info is related to virtio-net. Note: if there's no developer listed for an item, this just means I don't know of anyone actively working on an issue at the moment, not that no one intends to. I would appreciate it if others working on one of the items on this list would add their names so we can communicate better. If others like this wiki page, please go ahead and add stuff you are working on if any. It would be especially nice to add autotest projects: there is just a short test matrix and a catch-all 'Cover test matrix with autotest', currently. Currently there are some links to Red Hat bugzilla entries, feel free to add links to other bugzillas. Thanks! ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH-v2 1/2] virtio-scsi: create VirtIOSCSICommon
Nicholas A. Bellinger n...@linux-iscsi.org writes: From: Paolo Bonzini pbonz...@redhat.com This patch refactors existing virtio-scsi code into VirtIOSCSICommon in order to allow virtio_scsi_init_common() to be used by both internal virtio_scsi_init() and external vhost-scsi-pci code. Changes in Patch-v2: - Move -get_features() assignment to virtio_scsi_init() instead of virtio_scsi_init_common() Any reason we're not doing this as a QOM base class? Similiar to how the in-kernel PIT/PIC work using a common base class... Regards, Anthony Liguori Signed-off-by: Paolo Bonzini pbonz...@redhat.com Cc: Michael S. Tsirkin m...@redhat.com Cc: Asias He as...@redhat.com Signed-off-by: Nicholas Bellinger n...@linux-iscsi.org --- hw/virtio-scsi.c | 192 +- hw/virtio-scsi.h | 130 -- include/qemu/osdep.h |4 + 3 files changed, 178 insertions(+), 148 deletions(-) diff --git a/hw/virtio-scsi.c b/hw/virtio-scsi.c index 8620712..c59e9c6 100644 --- a/hw/virtio-scsi.c +++ b/hw/virtio-scsi.c @@ -18,118 +18,6 @@ #include hw/scsi.h #include hw/scsi-defs.h -#define VIRTIO_SCSI_VQ_SIZE 128 -#define VIRTIO_SCSI_CDB_SIZE32 -#define VIRTIO_SCSI_SENSE_SIZE 96 -#define VIRTIO_SCSI_MAX_CHANNEL 0 -#define VIRTIO_SCSI_MAX_TARGET 255 -#define VIRTIO_SCSI_MAX_LUN 16383 - -/* Response codes */ -#define VIRTIO_SCSI_S_OK 0 -#define VIRTIO_SCSI_S_OVERRUN 1 -#define VIRTIO_SCSI_S_ABORTED 2 -#define VIRTIO_SCSI_S_BAD_TARGET 3 -#define VIRTIO_SCSI_S_RESET4 -#define VIRTIO_SCSI_S_BUSY 5 -#define VIRTIO_SCSI_S_TRANSPORT_FAILURE6 -#define VIRTIO_SCSI_S_TARGET_FAILURE 7 -#define VIRTIO_SCSI_S_NEXUS_FAILURE8 -#define VIRTIO_SCSI_S_FAILURE 9 -#define VIRTIO_SCSI_S_FUNCTION_SUCCEEDED 10 -#define VIRTIO_SCSI_S_FUNCTION_REJECTED11 -#define VIRTIO_SCSI_S_INCORRECT_LUN12 - -/* Controlq type codes. */ -#define VIRTIO_SCSI_T_TMF 0 -#define VIRTIO_SCSI_T_AN_QUERY 1 -#define VIRTIO_SCSI_T_AN_SUBSCRIBE 2 - -/* Valid TMF subtypes. */ -#define VIRTIO_SCSI_T_TMF_ABORT_TASK 0 -#define VIRTIO_SCSI_T_TMF_ABORT_TASK_SET 1 -#define VIRTIO_SCSI_T_TMF_CLEAR_ACA2 -#define VIRTIO_SCSI_T_TMF_CLEAR_TASK_SET 3 -#define VIRTIO_SCSI_T_TMF_I_T_NEXUS_RESET 4 -#define VIRTIO_SCSI_T_TMF_LOGICAL_UNIT_RESET 5 -#define VIRTIO_SCSI_T_TMF_QUERY_TASK 6 -#define VIRTIO_SCSI_T_TMF_QUERY_TASK_SET 7 - -/* Events. */ -#define VIRTIO_SCSI_T_EVENTS_MISSED0x8000 -#define VIRTIO_SCSI_T_NO_EVENT 0 -#define VIRTIO_SCSI_T_TRANSPORT_RESET 1 -#define VIRTIO_SCSI_T_ASYNC_NOTIFY 2 -#define VIRTIO_SCSI_T_PARAM_CHANGE 3 - -/* Reasons for transport reset event */ -#define VIRTIO_SCSI_EVT_RESET_HARD 0 -#define VIRTIO_SCSI_EVT_RESET_RESCAN 1 -#define VIRTIO_SCSI_EVT_RESET_REMOVED 2 - -/* SCSI command request, followed by data-out */ -typedef struct { -uint8_t lun[8]; /* Logical Unit Number */ -uint64_t tag;/* Command identifier */ -uint8_t task_attr; /* Task attribute */ -uint8_t prio; -uint8_t crn; -uint8_t cdb[]; -} QEMU_PACKED VirtIOSCSICmdReq; - -/* Response, followed by sense data and data-in */ -typedef struct { -uint32_t sense_len; /* Sense data length */ -uint32_t resid; /* Residual bytes in data buffer */ -uint16_t status_qualifier; /* Status qualifier */ -uint8_t status; /* Command completion status */ -uint8_t response;/* Response values */ -uint8_t sense[]; -} QEMU_PACKED VirtIOSCSICmdResp; - -/* Task Management Request */ -typedef struct { -uint32_t type; -uint32_t subtype; -uint8_t lun[8]; -uint64_t tag; -} QEMU_PACKED VirtIOSCSICtrlTMFReq; - -typedef struct { -uint8_t response; -} QEMU_PACKED VirtIOSCSICtrlTMFResp; - -/* Asynchronous notification query/subscription */ -typedef struct { -uint32_t type; -uint8_t lun[8]; -uint32_t event_requested; -} QEMU_PACKED VirtIOSCSICtrlANReq; - -typedef struct { -uint32_t event_actual; -uint8_t response; -} QEMU_PACKED VirtIOSCSICtrlANResp; - -typedef struct { -uint32_t event; -uint8_t lun[8]; -uint32_t reason; -} QEMU_PACKED VirtIOSCSIEvent; - -typedef struct { -uint32_t num_queues; -uint32_t seg_max; -uint32_t max_sectors; -uint32_t cmd_per_lun; -uint32_t event_info_size; -uint32_t sense_size; -uint32_t cdb_size; -uint16_t max_channel; -uint16_t max_target; -uint32_t max_lun
Re: [PATCH 0/5] vhost-scsi: Add support for host virtualized target
Nicholas A. Bellinger n...@linux-iscsi.org writes: Hi MST Co, On Thu, 2013-01-17 at 18:43 +0200, Michael S. Tsirkin wrote: On Fri, Sep 07, 2012 at 06:48:14AM +, Nicholas A. Bellinger wrote: From: Nicholas Bellinger n...@linux-iscsi.org Hello Anthony Co, This is the fourth installment to add host virtualized target support for the mainline tcm_vhost fabric driver using Linux v3.6-rc into QEMU 1.3.0-rc. The series is available directly from the following git branch: git://git.kernel.org/pub/scm/virt/kvm/nab/qemu-kvm.git vhost-scsi-for-1.3 Note the code is cut against yesterday's QEMU head, and dispite the name of the tree is based upon mainline qemu.org git code + has thus far been running overnight with 100K IOPs small block 4k workloads using v3.6-rc2+ based target code with RAMDISK_DR backstores. Other than some minor fuzz between jumping from QEMU 1.2.0 - 1.2.50, this series is functionally identical to what's been posted for vhost-scsi RFC-v3 to qemu-devel. Please consider applying these patches for an initial vhost-scsi merge into QEMU 1.3.0-rc code, or let us know what else you'd like to see addressed for this series to in order to merge. Thank you! --nab OK what's the status here? We missed 1.3 but let's try not to miss 1.4? Unfortunately, I've not been able to get back to the conversion requested by Paolo for a standalone vhost-scsi PCI device. Is your git repo above up to date? Perhaps I can find someone to help out.. At this point my hands are still full with iSER-target for-3.9 kernel code over the next weeks. What's the v1.4 feature cut-off looking like at this point..? Hard freeze is on february 1st but 1.5 opens up again on the 15th. So the release windows shouldn't have a major impact on merging... Regards, Anthony Liguori --nab -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Pv-drivers] [PATCH 0/6] VSOCK for Linux upstreaming
On 11/07/2012 12:58 AM, Gerd Hoffmann wrote: On 11/05/12 19:19, Andy King wrote: Hi David, The big and only question is whether anyone can actually use any of this stuff without your proprietary bits? Do you mean the VMCI calls? The VMCI driver is in the process of being upstreamed into the drivers/misc tree. Greg (cc'd on these patches) is actively reviewing that code and we are addressing feedback. Also, there was some interest from RedHat into using vSockets as a unified interface, routed over a hypervisor-specific transport (virtio or otherwise, although for now VMCI is the only one implemented). Can you outline how this can be done? From a quick look over the code it seems like vsock has a hard dependency on vmci, is that correct? When making vsock a generic, reusable kernel service it should be the other way around: vsock should provide the core implementation and an interface where hypervisor-specific transports (vmci, virtio, xenbus, ...) can register themself. This was already done in a hypervisor neutral way using virtio: http://lists.openwall.net/netdev/2008/12/14/8 The concept was Nacked and that led to the abomination of virtio-serial. If an address family for virtualization is on the table, we should reconsider AF_VMCHANNEL. I'd be thrilled to get rid of virtio-serial... Regards, Anthony Liguori cheers, Gerd ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Using PCI config space to indicate config location
Rusty Russell ru...@rustcorp.com.au writes: Gerd Hoffmann kra...@redhat.com writes: So how about this: (1) Add a vendor specific pci capability for new-style virtio. Specifies the pci bar used for new-style virtio registers. Guests can use it to figure whenever new-style virtio is supported and to map the correct bar (which will probably be bar 1 in most cases). This was closer to the original proposal[1], which I really liked (you can layout bars however you want). Anthony thought that vendor capabilities were a PCI-e feature, but it seems they're blessed in PCI 2.3. 2.3 was standardized in 2002. Are we confident that vendor extensions play nice with pre-2.3 OSes like Win2k, WinXP, etc? I still think it's a bad idea to rely on something so new in something as fundamental as virtio-pci unless we have to. Regards, Anthony Liguori So let's return to that proposal, giving something like this: /* IDs for different capabilities. Must all exist. */ /* FIXME: Do we win from separating ISR, NOTIFY and COMMON? */ /* Common configuration */ #define VIRTIO_PCI_CAP_COMMON_CFG 1 /* Notifications */ #define VIRTIO_PCI_CAP_NOTIFY_CFG 2 /* ISR access */ #define VIRTIO_PCI_CAP_ISR_CFG3 /* Device specific confiuration */ #define VIRTIO_PCI_CAP_DEVICE_CFG 4 /* This is the PCI capability header: */ struct virtio_pci_cap { u8 cap_vndr;/* Generic PCI field: PCI_CAP_ID_VNDR */ u8 cap_next;/* Generic PCI field: next ptr. */ u8 cap_len; /* Generic PCI field: sizeof(struct virtio_pci_cap). */ u8 cfg_type;/* One of the VIRTIO_PCI_CAP_*_CFG. */ u8 bar; /* Where to find it. */ u8 unused; __le16 offset; /* Offset within bar. */ __le32 length; /* Length. */ }; This means qemu can point the isr_cfg into the legacy area if it wants. In fact, it can put everything in BAR0 if it wants. Thoughts? Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Using PCI config space to indicate config location
Rusty Russell ru...@rustcorp.com.au writes: Anthony Liguori aligu...@us.ibm.com writes: We'll never remove legacy so we shouldn't plan on it. There are literally hundreds of thousands of VMs out there with the current virtio drivers installed in them. We'll be supporting them for a very, very long time :-) You will be supporting this for qemu on x86, sure. And PPC. As I think we're still in the growth phase for virtio, I prioritize future spec cleanliness pretty high. But I think you'll be surprised how fast this is deprecated: 1) Bigger queues for block devices (guest-specified ringsize) 2) Smaller rings for openbios (guest-specified alignment) 3) All-mmio mode (powerpc) 4) Whatever network features get numbers 31. We can do all of these things with incremental change to the existing layout. That's the only way what I'm suggesting is different. You want to reorder all of the fields and have a driver flag day. But I strongly suspect we'll decide we need to do the same exercise again in 4 years when we now need to figure out how to take advantage of transactional memory or some other whiz-bang hardware feature. There are a finite number of BARs but each BAR has an almost infinite size. So extending BARs instead of introducing new one seems like the conservative approach moving forward. I don't think we gain a lot by moving the ISR into a separate BAR. Splitting up registers like that seems weird to me too. Confused. I proposed the same split as you have, just ISR by itself. I disagree with moving the ISR into a separate BAR. That's what seems weird to me. It's very normal to have a mirrored set of registers that are PIO in one bar and MMIO in a different BAR. If we added an additional constraints that BAR1 was mirrored except for the config space and the MSI section was always there, I think the end result would be nice. IOW: But it won't be the same, because we want all that extra stuff, like more feature bits and queue size alignment. (Admittedly queues past 16TB aren't a killer feature). To make it concrete: Current: struct { __le32 host_features; /* read-only */ __le32 guest_features; /* read/write */ __le32 queue_pfn; /* read/write */ __le16 queue_size; /* read-only */ __le16 queue_sel; /* read/write */ __le16 queue_notify;/* read/write */ u8 status; /* read/write */ u8 isr; /* read-only, clear on read */ /* Optional */ __le16 msi_config_vector; /* read/write */ __le16 msi_queue_vector;/* read/write */ /* ... device features */ }; Proposed: struct virtio_pci_cfg { /* About the whole device. */ __le32 device_feature_select; /* read-write */ __le32 device_feature; /* read-only */ __le32 guest_feature_select;/* read-write */ __le32 guest_feature; /* read-only */ __le16 msix_config; /* read-write */ __u8 device_status; /* read-write */ __u8 unused; /* About a specific virtqueue. */ __le16 queue_select;/* read-write */ __le16 queue_align; /* read-write, power of 2. */ __le16 queue_size; /* read-write, power of 2. */ __le16 queue_msix_vector;/* read-write */ __le64 queue_address; /* read-write: 0x == DNE. */ }; struct virtio_pci_isr { __u8 isr; /* read-only, clear on read */ }; What I'm suggesting is: struct { __le32 host_features; /* read-only */ __le32 guest_features; /* read/write */ __le32 queue_pfn; /* read/write */ __le16 queue_size; /* read-only */ __le16 queue_sel; /* read/write */ __le16 queue_notify;/* read/write */ u8 status; /* read/write */ u8 isr; /* read-only, clear on read */ __le16 msi_config_vector; /* read/write */ __le16 msi_queue_vector;/* read/write */ __le32 host_feature_select; /* read/write */ __le32 guest_feature_select;/* read/write */ __le32 queue_pfn_hi;/* read/write */ }; With the additional semantic that the virtio-config space is overlayed on top of the register set in BAR0 unless the VIRTIO_PCI_F_SEPARATE_CONFIG feature is acknowledged. This feature acts as a latch and when set, removes the config space overlay. If the config space overlays the registers, the offset in BAR0 of the overlay depends on whether MSI is enabled or not in the PCI device. BAR1 is an MMIO mirror of BAR0 except that the config space is never overlayed in BAR1 regardless of VIRTIO_PCI_F_SEPARATE_CONFIG. BAR2 contains the config space. A guest can look at BAR1 and BAR2 to determine whether they exist. We could also enforce LE in the per-device config space
Re: [Qemu-devel] Using PCI config space to indicate config location
Avi Kivity a...@redhat.com writes: On 10/09/2012 05:16 AM, Rusty Russell wrote: Anthony Liguori aligu...@us.ibm.com writes: We'll never remove legacy so we shouldn't plan on it. There are literally hundreds of thousands of VMs out there with the current virtio drivers installed in them. We'll be supporting them for a very, very long time :-) You will be supporting this for qemu on x86, sure. As I think we're still in the growth phase for virtio, I prioritize future spec cleanliness pretty high. If a pure ppc hypervisor was on the table, this might have been worthwhile. As it is the codebase is shared, and the Linux drivers are shared, so cleaning up the spec doesn't help the code. Note that distros have been (perhaps unknowingly) shipping virtio-pci for PPC for some time now. So even though there wasn't a hypervisor that supported virtio-pci, the guests already support it and are out there in the wild. There's a lot of value in maintaining legacy support even for PPC. But I think you'll be surprised how fast this is deprecated: 1) Bigger queues for block devices (guest-specified ringsize) 2) Smaller rings for openbios (guest-specified alignment) 3) All-mmio mode (powerpc) 4) Whatever network features get numbers 31. I don't think we gain a lot by moving the ISR into a separate BAR. Splitting up registers like that seems weird to me too. Confused. I proposed the same split as you have, just ISR by itself. I believe Anthony objects to having the ISR by itself. What is the motivation for that? Right, BARs are a precious resource not to be spent lightly. Having an entire BAR dedicated to a 1-byte register seems like a waste to me. Regards, Anthony Liguori -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Using PCI config space to indicate config location
Rusty Russell ru...@rustcorp.com.au writes: (Topic updated, cc's trimmed). Anthony Liguori aligu...@us.ibm.com writes: Rusty Russell ru...@rustcorp.com.au writes: 4) The only significant change to the spec is that we use PCI capabilities, so we can have infinite feature bits. (see http://lists.linuxfoundation.org/pipermail/virtualization/2011-December/019198.html) We discussed this on IRC last night. I don't think PCI capabilites are a good mechanism to use... PCI capabilities are there to organize how the PCI config space is allocated to allow vendor extensions to co-exist with future PCI extensions. But we've never used the PCI config space within virtio-pci. We do everything in BAR0. I don't think there's any real advantage of using the config space vs. a BAR for virtio-pci. Note before anyone gets confused; we were talking about using the PCI config space to indicate what BAR(s) the virtio stuff is in. An alternative would be to simply specify a new layout format in BAR1. The arguments for a more flexible format that I know of: 1) virtio-pci has already extended the pci-specific part of the configuration once (for MSI-X), so I don't want to assume it won't happen again. configuration is the wrong word here. The virtio-pci BAR0 layout is: 0..19 virtio-pci registers 20+ virtio configuration space MSI-X needed to add additional virtio-pci registers, so now we have: 0..19 virtio-pci registers if MSI-X: 20..23 virtio-pci MSI-X registers 24+ virtio configuration space else: 20+ virtio configuration space I agree, this stinks. But I think we could solve this in a different way. I think we could just move the virtio configuration space to BAR1 by using a transport feature bit. That then frees up the entire BAR0 for use as virtio-pci registers. We can then always include the virtio-pci MSI-X register space and introduce all new virtio-pci registers as simply being appended. This new feature bit then becomes essentially a virtio configuration latch. When unacked, virtio configuration hides new registers, when acked, those new registers are exposed. Another option is to simply put new registers after the virtio configuration blob. 2) ISTR an argument about mapping the ISR register separately, for performance, but I can't find a reference to it. I think the rationale is that ISR really needs to be PIO but everything else doesn't. PIO is much faster on x86 because it doesn't require walking page tables or instruction emulation to handle the exit. The argument to move the remaining registers to MMIO is to allow 64-bit accesses to registers which isn't possible with PIO. This maps really nicely to non-PCI transports too. This isn't right. Noone else can use the PCI layout. While parts are common, other parts are pci-specific (MSI-X and ISR for example), and yet other parts are specified by PCI elsewhere (eg interrupt numbers). But extending the PCI config space (especially dealing with capability allocation) is pretty gnarly and there isn't an obvious equivalent outside of PCI. That's OK, because general changes should be done with feature bits, and the others all have an infinite number. Being the first, virtio-pci has some unique limitations we'd like to fix. There are very devices that we emulate today that make use of extended PCI device registers outside the platform devices (that have no BARs). This sentence confused me? There is a missing few. There are very few devices... Extending the PCI configuration space is unusual for PCI devices. That was the point. Regards, Anthony Liguori Thanks, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Using PCI config space to indicate config location
Gerd Hoffmann kra...@redhat.com writes: Hi, But I think we could solve this in a different way. I think we could just move the virtio configuration space to BAR1 by using a transport feature bit. Why hard-code stuff? I think it makes alot of sense to have a capability simliar to msi-x which simply specifies bar and offset of the register sets: [root@fedora ~]# lspci -vvs4 00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device [ ... ] Region 0: I/O ports at c000 [size=64] Region 1: Memory at fc029000 (32-bit) [size=4K] Capabilities: [40] MSI-X: Enable+ Count=2 Masked- Vector table: BAR=1 offset= PBA: BAR=1 offset=0800 MSI-X capability is a standard PCI capability which is why lspci can parse it. So we could have for virtio something like this: Capabilities: [??] virtio-regs: legacy: BAR=0 offset=0 virtio-pci: BAR=1 offset=1000 virtio-cfg: BAR=1 offset=1800 This would be a vendor specific PCI capability so lspci wouldn't automatically know how to parse it. You could just as well teach lspci to parse BAR0 to figure out what features are supported. That then frees up the entire BAR0 for use as virtio-pci registers. We can then always include the virtio-pci MSI-X register space and introduce all new virtio-pci registers as simply being appended. BAR0 needs to stay as-is for compatibility reasons. New devices which don't have to care about old guests don't need to provide a 'legacy' register region. A latch feature bit would allow the format to change without impacting compatibility at all. 2) ISTR an argument about mapping the ISR register separately, for performance, but I can't find a reference to it. I think the rationale is that ISR really needs to be PIO but everything else doesn't. PIO is much faster on x86 because it doesn't require walking page tables or instruction emulation to handle the exit. Is this still a pressing issue? With MSI-X enabled ISR isn't needed, correct? Which would imply that pretty much only old guests without MSI-X support need this, and we don't need to worry that much when designing something new ... It wasn't that long ago that MSI-X wasn't supported.. I think we should continue to keep ISR as PIO as it is a fast path. Regards, Anthony Liguori cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Using PCI config space to indicate config location
Gerd Hoffmann kra...@redhat.com writes: Hi, So we could have for virtio something like this: Capabilities: [??] virtio-regs: legacy: BAR=0 offset=0 virtio-pci: BAR=1 offset=1000 virtio-cfg: BAR=1 offset=1800 This would be a vendor specific PCI capability so lspci wouldn't automatically know how to parse it. Sure, would need a patch to actually parse+print the cap, /me was just trying to make my point clear in a simple way. 2) ISTR an argument about mapping the ISR register separately, for performance, but I can't find a reference to it. I think the rationale is that ISR really needs to be PIO but everything else doesn't. PIO is much faster on x86 because it doesn't require walking page tables or instruction emulation to handle the exit. Is this still a pressing issue? With MSI-X enabled ISR isn't needed, correct? Which would imply that pretty much only old guests without MSI-X support need this, and we don't need to worry that much when designing something new ... It wasn't that long ago that MSI-X wasn't supported.. I think we should continue to keep ISR as PIO as it is a fast path. No problem if we allow to have both legacy layout and new layout at the same time. Guests can continue to use ISR @ BAR0 in PIO space for existing virtio devices, even in case they want use mmio for other registers - all fine. New virtio devices can support MSI-X from day one and decide to not expose a legacy layout PIO bar. I think having BAR1 be an MMIO mirror of the registers + a BAR2 for virtio configuration space is probably not that bad of a solution. Regards, Anthony Liguori cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Using PCI config space to indicate config location
Rusty Russell ru...@rustcorp.com.au writes: Anthony Liguori aligu...@us.ibm.com writes: Gerd Hoffmann kra...@redhat.com writes: Hi, So we could have for virtio something like this: Capabilities: [??] virtio-regs: legacy: BAR=0 offset=0 virtio-pci: BAR=1 offset=1000 virtio-cfg: BAR=1 offset=1800 This would be a vendor specific PCI capability so lspci wouldn't automatically know how to parse it. Sure, would need a patch to actually parse+print the cap, /me was just trying to make my point clear in a simple way. 2) ISTR an argument about mapping the ISR register separately, for performance, but I can't find a reference to it. I think the rationale is that ISR really needs to be PIO but everything else doesn't. PIO is much faster on x86 because it doesn't require walking page tables or instruction emulation to handle the exit. Is this still a pressing issue? With MSI-X enabled ISR isn't needed, correct? Which would imply that pretty much only old guests without MSI-X support need this, and we don't need to worry that much when designing something new ... It wasn't that long ago that MSI-X wasn't supported.. I think we should continue to keep ISR as PIO as it is a fast path. No problem if we allow to have both legacy layout and new layout at the same time. Guests can continue to use ISR @ BAR0 in PIO space for existing virtio devices, even in case they want use mmio for other registers - all fine. New virtio devices can support MSI-X from day one and decide to not expose a legacy layout PIO bar. I think having BAR1 be an MMIO mirror of the registers + a BAR2 for virtio configuration space is probably not that bad of a solution. Well, we also want to clean up the registers, so how about: BAR0: legacy, as is. If you access this, don't use the others. BAR1: new format virtio-pci layout. If you use this, don't use BAR0. BAR2: virtio-cfg. If you use this, don't use BAR0. BAR3: ISR. If you use this, don't use BAR0. I prefer the cases exclusive (ie. use one or the other) as a clear path to remove the legacy layout; and leaving the ISR in BAR0 leaves us with an ugly corner case in future (ISR is BAR0 + 19? WTF?). We'll never remove legacy so we shouldn't plan on it. There are literally hundreds of thousands of VMs out there with the current virtio drivers installed in them. We'll be supporting them for a very, very long time :-) I don't think we gain a lot by moving the ISR into a separate BAR. Splitting up registers like that seems weird to me too. It's very normal to have a mirrored set of registers that are PIO in one bar and MMIO in a different BAR. If we added an additional constraints that BAR1 was mirrored except for the config space and the MSI section was always there, I think the end result would be nice. IOW: BAR0[pio]: virtio-pci registers + optional MSI section + virtio-config BAR1[mmio]: virtio-pci registers + MSI section + future extensions BAR2[mmio]: virtio-config We can continue to do ISR access via BAR0 for performance reasons. As to MMIO vs PIO, the BARs are self-describing, so we should explicitly endorse that and leave it to the devices. The detection is simple: if BAR1 has non-zero length, it's new-style, otherwise legacy. I agree that this is the best way to extend, but I think we should still use a transport feature bit. We want to be able to detect within QEMU whether a guest is using these new features because we need to adjust migration state accordingly. Otherwise we would have to detect reads/writes to the new BARs to maintain whether the extended register state needs to be saved. This gets nasty dealing with things like reset. A feature bit simplifies this all pretty well. Regards, Anthony Liguori Thoughts? Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Proposal for virtio standardization.
Rusty Russell ru...@rustcorp.com.au writes: Hi all, I've had several requests for a more formal approach to the virtio draft spec, and (after some soul-searching) I'd like to try that. The proposal is to use OASIS as the standards body, as it's fairly light-weight as these things go. For me this means paperwork and setting up a Working Group and getting the right people involved as Voting members starting with the current contributors; for most of you it just means a new mailing list, though I'll be cross-posting any drafts and major changes here anyway. I believe that a documented standard (aka virtio 1.0) will increase visibility and adoption in areas outside our normal linux/kvm universe. There's been some of that already, but this is the clearest path to accelerate it. Not the easiest path, but I believe that a solid I/O standard is a Good Thing for everyone. Yet I also want to decouple new and experimental development from the standards effort; running code comes first. New feature bits and new device numbers should be reservable without requiring a full spec change. So the essence of my proposal is: 1) I start a Working Group within OASIS where we can aim for virtio spec 1.0. 2) The current spec is textually reordered so the core is clearly bus-independent, with PCI, mmio, etc appendices. 3) Various clarifications, formalizations and cleanups to the spec text, and possibly elimination of old deprecated features. 4) The only significant change to the spec is that we use PCI capabilities, so we can have infinite feature bits. (see http://lists.linuxfoundation.org/pipermail/virtualization/2011-December/019198.html) We discussed this on IRC last night. I don't think PCI capabilites are a good mechanism to use... PCI capabilities are there to organize how the PCI config space is allocated to allow vendor extensions to co-exist with future PCI extensions. But we've never used the PCI config space within virtio-pci. We do everything in BAR0. I don't think there's any real advantage of using the config space vs. a BAR for virtio-pci. The nice thing about using a BAR is that the region is MMIO or PIO. This maps really nicely to non-PCI transports too. But extending the PCI config space (especially dealing with capability allocation) is pretty gnarly and there isn't an obvious equivalent outside of PCI. There are very devices that we emulate today that make use of extended PCI device registers outside the platform devices (that have no BARs). Regards, Anthony Liguori 5) Changes to the ring layout and other such things are deferred to a future virtio version; whether this is done within OASIS or externally depends on how well this works for the 1.0 release. Thoughts? Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/3] virtio-net: inline header support
Rusty Russell ru...@rustcorp.com.au writes: Michael S. Tsirkin m...@redhat.com writes: Thinking about Sasha's patches, we can reduce ring usage for virtio net small packets dramatically if we put virtio net header inline with the data. This can be done for free in case guest net stack allocated extra head room for the packet, and I don't see why would this have any downsides. I've been wanting to do this for the longest time... but... Even though with my recent patches qemu no longer requires header to be the first s/g element, we need a new feature bit to detect this. A trivial qemu patch will be sent separately. There's a reason I haven't done this. I really, really dislike my implemention isn't broken feature bits. We could have an infinite number of them, for each bug in each device. This is a bug in the specification. The QEMU implementation pre-dates the specification. All of the actual implementations of virtio relied on the semantics of s/g elements and still do. What's in the specification really doesn't matter when it doesn't agree with all of the existing implementations. Users use implementations, not specifications. The specification really ought to be changed here. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/3] virtio-net: inline header support
Rusty Russell ru...@rustcorp.com.au writes: Michael S. Tsirkin m...@redhat.com writes: There's a reason I haven't done this. I really, really dislike my implemention isn't broken feature bits. We could have an infinite number of them, for each bug in each device. So my plan was to tie this assumption to the new PCI layout. And have a stress-testing patch like the one below in the kernel (see my virtio-wip branch for stuff like this). Turn it on at boot with virtio_ring.torture on the kernel commandline. BTW, I've fixed lguest, but my kvm here (Ubuntu precise, kvm-qemu 1.0) is too old. Building the latest git now... Cheers, Rusty. Subject: virtio: CONFIG_VIRTIO_DEVICE_TORTURE Virtio devices are not supposed to depend on the framing of the scatter-gather lists, but various implementations did. Safeguard this in future by adding an option to deliberately create perverse descriptors. Signed-off-by: Rusty Russell ru...@rustcorp.com.au Ignore framing is really a bad idea. You want backends to enforce reasonable framing because guest's shouldn't do silly things with framing. For instance, with virtio-blk, if you want decent performance, you absolutely want to avoid bouncing the data. If you're using O_DIRECT in the host to submit I/O requests, then it's critical that all of the s/g elements are aligned to a sector boundary and sized to a sector boundary. Yes, QEMU can handle if that's not the case, but it would be insanely stupid for a guest not to do this. This is the sort of thing that ought to be enforced in the specification because a guest cannot perform well if it doesn't follow these rules. A spec isn't terribly useful if the result is guest drivers that are slow. There's very little to gain by not enforcing rules around framing and there's a lot to lose if a guest frames incorrectly. In the rare case where we want to make a framing change, we should use feature bits like Michael is proposing. In this case, we should simply say that with the feature bit, the vnet header can be in the same element as the data but not allow the header to be spread across multiple elements. Regards, Anthony Liguori diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index 8d5bddb..930a4ea 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -5,6 +5,15 @@ config VIRTIO bus, such as CONFIG_VIRTIO_PCI, CONFIG_VIRTIO_MMIO, CONFIG_LGUEST, CONFIG_RPMSG or CONFIG_S390_GUEST. +config VIRTIO_DEVICE_TORTURE + bool Virtio device torture tests + depends on VIRTIO DEBUG_KERNEL + help + This makes the virtio_ring implementation creatively change + the format of requests to make sure that devices are + properly implemented. This will make your virtual machine + slow *and* unreliable! Say N. + menu Virtio drivers config VIRTIO_PCI diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index e639584..8893753 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -124,6 +124,149 @@ struct vring_virtqueue #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq) +#ifdef CONFIG_VIRTIO_DEVICE_TORTURE +static bool torture; +module_param(torture, bool, 0644); + +struct torture { + unsigned int orig_out, orig_in; + void *orig_data; + struct scatterlist sg[4]; + struct scatterlist orig_sg[]; +}; + +static size_t tot_len(struct scatterlist sg[], unsigned num) +{ + size_t len, i; + + for (len = 0, i = 0; i num; i++) + len += sg[i].length; + + return len; +} + +static void copy_sg_data(const struct scatterlist *dst, unsigned dnum, + const struct scatterlist *src, unsigned snum) +{ + unsigned len; + struct scatterlist s, d; + + s = *src; + d = *dst; + + while (snum dnum) { + len = min(s.length, d.length); + memcpy(sg_virt(d), sg_virt(s), len); + d.offset += len; + d.length -= len; + s.offset += len; + s.length -= len; + if (!s.length) { + BUG_ON(snum == 0); + src++; + snum--; + s = *src; + } + if (!d.length) { + BUG_ON(dnum == 0); + dst++; + dnum--; + d = *dst; + } + } +} + +static bool torture_replace(struct scatterlist **sg, + unsigned int *out, + unsigned int *in, + void **data, + gfp_t gfp) +{ + static size_t seed; + struct torture *t; + size_t outlen, inlen, ourseed, len1; + void *buf; + + if (!torture) + return true; + + outlen = tot_len(*sg, *out); + inlen = tot_len(*sg
Re: [PATCH 0/3] virtio-net: inline header support
Rusty Russell ru...@rustcorp.com.au writes: Anthony Liguori anth...@codemonkey.ws writes: Rusty Russell ru...@rustcorp.com.au writes: Michael S. Tsirkin m...@redhat.com writes: Thinking about Sasha's patches, we can reduce ring usage for virtio net small packets dramatically if we put virtio net header inline with the data. This can be done for free in case guest net stack allocated extra head room for the packet, and I don't see why would this have any downsides. I've been wanting to do this for the longest time... but... Even though with my recent patches qemu no longer requires header to be the first s/g element, we need a new feature bit to detect this. A trivial qemu patch will be sent separately. There's a reason I haven't done this. I really, really dislike my implemention isn't broken feature bits. We could have an infinite number of them, for each bug in each device. This is a bug in the specification. The QEMU implementation pre-dates the specification. All of the actual implementations of virtio relied on the semantics of s/g elements and still do. lguest fix is pending in my queue. lkvm and qemu are broken; lkvm isn't ever going to be merged, so I'm not sure what its status is? But I'm determined to fix qemu, and hence my torture patch to make sure this doesn't creep in again. There are even more implementations out there and I'd wager they all rely on framing. What's in the specification really doesn't matter when it doesn't agree with all of the existing implementations. Users use implementations, not specifications. The specification really ought to be changed here. I'm sorely tempted, except that we're losing a real optimization because of this :( What optimizations? What Michael is proposing is still achievable with a device feature. Are there other optimizations that can be achieved by changing framing that we can't achieve with feature bits? As I mentioned in another note, bad framing decisions can cause performance issues too... The specification has long contained the footnote: The current qemu device implementations mistakenly insist that the first descriptor cover the header in these cases exactly, so a cautious driver should arrange it so. I seem to recall this being a compromise between you and I.. I think I objected strongly to this back when you first wrote the spec and you added this to appease me ;-) Regards, Anthony Liguori I'd like to tie this caveat to the PCI capability change, so this note will move to the appendix with the old PCI layout. Cheers, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 4/5] virtio-scsi: Add start/stop functionality for vhost-scsi
On 09/10/2012 01:24 AM, Michael S. Tsirkin wrote: On Mon, Sep 10, 2012 at 08:16:54AM +0200, Paolo Bonzini wrote: Il 09/09/2012 00:40, Michael S. Tsirkin ha scritto: On Fri, Sep 07, 2012 at 06:00:50PM +0200, Paolo Bonzini wrote: Il 07/09/2012 08:48, Nicholas A. Bellinger ha scritto: Cc: Stefan Hajnoczistefa...@linux.vnet.ibm.com Cc: Zhi Yong Wuwu...@linux.vnet.ibm.com Cc: Michael S. Tsirkinm...@redhat.com Cc: Paolo Bonzinipbonz...@redhat.com Signed-off-by: Nicholas Bellingern...@linux-iscsi.org --- hw/virtio-pci.c |2 ++ hw/virtio-scsi.c | 49 + hw/virtio-scsi.h |1 + 3 files changed, 52 insertions(+), 0 deletions(-) Please create a completely separate device vhost-scsi-pci instead (or virtio-scsi-tcm-pci, or something like that). It is used completely differently from virtio-scsi-pci, it does not make sense to conflate the two. Ideally the name would say how it is different, not what backend it uses. Any good suggestions? I chose the backend name because, ideally, there would be no other difference. QEMU _could_ implement all the goodies in vhost-scsi (such as reservations or ALUA), it just doesn't do that yet. Paolo Then why do you say It is used completely differently from virtio-scsi-pci? Isn't it just a different backend? If yes then it should be a backend option, like it is for virtio-net. I don't mean to bike shed here so don't take this as a nack on making it a backend option, but in retrospect, the way we did vhost-net was a mistake even though I strongly advocated for it to be a backend option. The code to do it is really, really ugly. I think it would have made a lot more sense to just make it a device and then have it not use a netdev backend or any other kind of backend split. For instance: qemu -device vhost-net-pci,tapfd=X I know this breaks the model of separate backends and frontends but since vhost-net absolutely requires a tap fd, I think it's better in the long run to not abuse the netdev backend to prevent user confusion. Having a dedicated backend type that only has one possible option and can only be used by one device is a bit silly too. So I would be in favor of dropping/squashing 3/5 and radically simplifying how this was exposed to the user. I would just take qemu_vhost_scsi_opts and make them device properties. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RESEND 5/5] vhost-blk: Add vhost-blk support
Michael S. Tsirkin m...@redhat.com writes: On Thu, Jul 19, 2012 at 08:05:42AM -0500, Anthony Liguori wrote: Of course, the million dollar question is why would using AIO in the kernel be faster than using AIO in userspace? Actually for me a more important question is how does it compare with virtio-blk dataplane? I'm not even asking for a benchmark comparision. It's the same API being called from a kernel thread vs. a userspace thread. Why would there be a 60% performance difference between the two? That doesn't make any sense. There's got to be a better justification for putting this in the kernel than just that we can. I completely understand why Christoph's suggestion of submitting BIOs directly would be faster. There's no way to do that in userspace. Regards, Anthony Liguori -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RESEND 5/5] vhost-blk: Add vhost-blk support
= vhost_blk_set_status(blk, req-status, status); + if (!ret) + vhost_add_used_and_signal(blk-dev, vq, head, ret); + break; + case VIRTIO_BLK_T_GET_ID: + /* TODO: need a real ID string */ + ret = snprintf(vq-iov[BLK_HDR + 1].iov_base, +VIRTIO_BLK_ID_BYTES, VHOST-BLK-DISK); + status = ret 0 ? VIRTIO_BLK_S_IOERR : VIRTIO_BLK_S_OK; + ret = vhost_blk_set_status(blk, req-status, status); + if (!ret) + vhost_add_used_and_signal(blk-dev, vq, head, + VIRTIO_BLK_ID_BYTES); + break; + default: + pr_warn(Unsupported request type %d\n, hdr-type); + vhost_discard_vq_desc(vq, 1); + ret = -EFAULT; + break; + } There doesn't appear to be any error handling in the event that vhost_blk_io_submit fails. It would appear that you leak the ring queue entry since you never push anything onto the used queue. I think you need to handle io_submit() failing too with EAGAIN. Relying on min nr_events == queue_size seems like a dangerous assumption to me particularly since queue_size tends to be very large and max_nr_events is a fixed size global pool. To properly handle EAGAIN, you effectively need to implement flow control and back off reading the virtqueue until you can submit requests again. Of course, the million dollar question is why would using AIO in the kernel be faster than using AIO in userspace? When using eventfd, there is no heavy weight exit on the notify path. It should just be the difference between scheduling a kernel thread vs scheduling a userspace thread. There's simply no way that that's a 60% difference in performance. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC-v2 0/4] tcm_vhost+cmwq fabric driver code for-3.6
On 07/17/2012 04:50 PM, Nicholas A. Bellinger wrote: On Tue, 2012-07-17 at 13:55 -0500, Anthony Liguori wrote: On 07/17/2012 10:05 AM, Michael S. Tsirkin wrote: On Wed, Jul 11, 2012 at 09:15:00PM +, Nicholas A. Bellinger wrote: SNIP It still seems not 100% clear whether this driver will have major userspace using it. And if not, it would be very hard to support a driver when recent userspace does not use it in the end. I don't think this is a good reason to exclude something from the kernel. However, there are good reasons why this doesn't make sense for something like QEMU--specifically because we have a large number of features in our block layer that tcm_vhost would bypass. I can definitely appreciate your concern here as the QEMU maintainer. But perhaps it makes sense for something like native kvm tool. And if it did go into the kernel, we would certainly support it in QEMU. ... But I do think the kernel should carefully consider whether it wants to support an interface like this. This an extremely complicated ABI with a lot of subtle details around state and compatibility. Are you absolutely confident that you can support a userspace application that expects to get exactly the same response from all possible commands in 20 kernel versions from now? Virtualization requires absolutely precise compatibility in terms of bugs and features. This is probably not something the TCM stack has had to consider yet. We most certainly have thought about long term userspace compatibility with TCM. Our userspace code (that's now available in all major distros) is completely forward-compatible with new fabric modules such as tcm_vhost. No update required. I'm not sure we're talking about the same thing when we say compatibility. I'm not talking about the API. I'm talking about the behavior of the commands that tcm_vhost supports. If you add support for a new command, you need to provide userspace a way to disable this command. If you change what gets reported for VPD, you need to provide userspace a way to make VPD look like what it did in a previous version. Basically, you need to be able to make a TCM device behave 100% the same as it did in an older version of the kernel. This is unique to virtualization due to live migration. If you migrate from a 3.6 kernel to a 3.8 kernel, you need to make sure that the 3.8 kernel's TCM device behaves exactly like the 3.6 kernel because the guest that is interacting with it does not realize that live migration happened. Yes, you can add knobs via configfs to control this behavior, but I think the question is, what's the plan for this? BTW, I think this is a good thing to cover in Documentation/vhost/tcm_vhost.txt. I think that's probably the only change that's needed here. Regards, Anthony Liguori Also, by virtue of the fact that we are using configfs + rtslib (python object library) on top, it's very easy to keep any type of compatibility logic around in python code. With rtslib, we are able to hide configfs ABI changes from higher level apps. So far we've had a track record of 100% userspace ABI compatibility in mainline since .38, and I don't intend to merge a patch that breaks this any time soon. But if that ever happens, apps using rtslib are not going to be effected. I think a good idea for 3.6 would be to make it depend on CONFIG_STAGING. Then we don't commit to an ABI. I think this is a good idea. Even if it goes in, a really clear policy would be needed wrt the userspace ABI. While tcm_vhost is probably more useful than vhost_blk, it's a much more complex ABI to maintain. As far as I am concerned, the kernel API (eg: configfs directory layout) as it is now in sys/kernel/config/target/vhost/ is not going to change. It's based on the same drivers/target/target_core_fabric_configfs.c generic layout that we've had since .38. The basic functional fabric layout in configfs is identical (with fabric dependent WWPN naming of course) regardless of fabric driver, and by virtue of being generic it means we can add things like fabric dependent attributes + parameters in the future for existing fabrics without breaking userspace. So while I agree the ABI is more complex than vhost-blk, the logic in target_core_fabric_configfs.c is a basic ABI fabric definition that we are enforcing across all fabric modules in mainline for long term compatibility. --nab ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC-v2 0/4] tcm_vhost+cmwq fabric driver code for-3.6
On 07/18/2012 10:53 AM, Christoph Hellwig wrote: On Wed, Jul 18, 2012 at 08:42:21AM -0500, Anthony Liguori wrote: If you add support for a new command, you need to provide userspace a way to disable this command. If you change what gets reported for VPD, you need to provide userspace a way to make VPD look like what it did in a previous version. Basically, you need to be able to make a TCM device behave 100% the same as it did in an older version of the kernel. This is unique to virtualization due to live migration. If you migrate from a 3.6 kernel to a 3.8 kernel, you need to make sure that the 3.8 kernel's TCM device behaves exactly like the 3.6 kernel because the guest that is interacting with it does not realize that live migration happened. I don't think these strict live migration rules apply to SCSI targets. Real life storage systems get new features and different behaviour with firmware upgrades all the time, and SCSI initiators deal with that just fine. I don't see any reason to be more picky just because we're virtualized. But would this happen while a system is running live? I agree that in general, SCSI targets don't need this, but I'm pretty sure that if a guest probes for a command, you migrate to an old version, and that command is no longer there, badness will ensue. It's different when you're talking about a reboot happening or a disconnect/reconnect due to firmware upgrade. The OS would naturally be reprobing in this case. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC-v2 0/4] tcm_vhost+cmwq fabric driver code for-3.6
On 07/18/2012 11:47 AM, James Bottomley wrote: On Wed, 2012-07-18 at 11:00 -0500, Anthony Liguori wrote: Of course: Think about the consequences: you want to upgrade one array on your SAN. You definitely don't want to shut down your entire data centre to achieve it. In place upgrades on running SANs have been common in enterprise environments for a while. Would firmware upgrades ever result in major OS visible changes though? Maybe OSes are more robust with SCSI than with other types of buses, but I don't think it's safe to completely ignore the problem. I agree that in general, SCSI targets don't need this, but I'm pretty sure that if a guest probes for a command, you migrate to an old version, and that command is no longer there, badness will ensue. What command are we talking about? Operation of initiators is usually just READ and WRITE. So perhaps we might have inline UNMAP ... but the world wouldn't come to an end even if the latter stopped working. Is that true for all OSes? Linux may handle things gracefully if UNMAP starts throwing errors but that doesn't mean that Windows will. There are other cases where this creates problems. Windows (and some other OSes) fingerprint the hardware profile in order to do license enforcement. If the hardware changes beyond a certain amount, then they refuse to boot. Windows does this with a points system and I do believe that INQUIRY responses from any local disks are included in this tally. Most of the complex SCSI stuff is done at start of day; it's actually only then we'd notice things like changes in INQUIRY strings or mode pages. Failover, which is what you're talking about, requires reinstatement of all the operating parameters of the source/target system, but that's not wholly the responsibility of the storage system ... It's the responsibility of the hypervisor when dealing with live migration. Regards, Anthony Liguori James It's different when you're talking about a reboot happening or a disconnect/reconnect due to firmware upgrade. The OS would naturally be reprobing in this case. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC-v2 0/4] tcm_vhost+cmwq fabric driver code for-3.6
On 07/17/2012 10:05 AM, Michael S. Tsirkin wrote: On Wed, Jul 11, 2012 at 09:15:00PM +, Nicholas A. Bellinger wrote: From: Nicholas Bellingern...@linux-iscsi.org Hi folks, The following is a RFC-v2 series of tcm_vhost target fabric driver code currently in-flight for-3.6 mainline code. After last week's developments along with the help of some new folks, the changelog v1 - v2 so far looks like: *) Fix drivers/vhost/test.c to use VHOST_NET_FEATURES in patch #1 (Asias He) *) Fix tv_cmd completion - release SGL memory leak (nab) *) Fix sparse warnings for static variable usage (Fengguang Wu) *) Fix sparse warnings for min() typing + printk format specs (Fengguang Wu) *) Convert to cmwq submission for I/O dispatch (nab + hch) Also following Paolo's request, a patch for hw/virtio-scsi.c that sets scsi_host-max_target=0 that removes the need for virtio-scsi LLD to hardcode VirtIOSCSIConfig-max_id=1 in order to function with tcm_vhost. Note this series has been pushed into target-pending.git/for-next-merge, and should be getting picked up for tomorrow's linux-next build. Please let us know if you have any concerns and/or additional review feedback. Thank you! It still seems not 100% clear whether this driver will have major userspace using it. And if not, it would be very hard to support a driver when recent userspace does not use it in the end. I don't think this is a good reason to exclude something from the kernel. However, there are good reasons why this doesn't make sense for something like QEMU--specifically because we have a large number of features in our block layer that tcm_vhost would bypass. But perhaps it makes sense for something like native kvm tool. And if it did go into the kernel, we would certainly support it in QEMU. But I do think the kernel should carefully consider whether it wants to support an interface like this. This an extremely complicated ABI with a lot of subtle details around state and compatibility. Are you absolutely confident that you can support a userspace application that expects to get exactly the same response from all possible commands in 20 kernel versions from now? Virtualization requires absolutely precise compatibility in terms of bugs and features. This is probably not something the TCM stack has had to consider yet. I think a good idea for 3.6 would be to make it depend on CONFIG_STAGING. Then we don't commit to an ABI. I think this is a good idea. Even if it goes in, a really clear policy would be needed wrt the userspace ABI. While tcm_vhost is probably more useful than vhost_blk, it's a much more complex ABI to maintain. Regards, Anthony Liguori For this, you can add a separate Kconfig and source it from drivers/staging/Kconfig. Maybe it needs to be in a separate directory drivers/vhost/staging/Kconfig. Nicholas Bellinger (2): vhost: Add vhost_scsi specific defines tcm_vhost: Initial merge for vhost level target fabric driver Stefan Hajnoczi (2): vhost: Separate vhost-net features from vhost features vhost: make vhost work queue visible drivers/vhost/Kconfig |6 + drivers/vhost/Makefile|1 + drivers/vhost/net.c |4 +- drivers/vhost/tcm_vhost.c | 1609 + drivers/vhost/tcm_vhost.h | 74 ++ drivers/vhost/test.c |4 +- drivers/vhost/vhost.c |5 +- drivers/vhost/vhost.h |6 +- include/linux/vhost.h |9 + 9 files changed, 1710 insertions(+), 8 deletions(-) create mode 100644 drivers/vhost/tcm_vhost.c create mode 100644 drivers/vhost/tcm_vhost.h -- 1.7.2.5 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6
On 07/04/2012 10:05 AM, Michael S. Tsirkin wrote: On Wed, Jul 04, 2012 at 04:52:00PM +0200, Paolo Bonzini wrote: Il 04/07/2012 16:02, Michael S. Tsirkin ha scritto: On Wed, Jul 04, 2012 at 04:24:00AM +, Nicholas A. Bellinger wrote: From: Nicholas Bellingern...@linux-iscsi.org Hi folks, This series contains patches required to update tcm_vhost- virtio-scsi connected hosts- guests to run on v3.5-rc2 mainline code. This series is available on top of target-pending/auto-next here: git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git tcm_vhost This includes the necessary vhost changes from Stefan to to get tcm_vhost functioning, along a virtio-scsi LUN scanning change to address a client bug with tcm_vhost I ran into.. Also, tcm_vhost driver has been merged into a single source + header file that is now living under /drivers/vhost/, along with latest tcm_vhost changes from Zhi's tcm_vhost tree. Here are a couple of screenshots of the code in action using raw IBLOCK backends provided by FusionIO ioDrive Duo: http://linux-iscsi.org/images/Virtio-scsi-tcm-vhost-3.5-rc2-3.png http://linux-iscsi.org/images/Virtio-scsi-tcm-vhost-3.5-rc2-4.png So the next steps on my end will be converting tcm_vhost to submit backend I/O from cmwq context, along with fio benchmark numbers between tcm_vhost/virtio-scsi and virtio-scsi-raw using raw IBLOCK iomemory_vsl flash. OK so this is an RFC, not for merge yet? Patch 6 definitely looks RFCish, but patch 5 should go in anyway. Paolo I was talking about 4/6 first of all. Anyway, it's best to split, not to mix RFCs and fixes. What is the use-case that we're targeting for this? I certainly think it's fine to merge this into the kernel... maybe something will use it. But I'm pretty opposed to taking support for this into QEMU. It's going to create more problems than it solves specifically because I have no idea what problem it actually solves. We cannot avoid doing better SCSI emulation in QEMU. That cannot be a long term strategy on our part and vhost-scsi isn't going to solve that problem for us. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] [PATCH RFC] virtio-pci: add MMIO property
On 03/19/2012 10:56 AM, Michael S. Tsirkin wrote: Currently virtio-pci is specified so that configuration of the device is done through a PCI IO space (via BAR 0 of the virtual PCI device). However, Linux guests happen to use ioread/iowrite/iomap primitives for access, and these work uniformly across memory/io BARs. While PCI IO accesses are faster than MMIO on x86 kvm, MMIO might be helpful on other systems which don't implement PIO or where PIO is slower than MMIO. Add a property to make it possible to tweak the BAR type. Signed-off-by: Michael S. Tsirkinm...@redhat.com This is harmless by default but causes segfaults in memory.c when enabled. Thus an RFC until I figure out what's wrong. Doesn't this violate the virtio-pci spec? Making the same vendor/device ID have different semantics depending on a magic flag in QEMU seems like a pretty bad idea to me. Regards, Anthony Liguori --- hw/virtio-pci.c | 16 ++-- hw/virtio-pci.h |4 2 files changed, 18 insertions(+), 2 deletions(-) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index 28498ec..6f338d2 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -655,6 +655,7 @@ void virtio_init_pci(VirtIOPCIProxy *proxy, VirtIODevice *vdev) { uint8_t *config; uint32_t size; +uint8_t bar0_type; proxy-vdev = vdev; @@ -684,8 +685,14 @@ void virtio_init_pci(VirtIOPCIProxy *proxy, VirtIODevice *vdev) memory_region_init_io(proxy-bar,virtio_pci_config_ops, proxy, virtio-pci, size); -pci_register_bar(proxy-pci_dev, 0, PCI_BASE_ADDRESS_SPACE_IO, -proxy-bar); + +if (proxy-flags VIRTIO_PCI_FLAG_USE_MMIO) { +bar0_type = PCI_BASE_ADDRESS_SPACE_MEMORY; +} else { +bar0_type = PCI_BASE_ADDRESS_SPACE_IO; +} + +pci_register_bar(proxy-pci_dev, 0, bar0_type,proxy-bar); if (!kvm_has_many_ioeventfds()) { proxy-flags= ~VIRTIO_PCI_FLAG_USE_IOEVENTFD; @@ -823,6 +830,7 @@ static Property virtio_blk_properties[] = { DEFINE_PROP_BIT(ioeventfd, VirtIOPCIProxy, flags, VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT, true), DEFINE_PROP_UINT32(vectors, VirtIOPCIProxy, nvectors, 2), DEFINE_VIRTIO_BLK_FEATURES(VirtIOPCIProxy, host_features), +DEFINE_PROP_BIT(mmio, VirtIOPCIProxy, flags, VIRTIO_PCI_FLAG_USE_MMIO_BIT, false), DEFINE_PROP_END_OF_LIST(), }; @@ -856,6 +864,7 @@ static Property virtio_net_properties[] = { DEFINE_PROP_UINT32(x-txtimer, VirtIOPCIProxy, net.txtimer, TX_TIMER_INTERVAL), DEFINE_PROP_INT32(x-txburst, VirtIOPCIProxy, net.txburst, TX_BURST), DEFINE_PROP_STRING(tx, VirtIOPCIProxy, net.tx), +DEFINE_PROP_BIT(mmio, VirtIOPCIProxy, flags, VIRTIO_PCI_FLAG_USE_MMIO_BIT, false), DEFINE_PROP_END_OF_LIST(), }; @@ -888,6 +897,7 @@ static Property virtio_serial_properties[] = { DEFINE_PROP_HEX32(class, VirtIOPCIProxy, class_code, 0), DEFINE_VIRTIO_COMMON_FEATURES(VirtIOPCIProxy, host_features), DEFINE_PROP_UINT32(max_ports, VirtIOPCIProxy, serial.max_virtserial_ports, 31), +DEFINE_PROP_BIT(mmio, VirtIOPCIProxy, flags, VIRTIO_PCI_FLAG_USE_MMIO_BIT, false), DEFINE_PROP_END_OF_LIST(), }; @@ -915,6 +925,7 @@ static TypeInfo virtio_serial_info = { static Property virtio_balloon_properties[] = { DEFINE_VIRTIO_COMMON_FEATURES(VirtIOPCIProxy, host_features), +DEFINE_PROP_BIT(mmio, VirtIOPCIProxy, flags, VIRTIO_PCI_FLAG_USE_MMIO_BIT, false), DEFINE_PROP_END_OF_LIST(), }; @@ -969,6 +980,7 @@ static int virtio_scsi_exit_pci(PCIDevice *pci_dev) static Property virtio_scsi_properties[] = { DEFINE_PROP_UINT32(vectors, VirtIOPCIProxy, nvectors, 2), DEFINE_VIRTIO_SCSI_PROPERTIES(VirtIOPCIProxy, host_features, scsi), +DEFINE_PROP_BIT(mmio, VirtIOPCIProxy, flags, VIRTIO_PCI_FLAG_USE_MMIO_BIT, false), DEFINE_PROP_END_OF_LIST(), }; diff --git a/hw/virtio-pci.h b/hw/virtio-pci.h index e560428..e6a8861 100644 --- a/hw/virtio-pci.h +++ b/hw/virtio-pci.h @@ -24,6 +24,10 @@ #define VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT 1 #define VIRTIO_PCI_FLAG_USE_IOEVENTFD (1 VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT) +/* Some guests don't support port IO. Use MMIO instead. */ +#define VIRTIO_PCI_FLAG_USE_MMIO_BIT 2 +#define VIRTIO_PCI_FLAG_USE_MMIO (1 VIRTIO_PCI_FLAG_USE_MMIO_BIT) + typedef struct { PCIDevice pci_dev; VirtIODevice *vdev; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] [PATCH RFC] virtio-pci: add MMIO property
On 03/19/2012 06:52 PM, Rusty Russell wrote: On Mon, 19 Mar 2012 17:13:06 -0500, Anthony Liguorianth...@codemonkey.ws wrote: Maybe just make this a hidden option like x-miio? x-violate-the-virtio-spec-to-trick-old-linux-drivers-into-working-on-power? To configure the device, we use the first I/O region of the PCI device. Meh, it does sound a little like we are specifying that it's an PCI I/O bar. Let's resurrect the PCI-v2 idea, which is ready to implement now, and a nice cleanup? Detach it from the change-of-ring-format idea which is turning out to be a tarpit. I think that's a sensible approach. Regards, Anthony Liguori Thanks, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
On 01/10/2012 06:25 PM, Rusty Russell wrote: On Tue, 10 Jan 2012 19:03:36 +0200, Michael S. Tsirkinm...@redhat.com wrote: On Wed, Dec 21, 2011 at 11:03:25AM +1030, Rusty Russell wrote: Yes. The idea that we can alter fields in the device-specific config area is flawed. There may be cases where it doesn't matter, but as an idea it was holed to begin with. We can reduce probability by doing a double read to check, but there are still cases where it will fail. Okay - want me to propose an interface for that? Had a brief chat with BenH (CC'd). I think we should deprecate writing to the config space. Only balloon does it AFAICT, and I can't quite figure out *why* it has an 'active' field. This solves half the problem, of sync guest writes. For the other half, I suggest a generation counter; odd means inconsistent. The guest can poll. BenH also convinced me we should finally make the config space LE if we're going to change things. Since PCI is the most common transport, guest-endian confuses people. And it sucks for really weird machines I think the more important thing to do is require accesses to integers in the config space to always be aligned and to use the appropriate accessor. Non-integer fields should be restricted to byte access. That limits config space entries to 32-bit but also means that there is no need for a generation counter. It's also easier to deal with endian conversion that way. But it means the backend code ends up being much simpler to write (because it behaves more like a normal PCI device). If we're already making the change, the endianness ought to be a feature bit. We should also change the ring (to a single ring, I think). Ack. Descriptors to 24 bytes long (8 byte cookie, 8 byte addr, 4 byte len, 4 byte flags). We might be able to squeeze it into 20 bytes but that means packing. We should support inline, chained or indirect. Let the other side ack by setting flag, cookie and len (if written). Moreover, I think we should make all these changes at once (at least, in the spec). That makes it a big change, and it'll take longer to develop, but makes it easy in the long run to differentiate legacy and modern virtio. Ack. Long live virtio2! :-) Regards, Anthony Liguori Thoughts? Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
On 01/11/2012 09:12 AM, Michael S. Tsirkin wrote: On Wed, Jan 11, 2012 at 07:30:34AM -0600, Anthony Liguori wrote: On 01/10/2012 06:25 PM, Rusty Russell wrote: On Tue, 10 Jan 2012 19:03:36 +0200, Michael S. Tsirkinm...@redhat.com wrote: On Wed, Dec 21, 2011 at 11:03:25AM +1030, Rusty Russell wrote: Yes. The idea that we can alter fields in the device-specific config area is flawed. There may be cases where it doesn't matter, but as an idea it was holed to begin with. We can reduce probability by doing a double read to check, but there are still cases where it will fail. Okay - want me to propose an interface for that? Had a brief chat with BenH (CC'd). I think we should deprecate writing to the config space. Only balloon does it AFAICT, and I can't quite figure out *why* it has an 'active' field. This solves half the problem, of sync guest writes. For the other half, I suggest a generation counter; odd means inconsistent. The guest can poll. BenH also convinced me we should finally make the config space LE if we're going to change things. Since PCI is the most common transport, guest-endian confuses people. And it sucks for really weird machines I think the more important thing to do is require accesses to integers in the config space to always be aligned and to use the appropriate accessor. Non-integer fields should be restricted to byte access. That limits config space entries to 32-bit but also means that there is no need for a generation counter. It's also easier to deal with endian conversion that way. This is similar to what we have now. But it's still buggy: e.g. if guest updates MAC byte by byte, we have no way to know when it's done doing so. This is no different than a normal network card. You have to use a secondary register to trigger an update. Regards, Anthony Liguori But it means the backend code ends up being much simpler to write (because it behaves more like a normal PCI device). If we're already making the change, the endianness ought to be a feature bit. We should also change the ring (to a single ring, I think). Ack. Descriptors to 24 bytes long (8 byte cookie, 8 byte addr, 4 byte len, 4 byte flags). We might be able to squeeze it into 20 bytes but that means packing. We should support inline, chained or indirect. Let the other side ack by setting flag, cookie and len (if written). Moreover, I think we should make all these changes at once (at least, in the spec). That makes it a big change, and it'll take longer to develop, but makes it easy in the long run to differentiate legacy and modern virtio. Ack. Long live virtio2! :-) Regards, Anthony Liguori Thoughts? Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
On 01/11/2012 09:21 AM, Michael S. Tsirkin wrote: On Wed, Jan 11, 2012 at 09:15:49AM -0600, Anthony Liguori wrote: This is similar to what we have now. But it's still buggy: e.g. if guest updates MAC byte by byte, we have no way to know when it's done doing so. This is no different than a normal network card. You have to use a secondary register to trigger an update. Regards, Anthony Liguori Possible but doesn't let us layer nicely to allow unchanged drivers that work with all transports (new pci, old pci, non pci). If we declare config space LE, then we have to touch all drivers. There's no way around it because the virtio API is byte-based, not word based. This is why I'm suggesting making the virtio API (and then the virtio-pci ABI) word based. It gives us the flexibility to make endianness a property of the transport and not a property of the devices. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
On 01/11/2012 09:45 AM, Michael S. Tsirkin wrote: On Wed, Jan 11, 2012 at 09:28:27AM -0600, Anthony Liguori wrote: On 01/11/2012 09:21 AM, Michael S. Tsirkin wrote: On Wed, Jan 11, 2012 at 09:15:49AM -0600, Anthony Liguori wrote: This is similar to what we have now. But it's still buggy: e.g. if guest updates MAC byte by byte, we have no way to know when it's done doing so. This is no different than a normal network card. You have to use a secondary register to trigger an update. Regards, Anthony Liguori Possible but doesn't let us layer nicely to allow unchanged drivers that work with all transports (new pci, old pci, non pci). If we declare config space LE, then we have to touch all drivers. There's no way around it because the virtio API is byte-based, not word based. Fine but don't we want to be compatible with old hypervisors? We can modify the drivers to work either with a virtio1 or virtio2 transport. If the only difference is that we move to word access instead of byte access for the config space, it's a nop because drivers don't rely on sub-word access today. This is why I'm suggesting making the virtio API (and then the virtio-pci ABI) word based. It gives us the flexibility to make endianness a property of the transport and not a property of the devices. Regards, Anthony Liguori Some fields are 64 bit, this is still tricky to do atomically. What's the objection to using a config VQ? Then we move very far away from something that looks like a PCI device. The problem we're having here is specifically where we've deviated from what a normal PCI device would do. Fixing that by deviating even further seems counter intuitive to me. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC 7/11] virtio_pci: new, capability-aware driver.
On 01/11/2012 11:08 AM, Michael S. Tsirkin wrote: Not sure what you mean. Using VQ is DMA which is pretty common for PCI. Do you know of a network device that obtains it's mac address via a DMA transaction? Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio-blk: Add stats VQ to collect information about devices
On 08/18/2011 11:29 AM, Sasha Levin wrote: On Thu, 2011-08-18 at 08:10 -0700, Avi Kivity wrote: On 08/17/2011 09:38 PM, Sasha Levin wrote: On Wed, 2011-08-17 at 16:00 -0700, Avi Kivity wrote: On 08/16/2011 12:47 PM, Sasha Levin wrote: This patch adds support for an optional stats vq that works similary to the stats vq provided by virtio-balloon. The purpose of this change is to allow collection of statistics about working virtio-blk devices to easily analyze performance without having to tap into the guest. Why can't you get the same info from the host? i.e. read sectors? Some of the stats you can collect from the host, but some you can't. The ones you can't include all the timing statistics and the internal queue statistics (read/write merges). Surely you can time the actual amount of time the I/O takes? It doesn't account for the virtio round-trip, but does it matter? Why is the merge count important for the host? I assumed that the time the request spends in the virtio layer is (somewhat) significant, specially since that this is something that adds up over time. Merge count can be useful for several testing scenarios (I'll describe the reasoning behind this patch below). The idea behind providing all of the stats on the stats vq (which is basically what you see in '/dev/block/[device]/stats') is to give a consistent snapshot of the state of the device. What can you do with it? I was actually planning on submitting another patch that would add something similar into virtio-net. My plan was to enable collecting statistics regarding memory, network and disk usage in a simple manner without accessing guests. Why not just add an interface that lets you read files from a guest either via a guest agent (like qemu-ga) or a purpose built PV device? That would let you access the guest's full sysfs which seems to be quite a lot more useful long term than adding a bunch of specific interfaces. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio-pci: add softlinks between virtio and pci
On 01/05/2011 01:17 PM, Michael S. Tsirkin wrote: We sometimes need to map between the virtio device and the given pci device. One such use is OS installer that gets the boot pci device from BIOS and needs to find the relevant block device. Since it can't, installation fails. I have no objection to this patch but I'm a tad confused by the description. I assume you mean the installer is querying the boot device via int13 get driver parameters such that it returns the pci address of the device? Or is it querying geometry information and then trying to find the best match block device? If it's the former, I don't really understand the need for a backlink since the PCI address gives you a link to the block device. OTOH, if it's the later, it would make sense but then your description doesn't really make much sense. At any rate, a better commit message would be helpful in explaining the need for this. Regards, Anthony Liguori Supply softlinks between these to make it possible. Signed-off-by: Michael S. Tsirkinm...@redhat.com --- Gleb, could you please ack that this patch below will be enough to fix the installer issue that you see? drivers/virtio/virtio_pci.c | 18 +- 1 files changed, 17 insertions(+), 1 deletions(-) diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index ef8d9d5..06eb2f8 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -25,6 +25,7 @@ #includelinux/virtio_pci.h #includelinux/highmem.h #includelinux/spinlock.h +#includelinux/sysfs.h MODULE_AUTHOR(Anthony Liguorialigu...@us.ibm.com); MODULE_DESCRIPTION(virtio-pci); @@ -667,8 +668,21 @@ static int __devinit virtio_pci_probe(struct pci_dev *pci_dev, if (err) goto out_set_drvdata; - return 0; + err = sysfs_create_link(pci_dev-dev.kobj,vp_dev-vdev.dev.kobj, + virtio_device); + if (err) + goto out_register_device; + + err = sysfs_create_link(vp_dev-vdev.dev.kobj,pci_dev-dev.kobj, + bus_device); + if (err) + goto out_create_link; + return 0; +out_create_link: + sysfs_remove_link(pci_dev-dev.kobj, virtio_device); +out_register_device: + unregister_virtio_device(vp_dev-vdev); out_set_drvdata: pci_set_drvdata(pci_dev, NULL); pci_iounmap(pci_dev, vp_dev-ioaddr); @@ -685,6 +699,8 @@ static void __devexit virtio_pci_remove(struct pci_dev *pci_dev) { struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev); + sysfs_remove_link(vp_dev-vdev.dev.kobj, bus_device); + sysfs_remove_link(pci_dev-dev.kobj, virtio_device); unregister_virtio_device(vp_dev-vdev); } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio-pci: add softlinks between virtio and pci
On 01/05/2011 01:38 PM, Michael S. Tsirkin wrote: On Wed, Jan 05, 2011 at 01:28:34PM -0600, Anthony Liguori wrote: On 01/05/2011 01:17 PM, Michael S. Tsirkin wrote: We sometimes need to map between the virtio device and the given pci device. One such use is OS installer that gets the boot pci device from BIOS and needs to find the relevant block device. Since it can't, installation fails. I have no objection to this patch but I'm a tad confused by the description. I assume you mean the installer is querying the boot device via int13 get driver parameters such that it returns the pci address of the device? Or is it querying geometry information and then trying to find the best match block device? I think it's the former. If it's the former, I don't really understand the need for a backlink since the PCI address gives you a link to the block device. It does? How does it? Okay, after some more discussion, I think I better understand what's going on here. The installer needs to figure out the boot device. It does this by querying the BIOS and the BIOS can only give it back a PCI address. Right now, if you follow the PCI sysfs path, you cannot reach the actual virtio device because the PCI device only refers to the bus. This is because virtio has a separate struct device than the PCI struct device. These explicit links establish the relationship. A lot of this would be cleaner if virtio didn't force an independent struct device and could simply make use of an already existing struct device. Regards, Anthony Liguori OTOH, if it's the later, it would make sense but then your description doesn't really make much sense. At any rate, a better commit message would be helpful in explaining the need for this. Regards, Anthony Liguori Supply softlinks between these to make it possible. Signed-off-by: Michael S. Tsirkinm...@redhat.com --- Gleb, could you please ack that this patch below will be enough to fix the installer issue that you see? drivers/virtio/virtio_pci.c | 18 +- 1 files changed, 17 insertions(+), 1 deletions(-) diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index ef8d9d5..06eb2f8 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -25,6 +25,7 @@ #includelinux/virtio_pci.h #includelinux/highmem.h #includelinux/spinlock.h +#includelinux/sysfs.h MODULE_AUTHOR(Anthony Liguorialigu...@us.ibm.com); MODULE_DESCRIPTION(virtio-pci); @@ -667,8 +668,21 @@ static int __devinit virtio_pci_probe(struct pci_dev *pci_dev, if (err) goto out_set_drvdata; - return 0; + err = sysfs_create_link(pci_dev-dev.kobj,vp_dev-vdev.dev.kobj, + virtio_device); + if (err) + goto out_register_device; + + err = sysfs_create_link(vp_dev-vdev.dev.kobj,pci_dev-dev.kobj, + bus_device); + if (err) + goto out_create_link; + return 0; +out_create_link: + sysfs_remove_link(pci_dev-dev.kobj, virtio_device); +out_register_device: + unregister_virtio_device(vp_dev-vdev); out_set_drvdata: pci_set_drvdata(pci_dev, NULL); pci_iounmap(pci_dev, vp_dev-ioaddr); @@ -685,6 +699,8 @@ static void __devexit virtio_pci_remove(struct pci_dev *pci_dev) { struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev); + sysfs_remove_link(vp_dev-vdev.dev.kobj, bus_device); + sysfs_remove_link(pci_dev-dev.kobj, virtio_device); unregister_virtio_device(vp_dev-vdev); } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio-pci: add softlinks between virtio and pci
On 01/05/2011 02:05 PM, Michael S. Tsirkin wrote: On Wed, Jan 05, 2011 at 01:28:34PM -0600, Anthony Liguori wrote: On 01/05/2011 01:17 PM, Michael S. Tsirkin wrote: We sometimes need to map between the virtio device and the given pci device. One such use is OS installer that gets the boot pci device from BIOS and needs to find the relevant block device. Since it can't, installation fails. I have no objection to this patch but I'm a tad confused by the description. I assume you mean the installer is querying the boot device via int13 get driver parameters such that it returns the pci address of the device? Or is it querying geometry information and then trying to find the best match block device? If it's the former, I don't really understand the need for a backlink since the PCI address gives you a link to the block device. OTOH, if it's the later, it would make sense but then your description doesn't really make much sense. At any rate, a better commit message would be helpful in explaining the need for this. Regards, Anthony Liguori OK just to clarify: we get pci address from BIOS and need the virtio device to get at the linux device (e.g. block) in the end. Thus the link from pci to virtio. I also added a backlink since I thought it's handy. Does this answer the questions? Rusty rewrites my commit logs anyway, he has better style :) It helps. The real reason this is needed is because in a normal device, there is only one struct device whereas with virtio-pci, the virtio-pci device has a struct device and then the actual virtio device has another one. There's probably a better way to handle this in sysfs making virtio-pci a proper bus with only a single device as a child or something like that. But the links are probably an easier solution. Regards, Anthony Liguori Supply softlinks between these to make it possible. Signed-off-by: Michael S. Tsirkinm...@redhat.com --- Gleb, could you please ack that this patch below will be enough to fix the installer issue that you see? drivers/virtio/virtio_pci.c | 18 +- 1 files changed, 17 insertions(+), 1 deletions(-) diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c index ef8d9d5..06eb2f8 100644 --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -25,6 +25,7 @@ #includelinux/virtio_pci.h #includelinux/highmem.h #includelinux/spinlock.h +#includelinux/sysfs.h MODULE_AUTHOR(Anthony Liguorialigu...@us.ibm.com); MODULE_DESCRIPTION(virtio-pci); @@ -667,8 +668,21 @@ static int __devinit virtio_pci_probe(struct pci_dev *pci_dev, if (err) goto out_set_drvdata; - return 0; + err = sysfs_create_link(pci_dev-dev.kobj,vp_dev-vdev.dev.kobj, + virtio_device); + if (err) + goto out_register_device; + + err = sysfs_create_link(vp_dev-vdev.dev.kobj,pci_dev-dev.kobj, + bus_device); + if (err) + goto out_create_link; + return 0; +out_create_link: + sysfs_remove_link(pci_dev-dev.kobj, virtio_device); +out_register_device: + unregister_virtio_device(vp_dev-vdev); out_set_drvdata: pci_set_drvdata(pci_dev, NULL); pci_iounmap(pci_dev, vp_dev-ioaddr); @@ -685,6 +699,8 @@ static void __devexit virtio_pci_remove(struct pci_dev *pci_dev) { struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev); + sysfs_remove_link(vp_dev-vdev.dev.kobj, bus_device); + sysfs_remove_link(pci_dev-dev.kobj, virtio_device); unregister_virtio_device(vp_dev-vdev); } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: [PATCH] Implement a virtio GPU transport
On 11/12/2010 06:14 AM, Ian Molton wrote: On 10/11/10 17:47, Anthony Liguori wrote: On 11/10/2010 11:22 AM, Ian Molton wrote: Ping ? I think the best way forward is to post patches. I posted links to the git trees. I can post patches, but they are *large*. Do you really want me to post them? Yes, and they have to be split up into something reviewable. To summarize what I was trying to express in the thread, I think this is not the right long term architecture but am not opposed to it as a short term solution. I think having a new virtio device is a bad design choice but am not totally opposed to it. Ok! (I agree (that this should be a short term solution) :) ) you want to go for the path of integration, you're going to have to fix all of the coding style issues and make the code fit into QEMU. Dropping a bunch of junk into target-i386/ is not making the code fit into QEMU. I agree. how about hw/gl for the renderer and hw/ for the virtio module? That would be fine. If you post just what you have now in patch form, I can try to provide more concrete advice ignoring the coding style problems. I can post patches, although I dont think LKML would appreciate the volume! I can post them to the qemu list if you do. Yes, qemu is where I was suggesting you post them. Regards, Anthony Liguori -Ian ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: [PATCH] Implement a virtio GPU transport
On 11/10/2010 11:22 AM, Ian Molton wrote: Ping ? I think the best way forward is to post patches. To summarize what I was trying to express in the thread, I think this is not the right long term architecture but am not opposed to it as a short term solution. I think having a new virtio device is a bad design choice but am not totally opposed to it. My advice is that using virtio-serial + an external tool is probably the least amount of work to get something working and usable with QEMU. If you want to go for the path of integration, you're going to have to fix all of the coding style issues and make the code fit into QEMU. Dropping a bunch of junk into target-i386/ is not making the code fit into QEMU. If you post just what you have now in patch form, I can try to provide more concrete advice ignoring the coding style problems. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: [PATCH] Implement a virtio GPU transport
On 11/03/2010 01:03 PM, Ian Molton wrote: The virtio driver enfoces the PID field and understands the packet format used. Its better than using serial. Its also just one driver - which doesnt have any special interdependencies and can be extended or got rid of in future if and when better things come along. Why is it better than using virtio-serial? In the very, very short term, I think an external backend to QEMU also makes a lot of sense because that's something that Just Works today. Whos written that? The 2007 patch I've been working on and updating simply fails to work altogether without huge alterations on current qemu. My current patch touches a tiny part of the qemu sources. It works today. But it's not at all mergable in the current form. If you want to do the work of getting it into a mergable state (cleaning up the coding style, moving it to hw/, etc.) than I'm willing to consider it. But I don't think a custom virtio transport is the right thing to do here. However, if you want something that Just Works with the least amount of code possible, just split it into a separate process and we can stick it in a contrib/ directory or something. I think we can consider integrating it into QEMU (or at least simplifying the execution of the backend) but integrating into QEMU is going to require an awful lot of the existing code to be rewritten. Keeping it separate has the advantage of allowing something to Just Work as an interim solution as we wait for proper support in Spice. I dont know why you think integrating it into qemu is hard? I've already done it. Adding a file that happens to compile as part of qemu even though it doesn't actually integrate with qemu in any meaningful way is not integrating. That's just build system manipulation. Regards, Anthony Liguori I added one virtio driver and a seperate offscreen renderer. it touches the qemu code in *one* place. There should be no need to rewrite anything. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: [PATCH] Implement a virtio GPU transport
On 11/01/2010 05:42 AM, Avi Kivity wrote: On 10/28/2010 03:52 PM, Ian Molton wrote: On 28/10/10 15:24, Avi Kivity wrote: The caller is intended to block as the host must perform GL rendering before allowing the guests process to continue. Why is that? Can't we pipeline the process? No, not really. the guest may call for the scene to be rendered at any time and we have to wait for that to happen before we can return the data to it. Waiting for a response is fine, but can't the guest issue a second batch while waiting for the first? In a threaded application I think you mean but all RPCs are dispatched holding a global lock so even within a threaded application, only a single GL call will be executed at a time. The other scenario would be multiple applications trying to use GL but AFAICT, this is not supported in the current model. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: [PATCH] Implement a virtio GPU transport
On 11/01/2010 06:53 AM, Alon Levy wrote: The alternative proposal is Spice which so far noone has mentioned. Right now, Spice seems to be taking the right approach to guest 3d support. While we (speaking as part of the SPICE developers) want to have the same support in our virtual GPU for 3d as we have for 2d, we just don't at this point of time. Yes, but I think the point is that are two general approaches to supporting 3d that are being proposed. One approach is to an RPC layer at the OpenGL level which essentially passes through the host OpenGL stack. That's what virtio-gl is. This has existed for quite a while and there are multiple transports for it. It supports serial ports, TCP sockets, a custom ISA extension for x86, and now a custom virtio transport. Another approach would be to have a virtual GPU and to implement GPU-level commands for 3d. I have been repeated told that much of the complexity of Spice is absolutely needed for 3d and that that's a major part of the design. GPU-level support for 3d operations has a number of advantages mainly that it's more reasonably portable to things like Windows since the 3d commands can be a superset of both OpenGL and Direct3d. Also, Spice has an abstraction layer that doesn't simply passthrough graphics commands, but translates/sanitizes them first. That's another advantage over OpenGL passthrough. Without a layer to sanitize commands, a guest can do funky things with the host or other guests. I think a Spice-like approach is the best thing long term. In the short term, I think doing the GL marshalling over virtio-serial makes a ton of sense since the kernel driver is already present upstream. It exists exactly for things like this. In the very, very short term, I think an external backend to QEMU also makes a lot of sense because that's something that Just Works today. I think we can consider integrating it into QEMU (or at least simplifying the execution of the backend) but integrating into QEMU is going to require an awful lot of the existing code to be rewritten. Keeping it separate has the advantage of allowing something to Just Work as an interim solution as we wait for proper support in Spice. Regards, Anthony Liguori Regards, Anthony Liguori I understand others are skeptical, but this seems simple and if it works and you're happy to maintain it I'm happy to let you do it :) Cheers, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: [PATCH] Implement a virtio GPU transport
On 10/28/2010 09:24 AM, Avi Kivity wrote: On 10/28/2010 01:54 PM, Ian Molton wrote: Well, I like to review an implementation against a spec. True, but then all that would prove is that I can write a spec to match the code. It would also allow us to check that the spec matches the requirements. Those two steps are easier than checking that the code matches the requirements. I'm extremely sceptical of any GL passthrough proposal. There have literally been half a dozen over the years and they never seem to leave proof-of-concept phase. My (limited) understanding is that it's a fundamentally hard problem that no one has adequately solved yet. A specifically matters an awful lot less than an explanation of how the problem is being solved in a robust fashion such that it can be reviewed by people with a deeper understanding of the problem space. Regards, Anthony Liguori The code is proof of concept. the kernel bit is pretty simple, but I'd like to get some idea of whether the rest of the code will be accepted given that theres not much point in having any one (or two) of these components exist without the other. I guess some graphics people need to be involved. Better, but still unsatisfying. If the server is busy, the caller would block. I guess it's expected since it's called from -fsync(). I'm not sure whether that's the best interface, perhaps aio_writev is better. The caller is intended to block as the host must perform GL rendering before allowing the guests process to continue. Why is that? Can't we pipeline the process? The only real bottleneck is that processes will block trying to submit data if another process is performing rendering, but that will only be solved when the renderer is made multithreaded. The same would happen on a real GPU if it had only one queue too. If you look at the host code, you can see that the data is already buffered per-process, in a pretty sensible way. if the renderer itself were made a seperate thread, then this problem magically disappears (the queuing code on the host is pretty fast). Well, this is out of my area of expertise. I don't like it, but if it's acceptable to the gpu people, okay. In testing, the overhead of this was pretty small anyway. Running a few dozen glxgears and a copy of ioquake3 simultaneously on an intel video card managed the same framerate with the same CPU utilisation, both with the old code and the version I just posted. Contention during rendering just isn't much of an issue. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv2] vhost-net: add dhclient work-around from userspace
On 06/29/2010 08:04 AM, Michael S. Tsirkin wrote: On Tue, Jun 29, 2010 at 12:36:47AM -0700, David Miller wrote: From: Michael S. Tsirkinm...@redhat.com Date: Mon, 28 Jun 2010 13:08:07 +0300 Userspace virtio server has the following hack so guests rely on it, and we have to replicate it, too: Use port number to detect incoming IPv4 DHCP response packets, and fill in the checksum for these. The issue we are solving is that on linux guests, some apps that use recvmsg with AF_PACKET sockets, don't know how to handle CHECKSUM_PARTIAL; The interface to return the relevant information was added in 8dc4194474159660d7f37c495e3fc3f10d0db8cc, and older userspace does not use it. One important user of recvmsg with AF_PACKET is dhclient, so we add a work-around just for DHCP. Don't bother applying the hack to IPv6 as userspace virtio does not have a work-around for that - let's hope guests will do the right thing wrt IPv6. Signed-off-by: Michael S. Tsirkinm...@redhat.com Yikes, this is awful too. Nothing in the kernel should be mucking around with procotol packets like this by default. In particular, what the heck does port 67 mean? Locally I can use it for whatever I want for my own purposes, I don't have to follow the conventions for service ports as specified by the IETF. But I can't have the packet checksum state be left alone for port 67 traffic on a box using virtio because you have this hack there. And yes it's broken on machines using the qemu thing, but at least the hack there is restricted to userspace. Yes, and I think it was a mistake to add the hack there. This is what prevented applications from using the new interface in the 3 years since it was first introduced. It's far more complicated than that. dhclient is part of ISC's DHCP package. They do not have a public SCM and instead require you to join their Software Guild to get access to it. This problem was identified in one distribution and the patch was pushed upstream but because they did not have a public SCM, most other distributions did not see the fix until it appeared in a release. ISC has a pretty long release cycle historically. ISC's had the fix for a long time but there was a 3-year gap in their releases and since their SCM isn't public, users are stuck with the last release. This hack makes sense in QEMU as we have a few hacks like this to fix broken guests. A primary use of virtualization is to run old applications so it makes sense for us to do that. I don't think it makes sense for vhost to do this. These guests are so old that they don't have the requisite features to achieve really high performance anyway. I've always thought making vhost totally transparent was a bad idea and this is one of the reasons. We can do a lot of ugly things in userspace that we shouldn't be doing in the kernel. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv2] vhost-net: add dhclient work-around from userspace
On 06/30/2010 05:31 PM, Michael S. Tsirkin wrote: On Wed, Jun 30, 2010 at 05:08:11PM -0500, Anthony Liguori wrote: On 06/29/2010 08:04 AM, Michael S. Tsirkin wrote: On Tue, Jun 29, 2010 at 12:36:47AM -0700, David Miller wrote: From: Michael S. Tsirkinm...@redhat.com Date: Mon, 28 Jun 2010 13:08:07 +0300 Userspace virtio server has the following hack so guests rely on it, and we have to replicate it, too: Use port number to detect incoming IPv4 DHCP response packets, and fill in the checksum for these. The issue we are solving is that on linux guests, some apps that use recvmsg with AF_PACKET sockets, don't know how to handle CHECKSUM_PARTIAL; The interface to return the relevant information was added in 8dc4194474159660d7f37c495e3fc3f10d0db8cc, and older userspace does not use it. One important user of recvmsg with AF_PACKET is dhclient, so we add a work-around just for DHCP. Don't bother applying the hack to IPv6 as userspace virtio does not have a work-around for that - let's hope guests will do the right thing wrt IPv6. Signed-off-by: Michael S. Tsirkinm...@redhat.com Yikes, this is awful too. Nothing in the kernel should be mucking around with procotol packets like this by default. In particular, what the heck does port 67 mean? Locally I can use it for whatever I want for my own purposes, I don't have to follow the conventions for service ports as specified by the IETF. But I can't have the packet checksum state be left alone for port 67 traffic on a box using virtio because you have this hack there. And yes it's broken on machines using the qemu thing, but at least the hack there is restricted to userspace. Yes, and I think it was a mistake to add the hack there. This is what prevented applications from using the new interface in the 3 years since it was first introduced. It's far more complicated than that. dhclient is part of ISC's DHCP package. They do not have a public SCM and instead require you to join their Software Guild to get access to it. This problem was identified in one distribution and the patch was pushed upstream but because they did not have a public SCM, most other distributions did not see the fix until it appeared in a release. ISC has a pretty long release cycle historically. ISC's had the fix for a long time but there was a 3-year gap in their releases and since their SCM isn't public, users are stuck with the last release. This hack makes sense in QEMU as we have a few hacks like this to fix broken guests. A primary use of virtualization is to run old applications so it makes sense for us to do that. IMO it was wrong to put it in qemu: originally, if a distro shipped a broken virtio/dhclient combo, it was it's own bug to fix. But now that qemu has shipped the work-around for so long, broken guests seemed work. The guests were broken before qemu implemented this. virtio-net had checksum offload long before it was ever implemented in qemu. Not even lguest implemented it because the interfaces weren't available in tun/tap. I'm not sure how Rusty ever tested it. We only discovered this bug after checksum offload was implemented in tun/tap and we were able to enable it in QEMU. At that point, the guests had shipped and were in the wild. The real problem was that we implemented a feature in a guest driver without having a backend implementation and then shipped the code. Shipping untested code is a recipe for failure. If we had implemented the front-end feature only when a backend implementation was available, we would have caught this, fixed it in the guests, and not be in the situation because there wouldn't be these broken guests. So we *still* see the bug re-surface in new guests. Which guests? Newer versions of dhclient should work as expected. And since they are fairly new, it is interesting to get decent performance from them now. I don't think it makes sense for vhost to do this. These guests are so old that they don't have the requisite features to achieve really high performance anyway. I've always thought making vhost totally transparent was a bad idea and this is one of the reasons. It does not have to be fully transparent. You can insert your own ring in the middle, and copy descriptors around. And we stop on errors and let userspace handle. This will come handy if we get e.g. virtio bug that we need to work around. I mean from a UI perspective. IOW, if users have to explicitly choose to use vhost-net, then it's okay to force them to use newer guests that support vhost-net. However, if we make it transparent, then it has to support everything that QEMU virtio has ever supported which is problematic for exactly the reasons you are now encountering. We can do a lot of ugly things in userspace that we shouldn't be doing in the kernel. Regards, Anthony
Re: Hypervisor detection from within a Linux VM
On 06/29/2010 04:25 PM, Chetan Loke wrote: Hello, Requirement: I have the need to support my apps(running on a Linux VM) on different *nix hypervisors(ESX/Xen etc). I need to know on which hypervisor my app is running. I read the CPUID usage thread - http://thread.gmane.org/gmane.comp.emulators.kvm.devel/22643 but to be honest in the end I looked at http://lxr.linux.no/#linux+v2.6.34/arch/x86/kernel/cpu/vmware.c#L88 The vmware_platform() detection code is straight forward. Current-hack: As a quick hack we just grep lspci for VMware's pci-ids. Solution: I can write a bare minimal driver, check the cpu-id as VMware's balloon driver does and then emit a proc/sysfs node. The setup packages and the apps can then check for this node-string.I'm currently working on ESX and I am hoping that this thin-driver will work. It can be done entirely in userspace. Take a look at virt-what: http://people.redhat.com/~rjones/virt-what/ Question: Q1)Is it possible to get this functionality as part of the stock kernel or is that a bad idea? I suspect there could be other users/apps who would need to know what *nix hypervisor(or a non-virtualized environment) they are running on? Q2)If this is not the right approach then can someone please suggest another approach? It might be reasonable to list the hypervisor signature as a field in /proc/cpuinfo. There's also a /sys/hypervisor where such information could go. Regards, Anthony Liguori Regards Chetan Loke -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [RFC] virtio: Support releasing lock during kick
On 06/23/2010 04:24 PM, Stefan Hajnoczi wrote: The virtio block device holds a lock during I/O request processing. Kicking the virtqueue while the lock is held results in long lock hold times and increases contention for the lock. This patch modifies virtqueue_kick() to optionally release a lock while notifying the host. Virtio block is modified to pass in its lock. This allows other vcpus to queue I/O requests during the time spent servicing the virtqueue notify in the host. The virtqueue_kick() function is modified to know about locking because it changes the state of the virtqueue and should execute with the lock held (it would not be correct for virtio block to release the lock before calling virtqueue_kick()). Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com --- I am not yet 100% happy with this patch which aims to reduce guest CPU consumption related to vblk-lock contention. Although this patch reduces wait/hold times it does not affect I/O throughput or guest CPU utilization. More investigation is required to get to the bottom of why guest CPU utilization does not decrease when a lock bottleneck has been removed. Performance figures: Host: 2.6.34 upstream kernel, qemu-kvm-0.12.4 if=virtio,cache=none Guest: 2.6.35-rc3-kvm.git upstream kernel Storage: 12 disks as striped LVM volume Benchmark: 4 concurrent dd bs=4k iflag=direct Lockstat data forvblk-lock: test con-bounces contentions waittime-min waittime-max waittime-total unmodified 70977108 0.31 956.09 161165.4 patched11484 115500.30 411.80 50245.83 The maximum wait time went down by 544.29 us (-57%) and the total wait time decreased by 69%. This shows that the virtqueue kick is indeed hogging the lock. The patched version actually has higher contention than the unmodified version. I think the reason for this is that each virtqueue kick now includes a short release and reacquire. This short release gives other vcpus a chance to acquire the lock and progress, hence more contention but overall better wait time numbers. name acq-bounces acquisitions holdtime-min holdtime-max holdtime-total unmodified 10771 5038346 0.00 3271.81 59016905.47 patched31594 5857813 0.00 219.76 24104915.55 Here we see the full impact of this patch: hold time reduced to 219.76 us (-93%). Again the acquisitions have increased since we're now doing an extra unlock+lock per virtqueue kick. Testing, ideas, and comments appreciated. drivers/block/virtio_blk.c |2 +- drivers/char/hw_random/virtio-rng.c |2 +- drivers/char/virtio_console.c |6 +++--- drivers/net/virtio_net.c|6 +++--- drivers/virtio/virtio_balloon.c |6 +++--- drivers/virtio/virtio_ring.c| 13 +++-- include/linux/virtio.h |3 ++- net/9p/trans_virtio.c |2 +- 8 files changed, 25 insertions(+), 15 deletions(-) diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 258bc2a..de033bf 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -187,7 +187,7 @@ static void do_virtblk_request(struct request_queue *q) } if (issued) - virtqueue_kick(vblk-vq); + virtqueue_kick(vblk-vq,vblk-lock); } Shouldn't it be possible to just drop the lock before invoking virtqueue_kick() and reacquire it afterwards? There's nothing in that virtqueue_kick() path that the lock is protecting AFAICT. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Guest bridge setup variations
Arnd Bergmann wrote: 3) given an fd, treat a vhost-style interface This could mean two things, not sure which one you mean. Either the file descriptor could be the vhost file descriptor, or the socket or tap file descriptor from above, with qemu operating on the vhost interface itself. Either option has its advantages, but I guess we should only implement one of the two to keep it simple. I was thinking the socket/tap descriptor. I believe we should continue supporting the mechanisms we support today. However, for people that invoke qemu directly from the command line, I believe we should provide a mechanism like the tap helper that can be used to call out to a separate program to create these initial file descriptors. We'll have to think about how we can make this integrate well so that the syntax isn't clumsy. Right. I wonder if this helper should integrate with netcf as an abstraction, or if we should rather do something generic. It may also be a good idea to let the helper decide which of the three options you listed to use and pass that back to qemu unless the user overrides it. The decision probably needs to be host specific, depending e.g. on the availability and version of tools (brctl, iproute, maybe tunctl, ...), the respective kernel modules (vhost, macvlan, bridge, tun, ...) and policy (VEPA, vlan, ebtables). Ideally the approach should be generic enough to work on other platforms (BSD, Solaris, Windows, ...). For helpers, I think I'd like to stick with what we currently support, and then allow for a robust way for there to be independent projects that implement their own helpers. For instance, I would love it if netcf had a qemu network plugin helper. There's just too much in the networking space all wrapped up in layers of policy decisions. I think it's time to move it out of qemu. One thing I realized the last time we discussed the helper approach is that qemu should not need to know or care about the arguments passed to the helper, otherwise you get all the complexity back in qemu that you're trying to avoid. Maybe for 0.13 we can convert -net socket and -net tap to just pass all their options to the helper and move that code out of qemu, along with introducting the new syntax. Yes, I was thinking the same thing. New syntax may need exploring. Another unrelated issue that I think needs to be addressed in a network code cleanup is adding better support for multi-queue transmit and receive. I've prepared macvtap for that by letting you open the chardev multiple times to get one queue per guest CPU, but that needs to be supported by qemu and virtio-net as well to actually parallelize network operation. Ideally, two guest CPUs should be able to transmit and receive on separate queues of the adapter without ever having to access any shared resources. Multiqueue adds another dimension but I think your approach is pretty much right on the money. Have multiple fds for each queue and we would support a mechanism with helpers to receive multiple fds as appropriate. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Guest bridge setup variations
Arnd Bergmann wrote: As promised, here is my small writeup on which setups I feel are important in the long run for server-type guests. This does not cover -net user, which is really for desktop kinds of applications where you do not want to connect into the guest from another IP address. I can see four separate setups that we may or may not want to support, the main difference being how the forwarding between guests happens: 1. The current setup, with a bridge and tun/tap devices on ports of the bridge. This is what Gerhard's work on access controls is focused on and the only option where the hypervisor actually is in full control of the traffic between guests. CPU utilization should be highest this way, and network management can be a burden, because the controls are done through a Linux, libvirt and/or Director specific interface. Typical bridging. 2. Using macvlan as a bridging mechanism, replacing the bridge and tun/tap entirely. This should offer the best performance on inter-guest communication, both in terms of throughput and CPU utilization, but offer no access control for this traffic at all. Performance of guest-external traffic should be slightly better than bridge/tap. Optimization to typical bridge (no traffic control). 3. Doing the bridging in the NIC using macvlan in passthrough mode. This lowers the CPU utilization further compared to 2, at the expense of limiting throughput by the performance of the PCIe interconnect to the adapter. Whether or not this is a win is workload dependent. Access controls now happen in the NIC. Currently, this is not supported yet, due to lack of device drivers, but it will be an important scenario in the future according to some people. Optimization to typical bridge (hardware accelerated). 4. Using macvlan for actual VEPA on the outbound interface. This is mostly interesting because it makes the network access controls visible in an external switch that is already managed. CPU utilization and guest-external throughput should be identical to 3, but inter-guest latency can only be worse because all frames go through the external switch. VEPA. While we go over all of these things one thing is becoming clear to me. We need to get qemu out of the network configuration business. There's too much going on here. What I'd like to see is the following interfaces supported: 1) given an fd, make socket calls to send packets. Could be used with a raw socket, a multicast or tcp socket. 2) given an fd, use tap-style read/write calls to send packets* 3) given an fd, treat a vhost-style interface * need to make all tun ioctls optional based on passed in flags Every backend we have today could be implemented in terms of one of the above three. They really come down to how the fd is created and setup. I believe we should continue supporting the mechanisms we support today. However, for people that invoke qemu directly from the command line, I believe we should provide a mechanism like the tap helper that can be used to call out to a separate program to create these initial file descriptors. We'll have to think about how we can make this integrate well so that the syntax isn't clumsy. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: virtio: Add memory statistics reporting to the balloon driver (V2)
Rusty Russell wrote: On Wed, 18 Nov 2009 07:06:29 am Adam Litke wrote: virtio: Add memory statistics reporting to the balloon driver (V2) Changes since V1: - Use a virtqueue instead of the device config space Hi Adam, If Anthony's happy, I'm happy with this approach. Couple of minor points: +static inline void update_stat(struct virtio_balloon *vb, int idx, +unsigned int tag, unsigned long val) +{ +BUG_ON(idx = VIRTIO_BALLOON_S_NR); +vb-stats[idx].tag = tag; +vb-stats[idx].val = cpu_to_le32(val); +} The little-endian conversion of the balloon driver is a historical mistake (no other driver does this). Let's not extend it to the stats. I think the mistake is that the other drivers don't do that. We cheat in qemu and assume that the guest is always in a fixed endianness but this is not always the case for all architectures. That said, since we make this mistake everywhere, I guess I understand the argument to have consistency and to just admit that we're broken here. But this is where the endianness bits come from. Here you've done one and not the other, which is even worse. (Sparse would have found this, I assume). Yup, that's definitely wrong. +struct virtio_balloon_stat +{ +__le16 tag; +__le32 val; +}; Is 32 bits sufficient? A big machine might get over 4bn faults, and certainly 4 TB of memory isn't that far away. I think making the interface u64 and byte based would be the best solution. Making assumptions about page size across guest and host is another thing we should try to avoid. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: virtio: Add memory statistics reporting to the balloon driver
Rusty Russell wrote: On Tue, 10 Nov 2009 03:02:06 am Adam Litke wrote: A simpler approach is to collect memory statistics in the virtio balloon driver and communicate them to the host via the device config space. There are two issues I see with this. First, there's an atomicity problem since you can't tell when the stats are consistent. Actually, config writes always require notification from the guest to the host. This means the host knows when they config space is changed so atomicity isn't a problem. In fact, if it were a problem, then the balloon driver would be fundamentally broken because target and actual are stored in the config space. If you recall, we had this discussion originally wrt the balloon driver :-) Second, polling is ugly. As opposed to? The guest could set a timer and update the values periodically but that's even uglier because then the host cannot determine the update granularity. A stats vq might solve this more cleanly? actual and target are both really just stats. Had we implemented those with a vq, I'd be inclined to agree with you but since they're implemented in the config space, it seems natural to extend the config space with other stats. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: virtio: Add memory statistics reporting to the balloon driver
Avi Kivity wrote: On 11/10/2009 04:36 PM, Anthony Liguori wrote: A stats vq might solve this more cleanly? actual and target are both really just stats. Had we implemented those with a vq, I'd be inclined to agree with you but since they're implemented in the config space, it seems natural to extend the config space with other stats. There is in fact a difference; actual and target are very rarely updated, while the stats are updated very often. Using a vq means a constant number of exits per batch instead of one exit per statistic. If the vq is host-driven, it also allows the host to control the update frequency dynamically (i.e. stop polling when there is no memory pressure). I'm not terribly opposed to using a vq for this. I would expect the stat update interval to be rather long (10s probably) but a vq works just as well. -- Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: virtio: Add memory statistics reporting to the balloon driver
Rusty Russell wrote: On Tue, 10 Nov 2009 03:02:06 am Adam Litke wrote: A simpler approach is to collect memory statistics in the virtio balloon driver and communicate them to the host via the device config space. There are two issues I see with this. First, there's an atomicity problem since you can't tell when the stats are consistent. Second, polling is ugly. A stats vq might solve this more cleanly? This turns out to not work so nicely. You really need bidirectional communication. You need to request that stats be collected and then you need to tell the hypervisor about the stats that were collected. You don't need any real correlation between requests and stat reports either. This really models how target/actual work and I think it suggests that we want to reuse that mechanism for the stats too. Rusty. -- Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: virtio: Add memory statistics reporting to the balloon driver
Rusty Russell wrote: On Wed, 11 Nov 2009 08:22:42 am Anthony Liguori wrote: Rusty Russell wrote: On Tue, 10 Nov 2009 03:02:06 am Adam Litke wrote: A simpler approach is to collect memory statistics in the virtio balloon driver and communicate them to the host via the device config space. There are two issues I see with this. First, there's an atomicity problem since you can't tell when the stats are consistent. Second, polling is ugly. A stats vq might solve this more cleanly? This turns out to not work so nicely. You really need bidirectional communication. You need to request that stats be collected and then you need to tell the hypervisor about the stats that were collected. You don't need any real correlation between requests and stat reports either. You register an outbuf at initialization time. The host hands it back when it wants you to refill it with stats. That's strangely backwards. Guest send a stat buffer that's filled out, host acks it when it wants another. That doesn't seem bizarre to you? This really models how target/actual work and I think it suggests that we want to reuse that mechanism for the stats too. Sure, I want to. You want to. It's simple. But the universe is remarkably indifferent to what we want. Is it actually sufficient or are we going to regret our laziness? It's not laziness, it's consistency. How is actual different than free memory or any other stat? Cheers, Rusty. -- Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv4 1/6] qemu/virtio: move features to an inline function
Michael S. Tsirkin wrote: devices should have the final say over which virtio features they support. E.g. indirect entries may or may not make sense in the context of virtio-console. In particular, for vhost, we do not want to report to guest bits not supported by kernel backend. Move the common bits from virtio-pci to an inline function and let each device call it. No functional changes. This is a layering violation. There are transport specific features and device specific features. The virtio-net device should have no knowledge or nack'ing ability for transport features. If you need to change transport features, it suggests you're modeling things incorrectly and should be supplying an alternative transport implementation. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv4 0/6] qemu-kvm: vhost net support
Hi Michael, I'll reserve individual patch review until they're in a mergable state, but I do have some comments about the overall integration architecture. Generally speaking, I think the integration unnecessarily invasive. It adds things to the virtio infrastructure that shouldn't be there like the irqfd/queuefd bindings. It also sneaks in things like raw backend support which really isn't needed. I think we can do better. Here's what I suggest: The long term goal should be to have a NetDevice interface that looks very much like virtio-net but as an API, not an ABI. Roughly, it would look something like: struct NetDevice { int add_xmit(NetDevice *dev, struct iovec *iov, int iovcnt, void *token); int add recv(NetDevice *dev, struct iovec *iov, int iovcnt, void *token); void *get_xmit(NetDevice *dev); void *get_recv(NetDevice *dev); void kick(NetDevice *dev); ... }; That gives us a better API for use with virtio-net, e1000, etc. Assuming we had this interface, I think a natural extension would be: int add_ring(NetDevice *dev, void *address); int add_kickfd(NetDevice *dev, int fd); For slot management, it really should happen outside of the NetDevice structure. We'll need a slot notifier mechanism such that we can keep this up to date as things change. vhost-net because a NetDevice. It can support things like the e1000 by doing ring translation behind the scenes. virtio-net can be fast pathed in the case that we're using KVM but otherwise, it would also rely on the ring translation. N.B. in the case vhost-net is fast pathed, it requires a different device in QEMU that uses a separate virtio transport. We should reuse as much code as possible obviously. It doesn't make sense to have all of the virtio-pci code and virtio-net code in place when we aren't using it. All this said, I'm *not* suggesting you have to implement all of this to get vhost-net merged. Rather, I'm suggesting that we should try to structure the current vhost-net implementation to complement this architecture assuming we all agree this is the sane thing to do. That means I would make the following changes to your series: - move vhost-net support to a VLANClientState backend. - do not introduce a raw socket backend - if for some reason you want to back to tap and raw, those should be options to the vhost-net backend. - when fast pathing with vhost-net, we should introduce interfaces to VLANClientState similar to add_ring and add_kickfd. They'll be very specific to vhost-net for now, but that's okay. - sort out the layering of vhost-net within the virtio infrastructure. vhost-net should really be it's own qdev device. I don't see very much code reuse happening right now so I don't understand why it's not that way currently. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] [PATCH 0/4] megaraid_sas HBA emulation
Christoph Hellwig wrote: On Thu, Oct 29, 2009 at 10:14:19AM -0500, Anthony Liguori wrote: Which patches are those? http://repo.or.cz/w/qemu/kraxel.git?a=commitdiff;h=1ee5ee08e4427c3db7e1322d30cc0e58e5ca48b9 and http://repo.or.cz/w/qemu/kraxel.git?a=commitdiff;h=a6e6178185786c582141f993272e00521d3f125a Doesn't look like they've gone on the list yet. When they do, poke me and I'll make sure to process them quickly. One of my short term goals is to get better at handling easy patches more quickly. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio_console: Add support for multiple ports for generic guest and host communication
H. Peter Anvin wrote: On 09/18/2009 10:55 AM, Anthony Liguori wrote: I fail to see how this is at all relevant. This is a virtual machine, we're presenting virtual hardware that behaves like a serial device. Where web servers fit in is completely beyond me. s/virtio_console/virtio_serial/ There is a fairly noticeable difference between a console device and a serial device. However, something that can be extended and exported to a physical serial port is definitely the latter. Indeed. I think part of the confusion here is that virtio_console started as just as console and hence it used hvc. As part of the current reworking, I think it makes sense to rename the driver virtio_serial. Regards, Anthony Liguori -hpa ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio_console: Add support for multiple ports for generic guest and host communication
Alan Cox wrote: This device is very much a serial port. I don't see any reason not to treat it like one. Here are a few - You don't need POSIX multi-open semantics, hangup and the like We do actually want hangup and a few other of the tty specific ops. The only thing we really don't want is a baud rate. - Seek makes sense on some kinds of fixed attributes I don't think we're dealing with fixed attributes. These are streams. Fundamentally, this is a paravirtual uart. The improvement over a standard uart is that there can be a larger number of ports, ports can have some identification associated with them, and we are not constrained to the emulated hardware interface which doesn't exist on certain platforms (like s390). - TTY has a relatively large memory overhead per device - Sysfs is what everything else uses - Sysfs has some rather complete lifetime management you'll need to redo by hand sysfs doesn't model streaming data which is what this driver provides. - You don't need idiotic games with numbering spaces Abusing tty for this is ridiculous. If the argument is that tty is an awkward interface that should only be used for legacy purposes, then sure, we should just implement a new userspace interface for this. In fact, this is probably supported by the very existence of hvc. On the other hand, this is fundamentally a paravirtual serial device. Since serial devices are exposed via the tty subsystem, it seems like a logical choice. In some ways putting much of it in kernel is ridiculous too as you can do it with a FUSE fs or simply export the info guest-guest using SNMP. This device cannot be implemented as-is in userspace because it depends on DMA which precludes the use of something like uio_pci. We could modify the device to avoid dma if the feeling was that there was no interest in putting this in the kernel. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio_console: Add support for multiple ports for generic guest and host communication
Amit Shah wrote: Hey Greg, Can you tell me how this could work out -- each console port could have a role string associated with it (obtainable from the invoking qemu process in case of qemu/kvm). Something that I have in mind currently is: $ qemu-kvm ... -virtioconsole role=org/qemu/clipboard and then the guest kernel sees the string, and puts the org/qemu/clipboard in some file in sysfs. Guest userspace should then be able to open and read/write to /dev/virtio_console/org/qemu/clipboard That's probably not what we want. I imagine what we want is: /dev/ttyV0 /dev/ttyV1 /dev/ttyVN And then we want: /sys/class/virtio-console/ttyV0/name - org.qemu.clipboard Userspace can detect when new virtio-consoles appear via udev events. When it sees a new ttyVN, it can then look in sysfs to discover it's name. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio_console: Add support for multiple ports for generic guest and host communication
Amit Shah wrote: On (Tue) Sep 15 2009 [07:57:10], Anthony Liguori wrote: Amit Shah wrote: Hey Greg, Can you tell me how this could work out -- each console port could have a role string associated with it (obtainable from the invoking qemu process in case of qemu/kvm). Something that I have in mind currently is: $ qemu-kvm ... -virtioconsole role=org/qemu/clipboard and then the guest kernel sees the string, and puts the org/qemu/clipboard in some file in sysfs. Guest userspace should then be able to open and read/write to /dev/virtio_console/org/qemu/clipboard That's probably not what we want. I imagine what we want is: /dev/ttyV0 /dev/ttyV1 /dev/ttyVN And then we want: /sys/class/virtio-console/ttyV0/name - org.qemu.clipboard Userspace can detect when new virtio-consoles appear via udev events. When it sees a new ttyVN, it can then look in sysfs to discover it's name. OK; but that's kind of roundabout isn't it? An application, instead of watching for the console port it's interested in, has to instead monitor all the ports. If you wanted to use /dev/virtio/org/qemu/clipboard you still have the same problem. You have to use udev or inotify to listen for a new file in a directory. The /dev/ path may look nicer from a high level, but the code ends up being roughly the same for either approach. What I propose has the advantage of looking like existing subsystems. It also avoids encoding device information in the device name. So in effect there has to be one app monitoring for new ports and then that app exec'ing the corresponding app meant for that port. I think if you think through both models, they end up looking the same. Regards, Anthony Liguroi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio_console: Add support for multiple ports for generic guest and host communication
Gerd Hoffmann wrote: Userspace can detect when new virtio-consoles appear via udev events. When it sees a new ttyVN, it can then look in sysfs to discover it's name. No. udev can create symlinks for you, so apps don't have to dig into sysfs. You'll need a rule along the lines (untested): SUBSYSTEM==virtio, \ DRIVERS=virtio-console, \ SYMLINK+=vcon/$attr{name} then udev will create a /dev/vcon/org.qemu.clipboard symlink pointing to dev/ttyV0. Apps can just open /dev/vcon/$name then. That works equally well. I don't necessarily think that naming is more useful. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: pci: is reset incomplete?
Michael S. Tsirkin wrote: On Mon, Sep 14, 2009 at 12:15:29PM -0500, Anthony Liguori wrote: Michael S. Tsirkin wrote: Hi! pci bus reset does not seem to clear pci config registers, such as BAR registers, or memory space enable, of the attached devices: it only clears the interrupt state. This seems wrong, but easy to fix. I don't think most pci devices reset their config space in their reset callbacks. For things like BAR registers, they really must. BARs should be registered via pci_register_bar so you should be able to centralize their reset. class codes are read only registers. Your proposal might be correct for some of these. But PCI registers that are reset, change as a result of guest activity, and reset values are typically specified by guest spec. So I don't think we should let users tweak these. Well, I guess my general point was that it would be good to add more structure to how config space is initialized. I think a natural consequence of that is that it becomes easier to automatically fix the values on reset. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio_console: Add support for multiple ports for generic guest and host communication
Amit Shah wrote: Right; there was some discussion about this. A few alternatives were suggested like - udev scripts to create symlinks from ports to function, like: /dev/vcon3 - /dev/virtio-console/clipboard - Some fqdn-like hierarchy, like /dev/virtio-console/com/redhat/clipboard which again can be created by udev scripts And I dislike all of them. What I'd rather have is these devices exposed as tty's with a sys attribute that exposed the name of the device. This is not all that different to how usb serial devices work. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] SCSI driver for VMware's virtual HBA - V4.
specifying the LUN. + * - _pad should be 0. + */ + +typedef struct PVSCSICmdDescAbortCmd { + u64 context; + u32 target; + u32 _pad; +} __packed PVSCSICmdDescAbortCmd; + +/* + * Command descriptor for PVSCSI_CMD_SETUP_RINGS -- + * + * Notes: + * - reqRingNumPages and cmpRingNumPages need to be power of two. + * - reqRingNumPages and cmpRingNumPages need to be different from 0, + * - reqRingNumPages and cmpRingNumPages need to be inferior to + * PVSCSI_SETUP_RINGS_MAX_NUM_PAGES. + */ + +#define PVSCSI_SETUP_RINGS_MAX_NUM_PAGES32 +typedef struct PVSCSICmdDescSetupRings { + u32 reqRingNumPages; + u32 cmpRingNumPages; + u64 ringsStatePPN; + u64 reqRingPPNs[PVSCSI_SETUP_RINGS_MAX_NUM_PAGES]; + u64 cmpRingPPNs[PVSCSI_SETUP_RINGS_MAX_NUM_PAGES]; +} __packed PVSCSICmdDescSetupRings; + +/* + * Command descriptor for PVSCSI_CMD_SETUP_MSG_RING -- + * + * Notes: + * - this command was not supported in the initial revision of the h/w + * interface. Before using it, you need to check that it is supported by + * writing PVSCSI_CMD_SETUP_MSG_RING to the 'command' register, then + * immediately after read the 'command status' register: + * * a value of -1 means that the cmd is NOT supported, + * * a value != -1 means that the cmd IS supported. + * If it's supported the 'command status' register should return: + * sizeof(PVSCSICmdDescSetupMsgRing) / sizeof(u32). + * - this command should be issued _after_ the usual SETUP_RINGS so that the + * RingsState page is already setup. If not, the command is a nop. + * - numPages needs to be a power of two, + * - numPages needs to be different from 0, + * - _pad should be zero. + */ + +#define PVSCSI_SETUP_MSG_RING_MAX_NUM_PAGES 16 + +typedef struct PVSCSICmdDescSetupMsgRing { + u32 numPages; + u32 _pad; + u64 ringPPNs[PVSCSI_SETUP_MSG_RING_MAX_NUM_PAGES]; +} __packed PVSCSICmdDescSetupMsgRing; + +enum PVSCSIMsgType { + PVSCSI_MSG_DEV_ADDED = 0, + PVSCSI_MSG_DEV_REMOVED= 1, + PVSCSI_MSG_LAST = 2, +}; + +/* + * Msg descriptor. + * + * sizeof(struct PVSCSIRingMsgDesc) == 128. + * + * - type is of type enum PVSCSIMsgType. + * - the content of args depend on the type of event being delivered. + */ + +typedef struct PVSCSIRingMsgDesc { + u32 type; + u32 args[31]; +} __packed PVSCSIRingMsgDesc; + +typedef struct PVSCSIMsgDescDevStatusChanged { + u32 type; /* PVSCSI_MSG_DEV _ADDED / _REMOVED */ + u32 bus; + u32 target; + u8 lun[8]; + u32 pad[27]; +} __packed PVSCSIMsgDescDevStatusChanged; + +/* + * Rings state. + * + * - the fields: + *. msgProdIdx, + *. msgConsIdx, + *. msgNumEntriesLog2, + * .. are only used once the SETUP_MSG_RING cmd has been issued. + * - '_pad' helps to ensure that the msg related fields are on their own + * cache-line. + */ + +typedef struct PVSCSIRingsState { + u32 reqProdIdx; + u32 reqConsIdx; + u32 reqNumEntriesLog2; + + u32 cmpProdIdx; + u32 cmpConsIdx; + u32 cmpNumEntriesLog2; + + u8 _pad[104]; + + u32 msgProdIdx; + u32 msgConsIdx; + u32 msgNumEntriesLog2; +} __packed PVSCSIRingsState; All of this can be hidden behind a struct virtqueue. You could then introduce a virtio-vmwring that implemented this ABI. You could then separate out the actual scsi logic into a separate virtio-scsi.c driver. There are numerous advantages to this layering. You get to remain ABI compatible with yourself. You can potentially reuse the ring logic in other drivers. Other VMMs can use different ring transports without introducing new drivers. For instance, in KVM, we would use virtio-pci + virtio-scsi instead of using virtio-vmwring + virtio-scsi. The virtio infrastructure has been backported to various old kernels too so there really is no reason not to use it. We have the interfaces in virtio to do notification suppression and MSI so I don't think there's any real limitation that it would impose on you. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] SCSI driver for VMware's virtual HBA - V4.
Alok Kataria wrote: I see your point, but the ring logic or the ABI that we use to communicate between the hypervisor and guest is not shared between our storage and network drivers. As a result, I don't see any benefit of separating out this ring handling mechanism, on the contrary it might just add some overhead of translating between various layers for our SCSI driver. But if you separate out the ring logic, it allows the scsi logic to be shared by other paravirtual device drivers. This is significant and important from a Linux point of view. There is almost nothing vmware specific about the vast majority of your code. If you split out the scsi logic, then you will receive help debugging, adding future features, and improving performance from other folks interested in this. In the long term, it will make your life much, much easier by making the driver relevant to a wider audience :-) Having said that, I will like to add that yes if in some future iteration of our paravirtualized drivers, if we decide to share this ring mechanism for our various device drivers this might be an interesting proposition. I am certainly not the block subsystem maintainer, but I'd hate to see this merged without any attempt at making the code more reusable. If adding the virtio layering is going to somehow hurt performance, break your ABI, or in some way limit you, then that's something to certainly discuss and would be valid concerns. That said, I don't think it's a huge change to your current patch and I don't see any obvious problems it would cause. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] SCSI driver for VMware's virtual HBA - V4.
Christoph Hellwig wrote: On Wed, Sep 09, 2009 at 05:12:26PM -0500, Anthony Liguori wrote: Alok Kataria wrote: I see your point, but the ring logic or the ABI that we use to communicate between the hypervisor and guest is not shared between our storage and network drivers. As a result, I don't see any benefit of separating out this ring handling mechanism, on the contrary it might just add some overhead of translating between various layers for our SCSI driver. But if you separate out the ring logic, it allows the scsi logic to be shared by other paravirtual device drivers. This is significant and important from a Linux point of view. As someone who has been hacking on a virtio scsi prototype I don't think it's a good idea. The vmware driver is a horrible design and I don't think it should be merged. What are the issues with the design compared to how you're approaching virtio-scsi? Besides beeing a ugly driver and ABI we really should not support this kind of closed protocol development. I don't see how a VMM that doesn't share the source code for it's backends or doesn't implement standard ABIs is any different than the hundreds of hardware vendors that behave exactly the same way. We haven't even been successful in getting the Xen folks to present their work on lkml before shipping it to their users. Why would we expect more from VMware if we're willing to merge the Xen stuff? Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] SCSI driver for VMware's virtual HBA - V4.
Jeremy Fitzhardinge wrote: On 09/09/09 16:34, Anthony Liguori wrote: We haven't even been successful in getting the Xen folks to present their work on lkml before shipping it to their users. Why would we expect more from VMware if we're willing to merge the Xen stuff? The Xen code may be out of tree, but it has always been readily available from a well-known place. I don't think its comparable. Once an ABI is set in stone, there's very little that can be done on lkml to review the ABI in any serious way. VMware has repeatedly done this in the past. Ship a product with their own drivers, then submit something to lkml saying that the ABI cannot be changed because it's present in a shipping product. I'll admit, Xen hasn't been as bad as VMware in this regard, but it's certainly not perfect either. Regards, Anthony Liguori J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: Extending virtio_console to support multiple ports
Amit Shah wrote: No flags, assume it's a streaming protocol and don't assume anything about message sizes. IOW, when you send clipboard data, send size and then the data. QEMU consumes bytes until it reaches size. Same intent but a different method: I'll have to specify that particular data is size and that data after this special data is the actual data stream. Sounds like every stream protocol in existence :-) - A lock has to be introduced to fetch one unused buffer from the list and pass it on to the host. And this lock has to be a spinlock, just because writes can be called from irq context. I don't see a problem here. You mean you don't see a problem in using a spinlock vs not using one? Right. This isn't a fast path. Userspace will typically send the entire buffer to be transmitted in one system call. If it's large, the system call will have to be broken into several. This results in multiple guest system calls, each one to be handled with a spinlock held. Compare this with the entire write handled in one system call in the current method. Does it matter? This isn't a fast path. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: Extending virtio_console to support multiple ports
Amit Shah wrote: Can you please explain your rationale for being so rigid about merging the two drivers? Because they do the same thing. I'm not going to constantly rehash this. It's been explained multiple times. If there are implementation issues within the Linux drivers because of peculiarities of hvc then hvc needs to be fixed. It has nothing to do with the driver ABI which is what qemu cares about. Regards, Anthony Liguori Amit ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: Extending virtio_console to support multiple ports
Amit Shah wrote: On (Mon) Aug 31 2009 [09:21:13], Anthony Liguori wrote: Amit Shah wrote: Can you please explain your rationale for being so rigid about merging the two drivers? Because they do the same thing. I'm not going to constantly rehash this. It's been explained multiple times. It hardly looks like the same thing each passing day. That's BS. The very first time you posted, you received the same feedback from both Paul and I. See http://article.gmane.org/gmane.comp.emulators.qemu/44778. That was back in June. You've consistently received the same feedback both on the ML and in private. We're ending up having to compromise on the performance or functionality or simplicity the devices just because of this restriction. This is _not_ a high performance device and there so far has been no functionality impact. I don't understand why you keep dragging your feet about this. It's very simple, if you post a functional set of patches for a converged virtio-console driver, we'll merge it. If you keep arguing about having a separate virtio-serial driver, it's not going to get merged. I don't know how to be more clear than this. If there are implementation issues within the Linux drivers because of peculiarities of hvc then hvc needs to be fixed. It has nothing to do with the driver ABI which is what qemu cares about. I'd welcome that effort as well. But we all know that's not going to happen anytime soon. That is not a justification to add a new device in QEMU. If we add a new device everytime we encounter a less than ideal interface within a guest, we're going to end up having hundreds of devices. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: Extending virtio_console to support multiple ports
Amit Shah wrote: We're ending up having to compromise on the performance or functionality or simplicity the devices just because of this restriction. This is _not_ a high performance device and there so far has been no functionality impact. I don't understand why you keep dragging your feet about this. It's very simple, if you post a functional set of patches for a converged virtio-console driver, we'll merge it. If you I have already posted them and have received no feedback about the patches since. Let me add another request here for you to review them. But the guest drivers do not have proper locking. Have you posted a new series with that fixed? keep arguing about having a separate virtio-serial driver, it's not going to get merged. I don't know how to be more clear than this. I'm not at all arguing for a separate virtio-serial driver. Please note the difference in what I'm asking for: I'm just asking for a good justification for the merging of the two since it just makes both the drivers not simple and also introduces dependencies on code outside our control. Functionally speaking, both virtio-console and virtio-serial do the same thing. In fact, virtio-console is just a subset of virtio-serial. If there are problems converging the two drivers in Linux, then I suggest you have two separate driver modules in Linux. That would obviously be rejected for Linux though because you cannot have two drivers for the same device. Why should qemu have a different policy? That is not a justification to add a new device in QEMU. If we add a new device everytime we encounter a less than ideal interface within a guest, we're going to end up having hundreds of devices. I just find this argument funny. I'm finding this discussion not so productive. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: On 08/31/2009 02:42 PM, Xin, Xiaohui wrote: Hi, Michael That's a great job. We are now working on support VMDq on KVM, and since the VMDq hardware presents L2 sorting based on MAC addresses and VLAN tags, our target is to implement a zero copy solution using VMDq. We stared from the virtio-net architecture. What we want to proposal is to use AIO combined with direct I/O: 1) Modify virtio-net Backend service in Qemu to submit aio requests composed from virtqueue. 2) Modify TUN/TAP device to support aio operations and the user space buffer directly mapping into the host kernel. 3) Let a TUN/TAP device binds to single rx/tx queue from the NIC. 4) Modify the net_dev and skb structure to permit allocated skb to use user space directly mapped payload buffer address rather then kernel allocated. As zero copy is also your goal, we are interested in what's in your mind, and would like to collaborate with you if possible. One way to share the effort is to make vmdq queues available as normal kernel interfaces. It may be possible to make vmdq appear like an sr-iov capable device from userspace. sr-iov provides the userspace interfaces to allocate interfaces and assign mac addresses. To make it useful, you would have to handle tx multiplexing in the driver but that would be much easier to consume for kvm. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: Extending virtio_console to support multiple ports
Amit Shah wrote: I did think about that as well, but there are problems: - vnc clients (at least tigervnc) wants to receive the entire clipboard in a single flush command. So in the pre-allocated buffers scenario we could run short of the available buffers in some cases. So there will have to be a flag with each buffer that says 'there's more data pending for this particular write' which will have to be passed on to qemu and qemu will then flush it once it receives all the data No flags, assume it's a streaming protocol and don't assume anything about message sizes. IOW, when you send clipboard data, send size and then the data. QEMU consumes bytes until it reaches size. - A lock has to be introduced to fetch one unused buffer from the list and pass it on to the host. And this lock has to be a spinlock, just because writes can be called from irq context. I don't see a problem here. Regards, Anthony Liguori Amit ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: Extending virtio_console to support multiple ports
Amit Shah wrote: On (Tue) Aug 25 2009 [11:47:20], Amit Shah wrote: Hello all, Here is a new iteration of the patch series that implements a transport for guest and host communications. The code has been updated to reuse the virtio-console device instead of creating a new virtio-serial device. And the problem now is that hvc calls the put_chars function with spinlocks held and we now allocate pages in send_buf(), called from put_chars. Don't allocate pages in send_buf. There's a fixed number of possible entries on the ring. Preallocate them up front and then you don't need to sleep. A few solutions: - Keep things as they are, virtio_console.c remains as it is and virtio_serial.c gets added Not an option from a QEMU perspective. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: vhost net: performance with ping benchmark
Avi Kivity wrote: I think this is likely going to be needed regardless. I also think the tap compatibility suggestion would simplify the consumption of this in userspace. What about veth pairs? Does veth support GSO and checksum offload? I'd like some time to look at get_state/set_state ioctl()s along with dirty tracking support. It's a much better model for live migration IMHO. My preference is ring proxying. Not we'll need ring proxying (or at least event proxying) for non-MSI guests. I avoided suggested ring proxying because I didn't want to suggest that merging should be contingent on it. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: vhost net: performance with ping benchmark
Michael S. Tsirkin wrote: My preference is ring proxying. Not we'll need ring proxying (or at least event proxying) for non-MSI guests. Exactly, that's what I meant earlier. That's enough, isn't it, Anthony? It is if we have a working implementation that demonstrates the userspace interface is sufficient. Once it goes into the upstream kernel, we need to have backwards compatibility code in QEMU forever to support that kernel version. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: vhost net: performance with ping benchmark
Michael S. Tsirkin wrote: On Mon, Aug 24, 2009 at 11:12:41AM +0300, Michael S. Tsirkin wrote: At Rusty's suggestion, I tested vhost base performance with ping. Results below, and seem to be what you'd expect. Rusty, any chance you could look at the code? Is it in reasonable shape? I think it makes sense to merge it through you. What do you think? One comment on file placement: I put files under a separate vhost directory to avoid confusion with virtio-net which runs in guest. Does this sound sane? Also, can a minimal version (without TSO, tap or any other features) be merged upstream first so that features can be added later? Or do we have to wait until it's more full featured? Finally, can it reasonably make 2.6.32, or you think it needs more time out of tree? I think 2.6.32 is pushing it. I think some time is needed to flush out the userspace interface. In particular, I don't think Mark's comments have been adequately addressed. If a version were merged without GSO support, some mechanism to do feature detection would be needed in the userspace API. I think this is likely going to be needed regardless. I also think the tap compatibility suggestion would simplify the consumption of this in userspace. I'd like some time to look at get_state/set_state ioctl()s along with dirty tracking support. It's a much better model for live migration IMHO. I think so more thorough benchmarking would be good too. In particular, netperf/iperf runs would be nice. Regards, Anthony Liguori Thanks very much, ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication
Amit Shah wrote: On (Mon) Aug 10 2009 [11:59:31], Anthony Liguori wrote: However, as I've mentioned repeatedly, the reason I won't merge virtio-serial is that it duplicates functionality with virtio-console. If the two are converged, I'm happy to merge it. I'm not opposed to having more functionality. The guest code sort-of ends up looking like this after merging virtio_console into virtio_serial. Diff is against virtio_serial in my git tree, but that should be pretty close to the last submission I made at http://patchwork.kernel.org/patch/39335/ or the tree at git://git.kernel.org/pub/scm/linux/kernel/git/amit/vs-kernel.git I've merged bits from virtio_console.c into virtio_serial.c. If needed, virtio_serial can be renamed to virtio_console. The VIRITIO_ID also needs to change to that of virtio_console's. Similar changes are needed for userspace. Since there's support for only one console as of now, I've assigned port #2 as the console port so that we hook into hvc when a port is found at that location. One issue that crops up for put_chars: a copy of the buffer to be sent has to be made as we don't wait for host to ack the buffer before we move on. The biggest loss so far is Rusty's excellent comments: they will have to be reworked and added for the whole of the new file. Is this approach acceptable? I think we want to keep virtio_console.c and definitely continue using the virtio_console id. It looks like you are still creating character devices as opposed to tty devices. Is this just an incremental step or are you choosing to not do tty devices for a reason (as if so, what's that reason)? Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication
Gerd Hoffmann wrote: On 08/14/09 10:15, Amit Shah wrote: The guest code sort-of ends up looking like this after merging virtio_console into virtio_serial. I think it should better go the other way around: add multichannel support to virtio-concole, probably guarded by a feature flag so old host+new guest and new host+old guest combinations continue to work. Since there's support for only one console as of now, I've assigned port #2 as the console port so that we hook into hvc when a port is found at that location. Doesn't sound like this is going to be backward compatible ... Also I still think passing a 'protocol' string for each port is a good idea, so you can stick that into a sysfs file for guests use. Or drops ports altogether and just use protocol strings... Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: [PATCH] qemu/virtio: move features to an inline function
Michael S. Tsirkin wrote: On Why bother switching to userspace for migration? Can't you just have get/set ioctls for the state? I have these. But live migration requires dirty page logging. I do not want to implement it in kernel. Is it really that difficult? I think it would be better to just do that. I wonder though if mmu notifiers can be used to make it transparent... Regards, Anthony Liguori Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
Michael S. Tsirkin wrote: We discussed this before, and I still think this could be directly derived from struct virtqueue, in the same way that vring_virtqueue is derived from struct virtqueue. I prefer keeping it simple. Much of abstraction in virtio is due to the fact that it needs to work on top of different hardware emulations: lguest,kvm, possibly others in the future. vhost is always working on real hardware, using eventfd as the interface, so it does not need that. Actually, vhost may not always be limited to real hardware. We may on day use vhost as the basis of a driver domain. There's quite a lot of interest in this for networking. At any rate, I'd like to see performance results before we consider trying to reuse virtio code. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
Arnd Bergmann wrote: As I pointed out earlier, most code in virtio net is asymmetrical: guest provides buffers, host consumes them. Possibly, one could use virtio rings in a symmetrical way, but support of existing guest virtio net means there's almost no shared code. The trick is to swap the virtqueues instead. virtio-net is actually mostly symmetric in just the same way that the physical wires on a twisted pair ethernet are symmetric (I like how that analogy fits). It's already been done between two guests. See http://article.gmane.org/gmane.linux.kernel.virtualization/5423 Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization