Re: [PATCH V4 1/2] virtio-net: fix the set affinity bug when CPU IDs are not consecutive
On 01/11/2013 01:56 PM, Wanlong Gao wrote: As Michael mentioned, set affinity and select queue will not work very well when CPU IDs are not consecutive, this can happen with hot unplug. Fix this bug by traversal the online CPUs, and create a per cpu variable to find the mapping from CPU to the preferable virtual-queue. Looks like ixgbe is trying to handling a similar issue recently by trying to reuse XPS when the channels and cpus are not equal. I wonder whether we can use the same method. Cc: Rusty Russell ru...@rustcorp.com.au Cc: Michael S. Tsirkin m...@redhat.com Cc: Jason Wang jasow...@redhat.com Cc: Eric Dumazet erdnet...@gmail.com Cc: virtualization@lists.linux-foundation.org Cc: net...@vger.kernel.org Signed-off-by: Wanlong Gao gaowanl...@cn.fujitsu.com --- V3-V4: move vq_index into virtnet_info (Jason) change the mapping value when not setting affinity (Jason) address the comments about select_queue (Rusty) Not Addressed yet: race between select_queue and set_channels need discuss more but not the same problem with this (Rusty) drivers/net/virtio_net.c | 53 ++-- 1 file changed, 42 insertions(+), 11 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index a6fcf15..ca17a58 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -123,6 +123,9 @@ struct virtnet_info { /* Does the affinity hint is set for virtqueues? */ bool affinity_hint_set; + + /* Per-cpu variable to show the mapping from CPU to virtqueue */ + int __percpu *vq_index; }; struct skb_vnet_hdr { @@ -1016,6 +1019,7 @@ static int virtnet_vlan_rx_kill_vid(struct net_device *dev, u16 vid) static void virtnet_set_affinity(struct virtnet_info *vi, bool set) { int i; + int cpu; /* In multiqueue mode, when the number of cpu is equal to the number of * queue pairs, we let the queue pairs to be private to one cpu by @@ -1029,16 +1033,29 @@ static void virtnet_set_affinity(struct virtnet_info *vi, bool set) return; } - for (i = 0; i vi-max_queue_pairs; i++) { - int cpu = set ? i : -1; - virtqueue_set_affinity(vi-rq[i].vq, cpu); - virtqueue_set_affinity(vi-sq[i].vq, cpu); - } + if (set) { + i = 0; + for_each_online_cpu(cpu) { + virtqueue_set_affinity(vi-rq[i].vq, cpu); + virtqueue_set_affinity(vi-sq[i].vq, cpu); + *per_cpu_ptr(vi-vq_index, cpu) = i; + i++; + } - if (set) vi-affinity_hint_set = true; - else + } else { + for(i = 0; i vi-max_queue_pairs; i++) { + virtqueue_set_affinity(vi-rq[i].vq, -1); + virtqueue_set_affinity(vi-sq[i].vq, -1); + } + + i = 0; + for_each_online_cpu(cpu) + *per_cpu_ptr(vi-vq_index, cpu) = + ++i % vi-curr_queue_pairs; + vi-affinity_hint_set = false; + } } static void virtnet_get_ringparam(struct net_device *dev, @@ -1127,12 +1144,19 @@ static int virtnet_change_mtu(struct net_device *dev, int new_mtu) /* To avoid contending a lock hold by a vcpu who would exit to host, select the * txq based on the processor id. - * TODO: handle cpu hotplug. */ static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb) { - int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : - smp_processor_id(); + int txq; + struct virtnet_info *vi = netdev_priv(dev); + + if (skb_rx_queue_recorded(skb)) { + txq = skb_get_rx_queue(skb); + } else { + txq = *__this_cpu_ptr(vi-vq_index); + if (txq == -1) + txq = 0; + } while (unlikely(txq = dev-real_num_tx_queues)) txq -= dev-real_num_tx_queues; @@ -1453,6 +1477,10 @@ static int virtnet_probe(struct virtio_device *vdev) if (vi-stats == NULL) goto free; + vi-vq_index = alloc_percpu(int); + if (vi-vq_index == NULL) + goto free_stats; + mutex_init(vi-config_lock); vi-config_enable = true; INIT_WORK(vi-config_work, virtnet_config_changed_work); @@ -1476,7 +1504,7 @@ static int virtnet_probe(struct virtio_device *vdev) /* Allocate/initialize the rx/tx queues, and invoke find_vqs */ err = init_vqs(vi); if (err) - goto free_stats; + goto free_index; netif_set_real_num_tx_queues(dev, 1); netif_set_real_num_rx_queues(dev, 1); @@ -1520,6 +1548,8 @@ free_recv_bufs: free_vqs: cancel_delayed_work_sync(vi-refill); virtnet_del_vqs(vi); +free_index: +
Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation
On Thu, Jan 10, 2013 at 02:19:55PM +, David Vrabel wrote: On 04/01/13 17:01, Daniel Kiper wrote: On Fri, Jan 04, 2013 at 02:38:44PM +, David Vrabel wrote: On 04/01/13 14:22, Daniel Kiper wrote: On Wed, Jan 02, 2013 at 11:26:43AM +, Andrew Cooper wrote: On 27/12/12 18:02, Eric W. Biederman wrote: Andrew Cooperandrew.coop...@citrix.com writes: On 27/12/2012 07:53, Eric W. Biederman wrote: The syscall ABI still has the wrong semantics. Aka totally unmaintainable and umergeable. The concept of domU support is also strange. What does domU support even mean, when the dom0 support is loading a kernel to pick up Xen when Xen falls over. There are two requirements pulling at this patch series, but I agree that we need to clarify them. It probably make sense to split them apart a little even. Thinking about this split, there might be a way to simply it even more. /sbin/kexec can load the Xen crash kernel itself by issuing hypercalls using /dev/xen/privcmd. This would remove the need for the dom0 kernel to distinguish between loading a crash kernel for itself and loading a kernel for Xen. Or is this just a silly idea complicating the matter? This is impossible with current Xen kexec/kdump interface. It should be changed to do that. However, I suppose that Xen community would not be interested in such changes. I don't see why the hypercall ABI cannot be extended with new sub-ops that do the right thing -- the existing ABI is a bit weird. I plan to start prototyping something shortly (hopefully next week) for the Xen kexec case. Wow... As I can this time Xen community is interested in... That is great. I agree that current kexec interface is not ideal. I spent some more time looking at the existing interface and implementation and it really is broken. David, I am happy to help in that process. However, if you wish I could carry it myself. Anyway, it looks that I should hold on with my Linux kexec/kdump patches. I should be able to post some prototype patches for Xen in a few weeks. No guarantees though. That is great. If you need any help drop me a line. My .5 cents: - We should focus on KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload; probably we should introduce KEXEC_CMD_kexec_load2 and KEXEC_CMD_kexec_unload2; load should __LOAD__ kernel image and other things into hypervisor memory; Yes, but I don't see how we can easily support both ABIs easily. I'd be in favour of replacing the existing hypercalls and requiring updated kexec tools in dom0 (this isn't that different to requiring the correct libxc in dom0). Why? Just define new strutures for new functions of kexec hypercall. That should suffice. I suppose that allmost all things could be copied from linux/kernel/kexec.c, linux/arch/x86/kernel/{machine_kexec_$(BITS).c,relocate_kernel_$(BITS).c}; I think that KEXEC_CMD_kexec should stay as is, I don't think we want all the junk from Linux inside Xen -- we only want to support the kdump case and do not have to handle returning from the kexec image. I do not want to implement kexec jump or stuff like. However, I think that it is worth use code which could be used. As I know there are lot of stuff which was taken with smaller or bigger changes from Linux Kernel. Why we would like to reinvent the wheel this time? Additionally, we should not drop kexec support. It is main part of kdump. In case of kdump new kernel (and other stuff) is placed in prealocated space in contrary to kexec. That's all. kexec is useful if you would like to quickly (skipping BIOS) switch from Xen to baremetal Linux. If you drop kexec support from Xen then you need alter kexec-tools package in bunch of distros to take into account new Xen behavior. I think that it is not we want to do. - Hmmm... Now I think that we should still use kexec syscall to load image into Xen memory (with new KEXEC_CMD_kexec_load2) because it establishes all things which are needed to call kdump if dom0 crashes; however, I could be wrong... I don't think we need the kexec syscall. The kernel can unconditionally do the crash hypercall, which will return if the kdump kernel isn't loaded and the kernel can fall back to the regular non-kexec panic. No, please do not do that. When you call HYPERVISOR_kexec_op(KEXEC_CMD_kexec) system is completly shutdown. Return form HYPERVISOR_kexec_op(KEXEC_CMD_kexec) would require to restore some kernel functionalities. It maybe impossible in some cases. Additionally, it means that some changes should be made in generic kexec code path. As I know kexec maintainers are very reluctant to make such things. This will allow the kexec syscall to be used only for the domU kexec case. - last but not least, we should think about support for PV guests too. I won't be looking at this. OK. To avoid confusion about the two largely
Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation
On Mon, Jan 07, 2013 at 01:49:44PM +, Ian Campbell wrote: On Mon, 2013-01-07 at 12:34 +, Daniel Kiper wrote: I think that new kexec hypercall function should mimics kexec syscall. We want to have an interface can be used by non-Linux domains (both dom0 and domU) as well though, so please bear this in mind. I agree, but all arguments passed to kexec syscall are quiet generic and they do not impose any limitations. Just look into include/linux/kexec.h. That is why I think that a lot of things could be taken from Linux kexec implementation. Daniel ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio: suppress kmemleak false positive
On Fri, Jan 11, 2013 at 4:02 PM, Catalin Marinas catalin.mari...@arm.comwrote: On Fri, Jan 11, 2013 at 01:51:43PM +, Alexandru Copot wrote: While doing simple IPv6 tests in KVM virtual machines, (add an IPv6 address to eth0) kmemleak complains about an unreferenced object: unreferenced object 0x88001e804120 (size 32): comm softirq, pid 0, jiffies 4294900928 (age 631.544s) hex dump (first 32 bytes): 28 cb fd 1d 00 00 00 00 0c 00 00 00 01 00 01 00 (... 02 d0 83 1d 00 00 00 00 6e 00 00 00 00 00 00 00 n... backtrace: [815ed721] kmemleak_alloc+0x21/0x50 [8111d0b0] __kmalloc+0xe0/0x160 [81362b7c] virtqueue_add_buf+0x1fc/0x3d0 [8140cbd3] start_xmit+0x153/0x3a0 [8150887e] dev_hard_start_xmit+0x21e/0x470 [815247ce] sch_direct_xmit+0xfe/0x280 [81509014] dev_queue_xmit+0x1f4/0x5d0 [81592e91] ip6_finish_output2+0x101/0x450 [81595ae8] ip6_finish_output+0x98/0x200 [81595ca1] ip6_output+0x51/0x1b0 [815b51af] mld_sendpack+0x19f/0x360 [815b59b4] mld_ifc_timer_expire+0x194/0x290 [8104b794] call_timer_fn+0x74/0xf0 [8104bb1b] run_timer_softirq+0x18b/0x220 [81045f81] __do_softirq+0xe1/0x1c0 [8160a4fc] call_softirq+0x1c/0x30 Seems the function vring_add_indirect stores an array of struct vring_desc by using virt_to_phys and kmemleak doesn't track the pointer. The following patch can fix this. Signed-off-by: Alexandru Copot alex.miha...@gmail.com CC: Daniel Baluta dbal...@ixiacom.com --- drivers/virtio/virtio_ring.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index ffd7e7d..e0b591b 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -23,6 +23,7 @@ #include linux/slab.h #include linux/module.h #include linux/hrtimer.h +#include linux/kmemleak.h /* virtio guest is communicating with a virtual device that actually runs on * a host processor. Memory barriers are used to control SMP effects. */ @@ -140,6 +141,7 @@ static int vring_add_indirect(struct vring_virtqueue *vq, desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp); if (!desc) return -ENOMEM; + kmemleak_not_leak(desc); Please add a comment above this call in case people later wonder why this annotation is needed. Thanks Cătălin. So, kememleak cannot handle this kind of pointer aliases? Also, I wonder if phys_to_virt(virt_to_phys(x)) == x holds true all the time. thanks, Daniel. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio: suppress kmemleak false positive
On Fri, Jan 11, 2013 at 02:32:12PM +, Daniel Baluta wrote: On Fri, Jan 11, 2013 at 4:02 PM, Catalin Marinas catalin.mari...@arm.com wrote: On Fri, Jan 11, 2013 at 01:51:43PM +, Alexandru Copot wrote: --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -23,6 +23,7 @@ #include linux/slab.h #include linux/module.h #include linux/hrtimer.h +#include linux/kmemleak.h /* virtio guest is communicating with a virtual device that actually runs on * a host processor. Memory barriers are used to control SMP effects. */ @@ -140,6 +141,7 @@ static int vring_add_indirect(struct vring_virtqueue *vq, desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp); if (!desc) return -ENOMEM; + kmemleak_not_leak(desc); Please add a comment above this call in case people later wonder why this annotation is needed. So, kememleak cannot handle this kind of pointer aliases? No. Basically it stores the allocated object start/end in an rb-tree and cannot cope with overlapping blocks. If we allow aliases, we could have some overlapping. I think there was a patch in the past to add a separate rb-tree for aliases but I didn't particularly like it because it affected the performance. Also, I wonder if phys_to_virt(virt_to_phys(x)) == x holds true all the time. It should as long as x is a valid kernel virtual address in the logical (linear) mapping. -- Catalin ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFCv2 00/12] Introduce host-side virtio queue and CAIF Virtio.
On Fri, Jan 11, 2013 at 12:35 AM, Rusty Russell ru...@rustcorp.com.au wrote: Hi Sjur! OK, the Internet was no help here, how do you pronounce Sjur? I'm guessing shoor rhyming with tour until I know better. Thank you for asking! This is pretty close yes. I usually tell people to pronounce it like sure for simplicity. But Google translate has a perfect Norwegian voice pronouncing my name: http://translate.google.com/#auto/no/sjur (Press the speaker button) While at the subject: Norwegian names can be a laugh: I have people named Odd and Even in my family, and I once had a friend named Aashold. ... I would prefer not to split into pages at this point, but rather provide an iterator or the original list found in the descriptor to the client. ... In other words, boundaries matter? While the sg-in-place hack is close to optimal for TCM_VHOST, neither net not you can use it directly. I'll switch to an iovec (with a similar use-caller-supplied-if-it-fits trick); they're smaller anyway. Great, I think iovec will be a good fit for my CAIF driver. Thanks, Sjur ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 0/2] make mac programming for virtio net more robust
On 1/10/2013 11:46 PM, Michael S. Tsirkin wrote: On Fri, Jan 11, 2013 at 12:53:07PM +1030, Rusty Russell wrote: Michael S. Tsirkin m...@redhat.com writes: On Thu, Jan 10, 2013 at 10:45:39PM +0800, ak...@redhat.com wrote: From: Amos Kong ak...@redhat.com Currenly mac is programmed byte by byte. This means that we have an intermediate step where mac is wrong. Second patch introduced a new vq control command to set mac address in one time. As you mention we could alternatively do it without new commands, simply add a feature bit that says that MACs are in the mac table. This would be a much bigger patch, and I'm fine with either way. Rusty what do you think? Hmm, mac filtering and my mac address are not quite the same thing. I don't know if it matters for anyone: does it? The mac address is abused for things like identifying machines, etc. I don't know either. I think net core differentiates between mac and uc_list because linux has to know which mac to use when building up packets, so at some level, I agree it might be useful to identify the machine. BTW netdev/davem should have been copied on this, Amos I think it's a good idea to remember to do it next time you post. If we keep it as a separate concept, Amos' patch seems to make sense. Yes. It also keeps the patch small, I just thought I'd mention the option. Cheers, Rusty. Don't have the entire context here but if you implement the ndo_fdb_dump() probably hooking it up to ndo_dflt_fdb_dump() you could use the 'bridge' tool dump the uc_list. Then use ndo_fdb_add() and ndo_fdb_del() to add and remove entries from the uc_list. We do this today in macvlan and the ixgbe driver when it is in SR-IOV mode and the embedded switch needs to be programmed. fdb is forwarding database its a bit different then mac filtering in that its telling the switch how to forward mac addresses, in ixgbe and macvlan at least we have been overloading it a bit to also stop filtering the mac address. I think this makes sense if you setup forwarding to a port it doesn't make much sense to then drop them. Maybe its not entirely applicable here just thought I would mention it. Thanks, John ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio: suppress kmemleak false positive
On Fri, Jan 11, 2013 at 01:51:43PM +, Alexandru Copot wrote: While doing simple IPv6 tests in KVM virtual machines, (add an IPv6 address to eth0) kmemleak complains about an unreferenced object: unreferenced object 0x88001e804120 (size 32): comm softirq, pid 0, jiffies 4294900928 (age 631.544s) hex dump (first 32 bytes): 28 cb fd 1d 00 00 00 00 0c 00 00 00 01 00 01 00 (... 02 d0 83 1d 00 00 00 00 6e 00 00 00 00 00 00 00 n... backtrace: [815ed721] kmemleak_alloc+0x21/0x50 [8111d0b0] __kmalloc+0xe0/0x160 [81362b7c] virtqueue_add_buf+0x1fc/0x3d0 [8140cbd3] start_xmit+0x153/0x3a0 [8150887e] dev_hard_start_xmit+0x21e/0x470 [815247ce] sch_direct_xmit+0xfe/0x280 [81509014] dev_queue_xmit+0x1f4/0x5d0 [81592e91] ip6_finish_output2+0x101/0x450 [81595ae8] ip6_finish_output+0x98/0x200 [81595ca1] ip6_output+0x51/0x1b0 [815b51af] mld_sendpack+0x19f/0x360 [815b59b4] mld_ifc_timer_expire+0x194/0x290 [8104b794] call_timer_fn+0x74/0xf0 [8104bb1b] run_timer_softirq+0x18b/0x220 [81045f81] __do_softirq+0xe1/0x1c0 [8160a4fc] call_softirq+0x1c/0x30 Seems the function vring_add_indirect stores an array of struct vring_desc by using virt_to_phys and kmemleak doesn't track the pointer. The following patch can fix this. Signed-off-by: Alexandru Copot alex.miha...@gmail.com CC: Daniel Baluta dbal...@ixiacom.com --- drivers/virtio/virtio_ring.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index ffd7e7d..e0b591b 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -23,6 +23,7 @@ #include linux/slab.h #include linux/module.h #include linux/hrtimer.h +#include linux/kmemleak.h /* virtio guest is communicating with a virtual device that actually runs on * a host processor. Memory barriers are used to control SMP effects. */ @@ -140,6 +141,7 @@ static int vring_add_indirect(struct vring_virtqueue *vq, desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp); if (!desc) return -ENOMEM; + kmemleak_not_leak(desc); Please add a comment above this call in case people later wonder why this annotation is needed. -- Catalin ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFCv2 00/12] Introduce host-side virtio queue and CAIF Virtio.
On Fri, Jan 11, 2013 at 7:37 AM, Rusty Russell ru...@rustcorp.com.au wrote: virtio_host: host-side implementation of virtio rings (untested!) Getting use of virtio rings correct is tricky, and a recent patch saw an implementation of in-kernel rings (as separate from userspace). How do you see the in-kernel API for this? I would like to see something similar to my previous patches, where we extend the virtqueue API. E.g. something like this: struct virtqueue *vring_new_virtqueueh(unsigned int index, unsigned int num, unsigned int vring_align, struct virtio_device *vdev, bool weak_barriers, void *pages, void (*notify)(struct virtqueue *), void (*callback)(struct virtqueue *), const char *name); int virtqueueh_get_iov(struct virtqueue *vqh, struct vringh_iov *riov, struct vringh_iov *wiov, gfp_t gfp); int virtqueueh_add_iov(struct virtqueue *vqh, struct vringh_iov *riov, struct vringh_iov *wiov); I guess implementation of the host-virtqueue should stay in drivers/virtio? Regards, Sjur ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation
On 11/01/13 13:22, Daniel Kiper wrote: On Thu, Jan 10, 2013 at 02:19:55PM +, David Vrabel wrote: On 04/01/13 17:01, Daniel Kiper wrote: My .5 cents: - We should focus on KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload; probably we should introduce KEXEC_CMD_kexec_load2 and KEXEC_CMD_kexec_unload2; load should __LOAD__ kernel image and other things into hypervisor memory; Yes, but I don't see how we can easily support both ABIs easily. I'd be in favour of replacing the existing hypercalls and requiring updated kexec tools in dom0 (this isn't that different to requiring the correct libxc in dom0). Why? Just define new strutures for new functions of kexec hypercall. That should suffice. The current hypervisor ABI depends on an internal kernel ABI (i.e., the ABI provided by relocate_kernel). We do not want hypervisor internals to be constrained by having to be compatible with kernel internals. - Hmmm... Now I think that we should still use kexec syscall to load image into Xen memory (with new KEXEC_CMD_kexec_load2) because it establishes all things which are needed to call kdump if dom0 crashes; however, I could be wrong... I don't think we need the kexec syscall. The kernel can unconditionally do the crash hypercall, which will return if the kdump kernel isn't loaded and the kernel can fall back to the regular non-kexec panic. No, please do not do that. When you call HYPERVISOR_kexec_op(KEXEC_CMD_kexec) system is completly shutdown. Return form HYPERVISOR_kexec_op(KEXEC_CMD_kexec) would require to restore some kernel functionalities. It maybe impossible in some cases. Additionally, it means that some changes should be made in generic kexec code path. As I know kexec maintainers are very reluctant to make such things. Huh? There only needs to be a call to a new hypervisor_crash_kexec() function (which would then call the Xen specific crash hypercall) at the very beginning of crash_kexec(). If this returns the normal crash/shutdown path is done (which could even include a guest kexec!). David ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation
David Vrabel david.vra...@citrix.com writes: On 11/01/13 13:22, Daniel Kiper wrote: On Thu, Jan 10, 2013 at 02:19:55PM +, David Vrabel wrote: On 04/01/13 17:01, Daniel Kiper wrote: My .5 cents: - We should focus on KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload; probably we should introduce KEXEC_CMD_kexec_load2 and KEXEC_CMD_kexec_unload2; load should __LOAD__ kernel image and other things into hypervisor memory; Yes, but I don't see how we can easily support both ABIs easily. I'd be in favour of replacing the existing hypercalls and requiring updated kexec tools in dom0 (this isn't that different to requiring the correct libxc in dom0). Why? Just define new strutures for new functions of kexec hypercall. That should suffice. The current hypervisor ABI depends on an internal kernel ABI (i.e., the ABI provided by relocate_kernel). We do not want hypervisor internals to be constrained by having to be compatible with kernel internals. I think this is violent agreement. A new call with new arguments seems agreed upon. The only question seems to be what happens to the old hypercall. Keeping the current deprecated hypercall with the current ABI and not updating it, or modifying the current hypercall to return the xen equivalant of -ENOSYS seems to be the only question. Certainly /sbin/kexec will only support the new hypercall once the support has merged. No, please do not do that. When you call HYPERVISOR_kexec_op(KEXEC_CMD_kexec) system is completly shutdown. Return form HYPERVISOR_kexec_op(KEXEC_CMD_kexec) would require to restore some kernel functionalities. It maybe impossible in some cases. Additionally, it means that some changes should be made in generic kexec code path. As I know kexec maintainers are very reluctant to make such things. Huh? There only needs to be a call to a new hypervisor_crash_kexec() function (which would then call the Xen specific crash hypercall) at the very beginning of crash_kexec(). If this returns the normal crash/shutdown path is done (which could even include a guest kexec!). Can you imagine what crash_kexec would look like if every architecture would hard code their own little piece in there? The practical issue with changing crash_kexec is that you are hard coding Xen policy just before a jump to a piece of code whose purpose is to implement policy. From a maintenance and code comprehension stand-ponit it is much cleaner to put the hypervisor_crash_kexec() hypercall into the code that is loaded with sys_kexec_load and is branched to by crash_kexec. I would have no problem with hard coding that behavior into /sbin/kexec in the case of Xen dom0. Having any code have different semantics when running under Xen is a maintenance nightmare, and why we are having the conversation years and years after the initial deployment of Xen. A tiny hard coded stub that calls a hypercall should work indefinitely with no one having to do anything. Eric ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation
And there is nothing fancy to be done for EFI and SecureBoot? Or is that something that the kernel has to handle on its own (so somehow passing some certificates to somewhere). For EFI, no... other than passing the EFI parameters, which apparently is *not* currently done (David Woodhouse is working on it.) Secure boot is still a work in progress. -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation
On Fri, Jan 11, 2013 at 12:26:48PM -0800, H. Peter Anvin wrote: And there is nothing fancy to be done for EFI and SecureBoot? Or is that something that the kernel has to handle on its own (so somehow passing some certificates to somewhere). For EFI, no... other than passing the EFI parameters, which apparently is *not* currently done (David Woodhouse is working on it.) Secure boot is still a work in progress. For secureboot, as a first step in that direction, I just wrote some code to sign elf executable and be able to verify it in kernel upon exec(). I am soon planning to post RFC code (most likely next week). Hopefully we will be able to sign statically signed /sbin/kexec, give it extra capability (upon signature verification) to be able to call sys_exec(). Thanks Vivek ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation
On Fri, Jan 11, 2013 at 12:26:56PM -0800, Eric W. Biederman wrote: [..] Recently there is a desire to figure out how to /sbin/kexec support signed kernel images. What will probably happen is to have a specially trusted userspace application perform the verification. Sort of like dom0 for the linux userspace. A few other ideas have been batted around but none that have stuck. [ CC David Howells ] Eric, In a private conversation, David Howells suggested why not pass kernel signature in a segment to kernel and kernel can do the verification. /sbin/kexec signature is verified by kernel at exec() time. Then /sbin/kexec just passes one signature segment (after regular segment) for each segment being loaded. The segments which don't have signature, are passed with section size 0. And signature passing behavior can be controlled by one new kexec flag. That way /sbin/kexec does not have to worry about doing any verification by itself. In fact, I am not sure how it can do the verification when crypto libraries it will need are not signed (assuming they are not statically linked in). What do you think about this idea? Thanks Vivek ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation
On 01/11/2013 12:52 PM, Vivek Goyal wrote: Eric, In a private conversation, David Howells suggested why not pass kernel signature in a segment to kernel and kernel can do the verification. /sbin/kexec signature is verified by kernel at exec() time. Then /sbin/kexec just passes one signature segment (after regular segment) for each segment being loaded. The segments which don't have signature, are passed with section size 0. And signature passing behavior can be controlled by one new kexec flag. That way /sbin/kexec does not have to worry about doing any verification by itself. In fact, I am not sure how it can do the verification when crypto libraries it will need are not signed (assuming they are not statically linked in). What do you think about this idea? A signed /sbin/kexec would realistically have to be statically linked, at least in the short term; otherwise the libraries and ld.so would need verification as well. Now, that *might* very well have some real value -- there are certainly users out there who would very much want only binaries signed with specific keys to get run on their system. -hpa ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation
On Fri, Jan 11, 2013 at 01:03:41PM -0800, H. Peter Anvin wrote: On 01/11/2013 12:52 PM, Vivek Goyal wrote: Eric, In a private conversation, David Howells suggested why not pass kernel signature in a segment to kernel and kernel can do the verification. /sbin/kexec signature is verified by kernel at exec() time. Then /sbin/kexec just passes one signature segment (after regular segment) for each segment being loaded. The segments which don't have signature, are passed with section size 0. And signature passing behavior can be controlled by one new kexec flag. That way /sbin/kexec does not have to worry about doing any verification by itself. In fact, I am not sure how it can do the verification when crypto libraries it will need are not signed (assuming they are not statically linked in). What do you think about this idea? A signed /sbin/kexec would realistically have to be statically linked, at least in the short term; otherwise the libraries and ld.so would need verification as well. Yes. That's the expectation. Sign only statically linked exeutables which don't do any of dlopen() stuff either. In fact in the patch, I fail the exec() if signed executable has interpreter. Thanks Vivek ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation
On 01/11/2013 01:08 PM, Vivek Goyal wrote: A signed /sbin/kexec would realistically have to be statically linked, at least in the short term; otherwise the libraries and ld.so would need verification as well. Yes. That's the expectation. Sign only statically linked exeutables which don't do any of dlopen() stuff either. In fact in the patch, I fail the exec() if signed executable has interpreter. As I said, though (and possibly not for kexec, that depends): in the long term we probably want a way to be able to sign all kinds binaries in the system. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFCv2 00/12] Introduce host-side virtio queue and CAIF Virtio.
Sjur Brændeland sjurb...@gmail.com writes: How do you see the in-kernel API for this? I would like to see something similar to my previous patches, where we extend the virtqueue API. E.g. something like this: struct virtqueue *vring_new_virtqueueh(unsigned int index, unsigned int num, unsigned int vring_align, struct virtio_device *vdev, bool weak_barriers, void *pages, void (*notify)(struct virtqueue *), void (*callback)(struct virtqueue *), const char *name); I was just going to create _kernel variants of all the _user helpers, and let you drive it directly like that. If we get a second in-kernel user, we create wrappers (I'd prefer not to overload struct virtqueue though). Cheers, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFCv2 00/12] Introduce host-side virtio queue and CAIF Virtio.
Michael S. Tsirkin m...@redhat.com writes: On Fri, Jan 11, 2013 at 09:18:33AM +1030, Rusty Russell wrote: Michael S. Tsirkin m...@redhat.com writes: On Thu, Jan 10, 2013 at 09:00:55PM +1030, Rusty Russell wrote: Not sure why vhost/net doesn't built a packet and feed it in netif_rx_ni(). This is what tun seems to do, and with this code it should be fairly optimal. Because we want to use NAPI. Not quite what I was asking; it was more a question of why we're using a raw socket, when we trivially have a complete skb already which we should be able to feed to Linux like any network packet. Oh for some reason I thought you were talking about virtio. I don't really understand what you are saying here - vhost actually calls out to tun to build and submit the skb. Ah, the fd is tun? Seems a bit indirect; I wonder if there's room for more optimization here... Cheers, Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization