Re: [PATCH V4 1/2] virtio-net: fix the set affinity bug when CPU IDs are not consecutive

2013-01-11 Thread Jason Wang
On 01/11/2013 01:56 PM, Wanlong Gao wrote:
 As Michael mentioned, set affinity and select queue will not work very
 well when CPU IDs are not consecutive, this can happen with hot unplug.
 Fix this bug by traversal the online CPUs, and create a per cpu variable
 to find the mapping from CPU to the preferable virtual-queue.

Looks like ixgbe is trying to handling a similar issue recently by
trying to reuse XPS when the channels and cpus are not equal. I wonder
whether we can use the same method.

 Cc: Rusty Russell ru...@rustcorp.com.au
 Cc: Michael S. Tsirkin m...@redhat.com
 Cc: Jason Wang jasow...@redhat.com
 Cc: Eric Dumazet erdnet...@gmail.com
 Cc: virtualization@lists.linux-foundation.org
 Cc: net...@vger.kernel.org
 Signed-off-by: Wanlong Gao gaowanl...@cn.fujitsu.com
 ---
 V3-V4:
   move vq_index into virtnet_info (Jason)
   change the mapping value when not setting affinity (Jason)
   address the comments about select_queue (Rusty)
 Not Addressed yet:
   race between select_queue and set_channels need discuss more
   but not the same problem with this (Rusty)

  drivers/net/virtio_net.c | 53 
 ++--
  1 file changed, 42 insertions(+), 11 deletions(-)

 diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
 index a6fcf15..ca17a58 100644
 --- a/drivers/net/virtio_net.c
 +++ b/drivers/net/virtio_net.c
 @@ -123,6 +123,9 @@ struct virtnet_info {
  
   /* Does the affinity hint is set for virtqueues? */
   bool affinity_hint_set;
 +
 + /* Per-cpu variable to show the mapping from CPU to virtqueue */
 + int __percpu *vq_index;
  };
  
  struct skb_vnet_hdr {
 @@ -1016,6 +1019,7 @@ static int virtnet_vlan_rx_kill_vid(struct net_device 
 *dev, u16 vid)
  static void virtnet_set_affinity(struct virtnet_info *vi, bool set)
  {
   int i;
 + int cpu;
  
   /* In multiqueue mode, when the number of cpu is equal to the number of
* queue pairs, we let the queue pairs to be private to one cpu by
 @@ -1029,16 +1033,29 @@ static void virtnet_set_affinity(struct virtnet_info 
 *vi, bool set)
   return;
   }
  
 - for (i = 0; i  vi-max_queue_pairs; i++) {
 - int cpu = set ? i : -1;
 - virtqueue_set_affinity(vi-rq[i].vq, cpu);
 - virtqueue_set_affinity(vi-sq[i].vq, cpu);
 - }
 + if (set) {
 + i = 0;
 + for_each_online_cpu(cpu) {
 + virtqueue_set_affinity(vi-rq[i].vq, cpu);
 + virtqueue_set_affinity(vi-sq[i].vq, cpu);
 + *per_cpu_ptr(vi-vq_index, cpu) = i;
 + i++;
 + }
  
 - if (set)
   vi-affinity_hint_set = true;
 - else
 + } else {
 + for(i = 0; i  vi-max_queue_pairs; i++) {
 + virtqueue_set_affinity(vi-rq[i].vq, -1);
 + virtqueue_set_affinity(vi-sq[i].vq, -1);
 + }
 +
 + i = 0;
 + for_each_online_cpu(cpu)
 + *per_cpu_ptr(vi-vq_index, cpu) =
 + ++i % vi-curr_queue_pairs;
 +
   vi-affinity_hint_set = false;
 + }
  }
  
  static void virtnet_get_ringparam(struct net_device *dev,
 @@ -1127,12 +1144,19 @@ static int virtnet_change_mtu(struct net_device *dev, 
 int new_mtu)
  
  /* To avoid contending a lock hold by a vcpu who would exit to host, select 
 the
   * txq based on the processor id.
 - * TODO: handle cpu hotplug.
   */
  static u16 virtnet_select_queue(struct net_device *dev, struct sk_buff *skb)
  {
 - int txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) :
 -   smp_processor_id();
 + int txq;
 + struct virtnet_info *vi = netdev_priv(dev);
 +
 + if (skb_rx_queue_recorded(skb)) {
 + txq = skb_get_rx_queue(skb);
 + } else {
 + txq = *__this_cpu_ptr(vi-vq_index);
 + if (txq == -1)
 + txq = 0;
 + }
  
   while (unlikely(txq = dev-real_num_tx_queues))
   txq -= dev-real_num_tx_queues;
 @@ -1453,6 +1477,10 @@ static int virtnet_probe(struct virtio_device *vdev)
   if (vi-stats == NULL)
   goto free;
  
 + vi-vq_index = alloc_percpu(int);
 + if (vi-vq_index == NULL)
 + goto free_stats;
 +
   mutex_init(vi-config_lock);
   vi-config_enable = true;
   INIT_WORK(vi-config_work, virtnet_config_changed_work);
 @@ -1476,7 +1504,7 @@ static int virtnet_probe(struct virtio_device *vdev)
   /* Allocate/initialize the rx/tx queues, and invoke find_vqs */
   err = init_vqs(vi);
   if (err)
 - goto free_stats;
 + goto free_index;
  
   netif_set_real_num_tx_queues(dev, 1);
   netif_set_real_num_rx_queues(dev, 1);
 @@ -1520,6 +1548,8 @@ free_recv_bufs:
  free_vqs:
   cancel_delayed_work_sync(vi-refill);
   virtnet_del_vqs(vi);
 +free_index:
 + 

Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2013-01-11 Thread Daniel Kiper
On Thu, Jan 10, 2013 at 02:19:55PM +, David Vrabel wrote:
 On 04/01/13 17:01, Daniel Kiper wrote:
  On Fri, Jan 04, 2013 at 02:38:44PM +, David Vrabel wrote:
  On 04/01/13 14:22, Daniel Kiper wrote:
  On Wed, Jan 02, 2013 at 11:26:43AM +, Andrew Cooper wrote:
  On 27/12/12 18:02, Eric W. Biederman wrote:
  Andrew Cooperandrew.coop...@citrix.com  writes:
 
  On 27/12/2012 07:53, Eric W. Biederman wrote:
  The syscall ABI still has the wrong semantics.
 
  Aka totally unmaintainable and umergeable.
 
  The concept of domU support is also strange.  What does domU support 
  even mean, when the dom0 support is loading a kernel to pick up Xen 
  when Xen falls over.
  There are two requirements pulling at this patch series, but I agree
  that we need to clarify them.
  It probably make sense to split them apart a little even.
 
 
 
  Thinking about this split, there might be a way to simply it even more.
 
  /sbin/kexec can load the Xen crash kernel itself by issuing
  hypercalls using /dev/xen/privcmd.  This would remove the need for
  the dom0 kernel to distinguish between loading a crash kernel for
  itself and loading a kernel for Xen.
 
  Or is this just a silly idea complicating the matter?
 
  This is impossible with current Xen kexec/kdump interface.
  It should be changed to do that. However, I suppose that
  Xen community would not be interested in such changes.
 
  I don't see why the hypercall ABI cannot be extended with new sub-ops
  that do the right thing -- the existing ABI is a bit weird.
 
  I plan to start prototyping something shortly (hopefully next week) for
  the Xen kexec case.
 
  Wow... As I can this time Xen community is interested in...
  That is great. I agree that current kexec interface is not ideal.

 I spent some more time looking at the existing interface and
 implementation and it really is broken.

  David, I am happy to help in that process. However, if you wish I could
  carry it myself. Anyway, it looks that I should hold on with my
  Linux kexec/kdump patches.

 I should be able to post some prototype patches for Xen in a few weeks.
  No guarantees though.

That is great. If you need any help drop me a line.

  My .5 cents:
- We should focus on KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload;
  probably we should introduce KEXEC_CMD_kexec_load2 and 
  KEXEC_CMD_kexec_unload2;
  load should __LOAD__ kernel image and other things into hypervisor 
  memory;

 Yes, but I don't see how we can easily support both ABIs easily.  I'd be
 in favour of replacing the existing hypercalls and requiring updated
 kexec tools in dom0 (this isn't that different to requiring the correct
 libxc in dom0).

Why? Just define new strutures for new functions of kexec hypercall.
That should suffice.

  I suppose that allmost all things could be copied from 
  linux/kernel/kexec.c,
  
  linux/arch/x86/kernel/{machine_kexec_$(BITS).c,relocate_kernel_$(BITS).c};
  I think that KEXEC_CMD_kexec should stay as is,

 I don't think we want all the junk from Linux inside Xen -- we only want
 to support the kdump case and do not have to handle returning from the
 kexec image.

I do not want to implement kexec jump or stuff like. However, I think that
it is worth use code which could be used. As I know there are lot of stuff
which was taken with smaller or bigger changes from Linux Kernel.
Why we would like to reinvent the wheel this time?

Additionally, we should not drop kexec support. It is main part of kdump.
In case of kdump new kernel (and other stuff) is placed in prealocated
space in contrary to kexec. That's all. kexec is useful if you would like
to quickly (skipping BIOS) switch from Xen to baremetal Linux. If you drop
kexec support from Xen then you need alter kexec-tools package in bunch
of distros to take into account new Xen behavior.
I think that it is not we want to do.

- Hmmm... Now I think that we should still use kexec syscall to load image
  into Xen memory (with new KEXEC_CMD_kexec_load2) because it establishes
  all things which are needed to call kdump if dom0 crashes; however,
  I could be wrong...

 I don't think we need the kexec syscall.  The kernel can unconditionally
 do the crash hypercall, which will return if the kdump kernel isn't
 loaded and the kernel can fall back to the regular non-kexec panic.

No, please do not do that. When you call HYPERVISOR_kexec_op(KEXEC_CMD_kexec)
system is completly shutdown. Return form HYPERVISOR_kexec_op(KEXEC_CMD_kexec)
would require to restore some kernel functionalities. It maybe impossible
in some cases. Additionally, it means that some changes should be made
in generic kexec code path. As I know kexec maintainers are very reluctant
to make such things.

 This will allow the kexec syscall to be used only for the domU kexec case.

- last but not least, we should think about support for PV guests
  too.

 I won't be looking at this.

OK.

 To avoid confusion about the two largely 

Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2013-01-11 Thread Daniel Kiper
On Mon, Jan 07, 2013 at 01:49:44PM +, Ian Campbell wrote:
 On Mon, 2013-01-07 at 12:34 +, Daniel Kiper wrote:
  I think that new kexec hypercall function should mimics kexec syscall.
 
 We want to have an interface can be used by non-Linux domains (both dom0
 and domU) as well though, so please bear this in mind.

I agree, but all arguments passed to kexec syscall are quiet generic and they
do not impose any limitations. Just look into include/linux/kexec.h.
That is why I think that a lot of things could be taken from
Linux kexec implementation.

Daniel
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio: suppress kmemleak false positive

2013-01-11 Thread Daniel Baluta
On Fri, Jan 11, 2013 at 4:02 PM, Catalin Marinas catalin.mari...@arm.comwrote:

 On Fri, Jan 11, 2013 at 01:51:43PM +, Alexandru Copot wrote:
  While doing simple IPv6 tests in KVM virtual machines,
  (add an IPv6 address to eth0) kmemleak complains about
  an unreferenced object:
 
  unreferenced object 0x88001e804120 (size 32):
comm softirq, pid 0, jiffies 4294900928 (age 631.544s)
hex dump (first 32 bytes):
  28 cb fd 1d 00 00 00 00 0c 00 00 00 01 00 01 00  (...
  02 d0 83 1d 00 00 00 00 6e 00 00 00 00 00 00 00  n...
backtrace:
  [815ed721] kmemleak_alloc+0x21/0x50
  [8111d0b0] __kmalloc+0xe0/0x160
  [81362b7c] virtqueue_add_buf+0x1fc/0x3d0
  [8140cbd3] start_xmit+0x153/0x3a0
  [8150887e] dev_hard_start_xmit+0x21e/0x470
  [815247ce] sch_direct_xmit+0xfe/0x280
  [81509014] dev_queue_xmit+0x1f4/0x5d0
  [81592e91] ip6_finish_output2+0x101/0x450
  [81595ae8] ip6_finish_output+0x98/0x200
  [81595ca1] ip6_output+0x51/0x1b0
  [815b51af] mld_sendpack+0x19f/0x360
  [815b59b4] mld_ifc_timer_expire+0x194/0x290
  [8104b794] call_timer_fn+0x74/0xf0
  [8104bb1b] run_timer_softirq+0x18b/0x220
  [81045f81] __do_softirq+0xe1/0x1c0
  [8160a4fc] call_softirq+0x1c/0x30
 
  Seems the function vring_add_indirect stores an array of
  struct vring_desc by using virt_to_phys and kmemleak
  doesn't track the pointer.
 
  The following patch can fix this.
 
  Signed-off-by: Alexandru Copot alex.miha...@gmail.com
  CC: Daniel Baluta dbal...@ixiacom.com
  ---
   drivers/virtio/virtio_ring.c | 2 ++
   1 file changed, 2 insertions(+)
 
  diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
  index ffd7e7d..e0b591b 100644
  --- a/drivers/virtio/virtio_ring.c
  +++ b/drivers/virtio/virtio_ring.c
  @@ -23,6 +23,7 @@
   #include linux/slab.h
   #include linux/module.h
   #include linux/hrtimer.h
  +#include linux/kmemleak.h
 
   /* virtio guest is communicating with a virtual device that actually
 runs on
* a host processor.  Memory barriers are used to control SMP effects.
 */
  @@ -140,6 +141,7 @@ static int vring_add_indirect(struct vring_virtqueue
 *vq,
desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
if (!desc)
return -ENOMEM;
  + kmemleak_not_leak(desc);

 Please add a comment above this call in case people later wonder why
 this annotation is needed.


Thanks Cătălin.

So, kememleak cannot handle this kind of pointer aliases?
Also, I wonder if phys_to_virt(virt_to_phys(x)) == x holds true all the
time.

thanks,
Daniel.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio: suppress kmemleak false positive

2013-01-11 Thread Catalin Marinas
On Fri, Jan 11, 2013 at 02:32:12PM +, Daniel Baluta wrote:
 On Fri, Jan 11, 2013 at 4:02 PM, Catalin Marinas catalin.mari...@arm.com
 wrote:
  On Fri, Jan 11, 2013 at 01:51:43PM +, Alexandru Copot wrote:
   --- a/drivers/virtio/virtio_ring.c
   +++ b/drivers/virtio/virtio_ring.c
   @@ -23,6 +23,7 @@
    #include linux/slab.h
    #include linux/module.h
    #include linux/hrtimer.h
   +#include linux/kmemleak.h
  
    /* virtio guest is communicating with a virtual device that actually 
   runs on
     * a host processor.  Memory barriers are used to control SMP effects. */
   @@ -140,6 +141,7 @@ static int vring_add_indirect(struct vring_virtqueue 
   *vq,
         desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
         if (!desc)
                 return -ENOMEM;
   +     kmemleak_not_leak(desc);
 
  Please add a comment above this call in case people later wonder why
  this annotation is needed.
 
 So, kememleak cannot handle this kind of pointer aliases?

No. Basically it stores the allocated object start/end in an rb-tree and
cannot cope with overlapping blocks. If we allow aliases, we could have
some overlapping.

I think there was a patch in the past to add a separate rb-tree for
aliases but I didn't particularly like it because it affected the
performance.

 Also, I wonder if phys_to_virt(virt_to_phys(x)) == x holds true all the time.

It should as long as x is a valid kernel virtual address in the logical
(linear) mapping.

-- 
Catalin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFCv2 00/12] Introduce host-side virtio queue and CAIF Virtio.

2013-01-11 Thread Sjur Brændeland
On Fri, Jan 11, 2013 at 12:35 AM, Rusty Russell ru...@rustcorp.com.au wrote:
 Hi Sjur!

 OK, the Internet was no help here, how do you pronounce Sjur?
 I'm guessing shoor rhyming with tour until I know better.

Thank you for asking! This is pretty close yes.
I usually tell people to pronounce it like sure for simplicity.
But Google translate has a perfect Norwegian voice pronouncing my name:
http://translate.google.com/#auto/no/sjur (Press the speaker button)

While at the subject: Norwegian names can be a laugh: I have people
named Odd and Even in my family, and I once had a friend named
Aashold.

...

 I would prefer not to split into pages at this point, but rather provide an
 iterator or the original list found in the descriptor to the client.
...
 In other words, boundaries matter?

 While the sg-in-place hack is close to optimal for TCM_VHOST, neither
 net not you can use it directly.  I'll switch to an iovec (with a
 similar use-caller-supplied-if-it-fits trick); they're smaller anyway.

Great, I think iovec will be a good fit for my CAIF driver.

Thanks,
Sjur
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 0/2] make mac programming for virtio net more robust

2013-01-11 Thread John Fastabend

On 1/10/2013 11:46 PM, Michael S. Tsirkin wrote:

On Fri, Jan 11, 2013 at 12:53:07PM +1030, Rusty Russell wrote:

Michael S. Tsirkin m...@redhat.com writes:


On Thu, Jan 10, 2013 at 10:45:39PM +0800, ak...@redhat.com wrote:

From: Amos Kong ak...@redhat.com

Currenly mac is programmed byte by byte. This means that we
have an intermediate step where mac is wrong.

Second patch introduced a new vq control command to set mac
address in one time.


As you mention we could alternatively do it without
new commands, simply add a feature bit that says that MACs are
in the mac table.
This would be a much bigger patch, and I'm fine with either way.
Rusty what do you think?


Hmm, mac filtering and my mac address are not quite the same thing.  I
don't know if it matters for anyone: does it?
The mac address is abused
for things like identifying machines, etc.


I don't know either. I think net core differentiates between mac and
uc_list because linux has to know which mac to use when building
up packets, so at some level, I agree it might be useful to identify the
machine.

BTW netdev/davem should have been copied on this, Amos I think it's a
good idea to remember to do it next time you post.



If we keep it as a separate concept, Amos' patch seems to make sense.


Yes. It also keeps the patch small, I just thought I'd mention the
option.



Cheers,
Rusty.




Don't have the entire context here but if you implement the
ndo_fdb_dump() probably hooking it up to ndo_dflt_fdb_dump() you could
use the 'bridge' tool dump the uc_list.

Then use ndo_fdb_add() and ndo_fdb_del() to add and remove entries
from the uc_list. We do this today in macvlan and the ixgbe driver when
it is in SR-IOV mode and the embedded switch needs to be programmed.

fdb is forwarding database its a bit different then mac filtering
in that its telling the switch how to forward mac addresses, in
ixgbe and macvlan at least we have been overloading it a bit to also
stop filtering the mac address. I think this makes sense if you setup
forwarding to a port it doesn't make much sense to then drop them.

Maybe its not entirely applicable here just thought I would mention it.

Thanks,
John
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio: suppress kmemleak false positive

2013-01-11 Thread Catalin Marinas
On Fri, Jan 11, 2013 at 01:51:43PM +, Alexandru Copot wrote:
 While doing simple IPv6 tests in KVM virtual machines,
 (add an IPv6 address to eth0) kmemleak complains about
 an unreferenced object:
 
 unreferenced object 0x88001e804120 (size 32):
   comm softirq, pid 0, jiffies 4294900928 (age 631.544s)
   hex dump (first 32 bytes):
 28 cb fd 1d 00 00 00 00 0c 00 00 00 01 00 01 00  (...
 02 d0 83 1d 00 00 00 00 6e 00 00 00 00 00 00 00  n...
   backtrace:
 [815ed721] kmemleak_alloc+0x21/0x50
 [8111d0b0] __kmalloc+0xe0/0x160
 [81362b7c] virtqueue_add_buf+0x1fc/0x3d0
 [8140cbd3] start_xmit+0x153/0x3a0
 [8150887e] dev_hard_start_xmit+0x21e/0x470
 [815247ce] sch_direct_xmit+0xfe/0x280
 [81509014] dev_queue_xmit+0x1f4/0x5d0
 [81592e91] ip6_finish_output2+0x101/0x450
 [81595ae8] ip6_finish_output+0x98/0x200
 [81595ca1] ip6_output+0x51/0x1b0
 [815b51af] mld_sendpack+0x19f/0x360
 [815b59b4] mld_ifc_timer_expire+0x194/0x290
 [8104b794] call_timer_fn+0x74/0xf0
 [8104bb1b] run_timer_softirq+0x18b/0x220
 [81045f81] __do_softirq+0xe1/0x1c0
 [8160a4fc] call_softirq+0x1c/0x30
 
 Seems the function vring_add_indirect stores an array of
 struct vring_desc by using virt_to_phys and kmemleak
 doesn't track the pointer.
 
 The following patch can fix this.
 
 Signed-off-by: Alexandru Copot alex.miha...@gmail.com
 CC: Daniel Baluta dbal...@ixiacom.com
 ---
  drivers/virtio/virtio_ring.c | 2 ++
  1 file changed, 2 insertions(+)
 
 diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
 index ffd7e7d..e0b591b 100644
 --- a/drivers/virtio/virtio_ring.c
 +++ b/drivers/virtio/virtio_ring.c
 @@ -23,6 +23,7 @@
  #include linux/slab.h
  #include linux/module.h
  #include linux/hrtimer.h
 +#include linux/kmemleak.h
  
  /* virtio guest is communicating with a virtual device that actually runs 
 on
   * a host processor.  Memory barriers are used to control SMP effects. */
 @@ -140,6 +141,7 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
   desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
   if (!desc)
   return -ENOMEM;
 + kmemleak_not_leak(desc);

Please add a comment above this call in case people later wonder why
this annotation is needed.

-- 
Catalin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFCv2 00/12] Introduce host-side virtio queue and CAIF Virtio.

2013-01-11 Thread Sjur Brændeland
On Fri, Jan 11, 2013 at 7:37 AM, Rusty Russell ru...@rustcorp.com.au wrote:
virtio_host: host-side implementation of virtio rings (untested!)

Getting use of virtio rings correct is tricky, and a recent patch saw
an implementation of in-kernel rings (as separate from userspace).

How do you see the in-kernel API for this? I would like to see
something similar to my previous patches, where we extend
the virtqueue API. E.g. something like this:

struct virtqueue *vring_new_virtqueueh(unsigned int index,
   unsigned int num,
   unsigned int vring_align,
   struct virtio_device *vdev,
   bool weak_barriers,
   void *pages,
   void (*notify)(struct virtqueue *),
   void (*callback)(struct virtqueue *),
   const char *name);

int virtqueueh_get_iov(struct virtqueue *vqh,
   struct vringh_iov *riov,
   struct vringh_iov *wiov,
   gfp_t gfp);

int virtqueueh_add_iov(struct virtqueue *vqh,
   struct vringh_iov *riov,
   struct vringh_iov *wiov);

I guess implementation of the host-virtqueue should stay in drivers/virtio?

Regards,
Sjur
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2013-01-11 Thread David Vrabel
On 11/01/13 13:22, Daniel Kiper wrote:
 On Thu, Jan 10, 2013 at 02:19:55PM +, David Vrabel wrote:
 On 04/01/13 17:01, Daniel Kiper wrote:
 My .5 cents:
   - We should focus on KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload;
 probably we should introduce KEXEC_CMD_kexec_load2 and 
 KEXEC_CMD_kexec_unload2;
 load should __LOAD__ kernel image and other things into hypervisor 
 memory;

 Yes, but I don't see how we can easily support both ABIs easily.  I'd be
 in favour of replacing the existing hypercalls and requiring updated
 kexec tools in dom0 (this isn't that different to requiring the correct
 libxc in dom0).
 
 Why? Just define new strutures for new functions of kexec hypercall.
 That should suffice.

The current hypervisor ABI depends on an internal kernel ABI (i.e., the
ABI provided by relocate_kernel).  We do not want hypervisor internals
to be constrained by having to be compatible with kernel internals.

   - Hmmm... Now I think that we should still use kexec syscall to load image
 into Xen memory (with new KEXEC_CMD_kexec_load2) because it establishes
 all things which are needed to call kdump if dom0 crashes; however,
 I could be wrong...

 I don't think we need the kexec syscall.  The kernel can unconditionally
 do the crash hypercall, which will return if the kdump kernel isn't
 loaded and the kernel can fall back to the regular non-kexec panic.
 
 No, please do not do that. When you call HYPERVISOR_kexec_op(KEXEC_CMD_kexec)
 system is completly shutdown. Return form HYPERVISOR_kexec_op(KEXEC_CMD_kexec)
 would require to restore some kernel functionalities. It maybe impossible
 in some cases. Additionally, it means that some changes should be made
 in generic kexec code path. As I know kexec maintainers are very reluctant
 to make such things.

Huh?  There only needs to be a call to a new hypervisor_crash_kexec()
function (which would then call the Xen specific crash hypercall) at the
very beginning of crash_kexec().  If this returns the normal
crash/shutdown path is done (which could even include a guest kexec!).

David
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2013-01-11 Thread Eric W. Biederman
David Vrabel david.vra...@citrix.com writes:

 On 11/01/13 13:22, Daniel Kiper wrote:
 On Thu, Jan 10, 2013 at 02:19:55PM +, David Vrabel wrote:
 On 04/01/13 17:01, Daniel Kiper wrote:
 My .5 cents:
   - We should focus on KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload;
 probably we should introduce KEXEC_CMD_kexec_load2 and 
 KEXEC_CMD_kexec_unload2;
 load should __LOAD__ kernel image and other things into hypervisor 
 memory;

 Yes, but I don't see how we can easily support both ABIs easily.  I'd be
 in favour of replacing the existing hypercalls and requiring updated
 kexec tools in dom0 (this isn't that different to requiring the correct
 libxc in dom0).
 
 Why? Just define new strutures for new functions of kexec hypercall.
 That should suffice.

 The current hypervisor ABI depends on an internal kernel ABI (i.e., the
 ABI provided by relocate_kernel).  We do not want hypervisor internals
 to be constrained by having to be compatible with kernel internals.

I think this is violent agreement.  A new call with new arguments seems
agreed upon.  The only question seems to be what happens to the old
hypercall.  Keeping the current deprecated hypercall with the current
ABI and not updating it, or modifying the current hypercall to return
the xen equivalant of -ENOSYS seems to be the only question.

Certainly /sbin/kexec will only support the new hypercall once the
support has merged.

 No, please do not do that. When you call HYPERVISOR_kexec_op(KEXEC_CMD_kexec)
 system is completly shutdown. Return form 
 HYPERVISOR_kexec_op(KEXEC_CMD_kexec)
 would require to restore some kernel functionalities. It maybe impossible
 in some cases. Additionally, it means that some changes should be made
 in generic kexec code path. As I know kexec maintainers are very reluctant
 to make such things.

 Huh?  There only needs to be a call to a new hypervisor_crash_kexec()
 function (which would then call the Xen specific crash hypercall) at the
 very beginning of crash_kexec().  If this returns the normal
 crash/shutdown path is done (which could even include a guest kexec!).

Can you imagine what crash_kexec would look like if every architecture
would hard code their own little piece in there?

The practical issue with changing crash_kexec is that you are hard
coding Xen policy just before a jump to a piece of code whose purpose
is to implement policy.

From a maintenance and code comprehension stand-ponit it is much cleaner
to put the hypervisor_crash_kexec() hypercall into the code that is
loaded with sys_kexec_load and is branched to by crash_kexec.  I would
have no problem with hard coding that behavior into /sbin/kexec in
the case of Xen dom0.

Having any code have different semantics when running under Xen is a
maintenance nightmare, and why we are having the conversation years and
years after the initial deployment of Xen.  A tiny hard coded stub that
calls a hypercall should work indefinitely with no one having to do
anything.

Eric

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2013-01-11 Thread H. Peter Anvin


And there is nothing fancy to be done for EFI and SecureBoot? Or is
that something that the kernel has to handle on its own (so somehow
passing some certificates to somewhere).



For EFI, no... other than passing the EFI parameters, which apparently 
is *not* currently done (David Woodhouse is working on it.)  Secure boot 
is still a work in progress.



--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2013-01-11 Thread Vivek Goyal
On Fri, Jan 11, 2013 at 12:26:48PM -0800, H. Peter Anvin wrote:
 
 And there is nothing fancy to be done for EFI and SecureBoot? Or is
 that something that the kernel has to handle on its own (so somehow
 passing some certificates to somewhere).
 
 
 For EFI, no... other than passing the EFI parameters, which
 apparently is *not* currently done (David Woodhouse is working on
 it.)  Secure boot is still a work in progress.

For secureboot, as a first step in that direction, I just wrote some code
to sign elf executable and be able to verify it in kernel upon exec(). I
am soon planning to post RFC code (most likely next week).

Hopefully we will be able to sign statically signed /sbin/kexec, give
it extra capability (upon signature verification) to be able to call
sys_exec().

Thanks
Vivek
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2013-01-11 Thread Vivek Goyal
On Fri, Jan 11, 2013 at 12:26:56PM -0800, Eric W. Biederman wrote:

[..]
 Recently there is a desire to figure out how to /sbin/kexec support
 signed kernel images.  What will probably happen is to have a specially
 trusted userspace application perform the verification.  Sort of like
 dom0 for the linux userspace.  A few other ideas have been batted around
 but none that have stuck.

[ CC David Howells ]

Eric,

In a private conversation, David Howells suggested why not pass kernel
signature in a segment to kernel and kernel can do the verification.

/sbin/kexec signature is verified by kernel at exec() time. Then
/sbin/kexec just passes one signature segment (after regular segment) for
each segment being loaded. The segments which don't have signature,
are passed with section size 0. And signature passing behavior can be
controlled by one new kexec flag.

That way /sbin/kexec does not have to worry about doing any verification
by itself. In fact, I am not sure how it can do the verification when
crypto libraries it will need are not signed (assuming they are not
statically linked in).

What do you think about this idea?

Thanks
Vivek
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2013-01-11 Thread H. Peter Anvin
On 01/11/2013 12:52 PM, Vivek Goyal wrote:
 
 Eric,
 
 In a private conversation, David Howells suggested why not pass kernel
 signature in a segment to kernel and kernel can do the verification.
 
 /sbin/kexec signature is verified by kernel at exec() time. Then
 /sbin/kexec just passes one signature segment (after regular segment) for
 each segment being loaded. The segments which don't have signature,
 are passed with section size 0. And signature passing behavior can be
 controlled by one new kexec flag.
 
 That way /sbin/kexec does not have to worry about doing any verification
 by itself. In fact, I am not sure how it can do the verification when
 crypto libraries it will need are not signed (assuming they are not
 statically linked in).
 
 What do you think about this idea?
 

A signed /sbin/kexec would realistically have to be statically linked,
at least in the short term; otherwise the libraries and ld.so would need
verification as well.

Now, that *might* very well have some real value -- there are certainly
users out there who would very much want only binaries signed with
specific keys to get run on their system.

-hpa


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2013-01-11 Thread Vivek Goyal
On Fri, Jan 11, 2013 at 01:03:41PM -0800, H. Peter Anvin wrote:
 On 01/11/2013 12:52 PM, Vivek Goyal wrote:
  
  Eric,
  
  In a private conversation, David Howells suggested why not pass kernel
  signature in a segment to kernel and kernel can do the verification.
  
  /sbin/kexec signature is verified by kernel at exec() time. Then
  /sbin/kexec just passes one signature segment (after regular segment) for
  each segment being loaded. The segments which don't have signature,
  are passed with section size 0. And signature passing behavior can be
  controlled by one new kexec flag.
  
  That way /sbin/kexec does not have to worry about doing any verification
  by itself. In fact, I am not sure how it can do the verification when
  crypto libraries it will need are not signed (assuming they are not
  statically linked in).
  
  What do you think about this idea?
  
 
 A signed /sbin/kexec would realistically have to be statically linked,
 at least in the short term; otherwise the libraries and ld.so would need
 verification as well.

Yes. That's the expectation. Sign only statically linked exeutables which
don't do any of dlopen() stuff either.

In fact in the patch, I fail the exec() if signed executable has
interpreter.

Thanks
Vivek
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] [PATCH v3 00/11] xen: Initial kexec/kdump implementation

2013-01-11 Thread H. Peter Anvin

On 01/11/2013 01:08 PM, Vivek Goyal wrote:


A signed /sbin/kexec would realistically have to be statically linked,
at least in the short term; otherwise the libraries and ld.so would need
verification as well.


Yes. That's the expectation. Sign only statically linked exeutables which
don't do any of dlopen() stuff either.

In fact in the patch, I fail the exec() if signed executable has
interpreter.



As I said, though (and possibly not for kexec, that depends): in the 
long term we probably want a way to be able to sign all kinds binaries 
in the system.


-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFCv2 00/12] Introduce host-side virtio queue and CAIF Virtio.

2013-01-11 Thread Rusty Russell
Sjur Brændeland sjurb...@gmail.com writes:
 How do you see the in-kernel API for this? I would like to see
 something similar to my previous patches, where we extend
 the virtqueue API. E.g. something like this:

 struct virtqueue *vring_new_virtqueueh(unsigned int index,
  unsigned int num,
  unsigned int vring_align,
  struct virtio_device *vdev,
  bool weak_barriers,
  void *pages,
  void (*notify)(struct virtqueue *),
  void (*callback)(struct virtqueue *),
  const char *name);

I was just going to create _kernel variants of all the _user helpers,
and let you drive it directly like that.

If we get a second in-kernel user, we create wrappers (I'd prefer not to
overload struct virtqueue though).

Cheers,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFCv2 00/12] Introduce host-side virtio queue and CAIF Virtio.

2013-01-11 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 On Fri, Jan 11, 2013 at 09:18:33AM +1030, Rusty Russell wrote:
 Michael S. Tsirkin m...@redhat.com writes:
  On Thu, Jan 10, 2013 at 09:00:55PM +1030, Rusty Russell wrote:
  Not sure why vhost/net doesn't built a packet and feed it in
  netif_rx_ni().  This is what tun seems to do, and with this code it
  should be fairly optimal.
 
  Because we want to use NAPI.
 
 Not quite what I was asking; it was more a question of why we're using a
 raw socket, when we trivially have a complete skb already which we
 should be able to feed to Linux like any network packet.

 Oh for some reason I thought you were talking about virtio.
 I don't really understand what you are saying here - vhost
 actually calls out to tun to build and submit the skb.

Ah, the fd is tun?  Seems a bit indirect; I wonder if there's room for
more optimization here...

Cheers,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization