Re: Interrupt Assignment on host

2009-11-08 Thread Avi Kivity

On 11/05/2009 09:09 PM, Erik Rull wrote:

Can you describe how performance suffers?

Please provide vmstat (for host int/sec) and kvm_stat  -l -f 
'exits|irq_exits' output.





I'm sorry, but my target system has no python installed. Any ideas how 
to do that manually?


Try the attached bash script.  You'll want debugfs mounted on 
/sys/kernel/debug.


--
error compiling committee.c: too many arguments to function

#!/bin/bash

p=/sys/kernel/debug/kvm

old_exits=0
old_irq_exits=0

while :; do
exits=$(cat $p/exits)
irq_exits=$(cat $p/irq_exits)
d_exits=$((exits - old_exits))
d_irq_exits=$((irq_exits - old_irq_exits))
printf %10d %10d\n $d_exits $d_irq_exits
sleep 1
old_exits=$exits
old_irq_exits=$irq_exits
done

Re: CPU change causes hanging of .NET apps

2009-11-08 Thread Avi Kivity

On 11/05/2009 09:03 PM, Erik Rull wrote:

qemu -cpu host,-flag




I tried that value, result is an error:
Unable to find x86 CPU definition


Please use qemu-kvm.git for this test.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: soft lockup after live migration

2009-11-08 Thread Tomasz Chmielewski

Marcelo Tosatti wrote:

Tomasz,

The screenshots seem to indicate a paravirt mmu problem.

Try to patch the x86.c file from kvm kernel module with:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c



-   case KVM_CAP_PV_MMU:
-   r = !tdp_enabled;
+   case KVM_CAP_PV_MMU:/* obsolete */
+   r = 0;




You'll probaby have to do it manually (this disables pvmmu).


With this, some guests fail to start with kernel panic; some have soft 
lockups all the time. Some don't start at all.


And generally, everything is dead slow.


--
Tomasz Chmielewski
http://wpkg.org
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/11] Add handle page fault PV helper.

2009-11-08 Thread Ingo Molnar

* Gleb Natapov g...@redhat.com wrote:

 On Mon, Nov 02, 2009 at 05:29:41PM +0100, Ingo Molnar wrote:
  
  * Gleb Natapov g...@redhat.com wrote:
  
   On Mon, Nov 02, 2009 at 05:12:48PM +0100, Ingo Molnar wrote:

* Gleb Natapov g...@redhat.com wrote:

 On Mon, Nov 02, 2009 at 10:22:14AM +0100, Ingo Molnar wrote:
  
  * Gleb Natapov g...@redhat.com wrote:
  
   diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
   index f4cee90..14707dc 100644
   --- a/arch/x86/mm/fault.c
   +++ b/arch/x86/mm/fault.c
   @@ -952,6 +952,9 @@ do_page_fault(struct pt_regs *regs, unsigned 
   long error_code)
 int write;
 int fault;

   + if (arch_handle_page_fault(regs, error_code))
   + return;
   +
  
  This patch is not acceptable unless it's done cleaner. Currently we 
  already have 3 callbacks in do_page_fault() (kmemcheck, mmiotrace, 
  notifier), and this adds a fourth one. Please consolidate them into 
  a 
  single callback site, this is a hotpath on x86.
  
 This call is patched out by paravirt patching mechanism so overhead 
 should be zero for non paravirt cases. [...]

arch_handle_page_fault() isnt upstream yet - precisely what is the 
instruction sequence injected into do_page_fault() in the patched-out 
case?
   
   It is introduced by the same patch. The instruction inserted is:
xor %rax, %rax
  
  ok.
  
  My observations still stand:
  
 [...] What do you want to achieve by consolidate them into single 
 callback? [...]

Less bloat in a hotpath and a shared callback infrastructure.

 [...] I mean the code will still exist and will have to be executed 
 on 
 every #PF. Is the goal to move them out of line?

The goal is to have a single callback site for all the users - which 
call-site is patched out ideally - on non-paravirt too if needed. Most 
of these callbacks/notifier-chains have are inactive most of the time.

I.e. a very low overhead 'conditional callback' facility, and a single 
one - not just lots of them sprinkled around the code.
  
  looks like a golden opportunity to get this right.
 
 Three existing callbacks are: kmemcheck, mmiotrace, notifier. Two
 of them kmemcheck, mmiotrace are enabled only for debugging, should
 not be performance concern. And notifier call sites (two of them)
 are deliberately, as explained by comment, not at the function entry,
 so can't be unified with others. (And kmemcheck also has two different
 call site BTW)

We want mmiotrace to be generic distro capable so the overhead when the 
hook is not used is of concern.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server

2009-11-08 Thread Michael S. Tsirkin
On Fri, Nov 06, 2009 at 03:29:17PM +1030, Rusty Russell wrote:
 On Thu, 5 Nov 2009 02:27:24 am Michael S. Tsirkin wrote:
  What it is: vhost net is a character device that can be used to reduce
  the number of system calls involved in virtio networking.
 
 Hi Michael,
 
Now everyone else has finally kicked all the tires and it seems to pass,
 I've done a fairly complete review.  Generally, it's really nice; just one
 bug and a few minor suggestions for polishing.

Thanks for the review! I'll add more polishing and repost.
Answers to some questions below.

  +/* Caller must have TX VQ lock */
  +static void tx_poll_stop(struct vhost_net *net)
  +{
  +   if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED))
  +   return;
 
 likely?  Really?

Hmm ... yes. tx poll stop is called on each packet (as long as we do not
fill up 1/2 backend queue), the first call will stop polling
the rest checks state and does nothing.

This is because we normally do not care when the message has left the
queue in backend device: we tell backend to send it and forget. We only
start polling when backend tx queue fills up.

Makes sense?

  +   for (;;) {
  +   head = vhost_get_vq_desc(net-dev, vq, vq-iov, out, in,
  +NULL, NULL);
 
 Danger!  You need an arg to vhost_get_vq_desc to tell it the max desc size
 you can handle.  Otherwise, it's only limited by ring size, and a malicious
 guest can overflow you here, and below:

In fact, I think this is not a bug.  This happens to work correctly
(even with malicious guests) because vhost_get_vq_desc is hard-coded to
check VHOST_NET_MAX_SG, so in fact no overflow is possible.  I agree
that it's mich nicer to pass iovec size to vhost_get_vq_desc.

 
  +   /* Skip header. TODO: support TSO. */
  +   s = move_iovec_hdr(vq-iov, vq-hdr, hdr_size, out);
 ...
  +
  +   use_mm(net-dev.mm);
  +   mutex_lock(vq-mutex);
  +   vhost_no_notify(vq);
 
 I prefer a name like vhost_disable_notify().

Good idea.

  +   /* OK, now we need to know about added descriptors. */
  +   if (head == vq-num  vhost_notify(vq))
  +   /* They could have slipped one in as we were doing that:
  +* check again. */
  +   continue;
  +   /* Nothing new?  Wait for eventfd to tell us they refilled. */
  +   if (head == vq-num)
  +   break;
  +   /* We don't need to be notified again. */
  +   vhost_no_notify(vq);
 
 Similarly, vhost_enable_notify.  This one is particularly misleading since
 it doesn't actually notify anything!

Good idea.


 
 In particular, this code would be neater as:
 
   if (head == vq-num) {
   if (vhost_enable_notify(vq)) {
   /* Try again, they could have slipped one in. */
   continue;
   }
   /* Nothing more to do. */
   break;
   }
   vhost_disable_notify(vq);
 
 Now, AFAICT vhost_notify()/enable_notify() would be better rewritten to
 return true only when there's more pending.  Saves a loop around here most
 of the time.

OKay, I'll look into this. It kind of annoys me that we would do
get_user for the same value twice: once in vhost_enable_notify and once
in vhost_get_vq_desc.  OTOH, all the loop does is call vhost_get_vq_desc
again.

  Also, the vhost_no_notify/vhost_disable_notify() can be moved
 out of the loop entirely.

I don't think it can, if we enabled notification and then see more
descriptors in queue, we want to disable notification again. But it can
be
  (It could be under an if (unlikely(enabled)), not
 sure if it's worth it).

if (unlikely(vhost_enable_notify(vq))) {
/* Try again, they have slipped one in. */
vhost_disable_notify(vq);
continue;
}

 
  +   len = err;
  +   err = memcpy_toiovec(vq-hdr, (unsigned char *)hdr, hdr_size);
 
 That unsigned char * arg to memcpy_toiovec is annoying.  A patch might be
 nice, separate from this effort.

Sounds good.

  +static int vhost_net_open(struct inode *inode, struct file *f)
  +{
  +   struct vhost_net *n = kzalloc(sizeof *n, GFP_KERNEL);
  +   int r;
  +   if (!n)
  +   return -ENOMEM;
  +   f-private_data = n;
  +   n-vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
  +   n-vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
 
 I have a personal dislike of calloc for structures.

You mean zalloc?

 In userspace, it's because valgrind can't spot uninitialized fields.
 These days a similar argument applies in the kernel, because we have
 KMEMCHECK now.  If someone adds a field to the struct and forgets to
 initialize it, we can spot it.

OK.

  +static void vhost_net_enable_vq(struct vhost_net *n, int index)
  +{
  +   struct socket *sock = n-vqs[index].private_data;
 
 OK, I can't help but this that presenting the vqs as an 

Re: xp guest, blue screen c0000221 on boot

2009-11-08 Thread Avi Kivity

On 11/04/2009 07:06 PM, Andrew Olney wrote:

Host is ext3

Linux monkamu 2.6.31-14-generic #48-Ubuntu SMP Fri Oct 16 14:04:26 UTC 
2009 i686 GNU/Linux


QEMU PC emulator version 0.11.0 (qemu-kvm-0.11.0), Copyright (c) 
2003-2008 Fabrice Bellard





If you convert a qcow image back to raw on ext3, does it fail?

Similarly, converting it into an lvm volume, works or not?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/11] Add handle page fault PV helper.

2009-11-08 Thread Avi Kivity

On 11/08/2009 01:36 PM, Ingo Molnar wrote:

Three existing callbacks are: kmemcheck, mmiotrace, notifier. Two
of them kmemcheck, mmiotrace are enabled only for debugging, should
not be performance concern. And notifier call sites (two of them)
are deliberately, as explained by comment, not at the function entry,
so can't be unified with others. (And kmemcheck also has two different
call site BTW)
 

We want mmiotrace to be generic distro capable so the overhead when the
hook is not used is of concern.

   


Maybe we should generalize paravirt-ops patching in case if (x) f() is 
deemed too expensive.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recording interrupted instruction on ioapic_set_irq

2009-11-08 Thread Avi Kivity

On 11/07/2009 10:13 PM, Kurt Kiefer wrote:

Hi guys,

I'm trying to record the instruction pointer at the exact time a guest 
was delivered an IOAPIC interrupt. Take for example a PS2 keyboard 
press. Clearly, when I read IP during the subsequent exit for 
IO_INSTRUCTION I'm just recording the IP of io_read in the handler, 
and not the IP at actual interrupt delivery.


Maybe I'm missing something fundamental. It doesn't look like exits 
for EXTERNAL_INTERRUPT (shouldn't it?) or INTERRUPT_WINDOW correspond 
one-to-one with delivery of these PS2 interrupts.


Just setting request_interrupt_window for these IRQs didn't give me an 
INTERRUPT_WINDOW for each key. I guess since the guest doesn't usually 
have interrupts masked when I press a key means delivery won't wait 
for the window.


Could I record during delivery? I figure I could look at the stack 
during the IO_INSTRUCTION exit and figure out what instruction was 
actually interrupted, but this would be a Linux-specific solution. Any 
other ideas? I think even a simple description of how these interrupts 
are being delivered to the guest would help me out a lot.




This is all available now in 2.6.32 or later with the new trace 
infrastructure.  If you enable ftrace and echo kvm  
/sys/kernel/tracing/set_event, you should get a trace of all interrupt 
injections.  Of course, you need to figure out which vector is 
associated with your irq; you can even have the trace infrastructure 
filter this for you.


Note you can only get the ip at the time the interrupt is delivered, 
rather than the time the irq is asserted on the ioapic pin.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/11] Add handle page fault PV helper.

2009-11-08 Thread Ingo Molnar

* Avi Kivity a...@redhat.com wrote:

 On 11/08/2009 01:36 PM, Ingo Molnar wrote:
 Three existing callbacks are: kmemcheck, mmiotrace, notifier. Two
 of them kmemcheck, mmiotrace are enabled only for debugging, should
 not be performance concern. And notifier call sites (two of them)
 are deliberately, as explained by comment, not at the function entry,
 so can't be unified with others. (And kmemcheck also has two different
 call site BTW)
 
  We want mmiotrace to be generic distro capable so the overhead when 
  the hook is not used is of concern.
 
 Maybe we should generalize paravirt-ops patching in case if (x) f() is 
 deemed too expensive.

Yes, that's a nice idea. We have quite a number of 'conditional 
callbacks' in various critical paths that could be made lighter via such 
a technique.

It would also free new callbacks from the 'it increases overhead even if 
unused' criticism and made it easier to add them.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/11] Add handle page fault PV helper.

2009-11-08 Thread Avi Kivity

On 11/08/2009 02:51 PM, Ingo Molnar wrote:

Maybe we should generalize paravirt-ops patching in case if (x) f() is
deemed too expensive.
 

Yes, that's a nice idea. We have quite a number of 'conditional
callbacks' in various critical paths that could be made lighter via such
a technique.

It would also free new callbacks from the 'it increases overhead even if
unused' criticism and made it easier to add them.
   


We can take the immediate values infrastructure as a first step.  Has 
that been merged?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/11] Add handle page fault PV helper.

2009-11-08 Thread Ingo Molnar

* Avi Kivity a...@redhat.com wrote:

 On 11/08/2009 02:51 PM, Ingo Molnar wrote:
 Maybe we should generalize paravirt-ops patching in case if (x) f() is
 deemed too expensive.
 Yes, that's a nice idea. We have quite a number of 'conditional
 callbacks' in various critical paths that could be made lighter via such
 a technique.
 
  It would also free new callbacks from the 'it increases overhead 
  even if unused' criticism and made it easier to add them.
 
 We can take the immediate values infrastructure as a first step. Has 
 that been merged?

No, there were doubts about whether patching in live instructions like 
that is safe on all CPU types.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qemu-kvm: clear only essential parts of VirtIOBlockReq on object allocation

2009-11-08 Thread Avi Kivity

On 11/04/2009 07:41 PM, Saul Tamari wrote:

This patch reduces the size of memory being cleared on every virtio-blk IO.

On every virtio-blk IO passed to QEMU, virtio_blk_alloc_request()
allocates and clears (with qemu_mallocz()) a VirtIOBlockReq object.
The sizeof(VirtIOBlockReq) equals 41040 bytes on my x86-64 machine.
By moving the 'elem' variable to the end of VirtIOBlockReq and
clearing only upto the address of the 'elem.in_addr' field, the
memset() call now clears only 80 bytes.


Signed-off-by: Saul Tamarist...@gmail.com
   


Use the full email in signoffs.


@@ -79,12 +79,12 @@ static inline void virtio_identify_template(struct
virtio_blk_config *bc)
  typedef struct VirtIOBlockReq
  {
  VirtIOBlock *dev;
-VirtQueueElement elem;
  struct virtio_blk_inhdr *in;
  struct virtio_blk_outhdr *out;
  struct virtio_scsi_inhdr *scsi;
  QEMUIOVector qiov;
  struct VirtIOBlockReq *next;
+VirtQueueElement elem;
  } VirtIOBlockReq;
   


Needs a comment to indicate new members must be before 'elem' if they 
need to be cleared.



  static void virtio_blk_req_complete(VirtIOBlockReq *req, int status)
@@ -139,7 +139,8 @@ static void virtio_blk_flush_complete(void *opaque, int ret)

  static VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
  {
-VirtIOBlockReq *req = qemu_mallocz(sizeof(*req));
+VirtIOBlockReq *req = qemu_malloc(sizeof(*req));
+memset(req, 0, (size_t)(((VirtIOBlockReq*)0)-elem.in_addr[0]));
   


Please use offsetof() instead of open-coding it.

Submit to qemu-de...@nongnu.org since this came from upstream.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/11] Add handle page fault PV helper.

2009-11-08 Thread Avi Kivity

On 11/08/2009 03:05 PM, Ingo Molnar wrote:

We can take the immediate values infrastructure as a first step. Has
that been merged?
 

No, there were doubts about whether patching in live instructions like
that is safe on all CPU types.
   


Ah, I remember.  Doesn't the trick of going through a breakpoint 
instruction work?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/11] Add handle page fault PV helper.

2009-11-08 Thread H. Peter Anvin
On 11/08/2009 04:51 AM, Ingo Molnar wrote:
 
 * Avi Kivity a...@redhat.com wrote:
 
 On 11/08/2009 01:36 PM, Ingo Molnar wrote:
 Three existing callbacks are: kmemcheck, mmiotrace, notifier. Two
 of them kmemcheck, mmiotrace are enabled only for debugging, should
 not be performance concern. And notifier call sites (two of them)
 are deliberately, as explained by comment, not at the function entry,
 so can't be unified with others. (And kmemcheck also has two different
 call site BTW)

 We want mmiotrace to be generic distro capable so the overhead when 
 the hook is not used is of concern.

 Maybe we should generalize paravirt-ops patching in case if (x) f() is 
 deemed too expensive.
 
 Yes, that's a nice idea. We have quite a number of 'conditional 
 callbacks' in various critical paths that could be made lighter via such 
 a technique.
 
 It would also free new callbacks from the 'it increases overhead even if 
 unused' criticism and made it easier to add them.
 

There are a number of other things were we permanently bind to a single
instance of something, too.  Optimizing those away would be nice.
Consider memcpy(), where we may want to have different implementations
for different processors.

-hpa


-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/11] Add handle page fault PV helper.

2009-11-08 Thread Ingo Molnar

* H. Peter Anvin h...@zytor.com wrote:

 On 11/08/2009 04:51 AM, Ingo Molnar wrote:
  
  * Avi Kivity a...@redhat.com wrote:
  
  On 11/08/2009 01:36 PM, Ingo Molnar wrote:
  Three existing callbacks are: kmemcheck, mmiotrace, notifier. Two
  of them kmemcheck, mmiotrace are enabled only for debugging, should
  not be performance concern. And notifier call sites (two of them)
  are deliberately, as explained by comment, not at the function entry,
  so can't be unified with others. (And kmemcheck also has two different
  call site BTW)
 
  We want mmiotrace to be generic distro capable so the overhead when 
  the hook is not used is of concern.
 
  Maybe we should generalize paravirt-ops patching in case if (x) f() is 
  deemed too expensive.
  
  Yes, that's a nice idea. We have quite a number of 'conditional 
  callbacks' in various critical paths that could be made lighter via such 
  a technique.
  
  It would also free new callbacks from the 'it increases overhead 
  even if unused' criticism and made it easier to add them.
 
 There are a number of other things were we permanently bind to a 
 single instance of something, too.  Optimizing those away would be 
 nice. Consider memcpy(), where we may want to have different 
 implementations for different processors.

yeah.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Doubt on KVM-88 vulnerabilities

2009-11-08 Thread Daniel Bareiro

Hi all!

I'm using KVM-88 compiled by myself from the source code provided by the
official site of the project.

Is this version of KVM vulnerable to the mentioned thing in the
DSA-1907-1 [1]? In such case, there is some published patch that can be
applied or some new version that solves this?

Thanks in advance for your reply.

Regards,
Daniel

[1] http://lists.debian.org/debian-security-announce/2009/msg00229.html
-- 
Fingerprint: BFB3 08D6 B4D1 31B2 72B9  29CE 6696 BF1B 14E6 1D37
Powered by Debian GNU/Linux Squeeze - Linux user #188.598


signature.asc
Description: Digital signature


Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server

2009-11-08 Thread Paul E. McKenney
On Sun, Nov 08, 2009 at 02:39:59PM +1030, Rusty Russell wrote:
 On Sat, 7 Nov 2009 03:00:07 am Paul E. McKenney wrote:
  On Fri, Nov 06, 2009 at 03:31:20PM +1030, Rusty Russell wrote:
   But it's still nasty to use half an API.  If it were a few places I would
   have open-coded it with a comment, or wrapped it.  As it is, I don't think
   that would be a win.
  
  So would it help to have a rcu_read_lock_workqueue() and
  rcu_read_unlock_workqueue() that checked nesting and whether they were
  actually running in the context of a workqueue item?  Or did you have
  something else in mind?  Or am I misjudging the level of sarcasm in
  your reply?  ;-)
 
 You read correctly.  If we get a second user, creating an API makes sense.

Makes sense to me as well.  Which does provide some time to come up with
a primitive designed to answer the question Am I currently executing in
the context of a workqueue item?.  ;-)

Thanx, Paul
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Reserve CPU cores for specific guests?

2009-11-08 Thread Neil Aggarwal
Hello:

I don't think there is a way to do this with KVM, but
I figured I would ask:

I want to be able to offer virtual private servers (VPSs)
to clients.  I am going to use KVM for it.

I would like to offer clients the option to buy either:
1. A VPS which allows CPUs to be overcommitted.
2. A VPS with a dedicated CPU core.

So, for example, if I have a six core opteron, I might
sell:
2 VPSs with a dedicated CPU core
6 VPSs which allow overcommitted CPUs

Since I need one core for the hypervisor, there would
need to be a way to say that it gets a dedicated core
plus the other 2 VPSs get a dedicated core.  That
leaves 3 pooled cores to serve the 6 VPSs that 
are allowed to overcommit.

Is there a way to set up a pooled set of cores
for a given list of VPSs?  

I think I may have to use separate physical machine
for the VPSs with dedicated cores and the ones with
overcommitted ones.

Thanks,
Neil

--
Neil Aggarwal, (281)846-8957, http://www.JAMMConsulting.com
CentOS 5.4 KVM VPS $55/mo, no setup fee, no contract, dedicated 64bit CPU
1GB dedicated RAM, 40GB RAID storage, 500GB/mo premium BW, Zero downtime

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reserve CPU cores for specific guests?

2009-11-08 Thread Zdenek Kaspar
Neil Aggarwal napsal(a):
 Hello:
 
 I don't think there is a way to do this with KVM, but
 I figured I would ask:
 
 I want to be able to offer virtual private servers (VPSs)
 to clients.  I am going to use KVM for it.
 
 I would like to offer clients the option to buy either:
 1. A VPS which allows CPUs to be overcommitted.
 2. A VPS with a dedicated CPU core.
 
 So, for example, if I have a six core opteron, I might
 sell:
 2 VPSs with a dedicated CPU core
 6 VPSs which allow overcommitted CPUs
 
 Since I need one core for the hypervisor, there would
 need to be a way to say that it gets a dedicated core
 plus the other 2 VPSs get a dedicated core.  That
 leaves 3 pooled cores to serve the 6 VPSs that 
 are allowed to overcommit.
 
 Is there a way to set up a pooled set of cores
 for a given list of VPSs?  
 
 I think I may have to use separate physical machine
 for the VPSs with dedicated cores and the ones with
 overcommitted ones.
 
 Thanks,
   Neil
 
 --
 Neil Aggarwal, (281)846-8957, http://www.JAMMConsulting.com
 CentOS 5.4 KVM VPS $55/mo, no setup fee, no contract, dedicated 64bit CPU
 1GB dedicated RAM, 40GB RAID storage, 500GB/mo premium BW, Zero downtime
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

I think you can achieve that on some simple level DIY with taskset from
util-linux(-ng).

HTH, Z.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2894352 ] Ctrl+Alt+f

2009-11-08 Thread SourceForge.net
Bugs item #2894352, was opened at 2009-11-09 03:03
Message generated for change (Tracker Item Submitted) made by pawelwalaszek
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2894352group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: qemu
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: pjw (pawelwalaszek)
Assigned to: Nobody/Anonymous (nobody)
Summary: Ctrl+Alt+f

Initial Comment:
Ctrl+Alt+f doesn't work in qemu-kvm 0.11.0. I have tested it on Ubuntu 9.10. 
Provisionally there is a workaround. To disable a full screen is Ctrl+Alt+f  
Ctrl+Alt+1 and to enable is Ctrl+Alt+1  Ctrl+Alt+f.

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2894352group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2894352 ] Ctrl+Alt+f doesn't work

2009-11-08 Thread SourceForge.net
Bugs item #2894352, was opened at 2009-11-09 03:03
Message generated for change (Settings changed) made by pawelwalaszek
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2894352group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: qemu
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: pjw (pawelwalaszek)
Assigned to: Nobody/Anonymous (nobody)
Summary: Ctrl+Alt+f doesn't work

Initial Comment:
Ctrl+Alt+f doesn't work in qemu-kvm 0.11.0. I have tested it on Ubuntu 9.10. 
Provisionally there is a workaround. To disable a full screen is Ctrl+Alt+f  
Ctrl+Alt+1 and to enable is Ctrl+Alt+1  Ctrl+Alt+f.

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2894352group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Reserve CPU cores for specific guests?

2009-11-08 Thread Neil Aggarwal
 I think you can achieve that on some simple level DIY with 
 taskset from
 util-linux(-ng).

That is a good utility to know.  I did not know about that
earlier.  Thanks for the info.

I am wondering one thing though:

I will either need to call taskset when executing the
process or run taskset on a PID after it starts up.

Unless there is a way to tell KVM to call taskset when starting
a guest, I think that is going to be hard to automate since the
guests will get different PID each time they are started.

Any suggestions?

Thanks,
Neil

--
Neil Aggarwal, (281)846-8957, http://www.JAMMConsulting.com
CentOS 5.4 KVM VPS $55/mo, no setup fee, no contract, dedicated 64bit CPU
1GB dedicated RAM, 40GB RAID storage, 500GB/mo premium BW, Zero downtime 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reserve CPU cores for specific guests?

2009-11-08 Thread Thomas Fjellstrom
On Sun November 8 2009, Neil Aggarwal wrote:
  I think you can achieve that on some simple level DIY with
  taskset from
  util-linux(-ng).
 
 That is a good utility to know.  I did not know about that
 earlier.  Thanks for the info.
 
 I am wondering one thing though:
 
 I will either need to call taskset when executing the
 process or run taskset on a PID after it starts up.
 
 Unless there is a way to tell KVM to call taskset when starting
 a guest, I think that is going to be hard to automate since the
 guests will get different PID each time they are started.
 
 Any suggestions?


None directly related, but libvirt's kvm support supports pinning a vm to a 
physical cpu. At least it has the option in virt-manager.

 Thanks,
   Neil
 
 --
 Neil Aggarwal, (281)846-8957, http://www.JAMMConsulting.com
 CentOS 5.4 KVM VPS $55/mo, no setup fee, no contract, dedicated 64bit CPU
 1GB dedicated RAM, 40GB RAID storage, 500GB/mo premium BW, Zero downtime
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Thomas Fjellstrom
tfjellst...@shaw.ca
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] KVM Fault Tolerance: Kemari for KVM

2009-11-08 Thread Fernando Luis Vázquez Cao

Hi all,

It has been a while coming, but we have finally started work on
Kemari's port to KVM. For those not familiar with it, Kemari provides
the basic building block to create a virtualization-based fault
tolerant machine: a virtual machine synchronization mechanism.

Traditional high availability solutions can be classified in two
groups: fault tolerant servers, and software clustering.

Broadly speaking, fault tolerant servers protect us against hardware
failures and, generally, rely on redundant hardware (often
proprietary), and hardware failure detection to trigger fail-over.

On the other hand, software clustering, as its name indicates, takes
care of software failures and usually requires a standby server whose
software configuration for the part we are trying to make fault
tolerant must be identical to that of the active server.

Both solutions may be applied to virtualized environments. Indeed,
the current incarnation of Kemari (Xen-based) brings fault tolerant
server-like capabilities to virtual machines and integration with
existing HA stacks (Heartbeat, RHCS, etc) is under consideration.

After some time in the drawing board we completed the basic design of
Kemari for KVM, so we are sending an RFC at this point to get early
feedback and, hopefully, get things right from the start. Those
already familiar with Kemari and/or fault tolerance may want to skip
the Background and go directly to the design and implementation
bits.

This is a pretty long write-up, but please bear with me.

== Background ==

We started to play around with continuous virtual synchronization
technology about 3 years ago. As development progressed and, most
importantly, we got the first Xen-based working prototypes it became
clear that we needed a proper name for our toy: Kemari.

The goal of Kemari is to provide a fault tolerant platform for
virtualization environments, so that in the event of a hardware
failure the virtual machine fails over from compromised to properly
operating hardware (a physical machine) in a way that is completely
transparent to the guest operating system.

Although hardware based fault tolerant servers and HA servers
(software clustering) have been around for a (long) while, they
typically require specifically designed hardware and/or modifications
to applications. In contrast, by abstracting hardware using
virtualization, Kemari can be used on off-the-shelf hardware and no
application modifications are needed.

After a period of in-house development the first version of Kemari for
Xen was released in Nov 2008 as open source. However, by then it was
already pretty clear that a KVM port would have several
advantages. First, KVM is integrated into the Linux kernel, which
means one gets support for a wide variety of hardware for
free. Second, and in the same vein, KVM can also benefit from Linux'
low latency networking capabilities including RDMA, which is of
paramount importance for a extremely latency-sensitive functionality
like Kemari. Last, but not the least, KVM and its community is growing
rapidly, and there is increasing demand for Kemari-like functionality
for KVM.

Although the basic design principles will remain the same, our plan is
to write Kemari for KVM from scratch, since there does not seem to be
much opportunity for sharing between Xen and KVM.

== Design outline ==

The basic premise of fault tolerant servers is that when things go
awry with the hardware the running system should transparently
continue execution on an alternate physical host. For this to be
possible the state of the fallback host has to be identical to that of
the primary.

Kemari runs paired virtual machines in an active-passive configuration
and achieves whole-system replication by continuously copying the
state of the system (dirty pages and the state of the virtual devices)
from the active node to the passive node. An interesting implication
of this is that during normal operation only the active node is
actually executing code.

Another possible approach is to run a pair of systems in lock-step
(à la VMware FT). Since both the primary and fallback virtual machines
are active keeping them synchronized is a complex task, which usually
involves carefully injecting external events into both virtual
machines so that they result in identical states.

The latter approach is extremely architecture specific and not SMP
friendly. This spurred us to try the design that became Kemari, which
we believe lends itself to further optimizations.

== Implementation ==

The first step is to encapsulate the machine to be protected within a
virtual machine. Then the live migration functionality is leveraged to
keep the virtual machines synchronized.

Whereas during live migration dirty pages can be sent asynchronously
from the primary to the fallback server until the ratio of dirty pages
is low enough to guarantee very short downtimes, when it comes to
fault tolerance solutions whenever a synchronization point is reached
changes 

RE: Reserve CPU cores for specific guests?

2009-11-08 Thread Neil Aggarwal
 None directly related, but libvirt's kvm support supports 
 pinning a vm to a 
 physical cpu. At least it has the option in virt-manager.

That is exactly what I needed.
My KVM host does not have a GUI so I have been using
virsh.  I did not notice that option before.

Thank you,
Neil

--
Neil Aggarwal, (281)846-8957, http://www.JAMMConsulting.com
CentOS 5.4 KVM VPS $55/mo, no setup fee, no contract, dedicated 64bit CPU
1GB dedicated RAM, 40GB RAID storage, 500GB/mo premium BW, Zero downtime 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2847560 ] Guest hang with exhausted IRQ sources error if 8 VFs assigne

2009-11-08 Thread SourceForge.net
Bugs item #2847560, was opened at 2009-08-30 22:58
Message generated for change (Settings changed) made by jiajun
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2847560group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: kernel
Group: None
Status: Closed
Resolution: Fixed
Priority: 5
Private: No
Submitted By: Jiajun Xu (jiajun)
Assigned to: Nobody/Anonymous (nobody)
Summary: Guest hang with exhausted IRQ sources error if 8 VFs assigne

Initial Comment:
Environment:

Host OS (ia32/ia32e/IA64): ia32e
Guest OS (ia32/ia32e/IA64): ia32e 
Guest OS Type (Linux/Windows): Linux
kvm.git Commit: 779cc54dbccaa3a00d70a9d61d090be5d9ccc903
qemu-kvm Commit: 9e3269181e9bc56feb43bcd4e8ce0b82cd543e65
Host Kernel Version: 2.6.31-rc5
Hardware: NHM-EP


Bug detailed description:
--
If boot a guest with 8 VFs assigned, guest will hang at stage of Starting
udev:. And dmesg will shows that kvm: exhaust allocatable IRQ sources!

If boot a guest with 6 VFs assigned, guest can boot up and VF can work well in
guest.

Dmesg info:

pci-stub :01:10.0: enabling device ( - 0002)
assign device: host bdf = 1:10:0
pci-stub :01:10.1: enabling device ( - 0002)
assign device: host bdf = 1:10:1
pci-stub :01:10.2: enabling device ( - 0002)
assign device: host bdf = 1:10:2
pci-stub :01:10.3: enabling device ( - 0002)
assign device: host bdf = 1:10:3
pci-stub :01:10.4: enabling device ( - 0002)
assign device: host bdf = 1:10:4
pci-stub :01:10.5: enabling device ( - 0002)
assign device: host bdf = 1:10:5
pci-stub :01:10.6: enabling device ( - 0002)
assign device: host bdf = 1:10:6
pci-stub :01:10.7: enabling device ( - 0002)
assign device: host bdf = 1:10:7
pci-stub :01:10.0: irq 91 for MSI/MSI-X
pci-stub :01:10.0: irq 92 for MSI/MSI-X
pci-stub :01:10.0: irq 93 for MSI/MSI-X
pci-stub :01:10.1: irq 97 for MSI/MSI-X
pci-stub :01:10.1: irq 98 for MSI/MSI-X
pci-stub :01:10.1: irq 99 for MSI/MSI-X
pci-stub :01:10.2: irq 100 for MSI/MSI-X
pci-stub :01:10.2: irq 101 for MSI/MSI-X
pci-stub :01:10.2: irq 102 for MSI/MSI-X
pci-stub :01:10.3: irq 103 for MSI/MSI-X
pci-stub :01:10.3: irq 104 for MSI/MSI-X
pci-stub :01:10.3: irq 105 for MSI/MSI-X
pci-stub :01:10.4: irq 106 for MSI/MSI-X
pci-stub :01:10.4: irq 107 for MSI/MSI-X
pci-stub :01:10.4: irq 108 for MSI/MSI-X
pci-stub :01:10.5: irq 112 for MSI/MSI-X
pci-stub :01:10.5: irq 113 for MSI/MSI-X
pci-stub :01:10.5: irq 114 for MSI/MSI-X
pci-stub :01:10.6: irq 115 for MSI/MSI-X
pci-stub :01:10.6: irq 116 for MSI/MSI-X
pci-stub :01:10.6: irq 117 for MSI/MSI-X
kvm: exhaust allocatable IRQ sources!


--

Comment By: Jiajun Xu (jiajun)
Date: 2009-11-08 22:00

Message:
The bug is fixed by commit kvm.git commit,
99c52eb29f6f013b7a9f5c25439091a2fad0fd80:

commit 99c52eb29f6f013b7a9f5c25439091a2fad0fd80
Author: Marcelo Tosatti mtosa...@redhat.com
Date:   Sat Oct 17 22:47:23 2009 -0300

KVM: fix irq_source_id size verification

find_first_zero_bit works with bit numbers, not bytes.

Fixes

https://sourceforge.net/tracker/?func=detailaid=2847560group_id=180599atid=893831

Reported-by: Xu, Jiajun jiajun...@intel.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2847560group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server

2009-11-08 Thread Rusty Russell
On Sun, 8 Nov 2009 10:05:16 pm Michael S. Tsirkin wrote:
 On Fri, Nov 06, 2009 at 03:29:17PM +1030, Rusty Russell wrote:
   +/* Caller must have TX VQ lock */
   +static void tx_poll_stop(struct vhost_net *net)
   +{
   + if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED))
   + return;
  
  likely?  Really?
 
 Hmm ... yes. tx poll stop is called on each packet (as long as we do not
 fill up 1/2 backend queue), the first call will stop polling
 the rest checks state and does nothing.
 
 This is because we normally do not care when the message has left the
 queue in backend device: we tell backend to send it and forget. We only
 start polling when backend tx queue fills up.

OK, good.

   +static void vhost_net_set_features(struct vhost_net *n, u64 features)
   +{
   + size_t hdr_size = features  (1  VHOST_NET_F_VIRTIO_NET_HDR) ?
   + sizeof(struct virtio_net_hdr) : 0;
   + int i;
   + mutex_lock(n-dev.mutex);
   + n-dev.acked_features = features;
  
  Why is this called acked_features?  Not just features?  I expected
  to see code which exposed these back to userspace, and didn't.
 
 Not sure how do you mean. Userspace sets them, why
 does it want to get them exposed back?

There's something about the 'acked' which rubs me the wrong way.
enabled_features is perhaps a better term than acked_features; acked
seems more a user point-of-view, enabled seems more driver POV?

set_features matches your ioctl names, but it sounds like a fn name :(

It's marginal.  And 'features' is shorter than both.

   + switch (ioctl) {
   + case VHOST_SET_VRING_NUM:
  
  I haven't looked at your userspace implementation, but does a generic
  VHOST_SET_VRING_STATE  VHOST_GET_VRING_STATE with a struct make more
  sense?  It'd be simpler here,
 
 Not by much though, right?
 
  but not sure if it'd be simpler to use?
 
 The problem is with VHOST_SET_VRING_BASE as well. I want it to be
 separate because I want to make it possible to relocate e.g. used ring
 to another address while ring is running. This would be a good debugging
 tool (you look at kernel's used ring, check descriptor, then update
 guest's used ring) and also possibly an extra way to do migration.  And
 it's nicer to have vring size separate as well, because it is
 initialized by host and never changed, right?

Actually, this looks wrong to me:

+   case VHOST_SET_VRING_BASE:
...
+   vq-avail_idx = vq-last_avail_idx = s.num;

The last_avail_idx is part of the state of the driver.  It needs to be saved
and restored over susp/resume.  The only reason it's not in the ring itself
is because I figured the other side doesn't need to see it (which is true, but
missed debugging opportunities as well as man-in-the-middle issues like this
one).  I had a patch which put this field at the end of the ring, I might
resurrect it to avoid this problem.  This is backwards compatible with all
implementations.  See patch at end.

I would drop avail_idx altogether: get_user is basically free, and simplifies
a lot.  As most state is in the ring, all you need is an ioctl to save/restore
the last_avail_idx.

 We could merge DESC, AVAIL, USED, and it will reduce the amount of code
 in userspace. With both base, size and fds separate, it seemed a bit
 more symmetrical to have desc/avail/used separate as well.
 What's your opinion?

Well, DESC, AVAIL, and USED could easily be turned into SET/GET_LAYOUT.

  For future reference, this is *exactly* the kind of thing which would have
  been nice as a followup patch.  Easy to separate, easy to review, not 
  critical
  to the core.
 
 Yes. It's not too late to split it out though: should I do it yet?

Only if you're feeling enthused.  It's lightly reviewed now.

Cheers,
Rusty.

virtio: put last_used and last_avail index into ring itself.

Generally, the other end of the virtio ring doesn't need to see where
you're up to in consuming the ring.  However, to completely understand
what's going on from the outside, this information must be exposed.
For example, if you want to save and restore a virtio_ring, but you're
not the consumer because the kernel is using it directly.

Fortunately, we have room to expand: the ring is always a whole number
of pages and there's hundreds of bytes of padding after the avail ring
and the used ring, whatever the number of descriptors (which must be a
power of 2).

We add a feature bit so the guest can tell the host that it's writing
out the current value there, if it wants to use that.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
---
 drivers/virtio/virtio_ring.c |   23 +++
 include/linux/virtio_ring.h  |   12 +++-
 2 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -71,9 +71,6 @@ struct vring_virtqueue
/* Number we've added since last sync. */
unsigned int num_added;
 
-   /* Last 

Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server

2009-11-08 Thread Michael S. Tsirkin
On Mon, Nov 9, 2009 at 8:17 AM, Rusty Russell ru...@rustcorp.com.au wrote:
   +static void vhost_net_set_features(struct vhost_net *n, u64 features)
   +{
   + size_t hdr_size = features  (1  VHOST_NET_F_VIRTIO_NET_HDR) ?
   +         sizeof(struct virtio_net_hdr) : 0;
   + int i;
   + mutex_lock(n-dev.mutex);
   + n-dev.acked_features = features;
 
  Why is this called acked_features?  Not just features?  I expected
  to see code which exposed these back to userspace, and didn't.

 Not sure how do you mean. Userspace sets them, why
 does it want to get them exposed back?

 There's something about the 'acked' which rubs me the wrong way.
 enabled_features is perhaps a better term than acked_features; acked
 seems more a user point-of-view, enabled seems more driver POV?


Hmm. Are you happy with the ioctl name? If yes I think being consistent
with that is important.


 set_features matches your ioctl names, but it sounds like a fn name :(

 It's marginal.  And 'features' is shorter than both.

I started with this but I was always getting confused whether this
includes all features or just acked features.  I'll go with
enabled_features.



   + switch (ioctl) {
   + case VHOST_SET_VRING_NUM:
 
  I haven't looked at your userspace implementation, but does a generic
  VHOST_SET_VRING_STATE  VHOST_GET_VRING_STATE with a struct make more
  sense?  It'd be simpler here,

 Not by much though, right?

  but not sure if it'd be simpler to use?

 The problem is with VHOST_SET_VRING_BASE as well. I want it to be
 separate because I want to make it possible to relocate e.g. used ring
 to another address while ring is running. This would be a good debugging
 tool (you look at kernel's used ring, check descriptor, then update
 guest's used ring) and also possibly an extra way to do migration.  And
 it's nicer to have vring size separate as well, because it is
 initialized by host and never changed, right?

 Actually, this looks wrong to me:

 +       case VHOST_SET_VRING_BASE:
 ...
 +               vq-avail_idx = vq-last_avail_idx = s.num;

 The last_avail_idx is part of the state of the driver.  It needs to be saved
 and restored over susp/resume.


Exactly. That's what VHOST_GET/SET_VRING_BASE does.  avail_idx is just a
cached value for notify on empty, so what this does is clear the value.
What exactly do you refer to when you say this looks wrong?
This could trigger an extra notification if I ever called
trigger_irq without get first. As I don't, it in fact has no effect.

  The only reason it's not in the ring itself
 is because I figured the other side doesn't need to see it (which is true, but
 missed debugging opportunities as well as man-in-the-middle issues like this
 one).  I had a patch which put this field at the end of the ring, I might
 resurrect it to avoid this problem.  This is backwards compatible with all
 implementations.  See patch at end.

Yes, I remember that patch. There seems to be little point though, at
this stage.



 I would drop avail_idx altogether: get_user is basically free, and simplifies
 a lot.  As most state is in the ring, all you need is an ioctl to save/restore
 the last_avail_idx.


avail_idx is there for notify on empty: I had this thought that it's
better to leave the avail cache line alone when we are triggering
interrupt to avoid bouncing it around if guest is updating it meanwhile
on another CPU, and I think my testing showed that it helped
performance, but could be a mistake.  You don't believe this can help?




 We could merge DESC, AVAIL, USED, and it will reduce the amount of code
 in userspace. With both base, size and fds separate, it seemed a bit
 more symmetrical to have desc/avail/used separate as well.
 What's your opinion?

 Well, DESC, AVAIL, and USED could easily be turned into SET/GET_LAYOUT.


Will do.


  For future reference, this is *exactly* the kind of thing which would have
  been nice as a followup patch.  Easy to separate, easy to review, not 
  critical
  to the core.

 Yes. It's not too late to split it out though: should I do it yet?

 Only if you're feeling enthused.  It's lightly reviewed now.


Not really :) I'll keep this in mind for the future.
Thanks!




 Cheers,
 Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv8 3/3] vhost_net: a kernel-level virtio server

2009-11-08 Thread Michael S. Tsirkin
On Mon, Nov 09, 2009 at 04:47:29PM +1030, Rusty Russell wrote:
 On Sun, 8 Nov 2009 10:05:16 pm Michael S. Tsirkin wrote:
  On Fri, Nov 06, 2009 at 03:29:17PM +1030, Rusty Russell wrote:
+/* Caller must have TX VQ lock */
+static void tx_poll_stop(struct vhost_net *net)
+{
+   if (likely(net-tx_poll_state != VHOST_NET_POLL_STARTED))
+   return;
   
   likely?  Really?
  
  Hmm ... yes. tx poll stop is called on each packet (as long as we do not
  fill up 1/2 backend queue), the first call will stop polling
  the rest checks state and does nothing.
  
  This is because we normally do not care when the message has left the
  queue in backend device: we tell backend to send it and forget. We only
  start polling when backend tx queue fills up.
 
 OK, good.
 
+static void vhost_net_set_features(struct vhost_net *n, u64 features)
+{
+   size_t hdr_size = features  (1  VHOST_NET_F_VIRTIO_NET_HDR) ?
+   sizeof(struct virtio_net_hdr) : 0;
+   int i;
+   mutex_lock(n-dev.mutex);
+   n-dev.acked_features = features;
   
   Why is this called acked_features?  Not just features?  I expected
   to see code which exposed these back to userspace, and didn't.
  
  Not sure how do you mean. Userspace sets them, why
  does it want to get them exposed back?
 
 There's something about the 'acked' which rubs me the wrong way.
 enabled_features is perhaps a better term than acked_features; acked
 seems more a user point-of-view, enabled seems more driver POV?
 
 set_features matches your ioctl names, but it sounds like a fn name :(

Hmm. Are you happy with the ioctl name? If yes I think being consistent
with that is important.

 It's marginal.  And 'features' is shorter than both.

I started with this but I was always getting confused whether this
includes all features or just acked features.  I'll go with
enabled_features.

+   switch (ioctl) {
+   case VHOST_SET_VRING_NUM:
   
   I haven't looked at your userspace implementation, but does a generic
   VHOST_SET_VRING_STATE  VHOST_GET_VRING_STATE with a struct make more
   sense?  It'd be simpler here,
  
  Not by much though, right?
  
   but not sure if it'd be simpler to use?
  
  The problem is with VHOST_SET_VRING_BASE as well. I want it to be
  separate because I want to make it possible to relocate e.g. used ring
  to another address while ring is running. This would be a good debugging
  tool (you look at kernel's used ring, check descriptor, then update
  guest's used ring) and also possibly an extra way to do migration.  And
  it's nicer to have vring size separate as well, because it is
  initialized by host and never changed, right?
 
 Actually, this looks wrong to me:
 
 + case VHOST_SET_VRING_BASE:
 ...
 + vq-avail_idx = vq-last_avail_idx = s.num;
 
 The last_avail_idx is part of the state of the driver.  It needs to be saved
 and restored over susp/resume.

Exactly. That's what VHOST_GET/SET_VRING_BASE does.  avail_idx is just a
cached value for notify on empty, so what this does is clear the value.
What exactly do you refer to when you say this looks wrong?
This could trigger an extra notification if I ever called
trigger_irq without get first. As I don't, it in fact has no effect.

  The only reason it's not in the ring itself
 is because I figured the other side doesn't need to see it (which is true, but
 missed debugging opportunities as well as man-in-the-middle issues like this
 one).  I had a patch which put this field at the end of the ring, I might
 resurrect it to avoid this problem.  This is backwards compatible with all
 implementations.  See patch at end.

Yes, I remember that patch. There seems to be little point though, at
this stage.

 
 I would drop avail_idx altogether: get_user is basically free, and
 simplifies a lot.  As most state is in the ring, all you need is an
 ioctl to save/restore the last_avail_idx.

avail_idx is there for notify on empty: I had this thought that it's
better to leave the avail cache line alone when we are triggering
interrupt to avoid bouncing it around if guest is updating it meanwhile
on another CPU, and I think my testing showed that it helped
performance, but could be a mistake.  You don't believe this can help?

  We could merge DESC, AVAIL, USED, and it will reduce the amount of code
  in userspace. With both base, size and fds separate, it seemed a bit
  more symmetrical to have desc/avail/used separate as well.
  What's your opinion?
 
 Well, DESC, AVAIL, and USED could easily be turned into SET/GET_LAYOUT.

OK, I'll do this.

   For future reference, this is *exactly* the kind of thing which would have
   been nice as a followup patch.  Easy to separate, easy to review, not 
   critical
   to the core.
  
  Yes. It's not too late to split it out though: should I do it yet?
 
 Only if you're feeling enthused.  It's lightly reviewed now.

Not really :) I'll keep this in mind for the