Re: virtio-blk performance regression and qemu-kvm

2012-02-29 Thread Stefan Hajnoczi
On Tue, Feb 28, 2012 at 5:15 PM, Martin Mailand mar...@tuxadero.com wrote:
 Hi Stefan,
 I was bisecting qemu-kvm.git.

qemu-kvm.git regularly merges from qemu.git.  The history of the
qemu-kvm.git repository is not linear because of these periodic merges
from the qemu.git tree.  I think what happened is that git bisect went
down a qemu.git merge, which resulted in you effectively testing
qemu.git instead of qemu-kvm.git!

You can see this by using gitk(1) or git log --graph and searching for
the bad commit (12d4536f7d).  This will show a merge commit
(0b9b128530b999e36f0629dddcbafeda114fb4fb) where these qemu.git
commits were brought into qemu-kvm.git's history.

qemu-kvm.git
|
...
|
* merge commit (0b9b128530b999e36f0629dddcbafeda114fb4fb)
|\
| * bad commit (12d4536f7d911b6d87a766ad7300482ea663cea2)
| |
| ... more qemu.git commits
|
* previous commit, also a merge commit
(4fefc55ab04dd77002750f771e96477b5d2a473f)

Bisect seems to have gone down the qemu.git side of the merge at
0b9b128530b.  Instead we need to go down the qemu-kvm.git side of the
history.

The key thing is that the second git-bisect steps from qemu-kvm.git
into qemu.git history the source tree will be fairly different and
performance between the two is not easy to compare.

I suggest testing both of the qemu-kvm.git merge commits, 0b9b128530b
and 4fefc55ab04d.  My guess is you will find they perform the same,
i.e. the qemu.git commits which were merged did not affect performance
in qemu-kvm.git.  That would be evidence that git-bisect has not
located the real culprit.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-02-29 Thread Avi Kivity
On 02/29/2012 09:49 AM, Takuya Yoshikawa wrote:
 Avi Kivity a...@redhat.com wrote:

   The key part of the implementation is the use of xchg() operation for
   clearing dirty bits atomically.  Since this allows us to update only
   BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
   until every dirty bit is cleared again for the next call.
  
  What about using cmpxchg16b?  That should reduce locked ops by a factor
  of 2 (but note it needs 16 bytes alignment).

 I tried cmpxchg16b first: the implementation could not be naturally
 combined with the for loop over the unsigned long array.

   Extra if not zero, alignement check and ... it was ugly
   and I guessed it would be slow.

 Taking it into account that cmpxchg16b needs more cycles than others,
 I think this should be tried carefully with more measurement later.

 How about concentrating on xchg now?

Okay.  But note you don't need the alignment check; simply allocate the
array aligned, and a multiple of 16 bytes, in the first place.


 The implementation is simple and gives us enough improvement for now.
 At least, I want to see whether xchg-based implementation works well
 for one release.

   GET_DIRTY_LOG can be easily tuned to one particular case and
   it is really hard to check whether the implementation works well
   for every important case.  I really want feedback from users
   before adding non-obvious optimization.

 In addition, we should care about the new API.  It is not decided about
 what kind of range can be ordered.  I think restricting the range to be
 long size aligned is natural.  Do you have any plan?

Not really.  But the current changes are all internal and don't affect
the user API.


   Another point to note is that we do not use for_each_set_bit() to check
   which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
   simply use __ffs() and __fls() and pass the range in between the two
   positions found by them to kvm_mmu_write_protect_pt_range().
  
  This seems artificial.

 OK, then I want to pass the bits (unsingned long) as a mask.

 Non-NPT machines may gain some.

   Even though the passed range may include clean pages, it is much faster
   than repeatedly call find_next_bit() due to the locality of dirty pages.
  
  Perhaps this is due to the implementation of find_next_bit()?  would
  using bsf improve things?

 I need to explain what I did in the past.

 Before srcu-less work, I had already noticed the slowness of
 for_each_set_bit() and replaced it with simple for loop like now: the
 improvement was significant.

 Yes, find_next_bit() is for generic use and not at all good when there
 are many consecutive bits set: it cannot assume anything so needs to check
 a lot of cases - we have long size aligned bitmap and bits is already
 known to be non-zero after the first check of the for loop.

 Of course, doing 64 function calls alone should be avoided in our case.
 I also do not want to call kvm_mmu_* for each bit.

 So, above, I proposed just passing bits to kvm_mmu_*: we can check
 each bit i in a register before using rmap[i] if needed.

 __ffs is really fast compared to other APIs.

You could do something like this

for (i = 0; i  bitmap_size_in_longs; ++i) {
mask = bitmap[i];
if (!mask)
   continue;
for (j = __ffs(mask); mask; mask = mask - 1, j = __ffs(mask))
handle i * BITS_PER_LONG + j;
}

This gives you the speed of __ffs() but without working on zero bits.


 One note is that we will lose in cases like bits = 0x..

 2271171.412064.9   138.6  16K -45
 3375866.214743.3   103.0  32K -55
 4408395.610720.067.2  64K -51
 5915336.226538.145.1 128K -44
 8497356.416441.032.4 256K -29

 So the last one will become worse.  For other 4 patterns I am not sure.

 I thought that we should tune to the last case for gaining a lot from
 the locality of WWS.  What do you think about this point?

   - } else {
   - r = -EFAULT;
   - if (clear_user(log-dirty_bitmap, n))
   - goto out;
   + kvm_mmu_write_protect_pt_range(kvm, memslot, start, end);
  
  If indeed the problem is find_next_bit(), then we could hanve
  kvm_mmu_write_protect_slot_masked() which would just take the bitmap as
  a parameter.  This would allow covering just this function with the
  spinlock, not the xchg loop.

 We may see partial display updates if we do not hold the mmu_lock during
 xchg loop: it is possible that pages near the end of the framebuffer alone
 gets updated sometimes - I noticed this problem when I fixed the TLB flush
 issue.

I don't understand why this happens.


 Not a big problem but still maybe-noticeable change, so I think we should
 do it separately with some comments if needed.

Well if it's noticable on the framebuffer it's also noticable with live
migration.  We could do 

Re: Reconciling qemu-kvm and qemu's PIT

2012-02-29 Thread Avi Kivity
On 02/28/2012 11:47 PM, Jan Kiszka wrote:
  
  Looks like isa-pit has zero gpio pins, so it fails when crashing.  I
  must have mismerged it, but where is the gpio pin count set?

 In pit_initfn. Does -no-kvm-irqchip work fine? It's a bit tricky to
 discuss this without seeing your code.


pushed it into qemu-kvm.git usptream-merge

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-WIP 01/13] xen/arm: use r12 to pass the hypercall number to the hypervisor

2012-02-29 Thread Dave Martin
On Tue, Feb 28, 2012 at 12:28:29PM +, Stefano Stabellini wrote:
 On Tue, 28 Feb 2012, Ian Campbell wrote:
  On Tue, 2012-02-28 at 10:20 +, Dave Martin wrote:
   On Mon, Feb 27, 2012 at 07:33:39PM +, Ian Campbell wrote:
On Mon, 2012-02-27 at 18:03 +, Dave Martin wrote:
  Since we support only ARMv7+ there are T2 and T3 encodings 
  available
  which do allow direct mov of an immediate into R12, but are 32 bit 
  Thumb
  instructions.
  
  Should we use r7 instead to maximise instruction density for Thumb 
  code?
 
 The difference seems trivial when put into context, even if you code a
 special Thumb version of the code to maximise density (the Thumb-2 
 code
 which gets built from assembler in the kernel is very suboptimal in
 size, but there simply isn't a high proportion of asm code in the 
 kernel
 anyway.)  I wouldn't consider the ARM/Thumb differences as an 
 important
 factor when deciding on a register.

OK, that's useful information. thanks.

 One argument for _not_ using r12 for this purpose is that it is then
 harder to put a generic HVC function (analogous to the syscall
 syscall) out-of-line, since r12 could get destroyed by the call.

For an out of line syscall(2) wouldn't the syscall number either be in a
standard C calling convention argument register or on the stack when the
function was called, since it is just a normal argument at that point?
As you point out it cannot be passed in r12 (and could never be, due to
the clobbering).

The syscall function itself would have to move the arguments and syscall
nr etc around before issuing the syscall.

I think the same is true of a similar hypercall(2)

 If you don't think you will ever care about putting HVC out of line
 though, it may not matter.
   
   If you have both inline and out-of-line hypercalls, it's hard to ensure
   that you never have to shuffle the registers in either case.
  
  Agreed.
  
  I think we want to optimise for the inline case since those are the
  majority.
 
 They are not just the majority, all of them are static inline at the
 moment, even on x86 (where the number of hypercalls is much higher).
 
 So yes, we should optimize for the inline case.
 
 
  The only non-inline case is the special privcmd ioctl which is the
  mechanism that allows the Xen toolstack to make hypercalls. It's
  somewhat akin to syscall(2). By the time you get to it you will already
  have done a system call for the ioctl, pulled the arguments from the
  ioctl argument structure etc, plus such hypercalls are not really
  performance critical.
 
 Even the privcmd hypercall (privcmd_call) is a static inline function,
 it is just that at the moment there is only one caller :)
 
 
   Shuffling can be reduced but only at the expense of strange argument
   ordering in some cases when calling from C -- the complexity is probably
   not worth it.  Linux doesn't bother for its own syscalls.
   
   Note that even in assembler, a branch from one section to a label in
   another section may cause r12 to get destroyed, so you will need to be
   careful about how you code the hypervisor trap handler.  However, this
   is not different from coding exception handlers in general, so I don't
   know that it constitutes a conclusive argument on its own.
  
  We are happy to arrange that this doesn't occur on our trap entry paths,
  at least until the guest register state has been saved. Currently the
  hypercall dispatcher is in C and gets r12 from the on-stack saved state.
  We will likely eventually optimise the hypercall path directly in ASM
  and in that case we are happy to take steps to ensure we don't clobber
  r12 before we need it.
 
 Yes, I don't think this should be an issue.

Fair enough.

   My instinctive preference would therefore be for r7 (which also seems to
   be good enough for Linux syscalls) -- but it really depends how many
   arguments you expect to need to support.
  
  Apparently r7 is the frame pointer for gcc in thumb mode which I think
  is a good reason to avoid it.
  
  We currently have some 5 argument hypercalls and there have been
  occasional suggestions for interfaces which use 6 -- although none of
  them have come to reality.
  
 I don't have a very strong opinion on which register we should use, but
 I would like to avoid r7 if it is already actively used by gcc.

But there is no framepointer for Thumb-2 code (?)

 The fact that r12 can be destroyed so easily is actually a good argument
 for using it because it means it is less likely to contain useful data
 that needs to be saved/restored by gcc.

That's a fair point.

Cheers
---Dave
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Gleb Natapov
On Wed, Feb 29, 2012 at 09:08:52AM +0800, Wen Congyang wrote:
 At 02/28/2012 06:45 PM, Gleb Natapov Wrote:
  On Tue, Feb 28, 2012 at 11:19:47AM +0100, Jan Kiszka wrote:
  On 2012-02-28 10:42, Wen Congyang wrote:
  At 02/28/2012 05:34 PM, Jan Kiszka Wrote:
  On 2012-02-28 09:23, Wen Congyang wrote:
  At 02/27/2012 11:08 PM, Jan Kiszka Wrote:
  On 2012-02-27 04:01, Wen Congyang wrote:
  We can know the guest is paniced when the guest runs on xen.
  But we do not have such feature on kvm. This patch implemnts
  this feature, and the implementation is the same as xen:
  register panic notifier, and call hypercall when the guest
  is paniced.
 
  Signed-off-by: Wen Congyang we...@cn.fujitsu.com
  ---
   arch/x86/kernel/kvm.c|   12 
   arch/x86/kvm/svm.c   |8 ++--
   arch/x86/kvm/vmx.c   |8 ++--
   arch/x86/kvm/x86.c   |   13 +++--
   include/linux/kvm.h  |1 +
   include/linux/kvm_para.h |1 +
   6 files changed, 37 insertions(+), 6 deletions(-)
 
  diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
  index f0c6fd6..b928d1d 100644
  --- a/arch/x86/kernel/kvm.c
  +++ b/arch/x86/kernel/kvm.c
  @@ -331,6 +331,17 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
   };
   
  +static int
  +kvm_pv_panic_notify(struct notifier_block *nb, unsigned long code, 
  void *unused)
  +{
  + kvm_hypercall0(KVM_HC_GUEST_PANIC);
  + return NOTIFY_DONE;
  +}
  +
  +static struct notifier_block kvm_pv_panic_nb = {
  + .notifier_call = kvm_pv_panic_notify,
  +};
  +
 
  You should split up host and guest-side changes.
 
   static u64 kvm_steal_clock(int cpu)
   {
u64 steal;
  @@ -417,6 +428,7 @@ void __init kvm_guest_init(void)
   
paravirt_ops_setup();
register_reboot_notifier(kvm_pv_reboot_nb);
  + atomic_notifier_chain_register(panic_notifier_list, 
  kvm_pv_panic_nb);
for (i = 0; i  KVM_TASK_SLEEP_HASHSIZE; i++)
spin_lock_init(async_pf_sleepers[i].lock);
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
  diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
  index 0b7690e..38b4705 100644
  --- a/arch/x86/kvm/svm.c
  +++ b/arch/x86/kvm/svm.c
  @@ -1900,10 +1900,14 @@ static int halt_interception(struct vcpu_svm 
  *svm)
   
   static int vmmcall_interception(struct vcpu_svm *svm)
   {
  + int ret;
  +
svm-next_rip = kvm_rip_read(svm-vcpu) + 3;
skip_emulated_instruction(svm-vcpu);
  - kvm_emulate_hypercall(svm-vcpu);
  - return 1;
  + ret = kvm_emulate_hypercall(svm-vcpu);
  +
  + /* Ignore the error? */
  + return ret == 0 ? 0 : 1;
 
  Why can't kvm_emulate_hypercall return the right value?
 
  kvm_emulate_hypercall() will call kvm_hv_hypercall(), and
  kvm_hv_hypercall() will return 0 when vcpu's CPL  0.
  If vcpu's CPL  0, does kvm need to exit and tell it to
  qemu?
 
  No, there is currently no exit to userspace due to hypercalls, neither
  of HV nor KVM kind.
 
  The point is that the return code of kvm_emulate_hypercall is unused so
  far, so you can easily redefine it to encode continue vs. exit to
  userspace. Once someone has different needs, this could still be
  refactored again.
 
  So, it is OK to change the return value of kvm_hv_hypercall() if vcpu's
  CPL  0?
 
  Yes, change it to encode what vendor modules need to return to their
  callers.
 
  Better introduce new request flag and set it in your hypercall emulation. 
  See
  how triple fault is handled.
 
 triple fault sets KVM_EXIT_SHUTDOWN and exits to userspace. Do you mean 
 introduce
 a new value(like KVM_EXIT_SHUTDOWN)?
 
I mean introduce new request bit (like KVM_REQ_TRIPLE_FAULT) and set it
in your hypercall if exit to userspace is needed instead of changing
return values.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Avi Kivity
On 02/29/2012 03:29 AM, Wen Congyang wrote:
 At 02/28/2012 07:23 PM, Avi Kivity Wrote:
  On 02/27/2012 05:01 AM, Wen Congyang wrote:
  We can know the guest is paniced when the guest runs on xen.
  But we do not have such feature on kvm. This patch implemnts
  this feature, and the implementation is the same as xen:
  register panic notifier, and call hypercall when the guest
  is paniced.
  
  What's the motivation for this?  Xen does this is insufficient.

 Another purpose is: management app(for example: libvirt) can do auto
 dump when the guest is crashed. If management app does not do auto
 dump, the guest's user can do dump by hand if he sees the guest is
 paniced.

 I am thinking about another status: dumping. This status tells
 the guest's user that the guest is paniced, and the OS's dump function
 is working.

 These two status can tell the guest's user whether the guest is pancied,
 and what should he do if the guest is paniced.


How about using a virtio-serial channel for this?  You can transfer any
amount of information (including the dump itself).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Gleb Natapov
On Wed, Feb 29, 2012 at 11:49:58AM +0200, Avi Kivity wrote:
 On 02/29/2012 03:29 AM, Wen Congyang wrote:
  At 02/28/2012 07:23 PM, Avi Kivity Wrote:
   On 02/27/2012 05:01 AM, Wen Congyang wrote:
   We can know the guest is paniced when the guest runs on xen.
   But we do not have such feature on kvm. This patch implemnts
   this feature, and the implementation is the same as xen:
   register panic notifier, and call hypercall when the guest
   is paniced.
   
   What's the motivation for this?  Xen does this is insufficient.
 
  Another purpose is: management app(for example: libvirt) can do auto
  dump when the guest is crashed. If management app does not do auto
  dump, the guest's user can do dump by hand if he sees the guest is
  paniced.
 
  I am thinking about another status: dumping. This status tells
  the guest's user that the guest is paniced, and the OS's dump function
  is working.
 
  These two status can tell the guest's user whether the guest is pancied,
  and what should he do if the guest is paniced.
 
 
 How about using a virtio-serial channel for this?  You can transfer any
 amount of information (including the dump itself).
 
Isn't it unreliable after the guest panicked? Having special kdump
kernel that transfers dump to a host via virtio-serial channel though
sounds interesting. May be that's what you mean.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-WIP 01/13] xen/arm: use r12 to pass the hypercall number to the hypervisor

2012-02-29 Thread Ian Campbell
On Wed, 2012-02-29 at 09:34 +, Dave Martin wrote:
 On Tue, Feb 28, 2012 at 12:28:29PM +, Stefano Stabellini wrote:

  I don't have a very strong opinion on which register we should use, but
  I would like to avoid r7 if it is already actively used by gcc.
 
 But there is no framepointer for Thumb-2 code (?)

Peter Maydell suggested there was:
 r7 is (used by gcc as) the Thumb frame pointer; I don't know if this
 makes it worth avoiding in this context.

Sounds like it might be a gcc-ism, possibly a non-default option?

Anyway, I think r12 will be fine for our purposes so the point is rather
moot.

Ian.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Daniel P. Berrange
On Wed, Feb 29, 2012 at 11:49:58AM +0200, Avi Kivity wrote:
 On 02/29/2012 03:29 AM, Wen Congyang wrote:
  At 02/28/2012 07:23 PM, Avi Kivity Wrote:
   On 02/27/2012 05:01 AM, Wen Congyang wrote:
   We can know the guest is paniced when the guest runs on xen.
   But we do not have such feature on kvm. This patch implemnts
   this feature, and the implementation is the same as xen:
   register panic notifier, and call hypercall when the guest
   is paniced.
   
   What's the motivation for this?  Xen does this is insufficient.
 
  Another purpose is: management app(for example: libvirt) can do auto
  dump when the guest is crashed. If management app does not do auto
  dump, the guest's user can do dump by hand if he sees the guest is
  paniced.
 
  I am thinking about another status: dumping. This status tells
  the guest's user that the guest is paniced, and the OS's dump function
  is working.
 
  These two status can tell the guest's user whether the guest is pancied,
  and what should he do if the guest is paniced.
 
 
 How about using a virtio-serial channel for this?  You can transfer any
 amount of information (including the dump itself).

When the guest OS has crashed, any dumps will be done from the host
OS using libvirt's core dump mechanism. The guest OS isn't involved
and is likely too dead to be of any use anyway. Likewise it is
quite probably too dead to work a virtio-serial channel or any
similarly complex device. We're really just after the simplest
possible notification that the guest kernel has paniced.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Avi Kivity
On 02/29/2012 11:55 AM, Gleb Natapov wrote:
  
  
  How about using a virtio-serial channel for this?  You can transfer any
  amount of information (including the dump itself).
  
 Isn't it unreliable after the guest panicked? 

So is calling hypercalls, or dumping, or writing to the screen.  Of
course calling a hypercall is simpler and so is more reliable.

 Having special kdump
 kernel that transfers dump to a host via virtio-serial channel though
 sounds interesting. May be that's what you mean.

Yes.  The panic, starting dump signal should be initiated by the
panicking kernel though, in case the dump fails.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Avi Kivity
On 02/29/2012 11:58 AM, Daniel P. Berrange wrote:
  
  How about using a virtio-serial channel for this?  You can transfer any
  amount of information (including the dump itself).

 When the guest OS has crashed, any dumps will be done from the host
 OS using libvirt's core dump mechanism. The guest OS isn't involved
 and is likely too dead to be of any use anyway. Likewise it is
 quite probably too dead to work a virtio-serial channel or any
 similarly complex device. We're really just after the simplest
 possible notification that the guest kernel has paniced.

If it's alive enough to panic, it's alive enough to kexec its kdump
kernel.  After that it can do anything.

Guest-internal dumps are more useful IMO that host-initiated dumps.  In
a cloud, the host-initiated dump is left on the host, outside the reach
of the guest admin, outside the guest image where all the symbols are,
and sometimes not even on the same host if a live migration occurred. 
It's more useful in small setups, or if the problem is in the
hypervisor, not the guest.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Gleb Natapov
On Wed, Feb 29, 2012 at 12:00:41PM +0200, Avi Kivity wrote:
 On 02/29/2012 11:55 AM, Gleb Natapov wrote:
   
   
   How about using a virtio-serial channel for this?  You can transfer any
   amount of information (including the dump itself).
   
  Isn't it unreliable after the guest panicked? 
 
 So is calling hypercalls, or dumping, or writing to the screen.  Of
 course calling a hypercall is simpler and so is more reliable.
 
Yes, crash can be so severe that it is not even detected by a kernel
itself, so not OOPS message even printed. But in most cases if kernel is
functional enough to print OOPS it is functional enough to call single
hypercall instruction.

  Having special kdump
  kernel that transfers dump to a host via virtio-serial channel though
  sounds interesting. May be that's what you mean.
 
 Yes.  The panic, starting dump signal should be initiated by the
 panicking kernel though, in case the dump fails.
 
Then panic hypercall sounds like a reasonable solution.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Avi Kivity
On 02/29/2012 12:05 PM, Gleb Natapov wrote:
 On Wed, Feb 29, 2012 at 12:00:41PM +0200, Avi Kivity wrote:
  On 02/29/2012 11:55 AM, Gleb Natapov wrote:


How about using a virtio-serial channel for this?  You can transfer any
amount of information (including the dump itself).

   Isn't it unreliable after the guest panicked? 
  
  So is calling hypercalls, or dumping, or writing to the screen.  Of
  course calling a hypercall is simpler and so is more reliable.
  
 Yes, crash can be so severe that it is not even detected by a kernel
 itself, so not OOPS message even printed. But in most cases if kernel is
 functional enough to print OOPS it is functional enough to call single
 hypercall instruction.

Why not print the oops to virtio-serial?  Or even just a regular serial
port?  That's what bare metal does.

   Having special kdump
   kernel that transfers dump to a host via virtio-serial channel though
   sounds interesting. May be that's what you mean.
  
  Yes.  The panic, starting dump signal should be initiated by the
  panicking kernel though, in case the dump fails.
  
 Then panic hypercall sounds like a reasonable solution.

It is, but I'm trying to see if we can get away with doing nothing.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-02-29 Thread Takuya Yoshikawa
Avi Kivity a...@redhat.com wrote:

 Okay.  But note you don't need the alignment check; simply allocate the
 array aligned, and a multiple of 16 bytes, in the first place.

OKay, then we can do something like:

for each (x = bitmap[i], y = bitmap[i+1])
if (!x  !y)
continue
else if ...
cmpxchg16b or 8b
...

I will try later.

  In addition, we should care about the new API.  It is not decided about
  what kind of range can be ordered.  I think restricting the range to be
  long size aligned is natural.  Do you have any plan?
 
 Not really.  But the current changes are all internal and don't affect
 the user API.

Then I will do my best to improve the performance now!

  __ffs is really fast compared to other APIs.
 
 You could do something like this
 
 for (i = 0; i  bitmap_size_in_longs; ++i) {
 mask = bitmap[i];
 if (!mask)
continue;
 for (j = __ffs(mask); mask; mask = mask - 1, j = __ffs(mask))
 handle i * BITS_PER_LONG + j;
 }
 
 This gives you the speed of __ffs() but without working on zero bits.

Yes, I will do this in v2.

  We may see partial display updates if we do not hold the mmu_lock during
  xchg loop: it is possible that pages near the end of the framebuffer alone
  gets updated sometimes - I noticed this problem when I fixed the TLB flush
  issue.
 
 I don't understand why this happens.

Because only mmu_lock protects the bitmap for VGA.

xchg i = 1
xchg i = 2
...
xchg i = N

We cannot get a complete snapshot without mmu_lock; if the guest faults on
the Nth page during xchg'ing i = 1, 2, ... then the i = N alone will
become newer.

But others will be updated by the next call, so the problem is restricted:
maybe not noticeable.

  Not a big problem but still maybe-noticeable change, so I think we should
  do it separately with some comments if needed.
 
 Well if it's noticable on the framebuffer it's also noticable with live
 migration.  We could do it later, but we need to really understand it first.

About live migration, we do not mind whether the bitmap is a complete snapshot.
In addition, we cannot do anything because the emulator can access the bitmap
without mmu_lock.

What we are doing is calling GET_DIRTY_LOG slot by slot: so already the
result is not a snapshot at all.

In the end, at the last stage, we will stop the guest and get a complete 
snapshot.

  In addition, we do not want to scan the dirty bitmap twice.  Using the
  bits value soon after it is read into a register seems to be the fastest.
 
 Probably.
 
  BTW, I also want to decide the design of the new API at this chance.
 
 Let's wait with that.  We're adding APIs too quickly.  Let's squeeze
 everything we can out of the current APIs.

I agree with you of course.

At the same time, we cannot say anything without actually implementing
sample userspace programs.

So I want to see how much improvement the proposed API can achieve.

I thought this might be a good GSoC project but ...


Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Wen Congyang
At 02/29/2012 05:36 PM, Gleb Natapov Wrote:
 On Wed, Feb 29, 2012 at 09:08:52AM +0800, Wen Congyang wrote:
 At 02/28/2012 06:45 PM, Gleb Natapov Wrote:
 On Tue, Feb 28, 2012 at 11:19:47AM +0100, Jan Kiszka wrote:
 On 2012-02-28 10:42, Wen Congyang wrote:
 At 02/28/2012 05:34 PM, Jan Kiszka Wrote:
 On 2012-02-28 09:23, Wen Congyang wrote:
 At 02/27/2012 11:08 PM, Jan Kiszka Wrote:
 On 2012-02-27 04:01, Wen Congyang wrote:
 We can know the guest is paniced when the guest runs on xen.
 But we do not have such feature on kvm. This patch implemnts
 this feature, and the implementation is the same as xen:
 register panic notifier, and call hypercall when the guest
 is paniced.

 Signed-off-by: Wen Congyang we...@cn.fujitsu.com
 ---
  arch/x86/kernel/kvm.c|   12 
  arch/x86/kvm/svm.c   |8 ++--
  arch/x86/kvm/vmx.c   |8 ++--
  arch/x86/kvm/x86.c   |   13 +++--
  include/linux/kvm.h  |1 +
  include/linux/kvm_para.h |1 +
  6 files changed, 37 insertions(+), 6 deletions(-)

 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
 index f0c6fd6..b928d1d 100644
 --- a/arch/x86/kernel/kvm.c
 +++ b/arch/x86/kernel/kvm.c
 @@ -331,6 +331,17 @@ static struct notifier_block kvm_pv_reboot_nb = {
   .notifier_call = kvm_pv_reboot_notify,
  };
  
 +static int
 +kvm_pv_panic_notify(struct notifier_block *nb, unsigned long code, 
 void *unused)
 +{
 + kvm_hypercall0(KVM_HC_GUEST_PANIC);
 + return NOTIFY_DONE;
 +}
 +
 +static struct notifier_block kvm_pv_panic_nb = {
 + .notifier_call = kvm_pv_panic_notify,
 +};
 +

 You should split up host and guest-side changes.

  static u64 kvm_steal_clock(int cpu)
  {
   u64 steal;
 @@ -417,6 +428,7 @@ void __init kvm_guest_init(void)
  
   paravirt_ops_setup();
   register_reboot_notifier(kvm_pv_reboot_nb);
 + atomic_notifier_chain_register(panic_notifier_list, 
 kvm_pv_panic_nb);
   for (i = 0; i  KVM_TASK_SLEEP_HASHSIZE; i++)
   spin_lock_init(async_pf_sleepers[i].lock);
   if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index 0b7690e..38b4705 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -1900,10 +1900,14 @@ static int halt_interception(struct vcpu_svm 
 *svm)
  
  static int vmmcall_interception(struct vcpu_svm *svm)
  {
 + int ret;
 +
   svm-next_rip = kvm_rip_read(svm-vcpu) + 3;
   skip_emulated_instruction(svm-vcpu);
 - kvm_emulate_hypercall(svm-vcpu);
 - return 1;
 + ret = kvm_emulate_hypercall(svm-vcpu);
 +
 + /* Ignore the error? */
 + return ret == 0 ? 0 : 1;

 Why can't kvm_emulate_hypercall return the right value?

 kvm_emulate_hypercall() will call kvm_hv_hypercall(), and
 kvm_hv_hypercall() will return 0 when vcpu's CPL  0.
 If vcpu's CPL  0, does kvm need to exit and tell it to
 qemu?

 No, there is currently no exit to userspace due to hypercalls, neither
 of HV nor KVM kind.

 The point is that the return code of kvm_emulate_hypercall is unused so
 far, so you can easily redefine it to encode continue vs. exit to
 userspace. Once someone has different needs, this could still be
 refactored again.

 So, it is OK to change the return value of kvm_hv_hypercall() if vcpu's
 CPL  0?

 Yes, change it to encode what vendor modules need to return to their
 callers.

 Better introduce new request flag and set it in your hypercall emulation. 
 See
 how triple fault is handled.

 triple fault sets KVM_EXIT_SHUTDOWN and exits to userspace. Do you mean 
 introduce
 a new value(like KVM_EXIT_SHUTDOWN)?

 I mean introduce new request bit (like KVM_REQ_TRIPLE_FAULT) and set it
 in your hypercall if exit to userspace is needed instead of changing
 return values.

I donot notice it. Thanks for you suggestion. I will modify my patch.

Thanks
Wen Congyang

 
 --
   Gleb.
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] blockdev operations [was: KVM call agenda for Tuesday 28th]

2012-02-29 Thread Kevin Wolf
Am 28.02.2012 17:07, schrieb Eric Blake:
 On 02/28/2012 07:58 AM, Stefan Hajnoczi wrote:
 On Tue, Feb 28, 2012 at 2:47 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 28/02/2012 15:39, Stefan Hajnoczi ha scritto:
 I'm not a fan of transactions or freeze/thaw (if used to atomically
 perform other commands).

 We should not export low-level block device operations so that
 external software can micromanage via QMP.  I don't think this is a
 good idea because it takes the block device offline and possibly
 blocks the VM.  We're reaching a level comparable to an HTTP interface
 for acquiring pthread mutex, doing some operations, and then another
 HTTP request to unlock it.  This is micromanagement it will create
 more problems because we will have to support lots of little API
 functions.

 So you're for extending Jeff's patches to group mirroring etc.?

 That's also my favorite one, assuming we can do it in time for 1.1.

 Yes, that's the approach I like the most.  It's relatively clean and
 leaves us space to develop -blockdev.
 
 Here's the idea I was forming based on today's call:
 
 Jeff's idea of a group operation can be extended to allow multiple
 operations while reusing the framework.  For oVirt, we need the ability
 to open a mirror (by passing the mirror file alongside the name of the
 new external snapshot), as well as reopening a blockdev (to pivot to the
 other side of an already-open mirror).
 
 Is there a way to express a designated union in QMP?  I'm thinking
 something along the lines of having the overall group command take a
 list of operations, where each operation can either be 'create a
 snapshot', 'create a snapshot and mirror', or 'reopen a mirror'.
 
 I'm thinking it might look something like:
 
 { 'enum': 'BlockdevOp',
   'data': [ 'snapshot', 'snapshot-mirror', 'reopen' ] }
 { 'type': 'BlockdevAction',
   'data': {'device': 'str', 'op': 'BlockdevOp',
'file': 'str', '*format': 'str', '*reuse': 'bool',
'*mirror': 'str', '*mirror-format': 'str' } }
 { 'command': 'blkdev-group-action-sync',
   'data': { 'actionlist': [ 'BlockdevAction' ] } }

I think the general approach is good.

Your implementation in QAPI is kind of ugly because you mix the
arguments of all three commands in BlockdevAction (how about
extensibility? And the optional flags aren't the full truth either,
we'd have to add checks in the handlers.). We really want to have some
real union support in QAPI that creates three different C structs.

So something like (I'll reintroduce the bad word transaction, because
it's really what we're doing here, just everything in one command):

{ 'type': 'BlockdevSnapshot',
  'data': {'device': 'str', 'snapshot-file': 'str', '*format': 'str' } }
{ 'union': 'BlockdevAction',
  'types': {
   'snapshot': 'BlockdevSnapshot',
   'mirror': 'BlockdevMirror', ...
   } }
{ 'command': 'blockdev-transaction',
  'data': { 'actionlist': [ 'BlockdevAction' ] } }

On the wire in QMP this would look like:

{ 'execute': 'blockdev-transaction',
  'arguments': {
'actionlist': [
  { 'type': 'snapshot',
'data': { 'device': 'ide-hd0', ... } },
  { 'type': 'mirror',
'data': { ... }
]
  }
}

Now if you look closely, BlockdevSnapshot is exactly what Jeff called
SnapshotDev in his series, which in turn is the definition of the
blockdev-snapshot-sync command. We can reuse all of this even in the API
of a new more generic command.

So my conclusion for now is that we can merge the group snapshots right
now instead of waiting for the mirror design to be completed, and we can
reuse everything in a more complex transaction command later.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reconciling qemu-kvm and qemu's PIT

2012-02-29 Thread Jan Kiszka
On 2012-02-29 10:29, Avi Kivity wrote:
 On 02/28/2012 11:47 PM, Jan Kiszka wrote:

 Looks like isa-pit has zero gpio pins, so it fails when crashing.  I
 must have mismerged it, but where is the gpio pin count set?

 In pit_initfn. Does -no-kvm-irqchip work fine? It's a bit tricky to
 discuss this without seeing your code.

 
 pushed it into qemu-kvm.git usptream-merge
 

As I assumed: You try to initialize the kvm-pit like the userspace pit,
that cannot work. Don't run the qdev_connect_gpio_out in pit_init, just
like my kvm-pit patches does in their kvm_pit_init.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Wen Congyang
At 02/29/2012 06:08 PM, Avi Kivity Wrote:
 On 02/29/2012 12:05 PM, Gleb Natapov wrote:
 On Wed, Feb 29, 2012 at 12:00:41PM +0200, Avi Kivity wrote:
 On 02/29/2012 11:55 AM, Gleb Natapov wrote:


 How about using a virtio-serial channel for this?  You can transfer any
 amount of information (including the dump itself).

 Isn't it unreliable after the guest panicked? 

 So is calling hypercalls, or dumping, or writing to the screen.  Of
 course calling a hypercall is simpler and so is more reliable.

 Yes, crash can be so severe that it is not even detected by a kernel
 itself, so not OOPS message even printed. But in most cases if kernel is
 functional enough to print OOPS it is functional enough to call single
 hypercall instruction.
 
 Why not print the oops to virtio-serial?  Or even just a regular serial
 port?  That's what bare metal does.

If virtio-serial's driver has bug or the guest doesn't have such device...

 
 Having special kdump
 kernel that transfers dump to a host via virtio-serial channel though
 sounds interesting. May be that's what you mean.

 Yes.  The panic, starting dump signal should be initiated by the
 panicking kernel though, in case the dump fails.

 Then panic hypercall sounds like a reasonable solution.
 
 It is, but I'm trying to see if we can get away with doing nothing.
 

If we have a reliable way with doing nothing, it is better. But I donot
find such way.

Thanks
Wen Congyang

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Daniel P. Berrange
On Wed, Feb 29, 2012 at 12:05:32PM +0200, Avi Kivity wrote:
 On 02/29/2012 11:58 AM, Daniel P. Berrange wrote:
   
   How about using a virtio-serial channel for this?  You can transfer any
   amount of information (including the dump itself).
 
  When the guest OS has crashed, any dumps will be done from the host
  OS using libvirt's core dump mechanism. The guest OS isn't involved
  and is likely too dead to be of any use anyway. Likewise it is
  quite probably too dead to work a virtio-serial channel or any
  similarly complex device. We're really just after the simplest
  possible notification that the guest kernel has paniced.
 
 If it's alive enough to panic, it's alive enough to kexec its kdump
 kernel.  After that it can do anything.
 
 Guest-internal dumps are more useful IMO that host-initiated dumps.  In
 a cloud, the host-initiated dump is left on the host, outside the reach
 of the guest admin, outside the guest image where all the symbols are,
 and sometimes not even on the same host if a live migration occurred. 
 It's more useful in small setups, or if the problem is in the
 hypervisor, not the guest.

I don't think guest vs host dumps should be considered mutually exclusive,
they both have pluses+minuses.

Configuring kexec+kdump requires non-negligable guest admin configuration
work before it's usable, and this work is guest OS specific, if it is possible
at all. A permanent panic notifier that's built in the kernel by default
requires zero guest admin config, and can allow host admin to automate
collection of dumps across all their hosts/guests. The KVM hypercall
notification is fairly trivially ported to any OS kernel, by comparison
with a full virtio + virtio-serial impl.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Wen Congyang
At 02/29/2012 06:05 PM, Avi Kivity Wrote:
 On 02/29/2012 11:58 AM, Daniel P. Berrange wrote:

 How about using a virtio-serial channel for this?  You can transfer any
 amount of information (including the dump itself).

 When the guest OS has crashed, any dumps will be done from the host
 OS using libvirt's core dump mechanism. The guest OS isn't involved
 and is likely too dead to be of any use anyway. Likewise it is
 quite probably too dead to work a virtio-serial channel or any
 similarly complex device. We're really just after the simplest
 possible notification that the guest kernel has paniced.
 
 If it's alive enough to panic, it's alive enough to kexec its kdump
 kernel.  After that it can do anything.
 
 Guest-internal dumps are more useful IMO that host-initiated dumps.  In

Yes, guest-internal dump is better than host dump. But the user may not
start guest-internal dump or guest-internal dump failed. So we need the
following feature:
1. If the guest-internal dump does not work, the guest's status is 'crashed'.
   And then the user does the host dump.
2. If the guest-internal dump is working, the guest's status should be
   'dumping'. The user see this status and know the guest has paniced, and
the guest-internal dump is working.

Thanks
Wen Congyang

 a cloud, the host-initiated dump is left on the host, outside the reach
 of the guest admin, outside the guest image where all the symbols are,
 and sometimes not even on the same host if a live migration occurred. 
 It's more useful in small setups, or if the problem is in the
 hypervisor, not the guest.
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Avi Kivity
On 02/29/2012 12:17 PM, Wen Congyang wrote:
 
  Yes, crash can be so severe that it is not even detected by a kernel
  itself, so not OOPS message even printed. But in most cases if kernel is
  functional enough to print OOPS it is functional enough to call single
  hypercall instruction.
  
  Why not print the oops to virtio-serial?  Or even just a regular serial
  port?  That's what bare metal does.

 If virtio-serial's driver has bug or the guest doesn't have such device...

We have the same issue with the hypercall; and virtio-serial is
available on many deployed versions.

  
  Having special kdump
  kernel that transfers dump to a host via virtio-serial channel though
  sounds interesting. May be that's what you mean.
 
  Yes.  The panic, starting dump signal should be initiated by the
  panicking kernel though, in case the dump fails.
 
  Then panic hypercall sounds like a reasonable solution.
  
  It is, but I'm trying to see if we can get away with doing nothing.
  

 If we have a reliable way with doing nothing, it is better. But I donot
 find such way.

We won't have a 100% reliable way.  But I think a variant of the driver
that doesn't use interrupts, or just using the ordinary serial driver,
should be reliable enough.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Avi Kivity
On 02/29/2012 12:19 PM, Daniel P. Berrange wrote:
 On Wed, Feb 29, 2012 at 12:05:32PM +0200, Avi Kivity wrote:
  On 02/29/2012 11:58 AM, Daniel P. Berrange wrote:

How about using a virtio-serial channel for this?  You can transfer any
amount of information (including the dump itself).
  
   When the guest OS has crashed, any dumps will be done from the host
   OS using libvirt's core dump mechanism. The guest OS isn't involved
   and is likely too dead to be of any use anyway. Likewise it is
   quite probably too dead to work a virtio-serial channel or any
   similarly complex device. We're really just after the simplest
   possible notification that the guest kernel has paniced.
  
  If it's alive enough to panic, it's alive enough to kexec its kdump
  kernel.  After that it can do anything.
  
  Guest-internal dumps are more useful IMO that host-initiated dumps.  In
  a cloud, the host-initiated dump is left on the host, outside the reach
  of the guest admin, outside the guest image where all the symbols are,
  and sometimes not even on the same host if a live migration occurred. 
  It's more useful in small setups, or if the problem is in the
  hypervisor, not the guest.

 I don't think guest vs host dumps should be considered mutually exclusive,
 they both have pluses+minuses.

True.

 Configuring kexec+kdump requires non-negligable guest admin configuration
 work before it's usable, and this work is guest OS specific, if it is possible
 at all. 

I think it's on by default on Windows and requires installing a package
on Linux, which may be part of the default configuration on many distros.

 A permanent panic notifier that's built in the kernel by default
 requires zero guest admin config, and can allow host admin to automate
 collection of dumps across all their hosts/guests. The KVM hypercall
 notification is fairly trivially ported to any OS kernel, by comparison
 with a full virtio + virtio-serial impl.

That's the path of least resistance.  But it's not necessarily the best
path.  We end up with a wide set of disconnected ABIs instead of a
narrow set that is more flexible.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Gleb Natapov
On Wed, Feb 29, 2012 at 12:08:19PM +0200, Avi Kivity wrote:
 On 02/29/2012 12:05 PM, Gleb Natapov wrote:
  On Wed, Feb 29, 2012 at 12:00:41PM +0200, Avi Kivity wrote:
   On 02/29/2012 11:55 AM, Gleb Natapov wrote:
 
 
 How about using a virtio-serial channel for this?  You can transfer 
 any
 amount of information (including the dump itself).
 
Isn't it unreliable after the guest panicked? 
   
   So is calling hypercalls, or dumping, or writing to the screen.  Of
   course calling a hypercall is simpler and so is more reliable.
   
  Yes, crash can be so severe that it is not even detected by a kernel
  itself, so not OOPS message even printed. But in most cases if kernel is
  functional enough to print OOPS it is functional enough to call single
  hypercall instruction.
 
 Why not print the oops to virtio-serial?  Or even just a regular serial
 port?  That's what bare metal does.
 
The more interface is complex the more chances it will fail during
panic. Regular serial is likely more reliable than virtio-serial.
Hypercall is likely more reliable than uart. On serial user can
fake panic notification. Can this be problematic?

Having special kdump
kernel that transfers dump to a host via virtio-serial channel though
sounds interesting. May be that's what you mean.
   
   Yes.  The panic, starting dump signal should be initiated by the
   panicking kernel though, in case the dump fails.
   
  Then panic hypercall sounds like a reasonable solution.
 
 It is, but I'm trying to see if we can get away with doing nothing.
 
Fair enough.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Avi Kivity
On 02/29/2012 12:31 PM, Wen Congyang wrote:
 At 02/29/2012 06:05 PM, Avi Kivity Wrote:
  On 02/29/2012 11:58 AM, Daniel P. Berrange wrote:
 
  How about using a virtio-serial channel for this?  You can transfer any
  amount of information (including the dump itself).
 
  When the guest OS has crashed, any dumps will be done from the host
  OS using libvirt's core dump mechanism. The guest OS isn't involved
  and is likely too dead to be of any use anyway. Likewise it is
  quite probably too dead to work a virtio-serial channel or any
  similarly complex device. We're really just after the simplest
  possible notification that the guest kernel has paniced.
  
  If it's alive enough to panic, it's alive enough to kexec its kdump
  kernel.  After that it can do anything.
  
  Guest-internal dumps are more useful IMO that host-initiated dumps.  In

 Yes, guest-internal dump is better than host dump. But the user may not
 start guest-internal dump or guest-internal dump failed. So we need the
 following feature:
 1. If the guest-internal dump does not work, the guest's status is 'crashed'.
And then the user does the host dump.
 2. If the guest-internal dump is working, the guest's status should be
'dumping'. The user see this status and know the guest has paniced, and
 the guest-internal dump is working.

I agree.  There is room for host dump, and we do want notification about
what the guest is doing.  The question is whether we should reuse
virtio-serial for guest-host communication in this case.  It's more
complicated, but allows us to avoid touching the hypervisor.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Avi Kivity
On 02/29/2012 12:44 PM, Gleb Natapov wrote:

   Yes, crash can be so severe that it is not even detected by a kernel
   itself, so not OOPS message even printed. But in most cases if kernel is
   functional enough to print OOPS it is functional enough to call single
   hypercall instruction.
  
  Why not print the oops to virtio-serial?  Or even just a regular serial
  port?  That's what bare metal does.
  
 The more interface is complex the more chances it will fail during
 panic. Regular serial is likely more reliable than virtio-serial.
 Hypercall is likely more reliable than uart. On serial user can
 fake panic notification. 

The serial device is under control of the kernel, so the user can only
access it if the kernel so allows.

 Can this be problematic?

From the point of view of the guest, yes, it can generate support calls
in fill up the admin's dump quota.  From the point of view of the host,
there's no difference between the guest's admin and a random user.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Gleb Natapov
On Wed, Feb 29, 2012 at 12:48:57PM +0200, Avi Kivity wrote:
 On 02/29/2012 12:44 PM, Gleb Natapov wrote:
 
Yes, crash can be so severe that it is not even detected by a kernel
itself, so not OOPS message even printed. But in most cases if kernel is
functional enough to print OOPS it is functional enough to call single
hypercall instruction.
   
   Why not print the oops to virtio-serial?  Or even just a regular serial
   port?  That's what bare metal does.
   
  The more interface is complex the more chances it will fail during
  panic. Regular serial is likely more reliable than virtio-serial.
  Hypercall is likely more reliable than uart. On serial user can
  fake panic notification. 
 
 The serial device is under control of the kernel, so the user can only
 access it if the kernel so allows.
 
Yes, but if we will hijack UART for panic notification VM user will not
be able to use it for anything else. virtio-serial have many channels,
but AFAIK it does not work at early stages of boot process.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SeaBIOS] [PATCH RFC] seabios: add OSHP method stub

2012-02-29 Thread Gerd Hoffmann
 diff --git a/src/ssdt-pcihp.dsl b/src/ssdt-pcihp.dsl
 index 442e7a8..3f50169 100644
 --- a/src/ssdt-pcihp.dsl
 +++ b/src/ssdt-pcihp.dsl
 @@ -24,6 +24,7 @@ DefinitionBlock (ssdt-pcihp.aml, SSDT, 0x01, BXPC, 
 BXSSDTPCIHP, 0x1)
 ACPI_EXTRACT_METHOD_STRING aml_ej0_name  \
 Method (_EJ0, 1) { Return(PCEJ(0x##slot)) }  \
 Name (_SUN, 0x##slot)\
 +   Method (OSHP, 1) { Return(0x0) }  \
  }


Linux kernel (kernel-3.2.7-1.fc16.x86_64) complains:

ACPI Warning: For \_SB_.PCI0.S10_.OSHP: Insufficient arguments - needs
1, found 0 (20110623/nspredef-321)

cheers,
  Gerd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-WIP 01/13] xen/arm: use r12 to pass the hypercall number to the hypervisor

2012-02-29 Thread Dave Martin
On Wed, Feb 29, 2012 at 09:56:02AM +, Ian Campbell wrote:
 On Wed, 2012-02-29 at 09:34 +, Dave Martin wrote:
  On Tue, Feb 28, 2012 at 12:28:29PM +, Stefano Stabellini wrote:
 
   I don't have a very strong opinion on which register we should use, but
   I would like to avoid r7 if it is already actively used by gcc.
  
  But there is no framepointer for Thumb-2 code (?)
 
 Peter Maydell suggested there was:
  r7 is (used by gcc as) the Thumb frame pointer; I don't know if this
  makes it worth avoiding in this context.
 
 Sounds like it might be a gcc-ism, possibly a non-default option?

I seem to remember discussions about some cruft in gcc related to this.
gcc actually barfs at you if you try to allocate r7 to inline asm
without -fomit-frame-pointer.  That use for r7 really relates to the
legacy ABI, so this may be a bug.

 Anyway, I think r12 will be fine for our purposes so the point is rather
 moot.

Yes, it sounds like it.  If that r7 issue is a gcc bug, this would avoid
it.

If you leave the job of putting the right constant into r12 up to gcc,
it should generate reasonable for you without having to code it
explicitly anyway:

register int hvc_num asm(r12) = 0xDEADBEEF;

asm volatile (
hvc0
:: r (hvc_num)
)

Cheers
---Dave
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-02-29 Thread Avi Kivity
On 02/29/2012 12:16 PM, Takuya Yoshikawa wrote:
   We may see partial display updates if we do not hold the mmu_lock during
   xchg loop: it is possible that pages near the end of the framebuffer alone
   gets updated sometimes - I noticed this problem when I fixed the TLB flush
   issue.
  
  I don't understand why this happens.

 Because only mmu_lock protects the bitmap for VGA.

 xchg i = 1
 xchg i = 2
 ...
 xchg i = N

 We cannot get a complete snapshot without mmu_lock; if the guest faults on
 the Nth page during xchg'ing i = 1, 2, ... then the i = N alone will
 become newer.

Ah, so there is no data corruption (missed dirty bits), just the display
is updated inconsistently?

I don't think we can get a consistent snapshot anyway, since the guest
can update the framebuffer while userspace is processing it.


 But others will be updated by the next call, so the problem is restricted:
 maybe not noticeable.

   Not a big problem but still maybe-noticeable change, so I think we should
   do it separately with some comments if needed.
  
  Well if it's noticable on the framebuffer it's also noticable with live
  migration.  We could do it later, but we need to really understand it first.

 About live migration, we do not mind whether the bitmap is a complete 
 snapshot.
 In addition, we cannot do anything because the emulator can access the bitmap
 without mmu_lock.

 What we are doing is calling GET_DIRTY_LOG slot by slot: so already the
 result is not a snapshot at all.

 In the end, at the last stage, we will stop the guest and get a complete 
 snapshot.

Understood.  I don't think we can get a consistent vga snapshot without
stopping the guest, and even then, it depends on how the guest updates
the framebuffer.


   In addition, we do not want to scan the dirty bitmap twice.  Using the
   bits value soon after it is read into a register seems to be the fastest.
  
  Probably.
  
   BTW, I also want to decide the design of the new API at this chance.
  
  Let's wait with that.  We're adding APIs too quickly.  Let's squeeze
  everything we can out of the current APIs.

 I agree with you of course.

 At the same time, we cannot say anything without actually implementing
 sample userspace programs.

 So I want to see how much improvement the proposed API can achieve.

 I thought this might be a good GSoC project but ...

It may be too involved for GSoC, the issues are difficult.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-WIP 01/13] xen/arm: use r12 to pass the hypercall number to the hypervisor

2012-02-29 Thread Dave Martin
On Wed, Feb 29, 2012 at 09:56:02AM +, Ian Campbell wrote:
 On Wed, 2012-02-29 at 09:34 +, Dave Martin wrote:
  On Tue, Feb 28, 2012 at 12:28:29PM +, Stefano Stabellini wrote:
 
   I don't have a very strong opinion on which register we should use, but
   I would like to avoid r7 if it is already actively used by gcc.
  
  But there is no framepointer for Thumb-2 code (?)
 
 Peter Maydell suggested there was:
  r7 is (used by gcc as) the Thumb frame pointer; I don't know if this
  makes it worth avoiding in this context.
 
 Sounds like it might be a gcc-ism, possibly a non-default option?
 
 Anyway, I think r12 will be fine for our purposes so the point is rather
 moot.

Just had a chat with some tools guys -- apparently, when passing register
arguments to gcc inline asms there really isn't a guarantee that those
variables will be in the expected registers on entry to the inline asm.

If gcc reorders other function calls or other code around the inline asm
(which it can do, except under certain controlled situations), then
intervening code can clobber any registers in general.

Or, to summarise another way, there is no way to control which register
is used to pass something to an inline asm in general (often we get away
with this, and there are a lot of inline asms in the kernel that assume
it works, but the more you inline the more likely you are to get nasty
surprises).  There is no workaroud, except on some architectures where
special asm constraints allow specific individual registers to be
specified for operands (i386 for example).

If you need a specific register, this means that you must set up that
register explicitly inside the asm if you want a guarantee that the
code will work:

asm volatile (
movw   r12, %[hvc_num]\n\t
...
hvc#0
:: [hvc_num] i (NUMBER) : r12
);

Of course, if you need to set up more than about 5 or 6 registers in
this way, the doubled register footprint means that the compiler will
have to start spilling stuff to the stack.


This is the kind of problem which goes away when out-of-lining the
hvc wrapper behind a C function interface, since the ABI then provides
guarantees about how values are mershaled into and out of that code.


Notwithstanding the above, even if we do make theoretically unsound
(but often true) assumptions about inline asms, ARM will be no worse
than other arches in this respect.


Other than serving as a reminder that inline asm is a deep can of
worms, this doesn't really give us a neat solution...

---Dave
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-blk performance regression and qemu-kvm

2012-02-29 Thread Martin Mailand

Hi Stefan,
you are right, the performance for the commits 0b9b128530b and 
4fefc55ab04d are both good.

What is the best approach to stay in the qemu-kvm.git history?

-martin

On 29.02.2012 09:38, Stefan Hajnoczi wrote:

I suggest testing both of the qemu-kvm.git merge commits, 0b9b128530b
and 4fefc55ab04d.  My guess is you will find they perform the same,
i.e. the qemu.git commits which were merged did not affect performance
in qemu-kvm.git.  That would be evidence that git-bisect has not
located the real culprit.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] perf/x86: Fix HO/GO counting with SVM disabled

2012-02-29 Thread Joerg Roedel
On Tue, Feb 28, 2012 at 07:38:34PM +0200, Avi Kivity wrote:
 On 02/28/2012 07:36 PM, David Ahern wrote:

  I was to suggest the reverse: since this patch addesses an AMD bug,
  why not push those functions into perf_event_amd.c and make them
  dependent on CONFIG_CPU_SUP_AMD as well.
 
 It depends on which direction you expect the code to grow.  These hooks
 seem reasonable, so I think they should be generic.

Okay, I am fine with both directions. But since this patch is a fix in
the first place I make it more AMD specific for now. If we need this as
a generic feature it can easily be changed back.


Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] KVM: Resize kvm_io_range array dynamically

2012-02-29 Thread Amos Kong
kvm_io_bus devices are used for ioevent, pit, pic, ioapic,
coalesced_mmio.

Currently Qemu only emulates one PCI bus, it contains 32 slots,
one slot contains 8 functions, maximum of supported PCI devices:
 1 * 32 * 8 = 256. The maximum of coalesced mmio zone is 100,
each zone has an iobus devices. 300 io_bus devices is not enough.

This patch makes the kvm_io_range array can be resized dynamically.

Changes from v1:
- fix typo: kvm_io_bus_range - kvm_io_range

Changes from v2:
- unregister device only when it exists

Signed-off-by: Amos Kong ak...@redhat.com
Reviewed-by: Jason Wang jasow...@redhat.com
CC: Alex Williamson alex.william...@redhat.com
---
 include/linux/kvm_host.h |3 +--
 virt/kvm/kvm_main.c  |   41 +
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 355e445..0e6d9d2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -69,8 +69,7 @@ struct kvm_io_range {
 
 struct kvm_io_bus {
int   dev_count;
-#define NR_IOBUS_DEVS 300
-   struct kvm_io_range range[NR_IOBUS_DEVS];
+   struct kvm_io_range range[];
 };
 
 enum kvm_bus {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e4431ad..1275979 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2389,9 +2389,6 @@ int kvm_io_bus_sort_cmp(const void *p1, const void *p2)
 int kvm_io_bus_insert_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev,
  gpa_t addr, int len)
 {
-   if (bus-dev_count == NR_IOBUS_DEVS)
-   return -ENOSPC;
-
bus-range[bus-dev_count++] = (struct kvm_io_range) {
.addr = addr,
.len = len,
@@ -2491,10 +2488,12 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum 
kvm_bus bus_idx, gpa_t addr,
struct kvm_io_bus *new_bus, *bus;
 
bus = kvm-buses[bus_idx];
-   if (bus-dev_count  NR_IOBUS_DEVS-1)
-   return -ENOSPC;
 
-   new_bus = kmemdup(bus, sizeof(struct kvm_io_bus), GFP_KERNEL);
+   new_bus = kzalloc(sizeof(*bus) + ((bus-dev_count + 1) *
+ sizeof(struct kvm_io_range)), GFP_KERNEL);
+   if (new_bus)
+   memcpy(new_bus, bus, sizeof(*bus) + (bus-dev_count *
+  sizeof(struct kvm_io_range)));
if (!new_bus)
return -ENOMEM;
kvm_io_bus_insert_dev(new_bus, dev, addr, len);
@@ -2513,26 +2512,28 @@ int kvm_io_bus_unregister_dev(struct kvm *kvm, enum 
kvm_bus bus_idx,
struct kvm_io_bus *new_bus, *bus;
 
bus = kvm-buses[bus_idx];
-
-   new_bus = kmemdup(bus, sizeof(*bus), GFP_KERNEL);
-   if (!new_bus)
-   return -ENOMEM;
-
r = -ENOENT;
-   for (i = 0; i  new_bus-dev_count; i++)
-   if (new_bus-range[i].dev == dev) {
+   for (i = 0; i  bus-dev_count; i++)
+   if (bus-range[i].dev == dev) {
r = 0;
-   new_bus-dev_count--;
-   new_bus-range[i] = new_bus-range[new_bus-dev_count];
-   sort(new_bus-range, new_bus-dev_count,
-sizeof(struct kvm_io_range),
-kvm_io_bus_sort_cmp, NULL);
break;
}
 
-   if (r) {
-   kfree(new_bus);
+   if (r)
return r;
+
+   new_bus = kmemdup(bus, sizeof(*bus) + ((bus-dev_count - 1) *
+ sizeof(struct kvm_io_range)), GFP_KERNEL);
+   if (!new_bus)
+   return -ENOMEM;
+
+   new_bus-dev_count--;
+   /* copy last entry of bus-range to deleted entry spot if
+  deleted entry isn't the last entry of bus-range */
+   if (i != bus-dev_count - 1) {
+   new_bus-range[i] = bus-range[bus-dev_count - 1];
+   sort(new_bus-range, new_bus-dev_count,
+sizeof(struct kvm_io_range), kvm_io_bus_sort_cmp, NULL);
}
 
rcu_assign_pointer(kvm-buses[bus_idx], new_bus);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reconciling qemu-kvm and qemu's PIT

2012-02-29 Thread Avi Kivity
On 02/29/2012 12:14 PM, Jan Kiszka wrote:
 On 2012-02-29 10:29, Avi Kivity wrote:
  On 02/28/2012 11:47 PM, Jan Kiszka wrote:
 
  Looks like isa-pit has zero gpio pins, so it fails when crashing.  I
  must have mismerged it, but where is the gpio pin count set?
 
  In pit_initfn. Does -no-kvm-irqchip work fine? It's a bit tricky to
  discuss this without seeing your code.
 
  
  pushed it into qemu-kvm.git usptream-merge
  

 As I assumed: You try to initialize the kvm-pit like the userspace pit,
 that cannot work. Don't run the qdev_connect_gpio_out in pit_init, just
 like my kvm-pit patches does in their kvm_pit_init.


The merge is now in 'next', hopefully not too many regressions.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-blk performance regression and qemu-kvm

2012-02-29 Thread Stefan Hajnoczi
On Wed, Feb 29, 2012 at 1:12 PM, Martin Mailand mar...@tuxadero.com wrote:
 Hi Stefan,
 you are right, the performance for the commits 0b9b128530b and 4fefc55ab04d
 are both good.
 What is the best approach to stay in the qemu-kvm.git history?

I didn't know the answer so I asked on #git on freenode:

 charon stefanha: so just tell it that the upstream tip is good
 charon git-bisect assumes you are looking for a single commit C
such that any commit that contains C (in its history) is bad, and any
other commit is good. if you declare up front that upstream does not
contain C, then it won't go looking there
of course if that declaration was wrong, it will give wrong results.

I think there are two approaches:

1. Side-step this issue by bisecting qemu.git instead of qemu-kvm.git.

2. First test qemu-kvm.git origin/upstream-merge and if there is no
performance issue, do: git bisect good origin/upstream-merge.  If
we're luckily this will avoid going down all the qemu.git merge trees
and instead stay focussed on qemu-kvm.git commits.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-blk performance regression and qemu-kvm

2012-02-29 Thread Stefan Hajnoczi
On Wed, Feb 29, 2012 at 1:44 PM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Wed, Feb 29, 2012 at 1:12 PM, Martin Mailand mar...@tuxadero.com wrote:
 Hi Stefan,
 you are right, the performance for the commits 0b9b128530b and 4fefc55ab04d
 are both good.
 What is the best approach to stay in the qemu-kvm.git history?

 I didn't know the answer so I asked on #git on freenode:

And additional information from Avi:

http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/82851

Hope this helps.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-29 Thread Jamal Hadi Salim
On Tue, 2012-02-28 at 20:40 -0800, John Fastabend wrote:

 OK back to this. The last piece is where to put these messages...
 we could take PF_ROUTE:RTM_*NEIGH
 
  PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
  switch.
  PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
  switch.
  PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table
 

Why RTM_*NEIGH? RTM tends to map to Route/L3 and NEIGH tends to map
to ndisc or ARP both tied to IP address resolution. While both ARP/Ndisc
may play a role in the user space app populating the FDB, i dont think
they are necessary players.
Learning could be via a table entry miss and packet redirect to user
space.
So my suggestion is to use FDB_*ENTRY for names
 
 The neighbor code is using the PF_UNSPEC protocol type so we won't
 collide with these unless someone was using PF_ROUTE and relying on
 falling back to PF_UNSPEC however I couldn't find any programs that
 did this iproute2 certainly doesn't. And the bridge pieces are using
 PF_BRIDGE so no collision there.

They have to be different calls from the calls that talk to the s/ware
bridge. In my opinion, as controversial as this may sound, you need to
be flexible enough that some vendor can replace these calls with
proprietary calls which are more efficient for their hardware. So a
plugin to replace these calls in the user space code would be a 
good idea. Alternatively, you could make that something they do at
the driver level i.e from user space to kernel it is hardware, please
addthistotheFDBtable() call and the implementation of that could be
proprietary to the specific hardware.

[..]

 Also if there are embedded switches with learning capabilities they
 might want to trigger events to user space. In this case having
 a protocol type makes user space a bit easier to manage. I've
 added Lennert so maybe he can comment I think the Marvell chipsets
 might support something along these lines. The SR-IOV chipsets I'm
 aware of _today_ don't do learning. Learning makes the event model
 more plausible.

The other events to consider is aging of hardware entries.

 The other mechanism would be to embed some more attributes into the
 PF_UNSPEC:RTM_XXXLINK msg however I'm thinking that if we want to
 support learning and triggering events then we likely also don't
 want to send these events to every app with RTNLGRP_LINK set.

I think this needs to be a different event message. 
FDB_TABLEMISS? FDB_EXCEPTION?

 Plus there is already a proliferation of LINK attributes and dumping
 the FDB out of this seems a bit much but could be done with some
 bitmasks. Although the current ext_filter_mask u32 doesn't seem to
 be sufficient for events to trigger this.

Dumping the FDB table should be something along the lines of FDB_GET
with the dump flag. It shouldnt tie to the LINK side of things.

cheers,
jamal

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-29 Thread Jamal Hadi Salim
On Tue, 2012-02-28 at 21:14 -0800, John Fastabend wrote:

 Just checked looks like the DSA infrastructure has commands to enable
 STP so guess it is doing learning.

IIRC, Lennert built some of this stuff tied to the kernel.

cheers,
jamal

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] perf/x86: Fix HO/GO counting with SVM disabled

2012-02-29 Thread Joerg Roedel
It turned out that a performance counter on AMD does not
count at all when the GO or HO bit is set in the control
register and SVM is disabled in EFER.

This patch works around this issue by masking out the HO bit
in the performance counter control register when SVM is not
enabled.

The GO bit is not touched because it is only set when the
user wants to count in guest-mode only. So when SVM is
disabled the counter should not run at all and the
not-counting is the intended behaviour.

Cc: Peter Zijlstra a.p.zijls...@chello.nl
Cc: Ingo Molnar mi...@elte.hu
Cc: Avi Kivity a...@redhat.com
Cc: Stephane Eranian eran...@google.com
Cc: David Ahern dsah...@gmail.com
Cc: Gleb Natapov g...@redhat.com
Cc: Robert Richter robert.rich...@amd.com
Cc: sta...@vger.kernel.org # 3.2
Signed-off-by: Joerg Roedel joerg.roe...@amd.com
---
 arch/x86/include/asm/perf_event.h|8 +++
 arch/x86/kernel/cpu/perf_event.h |6 -
 arch/x86/kernel/cpu/perf_event_amd.c |   37 -
 arch/x86/kvm/svm.c   |5 
 4 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index 096c975e..ffad310 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -242,4 +242,12 @@ static inline void perf_get_x86_pmu_capability(struct 
x86_pmu_capability *cap)
 static inline void perf_events_lapic_init(void){ }
 #endif
 
+#if defined(CONFIG_PERF_EVENTS)  defined(CONFIG_CPU_SUP_AMD)
+extern void amd_pmu_enable_virt(void);
+extern void amd_pmu_disable_virt(void);
+#else /* CONFIG_PERF_EVENTS  CONFIG_CPU_SUP_AMD */
+static inline void amd_pmu_enable_virt(void) { }
+static inline void amd_pmu_disable_virt(void) { }
+#endif /* CONFIG_PERF_EVENTS  CONFIG_CPU_SUP_AMD */
+
 #endif /* _ASM_X86_PERF_EVENT_H */
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 8944062..2c581b9 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -148,6 +148,8 @@ struct cpu_hw_events {
 * AMD specific bits
 */
struct amd_nb   *amd_nb;
+   /* Inverted mask of bits to clear in the perf_ctr ctrl registers */
+   u64 perf_ctr_virt_mask;
 
void*kfree_on_online;
 };
@@ -417,9 +419,11 @@ void x86_pmu_disable_all(void);
 static inline void __x86_pmu_enable_event(struct hw_perf_event *hwc,
  u64 enable_mask)
 {
+   u64 disable_mask = __this_cpu_read(cpu_hw_events.perf_ctr_virt_mask);
+
if (hwc-extra_reg.reg)
wrmsrl(hwc-extra_reg.reg, hwc-extra_reg.config);
-   wrmsrl(hwc-config_base, hwc-config | enable_mask);
+   wrmsrl(hwc-config_base, (hwc-config | enable_mask)  ~disable_mask);
 }
 
 void x86_pmu_enable_all(int added);
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c 
b/arch/x86/kernel/cpu/perf_event_amd.c
index 0397b23..67250a5 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -1,4 +1,5 @@
 #include linux/perf_event.h
+#include linux/export.h
 #include linux/types.h
 #include linux/init.h
 #include linux/slab.h
@@ -357,7 +358,9 @@ static void amd_pmu_cpu_starting(int cpu)
struct amd_nb *nb;
int i, nb_id;
 
-   if (boot_cpu_data.x86_max_cores  2)
+   cpuc-perf_ctr_virt_mask = AMD_PERFMON_EVENTSEL_HOSTONLY;
+
+   if (boot_cpu_data.x86_max_cores  2 || boot_cpu_data.x86 == 0x15)
return;
 
nb_id = amd_get_nb_id(cpu);
@@ -587,9 +590,9 @@ static __initconst const struct x86_pmu amd_pmu_f15h = {
.put_event_constraints  = amd_put_event_constraints,
 
.cpu_prepare= amd_pmu_cpu_prepare,
-   .cpu_starting   = amd_pmu_cpu_starting,
.cpu_dead   = amd_pmu_cpu_dead,
 #endif
+   .cpu_starting   = amd_pmu_cpu_starting,
 };
 
 __init int amd_pmu_init(void)
@@ -621,3 +624,33 @@ __init int amd_pmu_init(void)
 
return 0;
 }
+
+void amd_pmu_enable_virt(void)
+{
+   struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events);
+
+   cpuc-perf_ctr_virt_mask = 0;
+
+   /* Reload all events */
+   x86_pmu_disable_all();
+   x86_pmu_enable_all(0);
+}
+EXPORT_SYMBOL_GPL(amd_pmu_enable_virt);
+
+void amd_pmu_disable_virt(void)
+{
+   struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events);
+
+   /*
+* We only mask out the Host-only bit so that host-only counting works
+* when SVM is disabled. If someone sets up a guest-only counter when
+* SVM is disabled the Guest-only bits still gets set and the counter
+* will not count anything.
+*/
+   cpuc-perf_ctr_virt_mask = AMD_PERFMON_EVENTSEL_HOSTONLY;
+
+   /* Reload all events */
+   x86_pmu_disable_all();
+   x86_pmu_enable_all(0);
+}
+EXPORT_SYMBOL_GPL(amd_pmu_disable_virt);
diff --git 

Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()

2012-02-29 Thread Takuya Yoshikawa
Avi Kivity a...@redhat.com wrote:

  We cannot get a complete snapshot without mmu_lock; if the guest faults on
  the Nth page during xchg'ing i = 1, 2, ... then the i = N alone will
  become newer.
 
 Ah, so there is no data corruption (missed dirty bits), just the display
 is updated inconsistently?

Yes, no data corruption and just a matter of ... feeling.

The basic rule I wrote in the comment in this patch should be enough to
not lose dirty bits.

 I don't think we can get a consistent snapshot anyway, since the guest
 can update the framebuffer while userspace is processing it.

Yes, nothing will be broken.  I was just not sure what the API should promise
to the userspace.

To some extent the inconsistency may be felt more than before.

 Understood.  I don't think we can get a consistent vga snapshot without
 stopping the guest, and even then, it depends on how the guest updates
 the framebuffer.

OKay, I will not care about this from now.

  So I want to see how much improvement the proposed API can achieve.
 
  I thought this might be a good GSoC project but ...
 
 It may be too involved for GSoC, the issues are difficult.

I am now checking QEMU code closely - rather readable than before!

Though I can make experimental KVM patches for the student, just changing
QEMU's live migration code seems to be difficult - but not sure now.

Anyway, we should try this before deciding the new API: I mean if I cannot
find anyone who want to try this, no need to be a student, I may do instead
at some point.

Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: Resize kvm_io_range array dynamically

2012-02-29 Thread Jan Kiszka
On 2012-02-29 14:30, Amos Kong wrote:
 kvm_io_bus devices are used for ioevent, pit, pic, ioapic,
 coalesced_mmio.
 
 Currently Qemu only emulates one PCI bus, it contains 32 slots,
 one slot contains 8 functions, maximum of supported PCI devices:
  1 * 32 * 8 = 256. The maximum of coalesced mmio zone is 100,
 each zone has an iobus devices. 300 io_bus devices is not enough.
 
 This patch makes the kvm_io_range array can be resized dynamically.

Is there any limit, or can userspace allocate arbitrary amounts of
kernel memory this way?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reconciling qemu-kvm and qemu's PIT

2012-02-29 Thread Jan Kiszka
On 2012-02-29 14:45, Avi Kivity wrote:
 On 02/29/2012 12:14 PM, Jan Kiszka wrote:
 On 2012-02-29 10:29, Avi Kivity wrote:
 On 02/28/2012 11:47 PM, Jan Kiszka wrote:

 Looks like isa-pit has zero gpio pins, so it fails when crashing.  I
 must have mismerged it, but where is the gpio pin count set?

 In pit_initfn. Does -no-kvm-irqchip work fine? It's a bit tricky to
 discuss this without seeing your code.


 pushed it into qemu-kvm.git usptream-merge


 As I assumed: You try to initialize the kvm-pit like the userspace pit,
 that cannot work. Don't run the qdev_connect_gpio_out in pit_init, just
 like my kvm-pit patches does in their kvm_pit_init.

 
 The merge is now in 'next', hopefully not too many regressions.

Hope we can quickly finish the switch to the new kvm-pit.

When do you plan to push or rebase uq/master? Then I could check/refresh
my kvm-pit series on top of it.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reconciling qemu-kvm and qemu's PIT

2012-02-29 Thread Avi Kivity
On 02/29/2012 04:24 PM, Jan Kiszka wrote:
 On 2012-02-29 14:45, Avi Kivity wrote:
  On 02/29/2012 12:14 PM, Jan Kiszka wrote:
  On 2012-02-29 10:29, Avi Kivity wrote:
  On 02/28/2012 11:47 PM, Jan Kiszka wrote:
 
  Looks like isa-pit has zero gpio pins, so it fails when crashing.  I
  must have mismerged it, but where is the gpio pin count set?
 
  In pit_initfn. Does -no-kvm-irqchip work fine? It's a bit tricky to
  discuss this without seeing your code.
 
 
  pushed it into qemu-kvm.git usptream-merge
 
 
  As I assumed: You try to initialize the kvm-pit like the userspace pit,
  that cannot work. Don't run the qdev_connect_gpio_out in pit_init, just
  like my kvm-pit patches does in their kvm_pit_init.
 
  
  The merge is now in 'next', hopefully not too many regressions.

 Hope we can quickly finish the switch to the new kvm-pit.

 When do you plan to push or rebase uq/master? Then I could check/refresh
 my kvm-pit series on top of it.

Real soon... I hoped to do it after the upstream merge, but it doesn't
pass autotest due to a screendump regression.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-WIP 01/13] xen/arm: use r12 to pass the hypercall number to the hypervisor

2012-02-29 Thread Ian Campbell
On Wed, 2012-02-29 at 12:58 +, Dave Martin wrote:
 On Wed, Feb 29, 2012 at 09:56:02AM +, Ian Campbell wrote:
  On Wed, 2012-02-29 at 09:34 +, Dave Martin wrote:
   On Tue, Feb 28, 2012 at 12:28:29PM +, Stefano Stabellini wrote:
  
I don't have a very strong opinion on which register we should use, but
I would like to avoid r7 if it is already actively used by gcc.
   
   But there is no framepointer for Thumb-2 code (?)
  
  Peter Maydell suggested there was:
   r7 is (used by gcc as) the Thumb frame pointer; I don't know if this
   makes it worth avoiding in this context.
  
  Sounds like it might be a gcc-ism, possibly a non-default option?
  
  Anyway, I think r12 will be fine for our purposes so the point is rather
  moot.
 
 Just had a chat with some tools guys -- apparently, when passing register
 arguments to gcc inline asms there really isn't a guarantee that those
 variables will be in the expected registers on entry to the inline asm.
 
 If gcc reorders other function calls or other code around the inline asm
 (which it can do, except under certain controlled situations), then
 intervening code can clobber any registers in general.
 
 Or, to summarise another way, there is no way to control which register
 is used to pass something to an inline asm in general (often we get away
 with this, and there are a lot of inline asms in the kernel that assume
 it works, but the more you inline the more likely you are to get nasty
 surprises).  There is no workaroud, except on some architectures where
 special asm constraints allow specific individual registers to be
 specified for operands (i386 for example).

I had assumed I just couldn't find the right syntax. Useful to know that
I couldn't find it because it doesn't exist!

 If you need a specific register, this means that you must set up that
 register explicitly inside the asm if you want a guarantee that the
 code will work:
 
   asm volatile (
   movw   r12, %[hvc_num]\n\t

Is gcc (or gas?) smart enough to optimise this away if it turns out that
%[hvc_num] == r12?

   ...
   hvc#0
   :: [hvc_num] i (NUMBER) : r12
   );
 
 Of course, if you need to set up more than about 5 or 6 registers in
 this way, the doubled register footprint means that the compiler will
 have to start spilling stuff to the stack.
 
 
 This is the kind of problem which goes away when out-of-lining the
 hvc wrapper behind a C function interface, since the ABI then provides
 guarantees about how values are mershaled into and out of that code.

I don't think anything would stop gcc from clobbering an argument
register right on function entry (e..g it might move r0 to r8 and
clobber r0, for whatever reason), so that they are no longer where you
expect them to be when you hit the asm. Unlikely perhaps but no more so
than the other issues you've raised?

Or did you mean out-of-line as in written in a .S file as well as out
of line?

 Notwithstanding the above, even if we do make theoretically unsound
 (but often true) assumptions about inline asms, ARM will be no worse
 than other arches in this respect.

This is true.

 Other than serving as a reminder that inline asm is a deep can of
 worms, this doesn't really give us a neat solution...

How are system calls implemented on the userspace side? I confess I
don't know what the ARM syscall ABI looks like -- is it all registers or
is some of it on the stack? It sounds like the solution ought to be
pretty similar though.

Ian.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-WIP 01/13] xen/arm: use r12 to pass the hypercall number to the hypervisor

2012-02-29 Thread Stefano Stabellini
On Wed, 29 Feb 2012, Dave Martin wrote:
 On Wed, Feb 29, 2012 at 09:56:02AM +, Ian Campbell wrote:
  On Wed, 2012-02-29 at 09:34 +, Dave Martin wrote:
   On Tue, Feb 28, 2012 at 12:28:29PM +, Stefano Stabellini wrote:
  
I don't have a very strong opinion on which register we should use, but
I would like to avoid r7 if it is already actively used by gcc.
   
   But there is no framepointer for Thumb-2 code (?)
  
  Peter Maydell suggested there was:
   r7 is (used by gcc as) the Thumb frame pointer; I don't know if this
   makes it worth avoiding in this context.
  
  Sounds like it might be a gcc-ism, possibly a non-default option?
  
  Anyway, I think r12 will be fine for our purposes so the point is rather
  moot.
 
 Just had a chat with some tools guys -- apparently, when passing register
 arguments to gcc inline asms there really isn't a guarantee that those
 variables will be in the expected registers on entry to the inline asm.
 
 If gcc reorders other function calls or other code around the inline asm
 (which it can do, except under certain controlled situations), then
 intervening code can clobber any registers in general.
 
 Or, to summarise another way, there is no way to control which register
 is used to pass something to an inline asm in general (often we get away
 with this, and there are a lot of inline asms in the kernel that assume
 it works, but the more you inline the more likely you are to get nasty
 surprises).  There is no workaroud, except on some architectures where
 special asm constraints allow specific individual registers to be
 specified for operands (i386 for example).
 
 If you need a specific register, this means that you must set up that
 register explicitly inside the asm if you want a guarantee that the
 code will work:
 
   asm volatile (
   movw   r12, %[hvc_num]\n\t
   ...
   hvc#0
   :: [hvc_num] i (NUMBER) : r12
   );
 

OK, we can arrange the hypercall code to be like that.
Also with your patch series it would be _hvc because of the .macro,
right?



 This is the kind of problem which goes away when out-of-lining the
 hvc wrapper behind a C function interface, since the ABI then provides
 guarantees about how values are mershaled into and out of that code.

Do you mean implementing the entire HYPERVISOR_example_op in assembly
and calling it from C?
Because I guess that gcc would still be free to mess with the registers
between the C function entry point and any inline assembly code.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] KVM: Allow host IRQ sharing for assigned PCI 2.3 devices

2012-02-29 Thread Alex Williamson
On Tue, 2012-02-28 at 14:19 +0100, Jan Kiszka wrote:
 PCI 2.3 allows to generically disable IRQ sources at device level. This
 enables us to share legacy IRQs of such devices with other host devices
 when passing them to a guest.
 
 The new IRQ sharing feature introduced here is optional, user space has
 to request it explicitly.

Is this really true?  Looks like it's automatic.

  Moreover, user space can inform us about its
 view of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the
 interrupt and signaling it if the guest masked it via the virtualized
 PCI config space.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
 
 Changes in v4:
  - Integrated doc changes as proposed by Alex
  - Fixed deassign_host_irq /wrt MSI
  - Fixed kvm_vm_ioctl_set_pci_irq_mask /wrt INTx unmasking of non-2.3
devices
 
  Documentation/virtual/kvm/api.txt |   41 +++
  arch/x86/kvm/x86.c|1 +
  include/linux/kvm.h   |6 +
  include/linux/kvm_host.h  |2 +
  virt/kvm/assigned-dev.c   |  209 +++-
  5 files changed, 230 insertions(+), 29 deletions(-)
[snip]
 +static int kvm_vm_ioctl_set_pci_irq_mask(struct kvm *kvm,
 + struct kvm_assigned_pci_dev *assigned_dev)
 +{
 + int r = 0;
 + struct kvm_assigned_dev_kernel *match;
 +
 + mutex_lock(kvm-lock);
 +
 + match = kvm_find_assigned_dev(kvm-arch.assigned_dev_head,
 +   assigned_dev-assigned_dev_id);
 + if (!match) {
 + r = -ENODEV;
 + goto out;
 + }
 +
 + mutex_lock(match-intx_mask_lock);
 +
 + match-flags = ~KVM_DEV_ASSIGN_MASK_INTX;
 + match-flags |= assigned_dev-flags  KVM_DEV_ASSIGN_MASK_INTX;
 +
 + if (match-irq_requested_type  KVM_DEV_IRQ_GUEST_INTX) {
 + if (assigned_dev-flags  KVM_DEV_ASSIGN_MASK_INTX) {
 + kvm_set_irq(match-kvm, match-irq_source_id,
 + match-guest_irq, 0);
 + /*
 +  * Masking at hardware-level is performed on demand,
 +  * i.e. when an IRQ actually arrives at the host.
 +  */
 + } else if (!(assigned_dev-flags  KVM_DEV_ASSIGN_PCI_2_3)) {
 + /*
 +  * Unmask the IRQ line if required. Unmasking at
 +  * device level will be performed by user space.
 +  */
 + spin_lock_irq(match-intx_lock);
 + if (match-host_irq_disabled) {
 + enable_irq(match-host_irq);
 + match-host_irq_disabled = false;
 + }
 + spin_unlock_irq(match-intx_lock);

This still looks broken.  If we start with a PCI 2.3 device with INTx
disabled, how does this ever kick start to get another interrupt?
Shouldn't we just call kvm_assigned_dev_ack_irq() here and handle both
INTx modes?  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: Resize kvm_io_range array dynamically

2012-02-29 Thread Amos Kong
- Original Message -
 On 2012-02-29 14:30, Amos Kong wrote:
  kvm_io_bus devices are used for ioevent, pit, pic, ioapic,
  coalesced_mmio.
  
  Currently Qemu only emulates one PCI bus, it contains 32 slots,
  one slot contains 8 functions, maximum of supported PCI devices:
   1 * 32 * 8 = 256. The maximum of coalesced mmio zone is 100,
  each zone has an iobus devices. 300 io_bus devices is not enough.
  
  This patch makes the kvm_io_range array can be resized dynamically.
 
 Is there any limit, or can userspace allocate arbitrary amounts of
 kernel memory this way?

Hi Jan,

There is a fixed array in linux-2.6/include/linux/kvm_host.h,
we can only register 300 devices.

struct kvm_io_range {
gpa_t addr;   
int len;
struct kvm_io_device *dev;
};

struct kvm_io_bus {
int   dev_count;
#define NR_IOBUS_DEVS 300
struct kvm_io_range range[NR_IOBUS_DEVS];
};
  ^^

Amos.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: Resize kvm_io_range array dynamically

2012-02-29 Thread Jan Kiszka
On 2012-02-29 16:22, Amos Kong wrote:
 - Original Message -
 On 2012-02-29 14:30, Amos Kong wrote:
 kvm_io_bus devices are used for ioevent, pit, pic, ioapic,
 coalesced_mmio.

 Currently Qemu only emulates one PCI bus, it contains 32 slots,
 one slot contains 8 functions, maximum of supported PCI devices:
  1 * 32 * 8 = 256. The maximum of coalesced mmio zone is 100,
 each zone has an iobus devices. 300 io_bus devices is not enough.

 This patch makes the kvm_io_range array can be resized dynamically.

 Is there any limit, or can userspace allocate arbitrary amounts of
 kernel memory this way?
 
 Hi Jan,
 
 There is a fixed array in linux-2.6/include/linux/kvm_host.h,
 we can only register 300 devices.
 
 struct kvm_io_range {
 gpa_t addr;   
 int len;
 struct kvm_io_device *dev;
 };
 
 struct kvm_io_bus {
 int   dev_count;
 #define NR_IOBUS_DEVS 300
 struct kvm_io_range range[NR_IOBUS_DEVS];
 };
   ^^

Right. But doesn't your patch remove precisely this limit? So what
limits userspace now? To register 300 million devices...?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] KVM: Allow host IRQ sharing for assigned PCI 2.3 devices

2012-02-29 Thread Jan Kiszka
On 2012-02-29 16:22, Alex Williamson wrote:
 On Tue, 2012-02-28 at 14:19 +0100, Jan Kiszka wrote:
 PCI 2.3 allows to generically disable IRQ sources at device level. This
 enables us to share legacy IRQs of such devices with other host devices
 when passing them to a guest.

 The new IRQ sharing feature introduced here is optional, user space has
 to request it explicitly.
 
 Is this really true?  Looks like it's automatic.

It is true: no IRQ sharing without userspace setting
KVM_DEV_ASSIGN_PCI_2_3 during KVM_ASSIGN_PCI_DEVICE. My qemu-kvm patches
will allow to control this, but make it default on (reasons for this
will be provided in that context).

 
  Moreover, user space can inform us about its
 view of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the
 interrupt and signaling it if the guest masked it via the virtualized
 PCI config space.

 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---

 Changes in v4:
  - Integrated doc changes as proposed by Alex
  - Fixed deassign_host_irq /wrt MSI
  - Fixed kvm_vm_ioctl_set_pci_irq_mask /wrt INTx unmasking of non-2.3
devices

  Documentation/virtual/kvm/api.txt |   41 +++
  arch/x86/kvm/x86.c|1 +
  include/linux/kvm.h   |6 +
  include/linux/kvm_host.h  |2 +
  virt/kvm/assigned-dev.c   |  209 
 +++-
  5 files changed, 230 insertions(+), 29 deletions(-)
 [snip]
 +static int kvm_vm_ioctl_set_pci_irq_mask(struct kvm *kvm,
 +struct kvm_assigned_pci_dev *assigned_dev)
 +{
 +int r = 0;
 +struct kvm_assigned_dev_kernel *match;
 +
 +mutex_lock(kvm-lock);
 +
 +match = kvm_find_assigned_dev(kvm-arch.assigned_dev_head,
 +  assigned_dev-assigned_dev_id);
 +if (!match) {
 +r = -ENODEV;
 +goto out;
 +}
 +
 +mutex_lock(match-intx_mask_lock);
 +
 +match-flags = ~KVM_DEV_ASSIGN_MASK_INTX;
 +match-flags |= assigned_dev-flags  KVM_DEV_ASSIGN_MASK_INTX;
 +
 +if (match-irq_requested_type  KVM_DEV_IRQ_GUEST_INTX) {
 +if (assigned_dev-flags  KVM_DEV_ASSIGN_MASK_INTX) {
 +kvm_set_irq(match-kvm, match-irq_source_id,
 +match-guest_irq, 0);
 +/*
 + * Masking at hardware-level is performed on demand,
 + * i.e. when an IRQ actually arrives at the host.
 + */
 +} else if (!(assigned_dev-flags  KVM_DEV_ASSIGN_PCI_2_3)) {
 +/*
 + * Unmask the IRQ line if required. Unmasking at
 + * device level will be performed by user space.
 + */
 +spin_lock_irq(match-intx_lock);
 +if (match-host_irq_disabled) {
 +enable_irq(match-host_irq);
 +match-host_irq_disabled = false;
 +}
 +spin_unlock_irq(match-intx_lock);
 
 This still looks broken.  If we start with a PCI 2.3 device with INTx
 disabled, how does this ever kick start to get another interrupt?
 Shouldn't we just call kvm_assigned_dev_ack_irq() here and handle both
 INTx modes?  Thanks,

No, that would be wrong. The IRQ must be delivered to the guest first.

And that will happen with 2.3 once userspace unmasks the device INTx
and, thus, triggers another host-side IRQ. Same for non-2.3 devices,
just that this happens on enable_irq.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: fill in padding to help valgrind

2012-02-29 Thread Michael S. Tsirkin
valgrind warns about padding fields which are passed
to vcpu ioctls uninitialized.
This is not an error in practice because kvm ignored padding.
Since the ioctls in question are off data path and
the cost is zero anyway, initialize padding to 0
to suppress these errors.

Signed-off-by: Michael S. Tsirkin m...@redhat.com

---
 kvm-all.c |2 ++
 target-i386/kvm.c |6 ++
 2 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index c58c77b..3bc0eb3 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -447,6 +447,7 @@ int kvm_coalesce_mmio_region(target_phys_addr_t start, 
ram_addr_t size)
 
 zone.addr = start;
 zone.size = size;
+zone.pad = 0;
 
 ret = kvm_vm_ioctl(s, KVM_REGISTER_COALESCED_MMIO, zone);
 }
@@ -464,6 +465,7 @@ int kvm_uncoalesce_mmio_region(target_phys_addr_t start, 
ram_addr_t size)
 
 zone.addr = start;
 zone.size = size;
+zone.pad = 0;
 
 ret = kvm_vm_ioctl(s, KVM_UNREGISTER_COALESCED_MMIO, zone);
 }
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 981192d..285cf55 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -555,6 +555,7 @@ int kvm_arch_init_vcpu(CPUState *env)
 
 qemu_add_vm_change_state_handler(cpu_update_state, env);
 
+cpuid_data.cpuid.padding = 0;
 r = kvm_vcpu_ioctl(env, KVM_SET_CPUID2, cpuid_data);
 if (r) {
 return r;
@@ -740,6 +741,7 @@ static void set_seg(struct kvm_segment *lhs, const 
SegmentCache *rhs)
 lhs-g = (flags  DESC_G_MASK) != 0;
 lhs-avl = (flags  DESC_AVL_MASK) != 0;
 lhs-unusable = 0;
+lhs-padding = 0;
 }
 
 static void get_seg(SegmentCache *lhs, const struct kvm_segment *rhs)
@@ -919,8 +921,10 @@ static int kvm_put_sregs(CPUState *env)
 
 sregs.idt.limit = env-idt.limit;
 sregs.idt.base = env-idt.base;
+memset(sregs.idt.padding, 0, sizeof sregs.idt.padding);
 sregs.gdt.limit = env-gdt.limit;
 sregs.gdt.base = env-gdt.base;
+memset(sregs.gdt.padding, 0, sizeof sregs.gdt.padding);
 
 sregs.cr0 = env-cr[0];
 sregs.cr2 = env-cr[2];
@@ -1392,6 +1396,7 @@ static int kvm_put_vcpu_events(CPUState *env, int level)
 events.exception.nr = env-exception_injected;
 events.exception.has_error_code = env-has_error_code;
 events.exception.error_code = env-error_code;
+events.exception.pad = 0;
 
 events.interrupt.injected = (env-interrupt_injected = 0);
 events.interrupt.nr = env-interrupt_injected;
@@ -1400,6 +1405,7 @@ static int kvm_put_vcpu_events(CPUState *env, int level)
 events.nmi.injected = env-nmi_injected;
 events.nmi.pending = env-nmi_pending;
 events.nmi.masked = !!(env-hflags2  HF2_NMI_MASK);
+events.nmi.pad = 0;
 
 events.sipi_vector = env-sipi_vector;
 
-- 
1.7.9.111.gf3fb0
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] KVM: Allow host IRQ sharing for assigned PCI 2.3 devices

2012-02-29 Thread Alex Williamson
On Wed, 2012-02-29 at 16:38 +0100, Jan Kiszka wrote:
 On 2012-02-29 16:22, Alex Williamson wrote:
  On Tue, 2012-02-28 at 14:19 +0100, Jan Kiszka wrote:
  PCI 2.3 allows to generically disable IRQ sources at device level. This
  enables us to share legacy IRQs of such devices with other host devices
  when passing them to a guest.
 
  The new IRQ sharing feature introduced here is optional, user space has
  to request it explicitly.
  
  Is this really true?  Looks like it's automatic.
 
 It is true: no IRQ sharing without userspace setting
 KVM_DEV_ASSIGN_PCI_2_3 during KVM_ASSIGN_PCI_DEVICE. My qemu-kvm patches
 will allow to control this, but make it default on (reasons for this
 will be provided in that context).

Ah right, we only clear it if not available.  Thanks.

  
   Moreover, user space can inform us about its
  view of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the
  interrupt and signaling it if the guest masked it via the virtualized
  PCI config space.
 
  Signed-off-by: Jan Kiszka jan.kis...@siemens.com
  ---
 
  Changes in v4:
   - Integrated doc changes as proposed by Alex
   - Fixed deassign_host_irq /wrt MSI
   - Fixed kvm_vm_ioctl_set_pci_irq_mask /wrt INTx unmasking of non-2.3
 devices
 
   Documentation/virtual/kvm/api.txt |   41 +++
   arch/x86/kvm/x86.c|1 +
   include/linux/kvm.h   |6 +
   include/linux/kvm_host.h  |2 +
   virt/kvm/assigned-dev.c   |  209 
  +++-
   5 files changed, 230 insertions(+), 29 deletions(-)
  [snip]
  +static int kvm_vm_ioctl_set_pci_irq_mask(struct kvm *kvm,
  +  struct kvm_assigned_pci_dev *assigned_dev)
  +{
  +  int r = 0;
  +  struct kvm_assigned_dev_kernel *match;
  +
  +  mutex_lock(kvm-lock);
  +
  +  match = kvm_find_assigned_dev(kvm-arch.assigned_dev_head,
  +assigned_dev-assigned_dev_id);
  +  if (!match) {
  +  r = -ENODEV;
  +  goto out;
  +  }
  +
  +  mutex_lock(match-intx_mask_lock);
  +
  +  match-flags = ~KVM_DEV_ASSIGN_MASK_INTX;
  +  match-flags |= assigned_dev-flags  KVM_DEV_ASSIGN_MASK_INTX;
  +
  +  if (match-irq_requested_type  KVM_DEV_IRQ_GUEST_INTX) {
  +  if (assigned_dev-flags  KVM_DEV_ASSIGN_MASK_INTX) {
  +  kvm_set_irq(match-kvm, match-irq_source_id,
  +  match-guest_irq, 0);
  +  /*
  +   * Masking at hardware-level is performed on demand,
  +   * i.e. when an IRQ actually arrives at the host.
  +   */
  +  } else if (!(assigned_dev-flags  KVM_DEV_ASSIGN_PCI_2_3)) {
  +  /*
  +   * Unmask the IRQ line if required. Unmasking at
  +   * device level will be performed by user space.
  +   */
  +  spin_lock_irq(match-intx_lock);
  +  if (match-host_irq_disabled) {
  +  enable_irq(match-host_irq);
  +  match-host_irq_disabled = false;
  +  }
  +  spin_unlock_irq(match-intx_lock);
  
  This still looks broken.  If we start with a PCI 2.3 device with INTx
  disabled, how does this ever kick start to get another interrupt?
  Shouldn't we just call kvm_assigned_dev_ack_irq() here and handle both
  INTx modes?  Thanks,
 
 No, that would be wrong. The IRQ must be delivered to the guest first.
 
 And that will happen with 2.3 once userspace unmasks the device INTx
 and, thus, triggers another host-side IRQ. Same for non-2.3 devices,
 just that this happens on enable_irq.

Ok, got it.  I was tempted to write in the revised description that
userspace could skip the device command register update if INTx disable
is the only change, but because of the way this works, that is clearly
not the case.  Thanks.

Acked-by: Alex Williamson alex.william...@redhat.com


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: Resize kvm_io_range array dynamically

2012-02-29 Thread Amos Kong
- Original Message -
 On 2012-02-29 16:22, Amos Kong wrote:
  - Original Message -
  On 2012-02-29 14:30, Amos Kong wrote:
  kvm_io_bus devices are used for ioevent, pit, pic, ioapic,
  coalesced_mmio.
 
  Currently Qemu only emulates one PCI bus, it contains 32 slots,
  one slot contains 8 functions, maximum of supported PCI devices:
   1 * 32 * 8 = 256. The maximum of coalesced mmio zone is 100,
  each zone has an iobus devices. 300 io_bus devices is not enough.
 
  This patch makes the kvm_io_range array can be resized
  dynamically.
 
  Is there any limit, or can userspace allocate arbitrary amounts of
  kernel memory this way?
  
  Hi Jan,
  
  There is a fixed array in linux-2.6/include/linux/kvm_host.h,
  we can only register 300 devices.
  
  struct kvm_io_range {
  gpa_t addr;
  int len;
  struct kvm_io_device *dev;
  };
  
  struct kvm_io_bus {
  int   dev_count;
  #define NR_IOBUS_DEVS 300
  struct kvm_io_range range[NR_IOBUS_DEVS];
  };
^^
 
 Right. But doesn't your patch remove precisely this limit? So what
 limits userspace now? To register 300 million devices...?

Hi Jan,

It seems we need to reserve the limitation in kvm_host.h

#define NR_IOBUS_DEVS 600

/* Currently Qemu only emulates one PCI bus, it contains 32 slots,
one slot contains 8 functions. Only 29 slots can be used to
add multiple function devices. Maximum of supported PCI devices:
29 * 8 = 232. Each virtio-blk device needs 1 iobus device,
each virtio-net(vhost) device requires 2 such devices to service
notifications (ioevent) from rx/tx queues. 
The maximum of coalesced mmio zone is 100, each zone has an iobus
devices. ioevent, pit, ioapic take less iobus devices.

So we can set max limitation to 600. */

- check limit when register dev 

virt/kvm/kvm_main.c:

/* Caller must hold slots_lock. */
int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
int len, struct kvm_io_device *dev)
{
struct kvm_io_bus *new_bus, *bus;

bus = kvm-buses[bus_idx];
if (bus-dev_count  NR_IOBUS_DEVS - 1)   // can only register 600 
devices
return -ENOSPC;

Amos.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] perf/x86: Fix HO/GO counting with SVM disabled

2012-02-29 Thread Avi Kivity
On 02/29/2012 03:57 PM, Joerg Roedel wrote:
 It turned out that a performance counter on AMD does not
 count at all when the GO or HO bit is set in the control
 register and SVM is disabled in EFER.

 This patch works around this issue by masking out the HO bit
 in the performance counter control register when SVM is not
 enabled.

 The GO bit is not touched because it is only set when the
 user wants to count in guest-mode only. So when SVM is
 disabled the counter should not run at all and the
 not-counting is the intended behaviour.

 diff --git a/arch/x86/kernel/cpu/perf_event_amd.c 
 b/arch/x86/kernel/cpu/perf_event_amd.c
 index 0397b23..67250a5 100644
 --- a/arch/x86/kernel/cpu/perf_event_amd.c
 +++ b/arch/x86/kernel/cpu/perf_event_amd.c
 @@ -1,4 +1,5 @@
  #include linux/perf_event.h
 +#include linux/export.h
  #include linux/types.h
  #include linux/init.h
  #include linux/slab.h
 @@ -357,7 +358,9 @@ static void amd_pmu_cpu_starting(int cpu)
   struct amd_nb *nb;
   int i, nb_id;
  
 - if (boot_cpu_data.x86_max_cores  2)
 + cpuc-perf_ctr_virt_mask = AMD_PERFMON_EVENTSEL_HOSTONLY;
 +
 + if (boot_cpu_data.x86_max_cores  2 || boot_cpu_data.x86 == 0x15)
   return;

Why this (boot_cpu_data.x86 == 0x15) change?



-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] perf/x86: Fix HO/GO counting with SVM disabled

2012-02-29 Thread Gleb Natapov
On Wed, Feb 29, 2012 at 07:00:09PM +0200, Avi Kivity wrote:
 On 02/29/2012 03:57 PM, Joerg Roedel wrote:
  It turned out that a performance counter on AMD does not
  count at all when the GO or HO bit is set in the control
  register and SVM is disabled in EFER.
 
  This patch works around this issue by masking out the HO bit
  in the performance counter control register when SVM is not
  enabled.
 
  The GO bit is not touched because it is only set when the
  user wants to count in guest-mode only. So when SVM is
  disabled the counter should not run at all and the
  not-counting is the intended behaviour.
 
  diff --git a/arch/x86/kernel/cpu/perf_event_amd.c 
  b/arch/x86/kernel/cpu/perf_event_amd.c
  index 0397b23..67250a5 100644
  --- a/arch/x86/kernel/cpu/perf_event_amd.c
  +++ b/arch/x86/kernel/cpu/perf_event_amd.c
  @@ -1,4 +1,5 @@
   #include linux/perf_event.h
  +#include linux/export.h
   #include linux/types.h
   #include linux/init.h
   #include linux/slab.h
  @@ -357,7 +358,9 @@ static void amd_pmu_cpu_starting(int cpu)
  struct amd_nb *nb;
  int i, nb_id;
   
  -   if (boot_cpu_data.x86_max_cores  2)
  +   cpuc-perf_ctr_virt_mask = AMD_PERFMON_EVENTSEL_HOSTONLY;
  +
  +   if (boot_cpu_data.x86_max_cores  2 || boot_cpu_data.x86 == 0x15)
  return;
 
 Why this (boot_cpu_data.x86 == 0x15) change?
 
I think it is related to .cpu_starting callback now available in
amd_pmu_f15h where previously it wasn't.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] perf/x86: Fix HO/GO counting with SVM disabled

2012-02-29 Thread Joerg Roedel
On Wed, Feb 29, 2012 at 07:00:09PM +0200, Avi Kivity wrote:
 On 02/29/2012 03:57 PM, Joerg Roedel wrote:
  It turned out that a performance counter on AMD does not
  count at all when the GO or HO bit is set in the control
  register and SVM is disabled in EFER.
 
  This patch works around this issue by masking out the HO bit
  in the performance counter control register when SVM is not
  enabled.
 
  The GO bit is not touched because it is only set when the
  user wants to count in guest-mode only. So when SVM is
  disabled the counter should not run at all and the
  not-counting is the intended behaviour.
 
  diff --git a/arch/x86/kernel/cpu/perf_event_amd.c 
  b/arch/x86/kernel/cpu/perf_event_amd.c
  index 0397b23..67250a5 100644
  --- a/arch/x86/kernel/cpu/perf_event_amd.c
  +++ b/arch/x86/kernel/cpu/perf_event_amd.c
  @@ -1,4 +1,5 @@
   #include linux/perf_event.h
  +#include linux/export.h
   #include linux/types.h
   #include linux/init.h
   #include linux/slab.h
  @@ -357,7 +358,9 @@ static void amd_pmu_cpu_starting(int cpu)
  struct amd_nb *nb;
  int i, nb_id;
   
  -   if (boot_cpu_data.x86_max_cores  2)
  +   cpuc-perf_ctr_virt_mask = AMD_PERFMON_EVENTSEL_HOSTONLY;
  +
  +   if (boot_cpu_data.x86_max_cores  2 || boot_cpu_data.x86 == 0x15)
  return;
 
 Why this (boot_cpu_data.x86 == 0x15) change?

This is because this function did not run on Fam15h before but now it
has to so that cpuc-perf_ctr_virt_mask is initialized. The other stuff
done in this function is setup for northbridge counter. These are not
yet implemented for Fam15h CPUs so this setup must not run on those
CPUs. Therefore the check was added.
Once northbridge counters are implemented for Fam15h this check can go
away again.


Joerg


-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] perf/x86: Fix HO/GO counting with SVM disabled

2012-02-29 Thread Avi Kivity
On 02/29/2012 07:05 PM, Joerg Roedel wrote:
 On Wed, Feb 29, 2012 at 07:00:09PM +0200, Avi Kivity wrote:
  On 02/29/2012 03:57 PM, Joerg Roedel wrote:
   It turned out that a performance counter on AMD does not
   count at all when the GO or HO bit is set in the control
   register and SVM is disabled in EFER.
  
   This patch works around this issue by masking out the HO bit
   in the performance counter control register when SVM is not
   enabled.
  
   The GO bit is not touched because it is only set when the
   user wants to count in guest-mode only. So when SVM is
   disabled the counter should not run at all and the
   not-counting is the intended behaviour.
  
   diff --git a/arch/x86/kernel/cpu/perf_event_amd.c 
   b/arch/x86/kernel/cpu/perf_event_amd.c
   index 0397b23..67250a5 100644
   --- a/arch/x86/kernel/cpu/perf_event_amd.c
   +++ b/arch/x86/kernel/cpu/perf_event_amd.c
   @@ -1,4 +1,5 @@
#include linux/perf_event.h
   +#include linux/export.h
#include linux/types.h
#include linux/init.h
#include linux/slab.h
   @@ -357,7 +358,9 @@ static void amd_pmu_cpu_starting(int cpu)
 struct amd_nb *nb;
 int i, nb_id;

   - if (boot_cpu_data.x86_max_cores  2)
   + cpuc-perf_ctr_virt_mask = AMD_PERFMON_EVENTSEL_HOSTONLY;
   +
   + if (boot_cpu_data.x86_max_cores  2 || boot_cpu_data.x86 == 0x15)
 return;
  
  Why this (boot_cpu_data.x86 == 0x15) change?

 This is because this function did not run on Fam15h before but now it
 has to so that cpuc-perf_ctr_virt_mask is initialized. The other stuff
 done in this function is setup for northbridge counter. These are not
 yet implemented for Fam15h CPUs so this setup must not run on those
 CPUs. Therefore the check was added.
 Once northbridge counters are implemented for Fam15h this check can go
 away again.


Thanks.  Ingo, this had better go through tip.git since most of the
changes are perf related.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] perf/x86: Fix HO/GO counting with SVM disabled

2012-02-29 Thread Peter Zijlstra
On Wed, 2012-02-29 at 18:05 +0100, Joerg Roedel wrote:
 Once northbridge counters are implemented for Fam15h this check can go
 away again. 

Nope, northbridge on fam15 is completely disjoint from the regular pmu
and should thus be a separate driver.

The only reason the old amd driver has them both is because how they
shared the registers so there is a shared resource to manage.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-29 Thread John Fastabend
On 2/29/2012 5:56 AM, Jamal Hadi Salim wrote:
 On Tue, 2012-02-28 at 20:40 -0800, John Fastabend wrote:
 
 OK back to this. The last piece is where to put these messages...
 we could take PF_ROUTE:RTM_*NEIGH

  PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
  switch.
  PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
  switch.
  PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table

 
 Why RTM_*NEIGH? RTM tends to map to Route/L3 and NEIGH tends to map
 to ndisc or ARP both tied to IP address resolution. While both ARP/Ndisc
 may play a role in the user space app populating the FDB, i dont think
 they are necessary players.
 Learning could be via a table entry miss and packet redirect to user
 space.
 So my suggestion is to use FDB_*ENTRY for names
  

Well I think NETLINK_ROUTE is the most correct type to use in this
case. Per netlink.h its for routing and device hooks.

#define NETLINK_ROUTE   0   /* Routing/device hook  
*/

And NETLINK_ROUTE msg_types use the RTM_* prefix. The _*NEIGH postfix
were merely a copy from the SW BRIDGE code paths. How about,

PF_BRIDGE:RTM_FDB_NEWENTRY
PF_BRIDGE:RTM_FDB_DELENTRY
PF_BRIDGE:RTM_FDB_GETENTRY

And a new group RTNLGRP_FDB. Also using NETLINK_ROUTE gives the correct
rtnl locking semantics for free.

 The neighbor code is using the PF_UNSPEC protocol type so we won't
 collide with these unless someone was using PF_ROUTE and relying on
 falling back to PF_UNSPEC however I couldn't find any programs that
 did this iproute2 certainly doesn't. And the bridge pieces are using
 PF_BRIDGE so no collision there.
 
 They have to be different calls from the calls that talk to the s/ware
 bridge. In my opinion, as controversial as this may sound, you need to
 be flexible enough that some vendor can replace these calls with
 proprietary calls which are more efficient for their hardware. So a
 plugin to replace these calls in the user space code would be a 
 good idea. Alternatively, you could make that something they do at
 the driver level i.e from user space to kernel it is hardware, please
 addthistotheFDBtable() call and the implementation of that could be
 proprietary to the specific hardware.
 

Agreed. I think adding some ndo_ops for bridging offloads here would
work. For example the DSA infrastructure and/or macvlan devices might
need this. Along the lines of extending this RFC,

[RFC] hardware bridging support for DSA switches
http://patchwork.ozlabs.org/patch/16578/


.John

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] perf/x86: Fix HO/GO counting with SVM disabled

2012-02-29 Thread Peter Zijlstra
On Wed, 2012-02-29 at 14:57 +0100, Joerg Roedel wrote:

 diff --git a/arch/x86/kernel/cpu/perf_event.h 
 b/arch/x86/kernel/cpu/perf_event.h
 index 8944062..2c581b9 100644
 --- a/arch/x86/kernel/cpu/perf_event.h
 +++ b/arch/x86/kernel/cpu/perf_event.h
 @@ -148,6 +148,8 @@ struct cpu_hw_events {
  * AMD specific bits
  */
 struct amd_nb   *amd_nb;
 +   /* Inverted mask of bits to clear in the perf_ctr ctrl registers */
 +   u64 perf_ctr_virt_mask;
  
 void*kfree_on_online;
  };
 @@ -417,9 +419,11 @@ void x86_pmu_disable_all(void);
  static inline void __x86_pmu_enable_event(struct hw_perf_event *hwc,
   u64 enable_mask)
  {
 +   u64 disable_mask = __this_cpu_read(cpu_hw_events.perf_ctr_virt_mask);
 +
 if (hwc-extra_reg.reg)
 wrmsrl(hwc-extra_reg.reg, hwc-extra_reg.config);
 -   wrmsrl(hwc-config_base, hwc-config | enable_mask);
 +   wrmsrl(hwc-config_base, (hwc-config | enable_mask)  ~disable_mask);
  }

Its starting to look like we should kill this helper, the extra_reg muck
is Intel only and the disable_mask is AMD.



Queued it for now though.. 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Don't sync timebase when inside KVM

2012-02-29 Thread Scott Wood
On 02/28/2012 08:16 PM, Alexander Graf wrote:
 When we know that we're running inside of a KVM guest, we don't have to
 worry about synchronizing timebases between different CPUs, since the
 host already took care of that.
 
 This fixes CPU overcommit scenarios where vCPUs could hang forever trying
 to sync each other while not being scheduled.
 
 Reported-by: Stuart Yoder b08...@freescale.com
 Signed-off-by: Alexander Graf ag...@suse.de

This should apply to any hypervisor, not just KVM.  On book3e, Power ISA
says timebase is read-only on virtualized implementations.  My
understanding is that book3s is paravirt-only (guest state is not
considered an implementation of the Power ISA), and it says Writing the
Time Base is privileged, and can be done only in hypervisor state.

Which platforms are you seeing this on?  If it's on Freescale chips,
U-Boot should be doing the sync and Linux should never do it, even in
the absence of a hypervisor.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-29 Thread Stephen Hemminger
On Wed, 29 Feb 2012 09:25:56 -0800
John Fastabend john.r.fastab...@intel.com wrote:

 On 2/29/2012 5:56 AM, Jamal Hadi Salim wrote:
  On Tue, 2012-02-28 at 20:40 -0800, John Fastabend wrote:
  
  OK back to this. The last piece is where to put these messages...
  we could take PF_ROUTE:RTM_*NEIGH
 
   PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
   switch.
   PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
   switch.
   PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table
 
  
  Why RTM_*NEIGH? RTM tends to map to Route/L3 and NEIGH tends to map
  to ndisc or ARP both tied to IP address resolution. While both ARP/Ndisc
  may play a role in the user space app populating the FDB, i dont think
  they are necessary players.
  Learning could be via a table entry miss and packet redirect to user
  space.
  So my suggestion is to use FDB_*ENTRY for names
   
 
 Well I think NETLINK_ROUTE is the most correct type to use in this
 case. Per netlink.h its for routing and device hooks.
 
 #define NETLINK_ROUTE   0   /* Routing/device hook
   */
 
 And NETLINK_ROUTE msg_types use the RTM_* prefix. The _*NEIGH postfix
 were merely a copy from the SW BRIDGE code paths. How about,
 
 PF_BRIDGE:RTM_FDB_NEWENTRY
 PF_BRIDGE:RTM_FDB_DELENTRY
 PF_BRIDGE:RTM_FDB_GETENTRY
 
 And a new group RTNLGRP_FDB. Also using NETLINK_ROUTE gives the correct
 rtnl locking semantics for free.
 
  The neighbor code is using the PF_UNSPEC protocol type so we won't
  collide with these unless someone was using PF_ROUTE and relying on
  falling back to PF_UNSPEC however I couldn't find any programs that
  did this iproute2 certainly doesn't. And the bridge pieces are using
  PF_BRIDGE so no collision there.
  
  They have to be different calls from the calls that talk to the s/ware
  bridge. In my opinion, as controversial as this may sound, you need to
  be flexible enough that some vendor can replace these calls with
  proprietary calls which are more efficient for their hardware. So a
  plugin to replace these calls in the user space code would be a 
  good idea. Alternatively, you could make that something they do at
  the driver level i.e from user space to kernel it is hardware, please
  addthistotheFDBtable() call and the implementation of that could be
  proprietary to the specific hardware.
  
 
 Agreed. I think adding some ndo_ops for bridging offloads here would
 work. For example the DSA infrastructure and/or macvlan devices might
 need this. Along the lines of extending this RFC,
 
 [RFC] hardware bridging support for DSA switches
 http://patchwork.ozlabs.org/patch/16578/

I want to see a unified API so that user space control applications (RSTP, 
TRILL?)
can use one set of netlink calls for both software bridge and hardware offloaded
bridges.  Does this proposal meet that requirement?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-29 Thread John Fastabend
On 2/29/2012 9:52 AM, Stephen Hemminger wrote:
 On Wed, 29 Feb 2012 09:25:56 -0800
 John Fastabend john.r.fastab...@intel.com wrote:
 
 On 2/29/2012 5:56 AM, Jamal Hadi Salim wrote:
 On Tue, 2012-02-28 at 20:40 -0800, John Fastabend wrote:

 OK back to this. The last piece is where to put these messages...
 we could take PF_ROUTE:RTM_*NEIGH

  PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded
  switch.
  PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded
  switch.
  PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table


 Why RTM_*NEIGH? RTM tends to map to Route/L3 and NEIGH tends to map
 to ndisc or ARP both tied to IP address resolution. While both ARP/Ndisc
 may play a role in the user space app populating the FDB, i dont think
 they are necessary players.
 Learning could be via a table entry miss and packet redirect to user
 space.
 So my suggestion is to use FDB_*ENTRY for names
  

 Well I think NETLINK_ROUTE is the most correct type to use in this
 case. Per netlink.h its for routing and device hooks.

 #define NETLINK_ROUTE   0   /* Routing/device hook   
*/

 And NETLINK_ROUTE msg_types use the RTM_* prefix. The _*NEIGH postfix
 were merely a copy from the SW BRIDGE code paths. How about,

 PF_BRIDGE:RTM_FDB_NEWENTRY
 PF_BRIDGE:RTM_FDB_DELENTRY
 PF_BRIDGE:RTM_FDB_GETENTRY

 And a new group RTNLGRP_FDB. Also using NETLINK_ROUTE gives the correct
 rtnl locking semantics for free.

 The neighbor code is using the PF_UNSPEC protocol type so we won't
 collide with these unless someone was using PF_ROUTE and relying on
 falling back to PF_UNSPEC however I couldn't find any programs that
 did this iproute2 certainly doesn't. And the bridge pieces are using
 PF_BRIDGE so no collision there.

 They have to be different calls from the calls that talk to the s/ware
 bridge. In my opinion, as controversial as this may sound, you need to
 be flexible enough that some vendor can replace these calls with
 proprietary calls which are more efficient for their hardware. So a
 plugin to replace these calls in the user space code would be a 
 good idea. Alternatively, you could make that something they do at
 the driver level i.e from user space to kernel it is hardware, please
 addthistotheFDBtable() call and the implementation of that could be
 proprietary to the specific hardware.


 Agreed. I think adding some ndo_ops for bridging offloads here would
 work. For example the DSA infrastructure and/or macvlan devices might
 need this. Along the lines of extending this RFC,

 [RFC] hardware bridging support for DSA switches
 http://patchwork.ozlabs.org/patch/16578/
 
 I want to see a unified API so that user space control applications (RSTP, 
 TRILL?)
 can use one set of netlink calls for both software bridge and hardware 
 offloaded
 bridges.  Does this proposal meet that requirement?
 

With the patches I sent out last night the same netlink calls are used
for both SW and HW with a flag set in ndm_flags to indicate it is a hardware
entry. The flag is needed when a port has offload support and is also
a slave of a SW bridge. Another option would be to apply the command to both
hardware and software tables. This might be good enough and user space would
not have to make distinctions between HW and SW bridges. Also helps with my
original use case where I want the SW and HW bridge FDBs to be in sync.

In response to Jamal's comment I proposed changing the type to RTM_FDB_XXXENTRY
but the message contents are the same in both cases.

Jamal, so why do They have to be different calls? I'm not so sure anymore...
moving to RTM_FDB_XXXENTRY saved some refactoring in the bridge module but that
is just cosmetic.

Thanks,
John
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Don't sync timebase when inside KVM

2012-02-29 Thread Alexander Graf


On 29.02.2012, at 18:50, Scott Wood scottw...@freescale.com wrote:

 On 02/28/2012 08:16 PM, Alexander Graf wrote:
 When we know that we're running inside of a KVM guest, we don't have to
 worry about synchronizing timebases between different CPUs, since the
 host already took care of that.
 
 This fixes CPU overcommit scenarios where vCPUs could hang forever trying
 to sync each other while not being scheduled.
 
 Reported-by: Stuart Yoder b08...@freescale.com
 Signed-off-by: Alexander Graf ag...@suse.de
 
 This should apply to any hypervisor, not just KVM.  

Sure, but do you have a generic function to evaluate that? :)

 On book3e, Power ISA
 says timebase is read-only on virtualized implementations.  My
 understanding is that book3s is paravirt-only (guest state is not
 considered an implementation of the Power ISA), and it says Writing the
 Time Base is privileged, and can be done only in hypervisor state.

For PR non-PAPR KVM, we are non-paravirt, but ignore tb writes iirc.

 
 Which platforms are you seeing this on?  If it's on Freescale chips,
 U-Boot should be doing the sync and Linux should never do it, even in
 the absence of a hypervisor.

This is on e500mc.

Alex

 
 -Scott
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm-ppc in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] Quirk for IVB graphics FLR errata

2012-02-29 Thread Jesse Barnes
On Wed, 29 Feb 2012 03:11:24 +
Hao, Xudong xudong@intel.com wrote:

 For IvyBridge Mobile platform, a system hang may occur if a
 FLR(Function Level Reset) is asserted to internal graphics.
 
 This quirk patch is workaround for the IVB FLR errata issue.
 We are disabling the FLR reset handshake between the PCH and CPU
 display, then manually powering down the panel power sequencing and
 resetting the PCH display.
 
 Signed-off-by: Xudong Hao xudong@intel.com
 Signed-off-by: Kay, Allen M allen.m@intel.com
 ---
  drivers/pci/quirks.c |   49
 + 1 files changed, 49
 insertions(+), 0 deletions(-)
 
 diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index
 6476547..5223b80 100644 --- a/drivers/pci/quirks.c
 +++ b/drivers/pci/quirks.c
 @@ -29,6 +29,11 @@
  #include asm/dma.h /* isa_dma_bridge_buggy */
  #include pci.h
  
 +#include ../gpu/drm/i915/i915_reg.h

Ugg... this is ugly.  Should probably go near the definition of the
quirk at any rate.

 +#include asm/tsc.h
 +/* 10 seconds */
 +#define IGD_OPERATION_TIMEOUT ((cycles_t) tsc_khz*10*1000)
 +

Same here, the asm/tsc.h can go at the top, but the timeout definition
should go near the reset function.

May as well make it a static const cycles_t as well.

 +#define MSG_CTL  0x45010
 +
 +static int reset_ivb_igd(struct pci_dev *dev, int probe) {
 + u8 *mmio_base;
 + u32 val;
 +
 + if (probe)
 + return 0;
 +
 + mmio_base = ioremap_nocache(pci_resource_start(dev, 0),
 +  pci_resource_len(dev, 0));
 + if (!mmio_base)
 + return -ENOMEM;
 +
 + /* Work Around */
 + *((u32 *)(mmio_base + MSG_CTL)) = 0x0002;
 + *((u32 *)(mmio_base + SOUTH_CHICKEN2)) = 0x0005;
 + val = *((u32 *)(mmio_base + PCH_PP_CONTROL))  0xfffe;
 + *((u32 *)(mmio_base + PCH_PP_CONTROL)) = val;

These clobber existing values.  Since we're doing a reset, clobbering
PCH_PP_CONTROL should be ok, but clobbering SOUTH_CHICKEN2 is only ok
if the next driver that loads sets the right bits (if i915 was
previously using the device, it would have set a couple of bits).

But again since it's it a reset that's probably ok, but you should put
a comment indicating as much.

 + do {
 + cycles_t start_time = get_cycles();
 + while (1) {
 + val = *((u32 *)(mmio_base + PCH_PP_STATUS));
 + if (((val  0x8000) == 0)
 +  ((val  0x3000) == 0))
 + break;
 + if (IGD_OPERATION_TIMEOUT  (get_cycles() -
 start_time))
 + break;
 + cpu_relax();
 + }
 + } while (0);
 + *((u32 *)(mmio_base + 0xd0100)) = 0x0002;
 +
 + iounmap(pci_resource_start(dev, 0));
 + return 0;
 +}
 +
  #define PCI_DEVICE_ID_INTEL_82599_SFP_VF   0x10ed
 +#define PCI_DEVICE_ID_INTEL_IVB_M_VGA  0x0156
 +#define PCI_DEVICE_ID_INTEL_IVB_M2_VGA 0x0166
  
  static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
   { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF,
reset_intel_82599_sfp_virtfn },
 + { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M_VGA,
 + reset_ivb_igd },
 + { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M2_VGA,
 + reset_ivb_igd },
   { PCI_VENDOR_ID_INTEL, PCI_ANY_ID,
   reset_intel_generic_dev },
   { 0 }

Looks ok otherwise, thanks.

Jesse
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Don't sync timebase when inside KVM

2012-02-29 Thread Scott Wood
On 02/29/2012 12:28 PM, Alexander Graf wrote:
 
 
 On 29.02.2012, at 18:50, Scott Wood scottw...@freescale.com wrote:
 
 On 02/28/2012 08:16 PM, Alexander Graf wrote:
 When we know that we're running inside of a KVM guest, we don't have to
 worry about synchronizing timebases between different CPUs, since the
 host already took care of that.

 This fixes CPU overcommit scenarios where vCPUs could hang forever trying
 to sync each other while not being scheduled.

 Reported-by: Stuart Yoder b08...@freescale.com
 Signed-off-by: Alexander Graf ag...@suse.de

 This should apply to any hypervisor, not just KVM.  
 
 Sure, but do you have a generic function to evaluate that? :)

The presence of a hypervisor node without testing compatible.  Might not
get them all, but at least it will cover more than just KVM.

 Which platforms are you seeing this on?  If it's on Freescale chips,
 U-Boot should be doing the sync and Linux should never do it, even in
 the absence of a hypervisor.
 
 This is on e500mc.

On e500mc Linux should never by trying to sync the timebase.  If it is,
let's fix that.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: Device isolation groups

2012-02-29 Thread Alex Williamson
On Thu, 2012-02-02 at 12:24 +1100, David Gibson wrote:
 On Wed, Feb 01, 2012 at 01:08:39PM -0700, Alex Williamson wrote:
  On Wed, 2012-02-01 at 15:46 +1100, David Gibson wrote:
   This patch series introduces a new infrastructure to the driver core
   for representing device isolation groups.  That is, groups of
   devices which can be isolated in such a way that the rest of the
   system can be protected from them, even in the presence of userspace
   or a guest OS directly driving the devices.
   
   Isolation will typically be due to an IOMMU which can safely remap DMA
   and interrupts coming from these devices.  We need to represent whole
   groups, rather than individual devices, because there are a number of
   cases where the group can be isolated as a whole, but devices within
   it cannot be safely isolated from each other - this usually occurs
   because the IOMMU cannot reliably distinguish which device in the
   group initiated a transaction.  In other words, isolation groups
   represent the minimum safe granularity for passthrough to guests or
   userspace.
   
   This series provides the core infraustrcture for tracking isolation
   groups, and example implementations initializing the groups
   appropriately for two PCI bridges (which include IOMMUs) found on IBM
   POWER systems.
   
   Actually using the group information is not included here, but David
   Woodhouse has expressed an interest in using a structure like this to
   represent operations in iommu_ops more correctly.
   
   Some tracking of groups is a prerequisite for safe passthrough of
   devices to guests or userspace, such as done by VFIO.  Current VFIO
   patches use the iommu_ops-device_group mechanism for this.  However,
   that mechanism is awkward, because without an in-kernel concrete
   representation of groups, enumerating a group requires traversing
   every device on a given bus type.  It also fails to cover some very
   plausible IOMMU topologies, because its groups cannot span devices on
   multiple bus types.
  
  So far so good, but there's not much meat on the bone yet.
 
 True..

Any update to this series?  It would be great if we could map out the
functionality to the point of breaking down and distributing work... or
determining if the end result has any value add to what VFIO already
does.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V2] Quirk for IVB graphics FLR errata

2012-02-29 Thread Hao, Xudong
 -Original Message-
 From: Jesse Barnes [mailto:jbar...@virtuousgeek.org]
 Sent: Thursday, March 01, 2012 2:32 AM
 To: Hao, Xudong
 Cc: linux-...@vger.kernel.org; linux-ker...@vger.kernel.org;
 kvm@vger.kernel.org; Kay, Allen M; Zhang, Xiantao
 Subject: Re: [PATCH V2] Quirk for IVB graphics FLR errata
 
 On Wed, 29 Feb 2012 03:11:24 +
 Hao, Xudong xudong@intel.com wrote:
 
  For IvyBridge Mobile platform, a system hang may occur if a
  FLR(Function Level Reset) is asserted to internal graphics.
 
  This quirk patch is workaround for the IVB FLR errata issue.
  We are disabling the FLR reset handshake between the PCH and CPU
  display, then manually powering down the panel power sequencing and
  resetting the PCH display.
 
  Signed-off-by: Xudong Hao xudong@intel.com
  Signed-off-by: Kay, Allen M allen.m@intel.com
  ---
   drivers/pci/quirks.c |   49
  + 1 files changed,
 49
  insertions(+), 0 deletions(-)
 
  diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index
  6476547..5223b80 100644 --- a/drivers/pci/quirks.c
  +++ b/drivers/pci/quirks.c
  @@ -29,6 +29,11 @@
   #include asm/dma.h   /* isa_dma_bridge_buggy */
   #include pci.h
 
  +#include ../gpu/drm/i915/i915_reg.h
 
 Ugg... this is ugly.  Should probably go near the definition of the quirk at 
 any
 rate.
 

OK, I'll move it just above the reset_ivb_igd function.

  +#include asm/tsc.h
  +/* 10 seconds */
  +#define IGD_OPERATION_TIMEOUT ((cycles_t) tsc_khz*10*1000)
  +
 
 Same here, the asm/tsc.h can go at the top, but the timeout definition should
 go near the reset function.
 
 May as well make it a static const cycles_t as well.
 
Will modify it.
 
  +#define MSG_CTL0x45010
  +
  +static int reset_ivb_igd(struct pci_dev *dev, int probe) {
  +   u8 *mmio_base;
  +   u32 val;
  +
  +   if (probe)
  +   return 0;
  +
  +   mmio_base = ioremap_nocache(pci_resource_start(dev, 0),
  +pci_resource_len(dev, 0));
  +   if (!mmio_base)
  +   return -ENOMEM;
  +
  +   /* Work Around */
  +   *((u32 *)(mmio_base + MSG_CTL)) = 0x0002;
  +   *((u32 *)(mmio_base + SOUTH_CHICKEN2)) = 0x0005;
  +   val = *((u32 *)(mmio_base + PCH_PP_CONTROL))  0xfffe;
  +   *((u32 *)(mmio_base + PCH_PP_CONTROL)) = val;
 
 These clobber existing values.  Since we're doing a reset, clobbering
 PCH_PP_CONTROL should be ok, but clobbering SOUTH_CHICKEN2 is only ok if
 the next driver that loads sets the right bits (if i915 was previously using 
 the
 device, it would have set a couple of bits).
 
 But again since it's it a reset that's probably ok, but you should put a 
 comment
 indicating as much.
 

I'll add comment here, thanks Jesse.

  +   do {
  +   cycles_t start_time = get_cycles();
  +   while (1) {
  +   val = *((u32 *)(mmio_base + PCH_PP_STATUS));
  +   if (((val  0x8000) == 0)
  +((val  0x3000) == 0))
  +   break;
  +   if (IGD_OPERATION_TIMEOUT  (get_cycles() -
  start_time))
  +   break;
  +   cpu_relax();
  +   }
  +   } while (0);
  +   *((u32 *)(mmio_base + 0xd0100)) = 0x0002;
  +
  +   iounmap(pci_resource_start(dev, 0));
  +   return 0;
  +}
  +
   #define PCI_DEVICE_ID_INTEL_82599_SFP_VF   0x10ed
  +#define PCI_DEVICE_ID_INTEL_IVB_M_VGA  0x0156
  +#define PCI_DEVICE_ID_INTEL_IVB_M2_VGA 0x0166
 
   static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
  { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF,
   reset_intel_82599_sfp_virtfn },
  +   { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M_VGA,
  +   reset_ivb_igd },
  +   { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M2_VGA,
  +   reset_ivb_igd },
  { PCI_VENDOR_ID_INTEL, PCI_ANY_ID,
  reset_intel_generic_dev },
  { 0 }
 
 Looks ok otherwise, thanks.
 
 Jesse
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] powerpc/e500: make load_up_spe a normal fuction

2012-02-29 Thread Olivia Yin
From: Liu Yu yu@freescale.com

So that we can call it when improving SPE switch like book3e did for fp switch.

Signed-off-by: Liu Yu yu@freescale.com
Signed-off-by: Olivia Yin hong-hua@freescale.com
---
v2: add Signed-off-by

 arch/powerpc/kernel/head_fsl_booke.S |   23 ++-
 1 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kernel/head_fsl_booke.S 
b/arch/powerpc/kernel/head_fsl_booke.S
index d5d78c4..c96e025 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -539,8 +539,10 @@ interrupt_base:
/* SPE Unavailable */
START_EXCEPTION(SPEUnavailable)
NORMAL_EXCEPTION_PROLOG
-   bne load_up_spe
-   addir3,r1,STACK_FRAME_OVERHEAD
+   beq 1f
+   bl  load_up_spe
+   b   fast_exception_return
+1: addir3,r1,STACK_FRAME_OVERHEAD
EXC_XFER_EE_LITE(0x2010, KernelSPE)
 #else
EXCEPTION(0x2020, SPEUnavailable, unknown_exception, EXC_XFER_EE)
@@ -743,7 +745,7 @@ tlb_write_entry:
 /* Note that the SPE support is closely modeled after the AltiVec
  * support.  Changes to one are likely to be applicable to the
  * other!  */
-load_up_spe:
+_GLOBAL(load_up_spe)
 /*
  * Disable SPE for the task which had SPE previously,
  * and save its SPE registers in its thread_struct.
@@ -791,20 +793,7 @@ load_up_spe:
subir4,r5,THREAD
stw r4,last_task_used_spe@l(r3)
 #endif /* !CONFIG_SMP */
-   /* restore registers and return */
-2: REST_4GPRS(3, r11)
-   lwz r10,_CCR(r11)
-   REST_GPR(1, r11)
-   mtcrr10
-   lwz r10,_LINK(r11)
-   mtlrr10
-   REST_GPR(10, r11)
-   mtspr   SPRN_SRR1,r9
-   mtspr   SPRN_SRR0,r12
-   REST_GPR(9, r11)
-   REST_GPR(12, r11)
-   lwz r11,GPR11(r11)
-   rfi
+   blr
 
 /*
  * SPE unavailable trap from kernel - print a message, but let
-- 
1.6.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] KVM: booke: Improve SPE switch

2012-02-29 Thread Olivia Yin
From: Liu Yu yu@freescale.com

Like book3s did for fp switch,
instead of switch SPE between host and guest,
the patch switch SPE state between qemu and guest.
In this way, we can simulate a host loadup SPE when load guest SPE state,
and let host to decide when to giveup SPE state.
Therefor it cooperates better with host SPE usage,
and so that has some performance benifit in UP host(lazy SPE).

Moreover, since the patch save guest SPE state into linux thread field,
it creates the condition to emulate guest SPE instructions in host,
so that we can avoid injecting SPE exception to guest.

The patch also turns all asm code into C code,
and add SPE stat counts.

Signed-off-by: Liu Yu yu@freescale.com
Signed-off-by: Olivia Yin hong-hua@freescale.com
---
v2: 
Keep shadow MSR[SPE] consistent with 
thread MSR[SPE] in kvmppc_core_vcpu_load

 arch/powerpc/include/asm/kvm_host.h |   11 +-
 arch/powerpc/kernel/asm-offsets.c   |7 
 arch/powerpc/kvm/booke.c|   63 +++
 arch/powerpc/kvm/booke.h|8 +
 arch/powerpc/kvm/booke_interrupts.S |   37 
 arch/powerpc/kvm/e500.c |   13 ---
 arch/powerpc/kvm/timing.c   |5 +++
 arch/powerpc/kvm/timing.h   |   11 ++
 8 files changed, 91 insertions(+), 64 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 1843d5d..6186d08 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -117,6 +117,11 @@ struct kvm_vcpu_stat {
u32 st;
u32 st_slow;
 #endif
+#ifdef CONFIG_SPE
+   u32 spe_unavail;
+   u32 spe_fp_data;
+   u32 spe_fp_round;
+#endif
 };
 
 enum kvm_exit_types {
@@ -147,6 +152,11 @@ enum kvm_exit_types {
FP_UNAVAIL,
DEBUG_EXITS,
TIMEINGUEST,
+#ifdef CONFIG_SPE
+   SPE_UNAVAIL,
+   SPE_FP_DATA,
+   SPE_FP_ROUND,
+#endif
__NUMBER_OF_KVM_EXIT_TYPES
 };
 
@@ -330,7 +340,6 @@ struct kvm_vcpu_arch {
 #ifdef CONFIG_SPE
ulong evr[32];
ulong spefscr;
-   ulong host_spefscr;
u64 acc;
 #endif
 #ifdef CONFIG_ALTIVEC
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 8e0db0b..ff68f71 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -604,13 +604,6 @@ int main(void)
DEFINE(TLBCAM_MAS7, offsetof(struct tlbcam, MAS7));
 #endif
 
-#if defined(CONFIG_KVM)  defined(CONFIG_SPE)
-   DEFINE(VCPU_EVR, offsetof(struct kvm_vcpu, arch.evr[0]));
-   DEFINE(VCPU_ACC, offsetof(struct kvm_vcpu, arch.acc));
-   DEFINE(VCPU_SPEFSCR, offsetof(struct kvm_vcpu, arch.spefscr));
-   DEFINE(VCPU_HOST_SPEFSCR, offsetof(struct kvm_vcpu, arch.host_spefscr));
-#endif
-
 #ifdef CONFIG_KVM_EXIT_TIMING
DEFINE(VCPU_TIMING_EXIT_TBU, offsetof(struct kvm_vcpu,
arch.timing_exit.tv32.tbu));
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index ee9e1ee..f20010b 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -55,6 +55,11 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ dec,VCPU_STAT(dec_exits) },
{ ext_intr,   VCPU_STAT(ext_intr_exits) },
{ halt_wakeup, VCPU_STAT(halt_wakeup) },
+#ifdef CONFIG_SPE
+   { spe_unavail, VCPU_STAT(spe_unavail) },
+   { spe_fp_data, VCPU_STAT(spe_fp_data) },
+   { spe_fp_round, VCPU_STAT(spe_fp_round) },
+#endif
{ NULL }
 };
 
@@ -80,11 +85,11 @@ void kvmppc_dump_vcpu(struct kvm_vcpu *vcpu)
 }
 
 #ifdef CONFIG_SPE
-void kvmppc_vcpu_disable_spe(struct kvm_vcpu *vcpu)
+static void kvmppc_vcpu_disable_spe(struct kvm_vcpu *vcpu)
 {
preempt_disable();
-   enable_kernel_spe();
-   kvmppc_save_guest_spe(vcpu);
+   if (current-thread.regs-msr  MSR_SPE)
+   giveup_spe(current);
vcpu-arch.shadow_msr = ~MSR_SPE;
preempt_enable();
 }
@@ -92,8 +97,10 @@ void kvmppc_vcpu_disable_spe(struct kvm_vcpu *vcpu)
 static void kvmppc_vcpu_enable_spe(struct kvm_vcpu *vcpu)
 {
preempt_disable();
-   enable_kernel_spe();
-   kvmppc_load_guest_spe(vcpu);
+   if (!(current-thread.regs-msr  MSR_SPE)) {
+   load_up_spe(NULL);
+   current-thread.regs-msr |= MSR_SPE;
+   }
vcpu-arch.shadow_msr |= MSR_SPE;
preempt_enable();
 }
@@ -104,7 +111,7 @@ static void kvmppc_vcpu_sync_spe(struct kvm_vcpu *vcpu)
if (!(vcpu-arch.shadow_msr  MSR_SPE))
kvmppc_vcpu_enable_spe(vcpu);
} else if (vcpu-arch.shadow_msr  MSR_SPE) {
-   kvmppc_vcpu_disable_spe(vcpu);
+   vcpu-arch.shadow_msr = ~MSR_SPE;
}
 }
 #else
@@ -124,7 +131,8 @@ void kvmppc_set_msr(struct kvm_vcpu *vcpu, u32 new_msr)
vcpu-arch.shared-msr = new_msr;
 
kvmppc_mmu_msr_notify(vcpu, old_msr);
-

Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Wen Congyang
At 02/29/2012 06:39 PM, Avi Kivity Wrote:
 On 02/29/2012 12:17 PM, Wen Congyang wrote:

 Yes, crash can be so severe that it is not even detected by a kernel
 itself, so not OOPS message even printed. But in most cases if kernel is
 functional enough to print OOPS it is functional enough to call single
 hypercall instruction.

 Why not print the oops to virtio-serial?  Or even just a regular serial
 port?  That's what bare metal does.

 If virtio-serial's driver has bug or the guest doesn't have such device...
 
 We have the same issue with the hypercall; and virtio-serial is
 available on many deployed versions.

How to know whether a guest has virtio-serial?

Thanks
Wen Congyang

 

 Having special kdump
 kernel that transfers dump to a host via virtio-serial channel though
 sounds interesting. May be that's what you mean.

 Yes.  The panic, starting dump signal should be initiated by the
 panicking kernel though, in case the dump fails.

 Then panic hypercall sounds like a reasonable solution.

 It is, but I'm trying to see if we can get away with doing nothing.


 If we have a reliable way with doing nothing, it is better. But I donot
 find such way.
 
 We won't have a 100% reliable way.  But I think a variant of the driver
 that doesn't use interrupts, or just using the ordinary serial driver,
 should be reliable enough.
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3] Quirk for IVB graphics FLR errata

2012-02-29 Thread Hao, Xudong
For IvyBridge Mobile platform, a system hang may occur if a FLR(Function Level 
Reset) is asserted to internal graphics.

This quirk patch is workaround for the IVB FLR errata issue.
We are disabling the FLR reset handshake between the PCH and CPU display, then 
manually powering down the panel power sequencing and resetting the PCH display.

Signed-off-by: Xudong Hao xudong@intel.com
Signed-off-by: Kay, Allen M allen.m@intel.com
Reviewed-by: Xiantao Zhang xiantao.zh...@intel.com
---
 drivers/pci/quirks.c |   53 ++
 1 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 6476547..c687218 
100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -27,6 +27,7 @@
 #include linux/pci-aspm.h
 #include linux/ioport.h
 #include asm/dma.h   /* isa_dma_bridge_buggy */
+#include asm/tsc.h
 #include pci.h
 
 /*
@@ -3069,11 +3070,63 @@ static int reset_intel_82599_sfp_virtfn(struct pci_dev 
*dev, int probe)
return 0;
 }
 
+#include ../gpu/drm/i915/i915_reg.h
+#define MSG_CTL0x45010
+static const int op_timeout = 10;  /* set timeout 10 seconds */
+
+static int reset_ivb_igd(struct pci_dev *dev, int probe) {
+   u8 *mmio_base;
+   u32 val;
+   cycles_t cyc_op_timeout = tsc_khz*op_timeout*1000;
+
+   if (probe)
+   return 0;
+
+   mmio_base = ioremap_nocache(pci_resource_start(dev, 0),
+pci_resource_len(dev, 0));
+   if (!mmio_base)
+   return -ENOMEM;
+
+   /* Work Around */
+   *((u32 *)(mmio_base + MSG_CTL)) = 0x0002;
+   /* Clobbering SOUTH_CHICKEN2 register is fine only if the next
+* driver loaded sets the right bits. However, this's a reset and
+* the bits have been set by i915 previously, so we clobber
+* SOUTH_CHICKEN2 register directly here.
+*/
+   *((u32 *)(mmio_base + SOUTH_CHICKEN2)) = 0x0005;
+   val = *((u32 *)(mmio_base + PCH_PP_CONTROL))  0xfffe;
+   *((u32 *)(mmio_base + PCH_PP_CONTROL)) = val;
+   do {
+   cycles_t start_time = get_cycles();
+   while (1) {
+   val = *((u32 *)(mmio_base + PCH_PP_STATUS));
+   if (((val  0x8000) == 0)
+((val  0x3000) == 0))
+   break;
+   if (cyc_op_timeout  (get_cycles() - start_time))
+   break;
+   cpu_relax();
+   }
+   } while (0);
+   *((u32 *)(mmio_base + 0xd0100)) = 0x0002;
+
+   iounmap(pci_resource_start(dev, 0));
+   return 0;
+}
+
 #define PCI_DEVICE_ID_INTEL_82599_SFP_VF   0x10ed
+#define PCI_DEVICE_ID_INTEL_IVB_M_VGA  0x0156
+#define PCI_DEVICE_ID_INTEL_IVB_M2_VGA 0x0166
 
 static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF,
 reset_intel_82599_sfp_virtfn },
+   { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M_VGA,
+   reset_ivb_igd },
+   { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M2_VGA,
+   reset_ivb_igd },
{ PCI_VENDOR_ID_INTEL, PCI_ANY_ID,
reset_intel_generic_dev },
{ 0 }
--
1.6.0.rc1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: Resize kvm_io_range array dynamically

2012-02-29 Thread Amos Kong

On 01/03/12 00:34, Amos Kong wrote:

- Original Message -

On 2012-02-29 16:22, Amos Kong wrote:

- Original Message -

On 2012-02-29 14:30, Amos Kong wrote:

kvm_io_bus devices are used for ioevent, pit, pic, ioapic,
coalesced_mmio.

Currently Qemu only emulates one PCI bus, it contains 32 slots,
one slot contains 8 functions, maximum of supported PCI devices:
  1 * 32 * 8 = 256. The maximum of coalesced mmio zone is 100,
each zone has an iobus devices. 300 io_bus devices is not enough.

This patch makes the kvm_io_range array can be resized
dynamically.


Is there any limit, or can userspace allocate arbitrary amounts of
kernel memory this way?


Hi Jan,

There is a fixed array in linux-2.6/include/linux/kvm_host.h,
we can only register 300 devices.

struct kvm_io_range {
 gpa_t addr;
 int len;
 struct kvm_io_device *dev;
};

struct kvm_io_bus {
 int   dev_count;
#define NR_IOBUS_DEVS 300
 struct kvm_io_range range[NR_IOBUS_DEVS];
};
   ^^


Right. But doesn't your patch remove precisely this limit? So what
limits userspace now? To register 300 million devices...?


Hi Jan,

It seems we need to reserve the limitation in kvm_host.h

#define NR_IOBUS_DEVS 600

/* Currently Qemu only emulates one PCI bus, it contains 32 slots,
one slot contains 8 functions. Only 29 slots can be used to
add multiple function devices. Maximum of supported PCI devices:
29 * 8 = 232. Each virtio-blk device needs 1 iobus device,
each virtio-net(vhost) device requires 2 such devices to service
notifications (ioevent) from rx/tx queues.
The maximum of coalesced mmio zone is 100, each zone has an iobus
devices. ioevent, pit, ioapic take less iobus devices.

So we can set max limitation to 600. */


One virtio-net(vhost=on) takes two iobus devices,
and it needs three IRQs for MSI/MIS-X.
I started a guest with 232 virtio-net(vhost=on),
guest IRQ 24 to 191 were used for virtio-config/input/output,
and only 56 virtio-nics' MSIX were enabled.
56 virtio-net(vhost=on) registered 56 * 2 = 112 iobus devices.

It's safe to set the limit to 300, right ?



- check limit when register dev 

virt/kvm/kvm_main.c:

/* Caller must hold slots_lock. */
int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
 int len, struct kvm_io_device *dev)
{
 struct kvm_io_bus *new_bus, *bus;

 bus = kvm-buses[bus_idx];
 if (bus-dev_count  NR_IOBUS_DEVS - 1)   // can only register 600 
devices
 return -ENOSPC;

Amos.


--
Amos.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: notify host when guest paniced

2012-02-29 Thread Wen Congyang
At 02/29/2012 06:39 PM, Avi Kivity Wrote:
 On 02/29/2012 12:17 PM, Wen Congyang wrote:

 Yes, crash can be so severe that it is not even detected by a kernel
 itself, so not OOPS message even printed. But in most cases if kernel is
 functional enough to print OOPS it is functional enough to call single
 hypercall instruction.

 Why not print the oops to virtio-serial?  Or even just a regular serial
 port?  That's what bare metal does.

 If virtio-serial's driver has bug or the guest doesn't have such device...
 
 We have the same issue with the hypercall; and virtio-serial is
 available on many deployed versions.

virtio-serial is available, but it is an optional device. If the guest does
not have this device, the guest cannot tell the host that is is paniced. So
I still prefer to touch the hypervisor.

Thanks
Wen Congyang

 

 Having special kdump
 kernel that transfers dump to a host via virtio-serial channel though
 sounds interesting. May be that's what you mean.

 Yes.  The panic, starting dump signal should be initiated by the
 panicking kernel though, in case the dump fails.

 Then panic hypercall sounds like a reasonable solution.

 It is, but I'm trying to see if we can get away with doing nothing.


 If we have a reliable way with doing nothing, it is better. But I donot
 find such way.
 
 We won't have a 100% reliable way.  But I think a variant of the driver
 that doesn't use interrupts, or just using the ordinary serial driver,
 should be reliable enough.
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4] KVM: Resize kvm_io_range array dynamically

2012-02-29 Thread Amos Kong
kvm_io_bus devices are used for ioevent, pit, pic, ioapic,
coalesced_mmio.

Currently Qemu only emulates one PCI bus, it contains 32 slots,
one slot contains 8 functions, maximum of supported PCI devices:
 1 * 32 * 8 = 256. One virtio-blk takes one iobus device,
one virtio-net(vhost=on) takes two iobus devices.
The maximum of coalesced mmio zone is 100, each zone
has an iobus devices. So 300 io_bus devices are not enough.

This patch makes the kvm_io_range array can be resized dynamically.
Set an upper bounds for kvm_io_range to limit userspace.
1000 is a very large limit and not bloat the typical user.

Changes from v1:
- fix typo: kvm_io_bus_range - kvm_io_range

Changes from v2:
- unregister device only when it exists

Changes from v3:
- set upper bounds to limit userspace

Signed-off-by: Amos Kong ak...@redhat.com
---
 include/linux/kvm_host.h |5 +++--
 virt/kvm/kvm_main.c  |   41 ++---
 2 files changed, 25 insertions(+), 21 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 355e445..24ee2db 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -67,10 +67,11 @@ struct kvm_io_range {
struct kvm_io_device *dev;
 };
 
+#define NR_IOBUS_DEVS 1000
+
 struct kvm_io_bus {
int   dev_count;
-#define NR_IOBUS_DEVS 300
-   struct kvm_io_range range[NR_IOBUS_DEVS];
+   struct kvm_io_range range[];
 };
 
 enum kvm_bus {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e4431ad..1baed68 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2389,9 +2389,6 @@ int kvm_io_bus_sort_cmp(const void *p1, const void *p2)
 int kvm_io_bus_insert_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev,
  gpa_t addr, int len)
 {
-   if (bus-dev_count == NR_IOBUS_DEVS)
-   return -ENOSPC;
-
bus-range[bus-dev_count++] = (struct kvm_io_range) {
.addr = addr,
.len = len,
@@ -2491,10 +2488,14 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum 
kvm_bus bus_idx, gpa_t addr,
struct kvm_io_bus *new_bus, *bus;
 
bus = kvm-buses[bus_idx];
-   if (bus-dev_count  NR_IOBUS_DEVS-1)
+   if (bus-dev_count  NR_IOBUS_DEVS - 1)
return -ENOSPC;
 
-   new_bus = kmemdup(bus, sizeof(struct kvm_io_bus), GFP_KERNEL);
+   new_bus = kzalloc(sizeof(*bus) + ((bus-dev_count + 1) *
+ sizeof(struct kvm_io_range)), GFP_KERNEL);
+   if (new_bus)
+   memcpy(new_bus, bus, sizeof(*bus) + (bus-dev_count *
+  sizeof(struct kvm_io_range)));
if (!new_bus)
return -ENOMEM;
kvm_io_bus_insert_dev(new_bus, dev, addr, len);
@@ -2513,26 +2514,28 @@ int kvm_io_bus_unregister_dev(struct kvm *kvm, enum 
kvm_bus bus_idx,
struct kvm_io_bus *new_bus, *bus;
 
bus = kvm-buses[bus_idx];
-
-   new_bus = kmemdup(bus, sizeof(*bus), GFP_KERNEL);
-   if (!new_bus)
-   return -ENOMEM;
-
r = -ENOENT;
-   for (i = 0; i  new_bus-dev_count; i++)
-   if (new_bus-range[i].dev == dev) {
+   for (i = 0; i  bus-dev_count; i++)
+   if (bus-range[i].dev == dev) {
r = 0;
-   new_bus-dev_count--;
-   new_bus-range[i] = new_bus-range[new_bus-dev_count];
-   sort(new_bus-range, new_bus-dev_count,
-sizeof(struct kvm_io_range),
-kvm_io_bus_sort_cmp, NULL);
break;
}
 
-   if (r) {
-   kfree(new_bus);
+   if (r)
return r;
+
+   new_bus = kmemdup(bus, sizeof(*bus) + ((bus-dev_count - 1) *
+ sizeof(struct kvm_io_range)), GFP_KERNEL);
+   if (!new_bus)
+   return -ENOMEM;
+
+   new_bus-dev_count--;
+   /* copy last entry of bus-range to deleted entry spot if
+  deleted entry isn't the last entry of bus-range */
+   if (i != bus-dev_count - 1) {
+   new_bus-range[i] = bus-range[bus-dev_count - 1];
+   sort(new_bus-range, new_bus-dev_count,
+sizeof(struct kvm_io_range), kvm_io_bus_sort_cmp, NULL);
}
 
rcu_assign_pointer(kvm-buses[bus_idx], new_bus);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Don't sync timebase when inside KVM

2012-02-29 Thread Scott Wood
On 02/28/2012 08:16 PM, Alexander Graf wrote:
 When we know that we're running inside of a KVM guest, we don't have to
 worry about synchronizing timebases between different CPUs, since the
 host already took care of that.
 
 This fixes CPU overcommit scenarios where vCPUs could hang forever trying
 to sync each other while not being scheduled.
 
 Reported-by: Stuart Yoder b08...@freescale.com
 Signed-off-by: Alexander Graf ag...@suse.de

This should apply to any hypervisor, not just KVM.  On book3e, Power ISA
says timebase is read-only on virtualized implementations.  My
understanding is that book3s is paravirt-only (guest state is not
considered an implementation of the Power ISA), and it says Writing the
Time Base is privileged, and can be done only in hypervisor state.

Which platforms are you seeing this on?  If it's on Freescale chips,
U-Boot should be doing the sync and Linux should never do it, even in
the absence of a hypervisor.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Don't sync timebase when inside KVM

2012-02-29 Thread Alexander Graf


On 29.02.2012, at 18:50, Scott Wood scottw...@freescale.com wrote:

 On 02/28/2012 08:16 PM, Alexander Graf wrote:
 When we know that we're running inside of a KVM guest, we don't have to
 worry about synchronizing timebases between different CPUs, since the
 host already took care of that.
 
 This fixes CPU overcommit scenarios where vCPUs could hang forever trying
 to sync each other while not being scheduled.
 
 Reported-by: Stuart Yoder b08...@freescale.com
 Signed-off-by: Alexander Graf ag...@suse.de
 
 This should apply to any hypervisor, not just KVM.  

Sure, but do you have a generic function to evaluate that? :)

 On book3e, Power ISA
 says timebase is read-only on virtualized implementations.  My
 understanding is that book3s is paravirt-only (guest state is not
 considered an implementation of the Power ISA), and it says Writing the
 Time Base is privileged, and can be done only in hypervisor state.

For PR non-PAPR KVM, we are non-paravirt, but ignore tb writes iirc.

 
 Which platforms are you seeing this on?  If it's on Freescale chips,
 U-Boot should be doing the sync and Linux should never do it, even in
 the absence of a hypervisor.

This is on e500mc.

Alex

 
 -Scott
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm-ppc in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Don't sync timebase when inside KVM

2012-02-29 Thread Scott Wood
On 02/29/2012 12:28 PM, Alexander Graf wrote:
 
 
 On 29.02.2012, at 18:50, Scott Wood scottw...@freescale.com wrote:
 
 On 02/28/2012 08:16 PM, Alexander Graf wrote:
 When we know that we're running inside of a KVM guest, we don't have to
 worry about synchronizing timebases between different CPUs, since the
 host already took care of that.

 This fixes CPU overcommit scenarios where vCPUs could hang forever trying
 to sync each other while not being scheduled.

 Reported-by: Stuart Yoder b08...@freescale.com
 Signed-off-by: Alexander Graf ag...@suse.de

 This should apply to any hypervisor, not just KVM.  
 
 Sure, but do you have a generic function to evaluate that? :)

The presence of a hypervisor node without testing compatible.  Might not
get them all, but at least it will cover more than just KVM.

 Which platforms are you seeing this on?  If it's on Freescale chips,
 U-Boot should be doing the sync and Linux should never do it, even in
 the absence of a hypervisor.
 
 This is on e500mc.

On e500mc Linux should never by trying to sync the timebase.  If it is,
let's fix that.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] powerpc/e500: make load_up_spe a normal fuction

2012-02-29 Thread Olivia Yin
From: Liu Yu yu@freescale.com

So that we can call it when improving SPE switch like book3e did for fp switch.

Signed-off-by: Liu Yu yu@freescale.com
Signed-off-by: Olivia Yin hong-hua@freescale.com
---
v2: add Signed-off-by

 arch/powerpc/kernel/head_fsl_booke.S |   23 ++-
 1 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kernel/head_fsl_booke.S 
b/arch/powerpc/kernel/head_fsl_booke.S
index d5d78c4..c96e025 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -539,8 +539,10 @@ interrupt_base:
/* SPE Unavailable */
START_EXCEPTION(SPEUnavailable)
NORMAL_EXCEPTION_PROLOG
-   bne load_up_spe
-   addir3,r1,STACK_FRAME_OVERHEAD
+   beq 1f
+   bl  load_up_spe
+   b   fast_exception_return
+1: addir3,r1,STACK_FRAME_OVERHEAD
EXC_XFER_EE_LITE(0x2010, KernelSPE)
 #else
EXCEPTION(0x2020, SPEUnavailable, unknown_exception, EXC_XFER_EE)
@@ -743,7 +745,7 @@ tlb_write_entry:
 /* Note that the SPE support is closely modeled after the AltiVec
  * support.  Changes to one are likely to be applicable to the
  * other!  */
-load_up_spe:
+_GLOBAL(load_up_spe)
 /*
  * Disable SPE for the task which had SPE previously,
  * and save its SPE registers in its thread_struct.
@@ -791,20 +793,7 @@ load_up_spe:
subir4,r5,THREAD
stw r4,last_task_used_spe@l(r3)
 #endif /* !CONFIG_SMP */
-   /* restore registers and return */
-2: REST_4GPRS(3, r11)
-   lwz r10,_CCR(r11)
-   REST_GPR(1, r11)
-   mtcrr10
-   lwz r10,_LINK(r11)
-   mtlrr10
-   REST_GPR(10, r11)
-   mtspr   SPRN_SRR1,r9
-   mtspr   SPRN_SRR0,r12
-   REST_GPR(9, r11)
-   REST_GPR(12, r11)
-   lwz r11,GPR11(r11)
-   rfi
+   blr
 
 /*
  * SPE unavailable trap from kernel - print a message, but let
-- 
1.6.4


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] KVM: booke: Improve SPE switch

2012-02-29 Thread Olivia Yin
From: Liu Yu yu@freescale.com

Like book3s did for fp switch,
instead of switch SPE between host and guest,
the patch switch SPE state between qemu and guest.
In this way, we can simulate a host loadup SPE when load guest SPE state,
and let host to decide when to giveup SPE state.
Therefor it cooperates better with host SPE usage,
and so that has some performance benifit in UP host(lazy SPE).

Moreover, since the patch save guest SPE state into linux thread field,
it creates the condition to emulate guest SPE instructions in host,
so that we can avoid injecting SPE exception to guest.

The patch also turns all asm code into C code,
and add SPE stat counts.

Signed-off-by: Liu Yu yu@freescale.com
Signed-off-by: Olivia Yin hong-hua@freescale.com
---
v2: 
Keep shadow MSR[SPE] consistent with 
thread MSR[SPE] in kvmppc_core_vcpu_load

 arch/powerpc/include/asm/kvm_host.h |   11 +-
 arch/powerpc/kernel/asm-offsets.c   |7 
 arch/powerpc/kvm/booke.c|   63 +++
 arch/powerpc/kvm/booke.h|8 +
 arch/powerpc/kvm/booke_interrupts.S |   37 
 arch/powerpc/kvm/e500.c |   13 ---
 arch/powerpc/kvm/timing.c   |5 +++
 arch/powerpc/kvm/timing.h   |   11 ++
 8 files changed, 91 insertions(+), 64 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 1843d5d..6186d08 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -117,6 +117,11 @@ struct kvm_vcpu_stat {
u32 st;
u32 st_slow;
 #endif
+#ifdef CONFIG_SPE
+   u32 spe_unavail;
+   u32 spe_fp_data;
+   u32 spe_fp_round;
+#endif
 };
 
 enum kvm_exit_types {
@@ -147,6 +152,11 @@ enum kvm_exit_types {
FP_UNAVAIL,
DEBUG_EXITS,
TIMEINGUEST,
+#ifdef CONFIG_SPE
+   SPE_UNAVAIL,
+   SPE_FP_DATA,
+   SPE_FP_ROUND,
+#endif
__NUMBER_OF_KVM_EXIT_TYPES
 };
 
@@ -330,7 +340,6 @@ struct kvm_vcpu_arch {
 #ifdef CONFIG_SPE
ulong evr[32];
ulong spefscr;
-   ulong host_spefscr;
u64 acc;
 #endif
 #ifdef CONFIG_ALTIVEC
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 8e0db0b..ff68f71 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -604,13 +604,6 @@ int main(void)
DEFINE(TLBCAM_MAS7, offsetof(struct tlbcam, MAS7));
 #endif
 
-#if defined(CONFIG_KVM)  defined(CONFIG_SPE)
-   DEFINE(VCPU_EVR, offsetof(struct kvm_vcpu, arch.evr[0]));
-   DEFINE(VCPU_ACC, offsetof(struct kvm_vcpu, arch.acc));
-   DEFINE(VCPU_SPEFSCR, offsetof(struct kvm_vcpu, arch.spefscr));
-   DEFINE(VCPU_HOST_SPEFSCR, offsetof(struct kvm_vcpu, arch.host_spefscr));
-#endif
-
 #ifdef CONFIG_KVM_EXIT_TIMING
DEFINE(VCPU_TIMING_EXIT_TBU, offsetof(struct kvm_vcpu,
arch.timing_exit.tv32.tbu));
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index ee9e1ee..f20010b 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -55,6 +55,11 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ dec,VCPU_STAT(dec_exits) },
{ ext_intr,   VCPU_STAT(ext_intr_exits) },
{ halt_wakeup, VCPU_STAT(halt_wakeup) },
+#ifdef CONFIG_SPE
+   { spe_unavail, VCPU_STAT(spe_unavail) },
+   { spe_fp_data, VCPU_STAT(spe_fp_data) },
+   { spe_fp_round, VCPU_STAT(spe_fp_round) },
+#endif
{ NULL }
 };
 
@@ -80,11 +85,11 @@ void kvmppc_dump_vcpu(struct kvm_vcpu *vcpu)
 }
 
 #ifdef CONFIG_SPE
-void kvmppc_vcpu_disable_spe(struct kvm_vcpu *vcpu)
+static void kvmppc_vcpu_disable_spe(struct kvm_vcpu *vcpu)
 {
preempt_disable();
-   enable_kernel_spe();
-   kvmppc_save_guest_spe(vcpu);
+   if (current-thread.regs-msr  MSR_SPE)
+   giveup_spe(current);
vcpu-arch.shadow_msr = ~MSR_SPE;
preempt_enable();
 }
@@ -92,8 +97,10 @@ void kvmppc_vcpu_disable_spe(struct kvm_vcpu *vcpu)
 static void kvmppc_vcpu_enable_spe(struct kvm_vcpu *vcpu)
 {
preempt_disable();
-   enable_kernel_spe();
-   kvmppc_load_guest_spe(vcpu);
+   if (!(current-thread.regs-msr  MSR_SPE)) {
+   load_up_spe(NULL);
+   current-thread.regs-msr |= MSR_SPE;
+   }
vcpu-arch.shadow_msr |= MSR_SPE;
preempt_enable();
 }
@@ -104,7 +111,7 @@ static void kvmppc_vcpu_sync_spe(struct kvm_vcpu *vcpu)
if (!(vcpu-arch.shadow_msr  MSR_SPE))
kvmppc_vcpu_enable_spe(vcpu);
} else if (vcpu-arch.shadow_msr  MSR_SPE) {
-   kvmppc_vcpu_disable_spe(vcpu);
+   vcpu-arch.shadow_msr = ~MSR_SPE;
}
 }
 #else
@@ -124,7 +131,8 @@ void kvmppc_set_msr(struct kvm_vcpu *vcpu, u32 new_msr)
vcpu-arch.shared-msr = new_msr;
 
kvmppc_mmu_msr_notify(vcpu, old_msr);
-