Re: Add savevm/loadvm support for MCE

2010-02-26 Thread Jan Kiszka
Huang Ying wrote:
 MCE registers are saved/load into/from CPUState in
 kvm_arch_save/load_regs. Because all MCE registers except for
 MCG_STATUS should be preserved, MCE registers are saved before
 kvm_arch_load_regs in kvm_arch_cpu_reset. To simulate the MCG_STATUS
 clearing upon reset, env-mcg_status is set to 0 after saving.

That should be solved differently on top of [1]: Write back
MSR_MCG_STATUS on KVM_PUT_RESET_STATE, write all MCE MSRs on
KVM_PUT_FULL_STATE. Then you can also unfold kvm_load/save_mce_regs to
avoid duplicating its infrastructure (becomes even more obvious when
looking at kvm_get/put/_msrs in upstream).

 
 Signed-off-by: Huang Ying ying.hu...@intel.com
 
 ---
  qemu-kvm-x86.c |   54 ++
  1 file changed, 54 insertions(+)
 
 --- a/qemu-kvm-x86.c
 +++ b/qemu-kvm-x86.c
 @@ -803,6 +803,27 @@ static void get_seg(SegmentCache *lhs, c
   | (rhs-avl * DESC_AVL_MASK);
  }
  
 +static void kvm_load_mce_regs(CPUState *env)
 +{
 +#ifdef KVM_CAP_MCE
 +struct kvm_msr_entry msrs[100];
 +int rc, n, i;
 +
 +if (!env-mcg_cap)
 + return;
 +
 +n = 0;
 +set_msr_entry(msrs[n++], MSR_MCG_STATUS, env-mcg_status);
 +set_msr_entry(msrs[n++], MSR_MCG_CTL, env-mcg_ctl);
 +for (i = 0; i  (env-mcg_cap  0xff) * 4; i++)
 +set_msr_entry(msrs[n++], MSR_MC0_CTL + i, env-mce_banks[i]);
 +
 +rc = kvm_set_msrs(env, msrs, n);
 +if (rc == -1)
 +perror(kvm_set_msrs FAILED);
 +#endif
 +}
 +
  void kvm_arch_load_regs(CPUState *env)
  {
  struct kvm_regs regs;
 @@ -922,6 +943,8 @@ void kvm_arch_load_regs(CPUState *env)
  if (rc == -1)
  perror(kvm_set_msrs FAILED);
  
 +kvm_load_mce_regs(env);
 +
  /*
   * Kernels before 2.6.33 (which correlates with !kvm_has_vcpu_events())
   * overwrote flags.TF injected via SET_GUEST_DEBUG while updating GP 
 regs.
 @@ -991,6 +1014,33 @@ void kvm_arch_load_mpstate(CPUState *env
  #endif
  }
  
 +static void kvm_save_mce_regs(CPUState *env)
 +{
 +#ifdef KVM_CAP_MCE
 +struct kvm_msr_entry msrs[100];
 +int rc, n, i;
 +
 +if (!env-mcg_cap)
 + return;
 +
 +msrs[0].index = MSR_MCG_STATUS;
 +msrs[1].index = MSR_MCG_CTL;
 +n = (env-mcg_cap  0xff) * 4;
 +for (i = 0; i  n; i++)
 +msrs[2 + i].index = MSR_MC0_CTL + i;
 +
 +rc = kvm_get_msrs(env, msrs, n + 2);
 +if (rc == -1)
 +perror(kvm_set_msrs FAILED);
 +else {
 +env-mcg_status = msrs[0].data;
 +env-mcg_ctl = msrs[1].data;
 +for (i = 0; i  n; i++)
 +env-mce_banks[i] = msrs[2 + i].data;
 +}
 +#endif
 +}
 +
  void kvm_arch_save_regs(CPUState *env)
  {
  struct kvm_regs regs;
 @@ -1148,6 +1198,7 @@ void kvm_arch_save_regs(CPUState *env)
  }
  }
  kvm_arch_save_mpstate(env);
 +kvm_save_mce_regs(env);
  }
  
  static void do_cpuid_ent(struct kvm_cpuid_entry2 *e, uint32_t function,
 @@ -1385,6 +1436,9 @@ void kvm_arch_push_nmi(void *opaque)
  void kvm_arch_cpu_reset(CPUState *env)
  {
  kvm_arch_reset_vcpu(env);
 +/* MCE registers except MCG_STATUS should be unchanged across reset */
 +kvm_save_mce_regs(env);
 +env-mcg_status = 0;
  kvm_arch_load_regs(env);
  kvm_put_vcpu_events(env);
  if (!cpu_is_bsp(env)) {
 
 

Jan

[1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/47411

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Zhang, Yanmin yanmin_zh...@linux.intel.com wrote:

 On Thu, 2010-02-25 at 17:26 +0100, Ingo Molnar wrote:
  * Jan Kiszka jan.kis...@siemens.com wrote:
  
   Jes Sorensen wrote:
Hi,

It looks like several of us have been looking at how to use the PMU
for virtualization. Rather than continuing to have discussions in
smaller groups, I think it is a good idea we move it to the mailing
lists to see what we can share and avoid duplicate efforts.

There are really two separate things to handle:

1) Add support to perf to allow it to monitor a KVM guest from the
   host.

2) Allow guests access to the PMU (or an emulated PMU), making it
   possible to run perf on applications running within the guest.

I know some of you have been looking at 1) and I am currently working
on 2). I have been looking at various approaches, including whether it
is feasible to share the PMU between the host and multiple guests. For
now I am going to focus on allowing one guest to take control of the
PMU, then later hopefully adding support for multiplexing it between
multiple guests.
   
   Given that perf can apply the PMU to individual host tasks, I don't see 
   fundamental problems multiplexing it between individual guests (which can 
   then internally multiplex it again).
  
  In terms of how to expose it to guests, a 'soft PMU' might be a usable 
  approach. Although to Linux guests you could expose much more 
  functionality and an non-PMU-limited number of instrumentation events, via 
  a more intelligent interface.
  
  But note that in terms of handling it on the host side the PMU approach is 
  not acceptable: instead it should map to proper perf_events, not try to 
  muck with the PMU itself.
 
 
  That, besides integrating properly with perf usage on the host, will also 
  allow interesting 'PMU' features on guests: you could set up the host side 
  to trace block IO requests (or VM exits) for example, and expose that as 
  'PMC
  #0' on the guest side.

 So virtualization becomes non-transparent to guest os? I know virtio is an 
 optimization on guest side.

The 'soft PMU' is transparent. The 'count IO events' kind of feature could be 
transparent too: you could re-configure (on the host) a given 'hardware' event 
to really count some software event.

That would make it compatible with whatever guest side tooling (without having 
to change that tooling) - while still allowing interesting new things to be 
measured.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Joerg Roedel
On Fri, Feb 26, 2010 at 10:55:17AM +0800, Zhang, Yanmin wrote:
 On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote:
  On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote:
  
   1) Add support to perf to allow it to monitor a KVM guest from the
  host.
  
  This shouldn't be a big problem. The PMU of AMD Fam10 processors can be
  configured to count only when in guest mode. Perf needs to be aware of
  that and fetch the rip from a different place when monitoring a guest.

 The idea is we want to measure both host and guest at the same time, and
 compare all the hot functions fairly.

So you want to measure while the guest vcpu is running and the vmexit
path of that vcpu (including qemu userspace part) together? The
challenge here is to find out if a performance event originated in guest
mode or in host mode.
But we can check for that in the nmi-protected part of the vmexit path.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enhance perf to support KVM

2010-02-26 Thread Ingo Molnar

* Zhang, Yanmin yanmin_zh...@linux.intel.com wrote:

 2) We couldn't get guest os kernel/user stack data in an easy way, so we 
 might not support callchain feature of tool perf. A work around is KVM 
 copies kernel stack data out, so we could at least support guest os kernel 
 callchain.

If the guest is Linux, KVM can get all the info we need.

While the PMU event itself might trigger in an NMI (where we cannot access 
most of KVM's data structures safely), for this specific case of KVM 
instrumentation we can delay the processing to a more appropriate time - in 
fact we can do it in the KVM thread itself.

We can do that because we just triggered a VM exit, so the VM state is for all 
purposes frozen (as far as this virtual CPU goes).

Which egives us plenty of time and opportunity to piggy back to the KVM 
thread, look up the guest stack, process/fill the MMU cache as we walk the 
guest page tables, etc. etc.

It would need some minimal callback facility towards KVM, triggered by a perf 
event PMI.

One additional step needed is to get symbol information from the guest, and to 
integrate it into the symbol cache on the host side in ~/.debug. We already 
support cross-arch symbols and 'perf archive', so the basic facilities are 
there for that. So you can profile on 32-bit PA-RISC and type 'perf report' on 
64-bit x86 and get all the right info.

For this to work across a guest, a gateway is needed towards the guest. 
There's several ways to achieve this. The most practical would be two steps:

 - a user-space facility to access guest images/libraries. (say via ssh, or 
   just a plain TCP port) This would be useful for general 'remote profiling' 
   sessions as well, so it's not KVM specific - it would be useful for remote 
   debugging.

 - The guest /proc/kallsyms (and vmlinux) could be accessed via that channel 
   as well.

(Note that this is purely for guest symbol space access - all the profiling 
data itself comes via the host kernel.)

In theory we could build some sort of 'symbol server' facility into the 
kernel, which could be enabled in guest kernels too - but i suspect existing, 
user-space transports go most of the way already. (the only disadvantage of 
existing transports is that they all have to be configured, enabled and made 
user-accessible, which is one of the few weak points of KVM in general.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Joerg Roedel j...@8bytes.org wrote:

 On Fri, Feb 26, 2010 at 10:55:17AM +0800, Zhang, Yanmin wrote:
  On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote:
   On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote:
   
1) Add support to perf to allow it to monitor a KVM guest from the
   host.
   
   This shouldn't be a big problem. The PMU of AMD Fam10 processors can be
   configured to count only when in guest mode. Perf needs to be aware of
   that and fetch the rip from a different place when monitoring a guest.
 
  The idea is we want to measure both host and guest at the same time, and
  compare all the hot functions fairly.
 
 So you want to measure while the guest vcpu is running and the vmexit
 path of that vcpu (including qemu userspace part) together? The
 challenge here is to find out if a performance event originated in guest
 mode or in host mode.
 But we can check for that in the nmi-protected part of the vmexit path.

As far as instrumentation goes, virtualization is simply another 'PID 
dimension' of measurement.

Today we can isolate system performance measurements/events to the following 
domains:

 - per system
 - per cpu
 - per task

( Note that PowerPC already supports certain sorts of 'hypervisor/kernel/user' 
  domain separation, and we have some ABI details for all that but it's by no 
  means complete. Anton is using the PowerPC bits AFAIK, so it already works 
  to a certain degree. )

When extending measurements to KVM, we want two things:

 - user friendliness: instead of having to check 'ps' and figure out which 
   Qemu thread is the KVM thread we want to profile, just give a convenience
   namespace to access guest profiling info. -G ought to map to the first
   currently running KVM guest it can find. (which would match like 90% of the
   cases) - etc. No ifs and when. If 'perf kvm top' doesnt show something 
   useful by default the whole effort is for naught.

 - Extend core facilities and enable the following measurement dimensions:

 host-kernel-space
 host-user-space
 guest-kernel-space
 guest-user-space

   on a per guest basis. We want to be able to measure just what the guest 
   does, and we want to be able to measure just what the host does.

   Some of this the hardware helps us with (say only measuring host kernel 
   events is possible), some has to be done by fiddling with event 
   enable/disable at vm-exit / vm-entry time.

My suggestion, as always, would be to start very simple and very minimal:

Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image 
both as a host and as guest (for testing), to not have to deal with the symbol 
space transport problem initially. Enable 'perf kvm record' to only record 
guest events by default. Etc.

This alone will be a quite useful result already - and gives a basis for 
further work. No need to spend months to do the big grand design straight 
away, all of this can be done gradually and in the order of usefulness - and 
you'll always have something that actually works (and helps your other KVM 
projects) along the way.

[ And, as so often, once you walk that path, that grand scheme you are 
  thinking about right now might easily become last year's really bad idea ;-) ]

So please start walking the path and experience the challenges first-hand.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 10:42 AM, Ingo Molnar wrote:

* Joerg Roedelj...@8bytes.org  wrote:

   

I personally don't like a self-defined event-set as the only solution
because that would probably only work with linux and perf. [...]
 

The 'soft-PMU' i suggested is transparent on the guest side - if you want to
enable non-Linux and legacy-Linux.

It's basically a PMU interface provided to the guest by catching the right MSR
accesses, implemented via perf_event_create_kernel_counter()/etc. on the host
side.
   


That only works if the software interface is 100% lossless - we can 
recreate every single hardware configuration through the API.  Is this 
the case?



Note that the 'soft PMU' still sucks from a design POV as there's no generic
hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft
Intel' PMU driver at minimum.
   


Right, this will severely limit migration domains to hosts of the same 
vendor and processor generation.  There is a  middle ground, though, 
Intel has recently moved to define an architectural pmu which is not 
model specific.  I don't know if AMD adopted it.  We could offer both 
options - native host capabilities, with a loss of compatibility, and 
the architectural pmu, with loss of model specific counters.



Far cleaner would be to expose it via hypercalls to guest OSs that are
interested in instrumentation.


It's also slower - you can give the guest direct access to the various 
counters so no exits are taken when reading the counters (though perhaps 
many tools are only interested in the interrupts, not the counter values).



That way it could also transparently integrate
with tracing, probes, etc. It would also be wiser to first concentrate on
improving Linux-Linux guest/host combos before gutting the design just to
fit Windows into the picture ...
   


gutting the design?

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] KVM: SVM: Move msrpm offset calculation to seperate function

2010-02-26 Thread Avi Kivity

On 02/25/2010 07:15 PM, Joerg Roedel wrote:

The algorithm to find the offset in the msrpm for a given
msr is needed at other places too. Move that logic to its
own function.

  #define MAX_INST_SIZE 15

@@ -417,23 +439,22 @@ err_1:
  static void set_msr_interception(u32 *msrpm, unsigned msr,
 int read, int write)
  {
-   int i;
+   u8 bit_read, bit_write;
+   unsigned long tmp;
+   u32 offset;

-   for (i = 0; i  NUM_MSR_MAPS; i++) {
-   if (msr= msrpm_ranges[i]
-   msr  msrpm_ranges[i] + MSRS_IN_RANGE) {
-   u32 msr_offset = (i * MSRS_IN_RANGE + msr -
- msrpm_ranges[i]) * 2;
-
-   u32 *base = msrpm + (msr_offset / 32);
-   u32 msr_shift = msr_offset % 32;
-   u32 mask = ((write) ? 0 : 2) | ((read) ? 0 : 1);
-   *base = (*base  ~(0x3  msr_shift)) |
-   (mask  msr_shift);
-   return;
-   }
-   }
-   BUG();
+   offset= svm_msrpm_offset(msr);
+   bit_read  = 2 * (msr  0x0f);
+   bit_write = 2 * (msr  0x0f) + 1;
+
+   BUG_ON(offset == MSR_INVALID);
+
+   tmp = msrpm[offset];
+
+   read  ? clear_bit(bit_read,tmp) : set_bit(bit_read,tmp);
+   write ? clear_bit(bit_write,tmp) : set_bit(bit_write,tmp);
+
+   msrpm[offset] = tmp;
  }
   


This can fault - set_bit() accesses an unsigned long, which can be 8 
bytes, while offset can point into the last u32 of msrpm.  So this needs 
either to revert to u32 shift/mask ops or msrpm be changed to a ulong 
array (actually better, since bitmaps in general are defined as arrays 
of ulongs).


btw, the op-level ternary expression is terrible, relying solely on 
*_bit()'s side effects.  Please convert to an ordinary if.


btw2, use __set_bit() which atomic operation is not needed.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] KVM: SVM: Move msrpm offset calculation to seperate function

2010-02-26 Thread Joerg Roedel
On Fri, Feb 26, 2010 at 12:20:10PM +0200, Avi Kivity wrote:
 On 02/25/2010 07:15 PM, Joerg Roedel wrote:
 The algorithm to find the offset in the msrpm for a given
 msr is needed at other places too. Move that logic to its
 own function.
 
   #define MAX_INST_SIZE 15
 
 @@ -417,23 +439,22 @@ err_1:
   static void set_msr_interception(u32 *msrpm, unsigned msr,
   int read, int write)
   {
 -int i;
 +u8 bit_read, bit_write;
 +unsigned long tmp;
 +u32 offset;
 
 -for (i = 0; i  NUM_MSR_MAPS; i++) {
 -if (msr= msrpm_ranges[i]
 -msr  msrpm_ranges[i] + MSRS_IN_RANGE) {
 -u32 msr_offset = (i * MSRS_IN_RANGE + msr -
 -  msrpm_ranges[i]) * 2;
 -
 -u32 *base = msrpm + (msr_offset / 32);
 -u32 msr_shift = msr_offset % 32;
 -u32 mask = ((write) ? 0 : 2) | ((read) ? 0 : 1);
 -*base = (*base  ~(0x3  msr_shift)) |
 -(mask  msr_shift);
 -return;
 -}
 -}
 -BUG();
 +offset= svm_msrpm_offset(msr);
 +bit_read  = 2 * (msr  0x0f);
 +bit_write = 2 * (msr  0x0f) + 1;
 +
 +BUG_ON(offset == MSR_INVALID);
 +
 +tmp = msrpm[offset];
 +
 +read  ? clear_bit(bit_read,tmp) : set_bit(bit_read,tmp);
 +write ? clear_bit(bit_write,tmp) : set_bit(bit_write,tmp);
 +
 +msrpm[offset] = tmp;
   }
 
 This can fault - set_bit() accesses an unsigned long, which can be 8
 bytes, while offset can point into the last u32 of msrpm.  So this
 needs either to revert to u32 shift/mask ops or msrpm be changed to
 a ulong array (actually better, since bitmaps in general are defined
 as arrays of ulongs).

Ah true, I will fix that. Thanks.

 btw, the op-level ternary expression is terrible, relying solely on
 *_bit()'s side effects.  Please convert to an ordinary if.
 
 btw2, use __set_bit() which atomic operation is not needed.

Right, will switch to __set_bit and __clear_bit.

Joerg


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] KVM: SVM: Optimize nested svm msrpm merging

2010-02-26 Thread Avi Kivity

On 02/25/2010 07:15 PM, Joerg Roedel wrote:

This patch optimizes the way the msrpm of the host and the
guest are merged. The old code merged the 2 msrpm pages
completly. This code needed to touch 24kb of memory for that
operation. The optimized variant this patch introduces
merges only the parts where the host msrpm may contain zero
bits. This reduces the amount of memory which is touched to
48 bytes.

Signed-off-by: Joerg Roedeljoerg.roe...@amd.com
---
  arch/x86/kvm/svm.c |   67 +---
  1 files changed, 58 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d8d4e35..d15e0ea 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -92,6 +92,9 @@ struct nested_state {

  };

+#define MSRPM_OFFSETS  16
+static u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly;
+
  struct vcpu_svm {
struct kvm_vcpu vcpu;
struct vmcb *vmcb;
@@ -436,6 +439,34 @@ err_1:

  }

+static void add_msr_offset(u32 offset)
+{
+   u32 old;
+   int i;
+
+again:
+   for (i = 0; i  MSRPM_OFFSETS; ++i) {
+   old = msrpm_offsets[i];
+
+   if (old == offset)
+   return;
+
+   if (old != MSR_INVALID)
+   continue;
+
+   if (cmpxchg(msrpm_offsets[i], old, offset) != old)
+   goto again;
+
+   return;
+   }
+
+   /*
+* If this BUG triggers the msrpm_offsets table has an overflow. Just
+* increase MSRPM_OFFSETS in this case.
+*/
+   BUG();
+}
   


Why all this atomic cleverness?  The possible offsets are all determined 
statically.  Even if you do them dynamically (makes sense when 
considering pmu passthrough), it's per-vcpu and therefore single 
threaded (just move msrpm_offsets into vcpu context).



@@ -1846,20 +1882,33 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)

  static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm)
  {
-   u32 *nested_msrpm;
-   struct page *page;
+   /*
+* This function merges the msr permission bitmaps of kvm and the
+* nested vmcb. It is omptimized in that it only merges the parts where
+* the kvm msr permission bitmap may contain zero bits
+*/
   


A comment that describes the entire function can be moved above the 
function, freeing a whole tab stop for contents.



--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] KVM: SVM: Use svm_msrpm_offset in nested_svm_exit_handled_msr

2010-02-26 Thread Avi Kivity

On 02/25/2010 07:15 PM, Joerg Roedel wrote:

There is a generic function now to calculate msrpm offsets.
Use that function in nested_svm_exit_handled_msr() remove
the duplicate logic.

   


Hm, if the function would also calculate the mask, then it would be 
useful for set_msr_interception() as well.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] KVM: SVM: Add correct handling of nested iopm

2010-02-26 Thread Avi Kivity

On 02/25/2010 07:15 PM, Joerg Roedel wrote:

This patch adds the correct handling of the nested io
permission bitmap. Old behavior was to not lookup the port
in the iopm but only reinject an io intercept to the guest.

Signed-off-by: Joerg Roedeljoerg.roe...@amd.com
---
  arch/x86/kvm/svm.c |   25 +
  1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index bb75a44..3859e2c 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -78,6 +78,7 @@ struct nested_state {

/* gpa pointers to the real vectors */
u64 vmcb_msrpm;
+   u64 vmcb_iopm;

/* A VMEXIT is required but not yet emulated */
bool exit_required;
@@ -1603,6 +1604,26 @@ static void nested_svm_unmap(struct page *page)
kvm_release_page_dirty(page);
  }

+static int nested_svm_intercept_ioio(struct vcpu_svm *svm)
+{
+   unsigned port;
+   u8 val, bit;
+   u64 gpa;
+
+   if (!(svm-nested.intercept  (1ULL  INTERCEPT_IOIO_PROT)))
+   return NESTED_EXIT_HOST;
+
+   port = svm-vmcb-control.exit_info_1  16;
+   gpa  = svm-nested.vmcb_iopm + (port / 8);
+   bit  = port % 8;
+   val  = 0;
+
+   if (kvm_read_guest(svm-vcpu.kvm, gpa,val, 1))
+   val= (1  bit);
+
+   return val ? NESTED_EXIT_DONE : NESTED_EXIT_HOST;
+}
+
   


A kvm_{test,set,clear}_guest_bit() would be useful, we have several 
users already (not a requirement for this patchset).


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enhance perf to support KVM

2010-02-26 Thread Ingo Molnar

* Avi Kivity a...@redhat.com wrote:

 On 02/26/2010 11:01 AM, Ingo Molnar wrote:
 * Zhang, Yanminyanmin_zh...@linux.intel.com  wrote:
 
 2) We couldn't get guest os kernel/user stack data in an easy way, so we
 might not support callchain feature of tool perf. A work around is KVM
 copies kernel stack data out, so we could at least support guest os kernel
 callchain.
 If the guest is Linux, KVM can get all the info we need.
 
 While the PMU event itself might trigger in an NMI (where we cannot access
 most of KVM's data structures safely), for this specific case of KVM
 instrumentation we can delay the processing to a more appropriate time - in
 fact we can do it in the KVM thread itself.
 
 The nmi will be a synchronous event: it happens in guest context,
 and we program the hardware to intercept nmis, so we just get an
 exit telling us that an nmi has happened.
 
 (would also be interesting to allow the guest to process the nmi
 directly in some scenarios, though that would require that there be
 no nmi sources on the host).
 
 We can do that because we just triggered a VM exit, so the VM state is for 
 all
 purposes frozen (as far as this virtual CPU goes).
 
 Yes.
 
 Which egives us plenty of time and opportunity to piggy back to the KVM
 thread, look up the guest stack, process/fill the MMU cache as we walk the
 guest page tables, etc. etc.
 
 It would need some minimal callback facility towards KVM, triggered by a perf
 event PMI.
 
 Since the event is synchronous and kvm is aware of it we don't need
 a callback; kvm can call directly into perf with all the
 information.

Yes - it's still a callback in the abstract sense. Much of it already all 
existing.

 One additional step needed is to get symbol information from the guest, and 
 to
 integrate it into the symbol cache on the host side in ~/.debug. We already
 support cross-arch symbols and 'perf archive', so the basic facilities are
 there for that. So you can profile on 32-bit PA-RISC and type 'perf report' 
 on
 64-bit x86 and get all the right info.
 
 For this to work across a guest, a gateway is needed towards the guest.
 There's several ways to achieve this. The most practical would be two steps:
 
   - a user-space facility to access guest images/libraries. (say via ssh, or
 just a plain TCP port) This would be useful for general 'remote 
  profiling'
 sessions as well, so it's not KVM specific - it would be useful for 
  remote
 debugging.
 
   - The guest /proc/kallsyms (and vmlinux) could be accessed via that channel
 as well.
 
 (Note that this is purely for guest symbol space access - all the profiling
 data itself comes via the host kernel.)
 
 In theory we could build some sort of 'symbol server' facility into the
 kernel, which could be enabled in guest kernels too - but i suspect existing,
 user-space transports go most of the way already.
 
 There is also vmchannel aka virtio-serial, a guest-to-host communication 
 channel.

Basically what is needed is plain filesystem access - properly privileged. So 
doing this via a vmchannel would be nice, but for the symbol extraction it 
would be a glorified NFS server in essence.

Do you have (or plan) any turn-key 'access to all files of the guest' kind of 
guest-transparent facility that could be used for such purposes? That would 
have various advantages over a traditional explicit file server approach:

 - it would not contaminate the guest port space

 - no guest side configuration needed (the various oprofile remote daemons 
   always sucked as they needed extra setup)

 - it might even be used with a guest that does no networking

 - if done fully in the kernel it could be done with a fully 'unaware' guest, 
etc.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Joerg Roedel
On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote:
 On 02/26/2010 10:42 AM, Ingo Molnar wrote:
 Note that the 'soft PMU' still sucks from a design POV as there's no generic
 hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft
 Intel' PMU driver at minimum.


 Right, this will severely limit migration domains to hosts of the same  
 vendor and processor generation.  There is a  middle ground, though,  
 Intel has recently moved to define an architectural pmu which is not  
 model specific.  I don't know if AMD adopted it.  We could offer both  
 options - native host capabilities, with a loss of compatibility, and  
 the architectural pmu, with loss of model specific counters.

I only had a quick look yet on the architectural pmu from intel but it
looks like it can be emulated for a guest on amd using existing
features.

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Joerg Roedel
On Fri, Feb 26, 2010 at 10:17:32AM +0100, Ingo Molnar wrote:
 My suggestion, as always, would be to start very simple and very minimal:
 
 Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image 
 both as a host and as guest (for testing), to not have to deal with the 
 symbol 
 space transport problem initially. Enable 'perf kvm record' to only record 
 guest events by default. Etc.
 
 This alone will be a quite useful result already - and gives a basis for 
 further work. No need to spend months to do the big grand design straight 
 away, all of this can be done gradually and in the order of usefulness - and 
 you'll always have something that actually works (and helps your other KVM 
 projects) along the way.
 
 [ And, as so often, once you walk that path, that grand scheme you are 
   thinking about right now might easily become last year's really bad idea 
 ;-) ]
 
 So please start walking the path and experience the challenges first-hand.

That sounds like a good approach for the 'measure-guest-from-host'
problem. It is also not very hard to implement. Where does perf fetch
the rip of the nmi from, stack only or is this configurable?

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity a...@redhat.com wrote:

 Right, this will severely limit migration domains to hosts of the same 
 vendor and processor generation.  There is a middle ground, though, Intel 
 has recently moved to define an architectural pmu which is not model 
 specific.  I don't know if AMD adopted it. [...]

Nope. It's architectural the following way: Intel wont change it with future 
CPU models, outside of the definitions of the hw-ABI. PMUs were model specific 
prior that time.

I'd say there's near zero chance the MSR spaces will unify. All the 'advanced' 
PMU features are wildly incompatible, and the gap is increasing not 
decreasing.

  Far cleaner would be to expose it via hypercalls to guest OSs that are 
  interested in instrumentation.
 
 It's also slower - you can give the guest direct access to the various 
 counters so no exits are taken when reading the counters (though perhaps 
 many tools are only interested in the interrupts, not the counter values).

Direct access to counters is not something that is a big issue. [ Given that i 
sometimes can see KVM redraw the screen of a guest OS real-time i doubt this 
is the biggest of performance challenges right now ;-) ]

By far the biggest instrumentation issue is:

 - availability
 - usability
 - flexibility

Exposing the raw hw is a step backwards in many regards. The same way we dont 
want to expose chipsets to the guest to allow them to do RAS. The same way we 
dont want to expose most raw PCI devices to guest in general, but have all 
these virt driver abstractions.

  That way it could also transparently integrate with tracing, probes, etc. 
  It would also be wiser to first concentrate on improving Linux-Linux 
  guest/host combos before gutting the design just to fit Windows into the 
  picture ...
 
 gutting the design?

Yes, gutting the design of a sane instrumentation API and moving it back 10-20 
years by squeezing it through non-standardized and incompatible PMU drivers.

When it comes to design my main interest is the Linux-Linux combo.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Joerg Roedel j...@8bytes.org wrote:

 On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote:
  On 02/26/2010 10:42 AM, Ingo Molnar wrote:
  Note that the 'soft PMU' still sucks from a design POV as there's no 
  generic
  hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft
  Intel' PMU driver at minimum.
 
 
  Right, this will severely limit migration domains to hosts of the same  
  vendor and processor generation.  There is a  middle ground, though,  
  Intel has recently moved to define an architectural pmu which is not  
  model specific.  I don't know if AMD adopted it.  We could offer both  
  options - native host capabilities, with a loss of compatibility, and  
  the architectural pmu, with loss of model specific counters.
 
 I only had a quick look yet on the architectural pmu from intel but it looks 
 like it can be emulated for a guest on amd using existing features.

AMD CPUs dont have enough events for that, they cannot do the 3 fixed events 
in addition to the 2 generic ones.

Nor do you really want to standardize on KVM guests on returning 
'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel PMU 
drivers, right?

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enhance perf to support KVM

2010-02-26 Thread Avi Kivity

On 02/26/2010 12:35 PM, Ingo Molnar wrote:



One additional step needed is to get symbol information from the guest, and to
integrate it into the symbol cache on the host side in ~/.debug. We already
support cross-arch symbols and 'perf archive', so the basic facilities are
there for that. So you can profile on 32-bit PA-RISC and type 'perf report' on
64-bit x86 and get all the right info.

For this to work across a guest, a gateway is needed towards the guest.
There's several ways to achieve this. The most practical would be two steps:

  - a user-space facility to access guest images/libraries. (say via ssh, or
just a plain TCP port) This would be useful for general 'remote profiling'
sessions as well, so it's not KVM specific - it would be useful for remote
debugging.

  - The guest /proc/kallsyms (and vmlinux) could be accessed via that channel
as well.

(Note that this is purely for guest symbol space access - all the profiling
data itself comes via the host kernel.)

In theory we could build some sort of 'symbol server' facility into the
kernel, which could be enabled in guest kernels too - but i suspect existing,
user-space transports go most of the way already.
   

There is also vmchannel aka virtio-serial, a guest-to-host communication
channel.
 

Basically what is needed is plain filesystem access - properly privileged. So
doing this via a vmchannel would be nice, but for the symbol extraction it
would be a glorified NFS server in essence.
   


Well, we could run an nfs server over vmchannel, or over a private 
network interface.



Do you have (or plan) any turn-key 'access to all files of the guest' kind of
guest-transparent facility that could be used for such purposes?


Not really.  The guest and host admins are usually different people, who 
may, being admins, even actively hate each other.  The guest admin would 
probably regard it as a security hole.  It's probably useful for the 
single-host scenario, and of course for developers.


I guess sshfs can fill this role, with one command it gives you secure 
access to all guest files, provided you have the proper credentials.



That would
have various advantages over a traditional explicit file server approach:

  - it would not contaminate the guest port space

  - no guest side configuration needed (the various oprofile remote daemons
always sucked as they needed extra setup)

  - it might even be used with a guest that does no networking

  - if done fully in the kernel it could be done with a fully 'unaware' guest, 
etc.
   


Seems sshfs fulfils the first two.  For the latter, we could do a 
vmchannelfs, but it seems quite a bit of work, and would require fairly 
new guest kernels, whereas sshfs would work out of the box on 10 year 
old guests and can be easily made to work on Windows.


Somewhat related, see libguestfs/guestfish, though that provides offline 
access only.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 12:46 PM, Ingo Molnar wrote:



Right, this will severely limit migration domains to hosts of the same
vendor and processor generation.  There is a  middle ground, though,
Intel has recently moved to define an architectural pmu which is not
model specific.  I don't know if AMD adopted it.  We could offer both
options - native host capabilities, with a loss of compatibility, and
the architectural pmu, with loss of model specific counters.
   

I only had a quick look yet on the architectural pmu from intel but it looks
like it can be emulated for a guest on amd using existing features.
 

AMD CPUs dont have enough events for that, they cannot do the 3 fixed events
in addition to the 2 generic ones.

Nor do you really want to standardize on KVM guests on returning
'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel PMU
drivers, right?

   


No - that would only work if AMD also adopted the architectural pmu.

Note virtualization clusters are typically split into 'migration pools' 
consisting of hosts with similar processor features, so that you can 
expose those features and yet live migrate guests at will.  It's likely 
that all hosts have the same pmu anyway, so the only downside is that we 
now have to expose the host's processor family and model.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Joerg Roedel j...@8bytes.org wrote:

 On Fri, Feb 26, 2010 at 10:17:32AM +0100, Ingo Molnar wrote:
  My suggestion, as always, would be to start very simple and very minimal:
  
  Enable 'perf kvm top' to show guest overhead. Use the exact same kernel 
  image 
  both as a host and as guest (for testing), to not have to deal with the 
  symbol 
  space transport problem initially. Enable 'perf kvm record' to only record 
  guest events by default. Etc.
  
  This alone will be a quite useful result already - and gives a basis for 
  further work. No need to spend months to do the big grand design straight 
  away, all of this can be done gradually and in the order of usefulness - 
  and 
  you'll always have something that actually works (and helps your other KVM 
  projects) along the way.
  
  [ And, as so often, once you walk that path, that grand scheme you are 
thinking about right now might easily become last year's really bad idea 
  ;-) ]
  
  So please start walking the path and experience the challenges first-hand.
 
 That sounds like a good approach for the 'measure-guest-from-host'
 problem. It is also not very hard to implement. Where does perf fetch
 the rip of the nmi from, stack only or is this configurable?

The host semantics are that it takes the stack from the regs, and with 
call-graph recording (perf record -g) it will walk down the exception stack, 
irq stack, kernel stack, and user-space stack as well. (up to the point the 
pages are present - it stops on a non-present page. An app that is being 
profiled has its stack present so it's not an issue in practice.)

I'd suggest to leave out call graph sampling initially, and just get 'perf kvm 
top' to work with guest RIPs, simply sampled from the VM exit state.

See arch/x86/kernel/cpu/perf_event.c:

static void
perf_callchain_kernel(struct pt_regs *regs, struct perf_callchain_entry *entry)
{
callchain_store(entry, PERF_CONTEXT_KERNEL);
callchain_store(entry, regs-ip);

dump_trace(NULL, regs, NULL, regs-bp, backtrace_ops, entry);
}

If you have easy access to the VM state from NMI context right there then just 
hack in the guest RIP and you should have some prototype that samples the 
guest. (assuming you use the same kernel image for both the host an the guest)

This would be the easiest way to prototype it all.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/25/10 17:26, Ingo Molnar wrote:

Given that perf can apply the PMU to individual host tasks, I don't see
fundamental problems multiplexing it between individual guests (which can
then internally multiplex it again).


In terms of how to expose it to guests, a 'soft PMU' might be a usable
approach. Although to Linux guests you could expose much more functionality
and an non-PMU-limited number of instrumentation events, via a more
intelligent interface.

But note that in terms of handling it on the host side the PMU approach is not
acceptable: instead it should map to proper perf_events, not try to muck with
the PMU itself.


I am not keen on emulating the PMU, if we do that we end up having to
emulate a large number of MSR accesses, which is really costly. It makes
a lot more sense to give the guest direct access to the PMU. The problem
here is how to manage it without too much overhead.

Cheers,
Jes

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Joerg Roedel
On Fri, Feb 26, 2010 at 11:46:59AM +0100, Ingo Molnar wrote:
 
 * Joerg Roedel j...@8bytes.org wrote:
 
  On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote:
   On 02/26/2010 10:42 AM, Ingo Molnar wrote:
   Note that the 'soft PMU' still sucks from a design POV as there's no 
   generic
   hw interface to the PMU. So there would have to be a 'soft AMD' and a 
   'soft
   Intel' PMU driver at minimum.
  
  
   Right, this will severely limit migration domains to hosts of the same  
   vendor and processor generation.  There is a  middle ground, though,  
   Intel has recently moved to define an architectural pmu which is not  
   model specific.  I don't know if AMD adopted it.  We could offer both  
   options - native host capabilities, with a loss of compatibility, and  
   the architectural pmu, with loss of model specific counters.
  
  I only had a quick look yet on the architectural pmu from intel but it 
  looks 
  like it can be emulated for a guest on amd using existing features.
 
 AMD CPUs dont have enough events for that, they cannot do the 3 fixed events 
 in addition to the 2 generic ones.

Good point. Maybe we can emulate that with some counter round-robin
usage if the guest really uses all 5 counters.

 Nor do you really want to standardize on KVM guests on returning 
 'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel PMU 
 drivers, right?

Isn't there a cpuid bit indicating the availability of architectural
perfmon?

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 12:44 PM, Ingo Molnar wrote:

Far cleaner would be to expose it via hypercalls to guest OSs that are
interested in instrumentation.
   

It's also slower - you can give the guest direct access to the various
counters so no exits are taken when reading the counters (though perhaps
many tools are only interested in the interrupts, not the counter values).
 

Direct access to counters is not something that is a big issue. [ Given that i
sometimes can see KVM redraw the screen of a guest OS real-time i doubt this
is the biggest of performance challenges right now ;-) ]
   


Outside 4-bit vga mode, this shouldn't happen.  Can you describe your 
scenario?



By far the biggest instrumentation issue is:

  - availability
  - usability
  - flexibility

Exposing the raw hw is a step backwards in many regards.


In a way, virtualization as a whole is a step backwards.  We take the 
nice firesystem/timer/network/scheduler APIs, and expose them as raw 
hardware.  The pmu isn't any different.




The same way we dont
want to expose chipsets to the guest to allow them to do RAS. The same way we
dont want to expose most raw PCI devices to guest in general, but have all
these virt driver abstractions.
   


Whenever we have a choice, we expose raw hardware (usually emulated, but 
in some cases real).  Raw hardware has the huge advantage of being 
already supported.  Write a software abstraction, and you get to (a) 
write and maintain the spec (b) write drivers for all guests (c) mumble 
something to users of OSes to which you haven't ported your driver (d) 
explain to users that they need to install those drivers.


For networking and block, it is simply impossible to obtain good 
performance without introducing a new interface, but for other stuff, 
that may not be the case.



That way it could also transparently integrate with tracing, probes, etc.
It would also be wiser to first concentrate on improving Linux-Linux
guest/host combos before gutting the design just to fit Windows into the
picture ...
   

gutting the design?
 

Yes, gutting the design of a sane instrumentation API and moving it back 10-20
years by squeezing it through non-standardized and incompatible PMU drivers.
   


Any new interface will be incompatible to all the exiting guests out 
there; and unlike networking, you can't retrofit a pmu interface to an 
existing guest.



When it comes to design my main interest is the Linux-Linux combo.
   


My main interest is the OSes that users actually install, and those are 
Windows and non-bleeding-edge Linux.


Look at guests as you do at userspace: you don't want to inflict changes 
upon them.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enhance perf to support KVM

2010-02-26 Thread Ingo Molnar

* Avi Kivity a...@redhat.com wrote:

  Do you have (or plan) any turn-key 'access to all files of the guest' kind 
  of guest-transparent facility that could be used for such purposes?
 
 Not really.  The guest and host admins are usually different people, who 
 may, being admins, even actively hate each other.  The guest admin would 
 probably regard it as a security hole.  It's probably useful for the 
 single-host scenario, and of course for developers.

Sounds like an exceedingly silly argument to me - the host admin is the king 
in any case.

Your argument boils down to: 'dont offer transparent, turn-key solutions 
because some might object to the functionality they offer for all the wrong 
reasons'. Which does not withstand elementary scrutiny.

This is a basic usability issue, and affects many parts of the KVM universe.

Really, it's by far the most fubar-ed notion of KVM. You are pushing _way_ too 
much to user-space into different modules and maintenance domains, and 
user-space forks those bits, fragments, diverts, delays and messes up basic 
features in the usual fashion.

The result is a basic out-of-box virtualization experience that sucks even 
these days.

Nobody is really 'in charge' of how KVM gets delivered to the user. You 
isolated the fun kernel part for you and pushed out the boring bits to 
user-space. So if mundane things like mouse integration sucks 'hey that's a 
user-space tooling problem', if file integration sucks then 'hey, that's an 
admin problem', if it cannot be used over the network 'hey, that's an Xorg 
problem', etc. etc.

You basically have given up control over the quality of KVM by pushing so many 
aspects of it to user-space and letting it rot there.

Sure the design looks somewhat cleaner on paper, but if the end result is not 
helped by it then over-modularization sure can hurt ...

( Note that i dont mind user-space tooling per se, as long as it sits together 
  with the kernel bits and gets developed, packaged and given to the user in 
  the same domain. )

And that's a key conceptual area were tools/perf/ differs: it's an integrated, 
turn-key solution that you can really rely on. We take responsibility for the 
full thing, no ifs and when. And if you cannot rely on your instrumentation 
tooling as a single unit you cannot use it, simple as that. (that is a key 
mistake Oprofile made a decade ago too btw.)

So i can see some upcoming culture friction with standing KVM principles there 
;-)

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: list_add corruption?

2010-02-26 Thread Avi Kivity

On 02/26/2010 06:57 AM, Zachary Amsden wrote:
Anyone seeing list_add corruption running qemu-kvm with -smp 2 on 
Intel hardware?


Debugging some local changes, which don't appear related.  Running 
module from latest git on F12.




Can you post a trace?  Which list appears to be involved?

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 12:06, Joerg Roedel wrote:

Isn't there a cpuid bit indicating the availability of architectural
perfmon?


Nope, the perfmon flag is a fake Linux flag, set based on the contents
on cpuid 0x0a

Jes

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Joerg Roedel j...@8bytes.org wrote:

 On Fri, Feb 26, 2010 at 11:46:59AM +0100, Ingo Molnar wrote:
  
  * Joerg Roedel j...@8bytes.org wrote:
  
   On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote:
On 02/26/2010 10:42 AM, Ingo Molnar wrote:
Note that the 'soft PMU' still sucks from a design POV as there's no 
generic
hw interface to the PMU. So there would have to be a 'soft AMD' and a 
'soft
Intel' PMU driver at minimum.
   
   
Right, this will severely limit migration domains to hosts of the same  
vendor and processor generation.  There is a  middle ground, though,  
Intel has recently moved to define an architectural pmu which is not  
model specific.  I don't know if AMD adopted it.  We could offer both  
options - native host capabilities, with a loss of compatibility, and  
the architectural pmu, with loss of model specific counters.
   
   I only had a quick look yet on the architectural pmu from intel but it 
   looks 
   like it can be emulated for a guest on amd using existing features.
  
  AMD CPUs dont have enough events for that, they cannot do the 3 fixed 
  events 
  in addition to the 2 generic ones.
 
 Good point. Maybe we can emulate that with some counter round-robin
 usage if the guest really uses all 5 counters.
 
  Nor do you really want to standardize on KVM guests on returning 
  'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel 
  PMU 
  drivers, right?
 
 Isn't there a cpuid bit indicating the availability of architectural 
 perfmon?

there is, but can you rely on all guest OSs keying off their PMU drivers based 
purely on the CPUID bit and not on any other CPUID aspects?

Guest OSs like ... Linux v2.6.33:

void __init init_hw_perf_events(void)
{
int err;

pr_info(Performance Events: );

switch (boot_cpu_data.x86_vendor) {
case X86_VENDOR_INTEL:
err = intel_pmu_init();
break;
case X86_VENDOR_AMD:
err = amd_pmu_init();
break;
default:

Really, if you want to emulate a single Intel PMU driver model you need to 
pretend that you are an Intel CPU, throughout. This cannot be had both ways.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Jes Sorensen jes.soren...@redhat.com wrote:

 On 02/26/10 12:06, Joerg Roedel wrote:

  Isn't there a cpuid bit indicating the availability of architectural 
  perfmon?
 
 Nope, the perfmon flag is a fake Linux flag, set based on the contents on 
 cpuid 0x0a

There is a way to query the CPU for 'architectural perfmon' though, via CPUID 
alone - that is how we set the X86_FEATURE_ARCH_PERFMON shortcut. The logic 
is:

if (c-cpuid_level  9) {
unsigned eax = cpuid_eax(10);
/* Check for version and the number of counters */
if ((eax  0xff)  (((eax8)  0xff)  1))
set_cpu_cap(c, X86_FEATURE_ARCH_PERFMON);
}

But emulating that doesnt solve the problem: as OSs generally dont key their 
PMU drivers off the relatively new 'architectural perfmon' CPUID detail, but 
based on much higher level CPUID attributes. (like Intel/AMD)

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 11:44, Ingo Molnar wrote:

Direct access to counters is not something that is a big issue. [ Given that i
sometimes can see KVM redraw the screen of a guest OS real-time i doubt this
is the biggest of performance challenges right now ;-) ]

By far the biggest instrumentation issue is:

  - availability
  - usability
  - flexibility

Exposing the raw hw is a step backwards in many regards. The same way we dont
want to expose chipsets to the guest to allow them to do RAS. The same way we
dont want to expose most raw PCI devices to guest in general, but have all
these virt driver abstractions.


I have to say I disagree on that. When you run perfmon on a system, it
is normally to measure a specific application. You want to see accurate
numbers for cache misses, mul instructions or whatever else is selected.
Emulating the PMU rather than using the real one, makes the numbers far
less useful. The most useful way to provide PMU support in a guest is
to expose the real PMU and let the guest OS program it.

We can do this in a reasonable way today, if we allow to take the PMU
away from the host, and only let guests access it when it's in use.
Hopefully Intel and AMD will come up with proper hw PMU virtualization
support that allows us to do it 100% guest and host at some point.

Cheers,
Jes

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 12:24, Ingo Molnar wrote:

There is a way to query the CPU for 'architectural perfmon' though, via CPUID
alone - that is how we set the X86_FEATURE_ARCH_PERFMON shortcut. The logic
is:

 if (c-cpuid_level  9) {
 unsigned eax = cpuid_eax(10);
 /* Check for version and the number of counters */
 if ((eax  0xff)  (((eax8)  0xff)  1))
 set_cpu_cap(c, X86_FEATURE_ARCH_PERFMON);
 }

But emulating that doesnt solve the problem: as OSs generally dont key their
PMU drivers off the relatively new 'architectural perfmon' CPUID detail, but
based on much higher level CPUID attributes. (like Intel/AMD)


Right, there is far more to it than just the arch-perfmon feature. They
still need to query cpuid 0x0a for counter size, number of counters and
stuff like that.

Cheers,
Jes
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity a...@redhat.com wrote:

 On 02/26/2010 12:44 PM, Ingo Molnar wrote:
 Far cleaner would be to expose it via hypercalls to guest OSs that are
 interested in instrumentation.
 It's also slower - you can give the guest direct access to the various
 counters so no exits are taken when reading the counters (though perhaps
 many tools are only interested in the interrupts, not the counter values).
 Direct access to counters is not something that is a big issue. [ Given that 
 i
 sometimes can see KVM redraw the screen of a guest OS real-time i doubt this
 is the biggest of performance challenges right now ;-) ]
 
 Outside 4-bit vga mode, this shouldn't happen.  Can you describe
 your scenario?
 
 By far the biggest instrumentation issue is:
 
   - availability
   - usability
   - flexibility
 
 Exposing the raw hw is a step backwards in many regards.
 
 In a way, virtualization as a whole is a step backwards.  We take the nice 
 firesystem/timer/network/scheduler APIs, and expose them as raw hardware.  
 The pmu isn't any different.

Uhm, it's obviously very different. A fake NE2000 will work on both Intel and 
AMD CPUs. Same for a fake PIT. PMU drivers are fundamentally per CPU vendor 
though.

So there's no generic hardware to emulate.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Jes Sorensen jes.soren...@redhat.com wrote:

 On 02/26/10 11:44, Ingo Molnar wrote:
 Direct access to counters is not something that is a big issue. [ Given that 
 i
 sometimes can see KVM redraw the screen of a guest OS real-time i doubt this
 is the biggest of performance challenges right now ;-) ]
 
 By far the biggest instrumentation issue is:
 
   - availability
   - usability
   - flexibility
 
 Exposing the raw hw is a step backwards in many regards. The same way we dont
 want to expose chipsets to the guest to allow them to do RAS. The same way we
 dont want to expose most raw PCI devices to guest in general, but have all
 these virt driver abstractions.
 
 I have to say I disagree on that. When you run perfmon on a system, it is 
 normally to measure a specific application. You want to see accurate numbers 
 for cache misses, mul instructions or whatever else is selected.

You can still get those. You can even enable RDPMC access and avoid VM exits.

What you _cannot_ do is to 'steal' the PMU and just give it to the guest.

 Emulating the PMU rather than using the real one, makes the numbers far less 
 useful. The most useful way to provide PMU support in a guest is to expose 
 the real PMU and let the guest OS program it.

Firstly, an emulated PMU was only the second-tier option i suggested. By far 
the best approach is native API to the host regarding performance events and 
good guest side integration.

Secondly, the PMU cannot be 'given' to the guest in the general case. Those 
are privileged registers. They can expose sensitive host execution details, 
etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses 
anyway for a secure solution. (RDPMC can still be supported, but in close 
cooperation with the host)

 We can do this in a reasonable way today, if we allow to take the PMU away 
 from the host, and only let guests access it when it's in use. [...]

You get my sure-fire NAK for that kind of crap though. Interfering with the 
host PMU and stealing it, is not a technical approach that has acceptable 
quality.

You need to integrate it properly so that host PMU functionality still works 
fine. (Within hardware constraints)

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enhance perf to support KVM

2010-02-26 Thread Avi Kivity

On 02/26/2010 01:17 PM, Ingo Molnar wrote:

* Avi Kivitya...@redhat.com  wrote:

   

Do you have (or plan) any turn-key 'access to all files of the guest' kind
of guest-transparent facility that could be used for such purposes?
   

Not really.  The guest and host admins are usually different people, who
may, being admins, even actively hate each other.  The guest admin would
probably regard it as a security hole.  It's probably useful for the
single-host scenario, and of course for developers.
 

Sounds like an exceedingly silly argument to me - the host admin is the king
in any case.

Your argument boils down to: 'dont offer transparent, turn-key solutions
because some might object to the functionality they offer for all the wrong
reasons'. Which does not withstand elementary scrutiny.
   


Again, the host admin and the guest admin are different people.  What 
would the host admin do with guest files?  Why would the guest admin 
want to run any code that exposes their files?




This is a basic usability issue, and affects many parts of the KVM universe.

Really, it's by far the most fubar-ed notion of KVM. You are pushing _way_ too
much to user-space into different modules and maintenance domains, and
user-space forks those bits, fragments, diverts, delays and messes up basic
features in the usual fashion.

The result is a basic out-of-box virtualization experience that sucks even
these days.
   




Nobody is really 'in charge' of how KVM gets delivered to the user. You
isolated the fun kernel part for you and pushed out the boring bits to
user-space. So if mundane things like mouse integration sucks 'hey that's a
user-space tooling problem', if file integration sucks then 'hey, that's an
admin problem', if it cannot be used over the network 'hey, that's an Xorg
problem', etc. etc.
   


What would you have me do?  Push 200K lines of device emulation code 
into the kernel?  Write an X client, toolkit, and display in the kernel 
so that mouse integration works out of the box when you install Linux 
2.6.653?


As to nobody is in charge, that's really insulting to the people who 
are in charge of the userspace components.  Perhaps the problems that we 
see are not the same problems that you see.  It might be that direct 
access to guest files from the host is only a pressing problem for you, 
but nobody else.  If there are features that you miss, post patches, if 
you will deign to code for lowly user space.



You basically have given up control over the quality of KVM by pushing so many
aspects of it to user-space and letting it rot there.
   


That's wrong on so many levels.  First, nothing is rotting in userspace, 
qemu is evolving faster than kvm is.  If I pushed it into the kernel 
then development pace would be much slower (since kernel development is 
harder), quality would be lower (less infrastructure, any bug is a host 
crash or security issue), and I personally would be totally swamped.



Sure the design looks somewhat cleaner on paper, but if the end result is not
helped by it then over-modularization sure can hurt ...
   


Run 'rpm -qa' one of these days.  Modern software is modular, that's the 
only way to manage it.



( Note that i dont mind user-space tooling per se, as long as it sits together
   with the kernel bits and gets developed, packaged and given to the user in
   the same domain. )
   


Call me when glibc, the X servers and clients, and everything else qemu 
now uses is developed, packaged, and given to the user in the same domain.



And that's a key conceptual area were tools/perf/ differs: it's an integrated,
turn-key solution that you can really rely on. We take responsibility for the
full thing, no ifs and when. And if you cannot rely on your instrumentation
tooling as a single unit you cannot use it, simple as that. (that is a key
mistake Oprofile made a decade ago too btw.)
   


perf is a tool written by developers for developers.  kvm is written for 
users (most of them hidden behind management interfaces).  There's no 
point at all in shipping it as part of the kernel, users don't install 
and use kernels, they install and use distributions.



So i can see some upcoming culture friction with standing KVM principles there
;-)
   


No friction at all - I don't think any kvm developer agrees with you 
(but if anyone does please speak up).


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enhance perf to support KVM

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 12:47 +0200, Avi Kivity wrote:
 Not really.  The guest and host admins are usually different people, who 
 may, being admins, even actively hate each other.  The guest admin would 
 probably regard it as a security hole.  It's probably useful for the 
 single-host scenario, and of course for developers. 

LOL, let me be the malicious host admin, then you can be the guest,
there is no way you can protect yourself. If you don't trust the host,
don't use it.

All your IO flows through the host, all your sekrit keys are in memory,
there is no security.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 01:26 PM, Ingo Molnar wrote:



By far the biggest instrumentation issue is:

  - availability
  - usability
  - flexibility

Exposing the raw hw is a step backwards in many regards.
   

In a way, virtualization as a whole is a step backwards.  We take the nice
firesystem/timer/network/scheduler APIs, and expose them as raw hardware.
The pmu isn't any different.
 

Uhm, it's obviously very different. A fake NE2000 will work on both Intel and
AMD CPUs. Same for a fake PIT. PMU drivers are fundamentally per CPU vendor
though.

So there's no generic hardware to emulate.
   


That's true, and it reduces the usability of the feature (you have to 
restrict your migration pools or not expose the pmu), but the general 
points still stand.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 01:42 PM, Ingo Molnar wrote:

* Jes Sorensenjes.soren...@redhat.com  wrote:

   

On 02/26/10 11:44, Ingo Molnar wrote:
 

Direct access to counters is not something that is a big issue. [ Given that i
sometimes can see KVM redraw the screen of a guest OS real-time i doubt this
is the biggest of performance challenges right now ;-) ]

By far the biggest instrumentation issue is:

  - availability
  - usability
  - flexibility

Exposing the raw hw is a step backwards in many regards. The same way we dont
want to expose chipsets to the guest to allow them to do RAS. The same way we
dont want to expose most raw PCI devices to guest in general, but have all
these virt driver abstractions.
   

I have to say I disagree on that. When you run perfmon on a system, it is
normally to measure a specific application. You want to see accurate numbers
for cache misses, mul instructions or whatever else is selected.
 

You can still get those. You can even enable RDPMC access and avoid VM exits.

What you _cannot_ do is to 'steal' the PMU and just give it to the guest.
   


Agreed - if both the host and guest want the pmu, the host wins.  This 
is what we do with debug registers - if both the host and guest contend 
for them, the host wins.



Emulating the PMU rather than using the real one, makes the numbers far less
useful. The most useful way to provide PMU support in a guest is to expose
the real PMU and let the guest OS program it.
 

Firstly, an emulated PMU was only the second-tier option i suggested. By far
the best approach is native API to the host regarding performance events and
good guest side integration.
   


A native API to the host will lock out 100% of the install base now, and 
a large section of any future install base.



Secondly, the PMU cannot be 'given' to the guest in the general case. Those
are privileged registers. They can expose sensitive host execution details,
etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
anyway for a secure solution. (RDPMC can still be supported, but in close
cooperation with the host)
   


No, stop and restart the counters on every exit/entry, so the guest 
doesn't observe any host data.



We can do this in a reasonable way today, if we allow to take the PMU away
from the host, and only let guests access it when it's in use. [...]
 

You get my sure-fire NAK for that kind of crap though. Interfering with the
host PMU and stealing it, is not a technical approach that has acceptable
quality.

   


It would be the other way round - the host would steal the pmu from the 
guest.  Later we can try to time-slice and extrapolate, though that's 
not going to be easy.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enhance perf to support KVM

2010-02-26 Thread Avi Kivity

On 02/26/2010 01:48 PM, Peter Zijlstra wrote:

On Fri, 2010-02-26 at 12:47 +0200, Avi Kivity wrote:
   

Not really.  The guest and host admins are usually different people, who
may, being admins, even actively hate each other.  The guest admin would
probably regard it as a security hole.  It's probably useful for the
single-host scenario, and of course for developers.
 

LOL, let me be the malicious host admin, then you can be the guest,
there is no way you can protect yourself. If you don't trust the host,
don't use it.

All your IO flows through the host, all your sekrit keys are in memory,
there is no security.
   


That's true.  But guest admins are going to be unhappy about a file 
server serving their data to the host all the same.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity a...@redhat.com wrote:

 A native API to the host will lock out 100% of the install base now, and a 
 large section of any future install base.

... which is why i suggested the soft-PMU approach.

And note that _any_ solution we offer locks out 100% of the installed base 
right now, as no solution is in the kernel yet. The only question is what kind 
of upgrade effort is needed for users to make use of the feature.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 02:07 PM, Ingo Molnar wrote:

* Avi Kivitya...@redhat.com  wrote:

   

A native API to the host will lock out 100% of the install base now, and a
large section of any future install base.
 

... which is why i suggested the soft-PMU approach.
   


Not sure I understand it completely.

Do you mean to take the model specific host pmu events, and expose them 
to the guest via trap'n'emulate?  In that case we may as well assign the 
host pmu to the guest if the host isn't using it, and avoid the traps.


Do you mean to choose some older pmu and emulate it using whatever pmu 
model the host has?  I haven't checked, but aren't there mutually 
exclusive events in every model pair?  The closest thing would be the 
architectural pmu thing.


Or do you mean to define a new, kvm-specific pmu model and feed it off 
the host pmu?  In this case all the guests will need to be taught about 
it, which raises the compatibility problem.



And note that _any_ solution we offer locks out 100% of the installed base
right now, as no solution is in the kernel yet. The only question is what kind
of upgrade effort is needed for users to make use of the feature.
   


I meant the guest installed base.  Hosts can be upgraded transparently 
to the guests (not even a shutdown/reboot).


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] KVM: SVM: Optimize nested svm msrpm merging

2010-02-26 Thread Alexander Graf

On 26.02.2010, at 13:25, Joerg Roedel wrote:

 On Fri, Feb 26, 2010 at 12:28:24PM +0200, Avi Kivity wrote:
 +static void add_msr_offset(u32 offset)
 +{
 +   u32 old;
 +   int i;
 +
 +again:
 +   for (i = 0; i  MSRPM_OFFSETS; ++i) {
 +   old = msrpm_offsets[i];
 +
 +   if (old == offset)
 +   return;
 +
 +   if (old != MSR_INVALID)
 +   continue;
 +
 +   if (cmpxchg(msrpm_offsets[i], old, offset) != old)
 +   goto again;
 +
 +   return;
 +   }
 +
 +   /*
 +* If this BUG triggers the msrpm_offsets table has an overflow. Just
 +* increase MSRPM_OFFSETS in this case.
 +*/
 +   BUG();
 +}
 
 Why all this atomic cleverness?  The possible offsets are all
 determined statically.  Even if you do them dynamically (makes sense
 when considering pmu passthrough), it's per-vcpu and therefore
 single threaded (just move msrpm_offsets into vcpu context).
 
 The msr_offset table is the same for all guests. It doesn't make sense
 to keep it per vcpu because it will currently look the same for all
 vcpus. For standard guests this array contains 3 entrys. It is marked
 with __read_mostly for the same reason.

I'm still not convinced on this way of doing things. If it's static, make it 
static. If it's dynamic, make it dynamic. Dynamically generating a static list 
just sounds plain wrong to me.

Alex
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] use eventfd for iothread

2010-02-26 Thread Pierre Riteau
When this was merged in qemu-kvm/master (commit 
6249f61a891b6b003531ca4e459c3a553faa82bc) it removed Avi's compile fix when 
!CONFIG_EVENTFD (db311e8619d310bd7729637b702581d3d8565049).
So current master fails to build:
  CCosdep.o
cc1: warnings being treated as errors
osdep.c: In function 'qemu_eventfd':
osdep.c:296: error: unused variable 'ret'
make: *** [osdep.o] Error 1

On 22 févr. 2010, at 22:26, Marcelo Tosatti wrote:

 From: Paolo Bonzini pbonz...@redhat.com
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 Signed-off-by: Avi Kivity a...@redhat.com
 ---
 osdep.c   |   32 
 qemu-common.h |1 +
 vl.c  |9 +
 3 files changed, 38 insertions(+), 4 deletions(-)
 
 diff --git a/osdep.c b/osdep.c
 index 9059f01..9e4b17b 100644
 --- a/osdep.c
 +++ b/osdep.c
 @@ -37,6 +37,10 @@
 #include sys/statvfs.h
 #endif
 
 +#ifdef CONFIG_EVENTFD
 +#include sys/eventfd.h
 +#endif
 +
 #ifdef _WIN32
 #include windows.h
 #elif defined(CONFIG_BSD)
 @@ -281,6 +285,34 @@ ssize_t qemu_write_full(int fd, const void *buf, size_t 
 count)
 
 #ifndef _WIN32
 /*
 + * Creates an eventfd that looks like a pipe and has EFD_CLOEXEC set.
 + */
 +int qemu_eventfd(int fds[2])
 +{
 +int ret;
 +
 +#ifdef CONFIG_EVENTFD
 +ret = eventfd(0, 0);
 +if (ret = 0) {
 +fds[0] = ret;
 +qemu_set_cloexec(ret);
 +if ((fds[1] = dup(ret)) == -1) {
 +close(ret);
 +return -1;
 +}
 +qemu_set_cloexec(fds[1]);
 +return 0;
 +}
 +
 +if (errno != ENOSYS) {
 +return -1;
 +}
 +#endif
 +
 +return qemu_pipe(fds);
 +}
 +
 +/*
  * Creates a pipe with FD_CLOEXEC set on both file descriptors
  */
 int qemu_pipe(int pipefd[2])
 diff --git a/qemu-common.h b/qemu-common.h
 index b09f717..c941006 100644
 --- a/qemu-common.h
 +++ b/qemu-common.h
 @@ -170,6 +170,7 @@ ssize_t qemu_write_full(int fd, const void *buf, size_t 
 count)
 void qemu_set_cloexec(int fd);
 
 #ifndef _WIN32
 +int qemu_eventfd(int pipefd[2]);
 int qemu_pipe(int pipefd[2]);
 #endif
 
 diff --git a/vl.c b/vl.c
 index 98918ac..1957018 100644
 --- a/vl.c
 +++ b/vl.c
 @@ -3211,14 +3211,15 @@ static int io_thread_fd = -1;
 
 static void qemu_event_increment(void)
 {
 -static const char byte = 0;
 +/* Write 8 bytes to be compatible with eventfd.  */
 +static uint64_t val = 1;
 ssize_t ret;
 
 if (io_thread_fd == -1)
 return;
 
 do {
 -ret = write(io_thread_fd, byte, sizeof(byte));
 +ret = write(io_thread_fd, val, sizeof(val));
 } while (ret  0  errno == EINTR);
 
 /* EAGAIN is fine, a read must be pending.  */
 @@ -3235,7 +3236,7 @@ static void qemu_event_read(void *opaque)
 ssize_t len;
 char buffer[512];
 
 -/* Drain the notify pipe */
 +/* Drain the notify pipe.  For eventfd, only 8 bytes will be read.  */
 do {
 len = read(fd, buffer, sizeof(buffer));
 } while ((len == -1  errno == EINTR) || len == sizeof(buffer));
 @@ -3246,7 +3247,7 @@ static int qemu_event_init(void)
 int err;
 int fds[2];
 
 -err = qemu_pipe(fds);
 +err = qemu_eventfd(fds);
 if (err == -1)
 return -errno;
 
 -- 
 1.6.6
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Pierre Riteau -- PhD student, Myriads team, IRISA, Rennes, France
http://perso.univ-rennes1.fr/pierre.riteau/

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity a...@redhat.com wrote:

 On 02/26/2010 02:07 PM, Ingo Molnar wrote:
 * Avi Kivitya...@redhat.com  wrote:
 
 A native API to the host will lock out 100% of the install base now, and a
 large section of any future install base.
 ... which is why i suggested the soft-PMU approach.
 
 Not sure I understand it completely.
 
 Do you mean to take the model specific host pmu events, and expose them to 
 the guest via trap'n'emulate?  In that case we may as well assign the host 
 pmu to the guest if the host isn't using it, and avoid the traps.

You are making the incorrect assumption that the emulated PMU uses up all host 
PMU resources ...

 Do you mean to choose some older pmu and emulate it using whatever pmu model 
 the host has?  I haven't checked, but aren't there mutually exclusive events 
 in every model pair?  The closest thing would be the architectural pmu 
 thing.

Yes, something like Core2 with 2 generic events.

That would leave 2 extra generic events on Nehalem and better. (which is 
really the target CPU type for any new feature we are talking about right now. 
Plus performance analysis tends to skew towards more modern CPU types as 
well.)

Plus the emulation can be smart about it and only use up a given number. Most 
guest OSs dont use the full PMU - they use a single counter.

Ideally for Linux-Linux there would be a PMU paravirt driver that allocates 
events on an as-needed basis.

 Or do you mean to define a new, kvm-specific pmu model and feed it off the 
 host pmu?  In this case all the guests will need to be taught about it, 
 which raises the compatibility problem.

  And note that _any_ solution we offer locks out 100% of the installed base 
  right now, as no solution is in the kernel yet. The only question is what 
  kind of upgrade effort is needed for users to make use of the feature.
 
 I meant the guest installed base.  Hosts can be upgraded transparently to 
 the guests (not even a shutdown/reboot).

The irony: this time guest-transparent solutions that need no configuration 
are good? ;-)

The very same argument holds for the file server thing: a guest transparent 
solution is easier wrt. the upgrade path.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] KVM: SVM: Optimize nested svm msrpm merging

2010-02-26 Thread Joerg Roedel
On Fri, Feb 26, 2010 at 12:28:24PM +0200, Avi Kivity wrote:
 +static void add_msr_offset(u32 offset)
 +{
 +u32 old;
 +int i;
 +
 +again:
 +for (i = 0; i  MSRPM_OFFSETS; ++i) {
 +old = msrpm_offsets[i];
 +
 +if (old == offset)
 +return;
 +
 +if (old != MSR_INVALID)
 +continue;
 +
 +if (cmpxchg(msrpm_offsets[i], old, offset) != old)
 +goto again;
 +
 +return;
 +}
 +
 +/*
 + * If this BUG triggers the msrpm_offsets table has an overflow. Just
 + * increase MSRPM_OFFSETS in this case.
 + */
 +BUG();
 +}
 
 Why all this atomic cleverness?  The possible offsets are all
 determined statically.  Even if you do them dynamically (makes sense
 when considering pmu passthrough), it's per-vcpu and therefore
 single threaded (just move msrpm_offsets into vcpu context).

The msr_offset table is the same for all guests. It doesn't make sense
to keep it per vcpu because it will currently look the same for all
vcpus. For standard guests this array contains 3 entrys. It is marked
with __read_mostly for the same reason.

 @@ -1846,20 +1882,33 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
 
   static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm)
   {
 -u32 *nested_msrpm;
 -struct page *page;
 +/*
 + * This function merges the msr permission bitmaps of kvm and the
 + * nested vmcb. It is omptimized in that it only merges the parts where
 + * the kvm msr permission bitmap may contain zero bits
 + */
 
 A comment that describes the entire function can be moved above the
 function, freeing a whole tab stop for contents.

Ok, will move it out of the function.

Joerg


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enhance perf to support KVM

2010-02-26 Thread Ingo Molnar

* Avi Kivity a...@redhat.com wrote:

  You basically have given up control over the quality of KVM by pushing so 
  many aspects of it to user-space and letting it rot there.
 
 That's wrong on so many levels.  First, nothing is rotting in userspace, 
 qemu is evolving faster than kvm is.  If I pushed it into the kernel then 
 development pace would be much slower (since kernel development is harder), 
 quality would be lower (less infrastructure, any bug is a host crash or 
 security issue), and I personally would be totally swamped.

That was not what i suggested tho. tools/kvm/ would work plenty fine.

As i said:

  [...] You are pushing _way_ too much to user-space into different modules 
  and maintenance domains, [...]
 
  ( Note that i dont mind user-space tooling per se, as long as it sits 
  together
with the kernel bits and gets developed, packaged and given to the user 
in the same domain. ) [...]


  Sure the design looks somewhat cleaner on paper, but if the end result is 
  not helped by it then over-modularization sure can hurt ...
 
 Run 'rpm -qa' one of these days.  Modern software is modular, that's the 
 only way to manage it.

Of course rpm -qa shows cases where modularization works. But my point was 
over-modularization, which due to the KVM/qemu split we all suffer from.

Modularizing along the wrong interface is worse than not modularizing 
something that could be. So when designing software you generally want to err 
on the side of _under_-modularizing. It's always very easy to split stuff up, 
when there's a really strong technical argument for it. It's very hard to pull 
the broken pieces back together though once they are in difference domains of 
maintanence - as then it's usually social integration that has to happen, 
which is always harder than a technical split-up.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 12:42, Ingo Molnar wrote:


* Jes Sorensenjes.soren...@redhat.com  wrote:

I have to say I disagree on that. When you run perfmon on a system, it is
normally to measure a specific application. You want to see accurate numbers
for cache misses, mul instructions or whatever else is selected.


You can still get those. You can even enable RDPMC access and avoid VM exits.

What you _cannot_ do is to 'steal' the PMU and just give it to the guest.


Well you cannot steal the PMU without collaborating with perf_event.c,
but thats quite feasible. Sharing the PMU between the guest and the host
is very costly and guarantees incorrect results in the host. Unless you
completely emulate the PMU by faking it and then allocating PMU counters
one by one at the host level. However that means trapping a lot of MSR
access.


Firstly, an emulated PMU was only the second-tier option i suggested. By far
the best approach is native API to the host regarding performance events and
good guest side integration.

Secondly, the PMU cannot be 'given' to the guest in the general case. Those
are privileged registers. They can expose sensitive host execution details,
etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
anyway for a secure solution. (RDPMC can still be supported, but in close
cooperation with the host)


There is nothing secret in the host PMU, and it's easy to clear out the
counters before passing them off to the guest.


We can do this in a reasonable way today, if we allow to take the PMU away
from the host, and only let guests access it when it's in use. [...]


You get my sure-fire NAK for that kind of crap though. Interfering with the
host PMU and stealing it, is not a technical approach that has acceptable
quality.


Having an allocation scheme and sharing it with the host, is a perfectly
legitimate and very clean way to do it. Once it's given to the guest,
the host knows not to touch it until it's been released again.


You need to integrate it properly so that host PMU functionality still works
fine. (Within hardware constraints)


Well with the hardware currently available, there is no such thing as
clean sharing between the host and the guest. It cannot be done without
messing up the host measurements, which effectively renders measuring at
the host side useless while a guest is allowed access to the PMU.

Jes
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enhance perf to support KVM

2010-02-26 Thread Avi Kivity

On 02/26/2010 02:46 PM, Ingo Molnar wrote:

* Avi Kivitya...@redhat.com  wrote:

   

You basically have given up control over the quality of KVM by pushing so
many aspects of it to user-space and letting it rot there.
   

That's wrong on so many levels.  First, nothing is rotting in userspace,
qemu is evolving faster than kvm is.  If I pushed it into the kernel then
development pace would be much slower (since kernel development is harder),
quality would be lower (less infrastructure, any bug is a host crash or
security issue), and I personally would be totally swamped.
 

That was not what i suggested tho. tools/kvm/ would work plenty fine.
   


I'll wait until we have tools/libc and tools/X.  After all, they affect 
a lot more people and are concerned with a lot more kernel/user 
interfaces than kvm.



As i said:

   

[...] You are pushing _way_ too much to user-space into different modules
and maintenance domains, [...]

( Note that i dont mind user-space tooling per se, as long as it sits together
   with the kernel bits and gets developed, packaged and given to the user
   in the same domain. ) [...]
   


   

Sure the design looks somewhat cleaner on paper, but if the end result is
not helped by it then over-modularization sure can hurt ...
   

Run 'rpm -qa' one of these days.  Modern software is modular, that's the
only way to manage it.
 

Of course rpm -qa shows cases where modularization works. But my point was
over-modularization, which due to the KVM/qemu split we all suffer from.
   


You're the only one who suffers from it.  Everyone else is happy with 
adding features in the modules that implements them, be it kvm, qemu, 
libvirt, or virt-manager (to name one tool stack out of several).



Modularizing along the wrong interface is worse than not modularizing
something that could be. So when designing software you generally want to err
on the side of _under_-modularizing. It's always very easy to split stuff up,
when there's a really strong technical argument for it. It's very hard to pull
the broken pieces back together though once they are in difference domains of
maintanence - as then it's usually social integration that has to happen,
which is always harder than a technical split-up.
   


As it happens, the kvm and qemu development community has a large 
overlap.  Many developers read both lists, contribute to both projects, 
and participate on the same weekly call.  While we had difficulties 
pushing patches to qemu in the past, that's behind us, and qemu is now 
accepting patches at a much higher rate than kvm.


Technically, it is obvious that the userspace and kernel components are 
separate projects.  All that remains is the social divide.  Since 
everyone (except you) is mostly happy, I see no reason to change.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 13:20, Avi Kivity wrote:

On 02/26/2010 02:07 PM, Ingo Molnar wrote:

... which is why i suggested the soft-PMU approach.


Not sure I understand it completely.

Do you mean to take the model specific host pmu events, and expose them
to the guest via trap'n'emulate? In that case we may as well assign the
host pmu to the guest if the host isn't using it, and avoid the traps.

Do you mean to choose some older pmu and emulate it using whatever pmu
model the host has? I haven't checked, but aren't there mutually
exclusive events in every model pair? The closest thing would be the
architectural pmu thing.


You cannot do this, as you say there is no guarantee that there are no
overlaps, and the current host may have different counter sizes two
which makes emulating it even more costly.

The cpuid bits basically tells you which version of the counters are
available, how many counters are there, word size of the counters and
I believe there are bits also stating which optional features are
available to be counted.


Or do you mean to define a new, kvm-specific pmu model and feed it off
the host pmu? In this case all the guests will need to be taught about
it, which raises the compatibility problem.


Cannot be done in a reasonable manner due to the above.

The key to all of this is that guests OSes, including that other OS,
should be able to use the performance counters without needing special
para virt drivers or other OS modifications. If we start requering that
kind of stuff, the whole point of having the feature goes down the
toilet.

Cheers,
Jes
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] KVM: SVM: Optimize nested svm msrpm merging

2010-02-26 Thread Joerg Roedel
On Fri, Feb 26, 2010 at 01:28:29PM +0100, Alexander Graf wrote:
 
 On 26.02.2010, at 13:25, Joerg Roedel wrote:
 
  On Fri, Feb 26, 2010 at 12:28:24PM +0200, Avi Kivity wrote:
  +static void add_msr_offset(u32 offset)
  +{
  + u32 old;
  + int i;
  +
  +again:
  + for (i = 0; i  MSRPM_OFFSETS; ++i) {
  + old = msrpm_offsets[i];
  +
  + if (old == offset)
  + return;
  +
  + if (old != MSR_INVALID)
  + continue;
  +
  + if (cmpxchg(msrpm_offsets[i], old, offset) != old)
  + goto again;
  +
  + return;
  + }
  +
  + /*
  +  * If this BUG triggers the msrpm_offsets table has an overflow. Just
  +  * increase MSRPM_OFFSETS in this case.
  +  */
  + BUG();
  +}
  
  Why all this atomic cleverness?  The possible offsets are all
  determined statically.  Even if you do them dynamically (makes sense
  when considering pmu passthrough), it's per-vcpu and therefore
  single threaded (just move msrpm_offsets into vcpu context).
  
  The msr_offset table is the same for all guests. It doesn't make sense
  to keep it per vcpu because it will currently look the same for all
  vcpus. For standard guests this array contains 3 entrys. It is marked
  with __read_mostly for the same reason.
 
 I'm still not convinced on this way of doing things. If it's static,
 make it static. If it's dynamic, make it dynamic. Dynamically
 generating a static list just sounds plain wrong to me.

Stop. I had a static list in the first version of the patch. This list
was fine except the fact that a developer needs to remember to update
this list if the list of non-intercepted msrs is expanded. The whole
reason for a dynamically built list is to take the task of maintaining
the list away from the developer and remove a possible source of hard to
find bugs. This is what the current approach does.

Joerg


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 02:38 PM, Ingo Molnar wrote:

* Avi Kivitya...@redhat.com  wrote:

   

On 02/26/2010 02:07 PM, Ingo Molnar wrote:
 

* Avi Kivitya...@redhat.com   wrote:

   

A native API to the host will lock out 100% of the install base now, and a
large section of any future install base.
 

... which is why i suggested the soft-PMU approach.
   

Not sure I understand it completely.

Do you mean to take the model specific host pmu events, and expose them to
the guest via trap'n'emulate?  In that case we may as well assign the host
pmu to the guest if the host isn't using it, and avoid the traps.
 

You are making the incorrect assumption that the emulated PMU uses up all host
PMU resources ...
   


Well, in the general case, it may?  If it doesn't, the host may use 
them.  We do a similar thing with debug breakpoints.


Sharing the pmu will mean trapping control msr writes at least, though.


Do you mean to choose some older pmu and emulate it using whatever pmu model
the host has?  I haven't checked, but aren't there mutually exclusive events
in every model pair?  The closest thing would be the architectural pmu
thing.
 

Yes, something like Core2 with 2 generic events.

That would leave 2 extra generic events on Nehalem and better. (which is
really the target CPU type for any new feature we are talking about right now.
Plus performance analysis tends to skew towards more modern CPU types as
well.)
   


Can you emulate the Core 2 pmu on, say, a P4?  Those P4s have very 
different instruction caches so I imagine the events are very different 
as well.


Agree about favouring modern processors.


Plus the emulation can be smart about it and only use up a given number. Most
guest OSs dont use the full PMU - they use a single counter.
   


But you have to expose all of the counters, no?  Unless you go with a 
kvm-specific pmu as described below.



Ideally for Linux-Linux there would be a PMU paravirt driver that allocates
events on an as-needed basis.
   


Or we could watch the control register and see how the guest programs 
it, provided it doesn't do that a lot.



Or do you mean to define a new, kvm-specific pmu model and feed it off the
host pmu?  In this case all the guests will need to be taught about it,
which raises the compatibility problem.

 

And note that _any_ solution we offer locks out 100% of the installed base
right now, as no solution is in the kernel yet. The only question is what
kind of upgrade effort is needed for users to make use of the feature.
   

I meant the guest installed base.  Hosts can be upgraded transparently to
the guests (not even a shutdown/reboot).
 

The irony: this time guest-transparent solutions that need no configuration
are good? ;-)

The very same argument holds for the file server thing: a guest transparent
solution is easier wrt. the upgrade path.
   


If we add pmu support, guests can begin to use if immediately.  If we 
add the file server support, guests need to install drivers before they 
can use it, while guest admins have no motivation to do so (it helps the 
host, not the guest).


Is something wrong with just using sshfs?  Seems a lot less hassle to me.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Jes Sorensen jes.soren...@redhat.com wrote:

 On 02/26/10 12:42, Ingo Molnar wrote:
 
 * Jes Sorensenjes.soren...@redhat.com  wrote:
 
  I have to say I disagree on that. When you run perfmon on a system, it is 
  normally to measure a specific application. You want to see accurate 
  numbers for cache misses, mul instructions or whatever else is selected.
 
  You can still get those. You can even enable RDPMC access and avoid VM 
  exits.
 
  What you _cannot_ do is to 'steal' the PMU and just give it to the guest.
 
 Well you cannot steal the PMU without collaborating with perf_event.c, but 
 thats quite feasible. Sharing the PMU between the guest and the host is very 
 costly and guarantees incorrect results in the host. Unless you completely 
 emulate the PMU by faking it and then allocating PMU counters one by one at 
 the host level. However that means trapping a lot of MSR access.

It's not that many MSR accesses.

 Firstly, an emulated PMU was only the second-tier option i suggested. By far
 the best approach is native API to the host regarding performance events and
 good guest side integration.
 
 Secondly, the PMU cannot be 'given' to the guest in the general case. Those
 are privileged registers. They can expose sensitive host execution details,
 etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
 anyway for a secure solution. (RDPMC can still be supported, but in close
 cooperation with the host)
 
 There is nothing secret in the host PMU, and it's easy to clear out the 
 counters before passing them off to the guest.

That's wrong. On some CPUs the host PMU can be used to say sample aspects of 
another CPU, allowing statistical attacks to recover crypto keys. It can be 
used to sample memory access patterns of another node.

There's a good reason PMU configuration registers are privileged and there's 
good value in only giving a certain sub-set to less privileged entities by 
default.

 We can do this in a reasonable way today, if we allow to take the PMU away
 from the host, and only let guests access it when it's in use. [...]
 
 You get my sure-fire NAK for that kind of crap though. Interfering with the
 host PMU and stealing it, is not a technical approach that has acceptable
 quality.
 
 Having an allocation scheme and sharing it with the host, is a perfectly 
 legitimate and very clean way to do it. Once it's given to the guest, the 
 host knows not to touch it until it's been released again.

'Full PMU' is not the granularity i find acceptable though: please do what i 
suggested, event granularity allocation and scheduling.

We are rehashing the whole 'perfmon versus perf events/counters' design 
arguments again here really.

  You need to integrate it properly so that host PMU functionality still 
  works fine. (Within hardware constraints)
 
 Well with the hardware currently available, there is no such thing as clean 
 sharing between the host and the guest. It cannot be done without messing up 
 the host measurements, which effectively renders measuring at the host side 
 useless while a guest is allowed access to the PMU.

That's precisely my point: the guest should obviously not get raw access to 
the PMU. (except where it might matter to performance, such as RDPMC)

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] KVM: SVM: Optimize nested svm msrpm merging

2010-02-26 Thread Alexander Graf

On 26.02.2010, at 14:04, Joerg Roedel wrote:

 On Fri, Feb 26, 2010 at 01:28:29PM +0100, Alexander Graf wrote:
 
 On 26.02.2010, at 13:25, Joerg Roedel wrote:
 
 On Fri, Feb 26, 2010 at 12:28:24PM +0200, Avi Kivity wrote:
 +static void add_msr_offset(u32 offset)
 +{
 + u32 old;
 + int i;
 +
 +again:
 + for (i = 0; i  MSRPM_OFFSETS; ++i) {
 + old = msrpm_offsets[i];
 +
 + if (old == offset)
 + return;
 +
 + if (old != MSR_INVALID)
 + continue;
 +
 + if (cmpxchg(msrpm_offsets[i], old, offset) != old)
 + goto again;
 +
 + return;
 + }
 +
 + /*
 +  * If this BUG triggers the msrpm_offsets table has an overflow. Just
 +  * increase MSRPM_OFFSETS in this case.
 +  */
 + BUG();
 +}
 
 Why all this atomic cleverness?  The possible offsets are all
 determined statically.  Even if you do them dynamically (makes sense
 when considering pmu passthrough), it's per-vcpu and therefore
 single threaded (just move msrpm_offsets into vcpu context).
 
 The msr_offset table is the same for all guests. It doesn't make sense
 to keep it per vcpu because it will currently look the same for all
 vcpus. For standard guests this array contains 3 entrys. It is marked
 with __read_mostly for the same reason.
 
 I'm still not convinced on this way of doing things. If it's static,
 make it static. If it's dynamic, make it dynamic. Dynamically
 generating a static list just sounds plain wrong to me.
 
 Stop. I had a static list in the first version of the patch. This list
 was fine except the fact that a developer needs to remember to update
 this list if the list of non-intercepted msrs is expanded. The whole
 reason for a dynamically built list is to take the task of maintaining
 the list away from the developer and remove a possible source of hard to
 find bugs. This is what the current approach does.

I was more thinking of replacing the function calls with a list of MSRs. You 
can then take that list on module init, generate the MSR bitmap once and be 
good.

Later you can use the same list for the nested bitmap.

Alex--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] KVM: SVM: Optimize nested svm msrpm merging

2010-02-26 Thread Avi Kivity

On 02/26/2010 03:04 PM, Joerg Roedel wrote:



I'm still not convinced on this way of doing things. If it's static,
make it static. If it's dynamic, make it dynamic. Dynamically
generating a static list just sounds plain wrong to me.
 

Stop. I had a static list in the first version of the patch. This list
was fine except the fact that a developer needs to remember to update
this list if the list of non-intercepted msrs is expanded. The whole
reason for a dynamically built list is to take the task of maintaining
the list away from the developer and remove a possible source of hard to
find bugs. This is what the current approach does.
   


The problem was the two lists.  If you had a

static struct svm_direct_access_msrs = {
u32 index;
bool longmode_only;
} direct_access_msrs = {
   ...
};

You could generate

static unsigned *msrpm_offsets_longmode, *msrpm_offsets_legacy;

as well as the original bitmaps at module init, no?

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] KVM: SVM: Optimize nested svm msrpm merging

2010-02-26 Thread Joerg Roedel
On Fri, Feb 26, 2010 at 02:26:32PM +0100, Alexander Graf wrote:
 
 On 26.02.2010, at 14:21, Joerg Roedel wrote:
 
  On Fri, Feb 26, 2010 at 03:10:13PM +0200, Avi Kivity wrote:
  On 02/26/2010 03:04 PM, Joerg Roedel wrote:
  
  I'm still not convinced on this way of doing things. If it's static,
  make it static. If it's dynamic, make it dynamic. Dynamically
  generating a static list just sounds plain wrong to me.
  Stop. I had a static list in the first version of the patch. This list
  was fine except the fact that a developer needs to remember to update
  this list if the list of non-intercepted msrs is expanded. The whole
  reason for a dynamically built list is to take the task of maintaining
  the list away from the developer and remove a possible source of hard to
  find bugs. This is what the current approach does.
  
  The problem was the two lists.  If you had a
  
  static struct svm_direct_access_msrs = {
 u32 index;
 bool longmode_only;
  } direct_access_msrs = {
...
  };
  
  You could generate
  
  static unsigned *msrpm_offsets_longmode, *msrpm_offsets_legacy;
  
  as well as the original bitmaps at module init, no?
  
  True for the msrs the guest always has access too. But for the lbr-msrs
  the intercept bits may change at runtime. So an addtional flag is
  required to indicate if the bits should be cleared initially.
 
 So the msrpm bitmap changes dynamically for each vcpu? Great, make it
 fully dynamic then, changing the vcpu-arch.msrpm only from within its
 vcpu context. No need for atomic ops.

The msrpm_offsets table is global. But I think I will follow Avis
suggestions and create a static direct_access_msrs list and generate the
msrpm_offsets at module_init. This solves the problem of two independent
lists too.

Joerg


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 03:06 PM, Ingo Molnar wrote:



Firstly, an emulated PMU was only the second-tier option i suggested. By far
the best approach is native API to the host regarding performance events and
good guest side integration.

Secondly, the PMU cannot be 'given' to the guest in the general case. Those
are privileged registers. They can expose sensitive host execution details,
etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
anyway for a secure solution. (RDPMC can still be supported, but in close
cooperation with the host)
   

There is nothing secret in the host PMU, and it's easy to clear out the
counters before passing them off to the guest.
 

That's wrong. On some CPUs the host PMU can be used to say sample aspects of
another CPU, allowing statistical attacks to recover crypto keys. It can be
used to sample memory access patterns of another node.

There's a good reason PMU configuration registers are privileged and there's
good value in only giving a certain sub-set to less privileged entities by
default.
   


Even if there were no security considerations, if the guest can observe 
host data in the pmu, it means the pmu is inaccurate.  We should expose 
guest data only in the guest pmu.  That's not difficult to do, you stop 
the pmu on exit and swap the counters on context switches.



Having an allocation scheme and sharing it with the host, is a perfectly
legitimate and very clean way to do it. Once it's given to the guest, the
host knows not to touch it until it's been released again.
 

'Full PMU' is not the granularity i find acceptable though: please do what i
suggested, event granularity allocation and scheduling.

We are rehashing the whole 'perfmon versus perf events/counters' design
arguments again here really.
   


Scheduling at event granularity would be a good thing.  However we need 
to be able to handle the guest using the full pmu.


Note that scheduling is only needed if both the guest and host want the 
pmu at the same time - and that should be a rare case and not the one to 
optimize for.



You need to integrate it properly so that host PMU functionality still
works fine. (Within hardware constraints)
   

Well with the hardware currently available, there is no such thing as clean
sharing between the host and the guest. It cannot be done without messing up
the host measurements, which effectively renders measuring at the host side
useless while a guest is allowed access to the PMU.
 

That's precisely my point: the guest should obviously not get raw access to
the PMU. (except where it might matter to performance, such as RDPMC)
   


That's doable if all counters are steerable.  IIRC some counters are 
fixed function, but I'm not certain about that.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 14:06, Ingo Molnar wrote:


* Jes Sorensenjes.soren...@redhat.com  wrote:

Well you cannot steal the PMU without collaborating with perf_event.c, but
thats quite feasible. Sharing the PMU between the guest and the host is very
costly and guarantees incorrect results in the host. Unless you completely
emulate the PMU by faking it and then allocating PMU counters one by one at
the host level. However that means trapping a lot of MSR access.


It's not that many MSR accesses.


Well it's more than enough to double the number of MSRs KVM has to track
on switches.


There is nothing secret in the host PMU, and it's easy to clear out the
counters before passing them off to the guest.


That's wrong. On some CPUs the host PMU can be used to say sample aspects of
another CPU, allowing statistical attacks to recover crypto keys. It can be
used to sample memory access patterns of another node.

There's a good reason PMU configuration registers are privileged and there's
good value in only giving a certain sub-set to less privileged entities by
default.


If a PMU can really count stuff on another CPU, then we shouldn't allow
PMU access to any application at all. It's more than just a KVM guest vs
a KVM guest issue then, but also a thread to thread issue.

My idea was obviously not to expose host timings to a guest. Save the
counters when a guest exits, and reload them when it's restarted. Not
just when switching to another task, but also when entering KVM, to
avoid the guest seeing overhead spent within KVM.


Having an allocation scheme and sharing it with the host, is a perfectly
legitimate and very clean way to do it. Once it's given to the guest, the
host knows not to touch it until it's been released again.


'Full PMU' is not the granularity i find acceptable though: please do what i
suggested, event granularity allocation and scheduling.


As I wrote earlier, at that level we have to do it all emulated. In
this case, providing any of this to a guest seems to be a waste of time
since the interface will cost way too much in trapping back and forth
and you have contention with the very limited resources in the PMU with
just 5 counters to pick from on Core2.

The guest PMU will think it's running on top of real hardware, and
scaling/estimating numbers like the perf_event.c code does today,
except that it will be using already scaled and estimated numbers for
it's calculations. Application users will have little use for this.


Well with the hardware currently available, there is no such thing as clean
sharing between the host and the guest. It cannot be done without messing up
the host measurements, which effectively renders measuring at the host side
useless while a guest is allowed access to the PMU.


That's precisely my point: the guest should obviously not get raw access to
the PMU. (except where it might matter to performance, such as RDPMC)


Well either you allow access to the PMU or you don't. If you allow
direct access to the PMU counters, but not the control registers, you
have to specify the counter sizes to match that of the host, making it
impossible to really emulate core2 on a non core2 architecture etc.

Jes
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity a...@redhat.com wrote:

 Or do you mean to define a new, kvm-specific pmu model and feed it off the 
 host pmu?  In this case all the guests will need to be taught about it, 
 which raises the compatibility problem.

You are missing two big things wrt. compatibility here:

 1) The first upgrade overhead a one time overhead only.

 2) Once a Linux guest has upgraded, it will work in the future, with _any_ 
future CPU - _without_ having to upgrade the guest!

Dont you see the advantage of that? You can instrument an old system on new 
hardware, without having to upgrade that guest for the new CPU support.

With the 'steal the PMU' messy approach the guest OS has to be upgraded to the 
new CPU type all the time. Ad infinitum.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 03:27 PM, Ingo Molnar wrote:


For Linux-Linux the sanest, tier-1 approach would be to map sys_perf_open()
on the guest side over to the host, transparently, via a paravirt driver.
   


Let us for the purpose of this discussion assume that we are also 
interested in supporting Windows and older Linux.  Paravirt 
optimizations can be added after we have the basic functionality, if 
they prove necessary.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 14:30, Avi Kivity wrote:

On 02/26/2010 03:06 PM, Ingo Molnar wrote:

That's precisely my point: the guest should obviously not get raw
access to
the PMU. (except where it might matter to performance, such as RDPMC)


That's doable if all counters are steerable. IIRC some counters are
fixed function, but I'm not certain about that.


I am not an expert, but from what I learned from Peter, there are
constraints on some of the counters. Ie. certain types of events can
only be counted on certain counters, which limits the already very
limited number of counters even further.

Cheers,
Jes
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 13:51 +0200, Avi Kivity wrote:

 It would be the other way round - the host would steal the pmu from the 
 guest.  Later we can try to time-slice and extrapolate, though that's 
 not going to be easy. 

Right, so perf already does the time slicing and interpolating thing, so
a soft-pmu gets that for free.

Anyway, this discussion seems somewhat in a stale-mate position.

The KVM folks basically demand a full PMU MSR shadow with PMI
passthrough so that their $legacy shit works without modification. 

My question with that is how $legacy muck can ever know how the current
PMU works, you can't even properly emulate a core2 pmu on a nehalem
because intel keeps messing with the event codes for every new model.

So basically for this to work means the guest can't run legacy stuff
anyway, but needs to run very up-to-date software, so we might as well
create a soft-pmu/paravirt interface now and have all up-to-date
software support that for the next generation.

Furthermore, when KVM doesn't virtualize the physical system topology,
some PMU features cannot even be sanely used from a vcpu.

So while currently a root user can already tie up all of the pmu using
perf, simply using that to hand the full pmu off to the guest still
leaves lots of issues.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 14:18, Ingo Molnar wrote:


* Avi Kivitya...@redhat.com  wrote:


Can you emulate the Core 2 pmu on, say, a P4? [...]


How about the Pentium? Or the i486?

As long as there's perf events support, the CPU can be supported in a soft
PMU. You can even cross-map exotic hw events if need to be - but most of the
tooling (in just about any OS) uses just a handful of core events ...


This is only possible if all future CPU perfmon events are guaranteed
to be a superset of previous versions. Otherwise you end up emulating
events and providing randomly generated numbers back.

The perfmon revision and size we present to a guest has to match the
current host.

Jes
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 14:31, Ingo Molnar wrote:

You are missing two big things wrt. compatibility here:

  1) The first upgrade overhead a one time overhead only.

  2) Once a Linux guest has upgraded, it will work in the future, with _any_
 future CPU - _without_ having to upgrade the guest!

Dont you see the advantage of that? You can instrument an old system on new
hardware, without having to upgrade that guest for the new CPU support.


That would only work if you are guaranteed to be able to emulate old
hardware on new hardware. Not going to be feasible, so then we are in a
real mess.


With the 'steal the PMU' messy approach the guest OS has to be upgraded to the
new CPU type all the time. Ad infinitum.


The way the Perfmon architecture is specified by Intel, that is what we
are stuck with. It's not going to be possible via software emulation to
count cache misses, unless you run it in a micro architecture emulator.

Jes
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 03:31 PM, Ingo Molnar wrote:

* Avi Kivitya...@redhat.com  wrote:

   

Or do you mean to define a new, kvm-specific pmu model and feed it off the
host pmu?  In this case all the guests will need to be taught about it,
which raises the compatibility problem.
 

You are missing two big things wrt. compatibility here:

  1) The first upgrade overhead a one time overhead only.
   


May be one too many, for certain guests.  Of course it may be argued 
that if the guest wants performance monitoring that much, they will upgrade.


Certainly guests that we don't port won't be able to use this.  I doubt 
we'll be able to make Windows work with this - the only performance tool 
I'm familiar with on Windows is Intel's VTune, and that's proprietary.



  2) Once a Linux guest has upgraded, it will work in the future, with _any_
 future CPU - _without_ having to upgrade the guest!

Dont you see the advantage of that? You can instrument an old system on new
hardware, without having to upgrade that guest for the new CPU support.
   


That also works for the architectural pmu, of course that's Intel only.  
And there you don't need to upgrade the guest even once.


The arch pmu seems nicely done - there's a bit for every counter that 
can be enabled and disabled at will, and the number of counters is also 
determined from cpuid.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity a...@redhat.com wrote:

 On 02/26/2010 03:06 PM, Ingo Molnar wrote:
 
 Firstly, an emulated PMU was only the second-tier option i suggested. By 
 far
 the best approach is native API to the host regarding performance events 
 and
 good guest side integration.
 
 Secondly, the PMU cannot be 'given' to the guest in the general case. Those
 are privileged registers. They can expose sensitive host execution details,
 etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
 anyway for a secure solution. (RDPMC can still be supported, but in close
 cooperation with the host)
 There is nothing secret in the host PMU, and it's easy to clear out the
 counters before passing them off to the guest.
 That's wrong. On some CPUs the host PMU can be used to say sample aspects of
 another CPU, allowing statistical attacks to recover crypto keys. It can be
 used to sample memory access patterns of another node.
 
 There's a good reason PMU configuration registers are privileged and there's
 good value in only giving a certain sub-set to less privileged entities by
 default.
 
 Even if there were no security considerations, if the guest can observe host 
 data in the pmu, it means the pmu is inaccurate.  We should expose guest 
 data only in the guest pmu.  That's not difficult to do, you stop the pmu on 
 exit and swap the counters on context switches.

Again you are making an incorrect assumption: that information leakage via the 
PMU only occurs while the host is running on that CPU. It does not - the PMU 
can leak general system details _while the guest is running_.

So for this and for the many other reasons we dont want to give a raw PMU to 
guests:

 - A paravirt event driver is more compatible and more transparent in the long 
   run: it allows hardware upgrade and upgraded PMU functionality (for Linux) 
   without having to upgrade the guest OS. Via that a guest OS could even be
   live-migrated to a different PMU, without noticing anything about it.

   In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS 
   always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state 
   cannot be live-migrated. (save/restore doesnt help)

 - It's far cleaner on the host side as well: more granular, per event usage
   is possible. The guest can use portion of the PMU (managed by the host), 
   and the host can use a portion too.

   In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS
   precludes the host OS from running some different piece of instrumentation
   at the same time.

 - It's more secure: the host can have a finegrained policy about what kinds of
   events it exposes to the guest. It might chose to only expose software 
   events for example.

   In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS is
   an all-or-nothing policy affair: either you fully allow the guest (and live
   with whatever consequences the piece of hardware that takes up a fair chunk
   on the CPU die causes), or you allow none of it.

 - A proper paravirt event driver gives more features as well: it can exposes 
   host software events and tracepoints, probes - not restricting itself to 
   the 'hardware PMU' abstraction.

 - There's proper event scheduling and event allocation. Time-slicing, etc.


The thing is, we made quite similar arguments in the past, during the perfmon 
vs. perfcounters discussions. There's really a big advantage to proper 
abstractions, both on the host and on the guest side.
 
Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 03:28 PM, Peter Zijlstra wrote:

On Fri, 2010-02-26 at 13:51 +0200, Avi Kivity wrote:

   

It would be the other way round - the host would steal the pmu from the
guest.  Later we can try to time-slice and extrapolate, though that's
not going to be easy.
 

Right, so perf already does the time slicing and interpolating thing, so
a soft-pmu gets that for free.
   


True.


Anyway, this discussion seems somewhat in a stale-mate position.

The KVM folks basically demand a full PMU MSR shadow with PMI
passthrough so that their $legacy shit works without modification.

My question with that is how $legacy muck can ever know how the current
PMU works, you can't even properly emulate a core2 pmu on a nehalem
because intel keeps messing with the event codes for every new model.
   


Right, this is pretty bad.  For Windows it's probably acceptable to 
upgrade your performance tools (since that's separate from the OS).  In 
Linux it is integrated into the kernel, and it's fairly unacceptable to 
demand a kernel upgrade when your host is upgraded underneath you.



So basically for this to work means the guest can't run legacy stuff
anyway, but needs to run very up-to-date software, so we might as well
create a soft-pmu/paravirt interface now and have all up-to-date
software support that for the next generation.
   


Still that leaves us with no Windows / non-Linux solution.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 14:28, Peter Zijlstra wrote:

On Fri, 2010-02-26 at 13:51 +0200, Avi Kivity wrote:


It would be the other way round - the host would steal the pmu from the
guest.  Later we can try to time-slice and extrapolate, though that's
not going to be easy.


Right, so perf already does the time slicing and interpolating thing, so
a soft-pmu gets that for free.


What I don't like here is that without rewriting the guest OS, there
will be two layers of time-slicing and extrapolation. That is going to
make the reported numbers close to useless.


Anyway, this discussion seems somewhat in a stale-mate position.

The KVM folks basically demand a full PMU MSR shadow with PMI
passthrough so that their $legacy shit works without modification.

My question with that is how $legacy muck can ever know how the current
PMU works, you can't even properly emulate a core2 pmu on a nehalem
because intel keeps messing with the event codes for every new model.

So basically for this to work means the guest can't run legacy stuff
anyway, but needs to run very up-to-date software, so we might as well
create a soft-pmu/paravirt interface now and have all up-to-date
software support that for the next generation.


That is the problem. Today there is a large install base out there of
core2 users who wish to measure their stuff on the hardware they have.
The same will be true for Nehalem based stuff, when whatever replaces
Nehalem comes out makes that incompatible.

Since we are unable to emulate Core2 on Nehalem, and almost certainly
will be unable to emulate Nehalem on it's successor, we are stuck with
this.

A para-virt interface is a nice idea, but since we cannot emulate an
old CPU properly it still means there isn't much we can do as we're
stuck with the same limitations. I simply see the value of introducing
a para-virt interface for this.


Furthermore, when KVM doesn't virtualize the physical system topology,
some PMU features cannot even be sanely used from a vcpu.


That is definitely an issue, and there is nothing we can really do about
that. Having two guests running in parallel under KVM means that they
are going to see more cache misses than they would if they ran barebone
on the hardware.

However even with all of this, we have to keep in mind who is going to
use the performance monitoring in a guest. It is going to be application
writers, mostly people writing analytical/scientific applications. They
rarely have control over the OS they are running on, but are given
systems and told to work on what they are given. Driver upgrades and
things like that don't come quickly. However they also tend to
understand limitations like these and will be able to still benefit from
perf on a system like that.


So while currently a root user can already tie up all of the pmu using
perf, simply using that to hand the full pmu off to the guest still
leaves lots of issues.


Well isn't that the case with the current setup anyway? If enough user
apps start requesting PMU resources, the hw is going to run out of
counters very quickly anyway.

The real issue here IMHO is whether or not is it possible to use a PMU
to count anything on different CPU? If that is really possible, sharing
the PMU is not an option :(

All that said, what we really want is for Intel+AMD to come up with
proper hw PMU virtualization support that makes it easy to rotate the
full PMU in and out for a guest. Then this whole discussion will become
a non issue.

Cheers,
Jes
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 03:44 PM, Ingo Molnar wrote:

* Avi Kivitya...@redhat.com  wrote:

   

On 02/26/2010 03:06 PM, Ingo Molnar wrote:
 
   

Firstly, an emulated PMU was only the second-tier option i suggested. By far
the best approach is native API to the host regarding performance events and
good guest side integration.

Secondly, the PMU cannot be 'given' to the guest in the general case. Those
are privileged registers. They can expose sensitive host execution details,
etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
anyway for a secure solution. (RDPMC can still be supported, but in close
cooperation with the host)
   

There is nothing secret in the host PMU, and it's easy to clear out the
counters before passing them off to the guest.
 

That's wrong. On some CPUs the host PMU can be used to say sample aspects of
another CPU, allowing statistical attacks to recover crypto keys. It can be
used to sample memory access patterns of another node.

There's a good reason PMU configuration registers are privileged and there's
good value in only giving a certain sub-set to less privileged entities by
default.
   

Even if there were no security considerations, if the guest can observe host
data in the pmu, it means the pmu is inaccurate.  We should expose guest
data only in the guest pmu.  That's not difficult to do, you stop the pmu on
exit and swap the counters on context switches.
 

Again you are making an incorrect assumption: that information leakage via the
PMU only occurs while the host is running on that CPU. It does not - the PMU
can leak general system details _while the guest is running_.
   


You mean like bus transactions on a multicore?  Well, we're already 
exposed to cache timing attacks.



So for this and for the many other reasons we dont want to give a raw PMU to
guests:

  - A paravirt event driver is more compatible and more transparent in the long
run: it allows hardware upgrade and upgraded PMU functionality (for Linux)
without having to upgrade the guest OS. Via that a guest OS could even be
live-migrated to a different PMU, without noticing anything about it.
   


What about Windows?


In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS
always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state
cannot be live-migrated. (save/restore doesnt help)
   


Why not?  So long as the source and destination are compatible?


  - It's far cleaner on the host side as well: more granular, per event usage
is possible. The guest can use portion of the PMU (managed by the host),
and the host can use a portion too.

In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS
precludes the host OS from running some different piece of instrumentation
at the same time.
   


Right, time slicing is something we want.


  - It's more secure: the host can have a finegrained policy about what kinds of
events it exposes to the guest. It might chose to only expose software
events for example.

In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS is
an all-or-nothing policy affair: either you fully allow the guest (and live
with whatever consequences the piece of hardware that takes up a fair chunk
on the CPU die causes), or you allow none of it.
   


No, we can hide insecure events with a full pmu.  Trap the control 
register and don't pass it on to the hardware.



  - A proper paravirt event driver gives more features as well: it can exposes
host software events and tracepoints, probes - not restricting itself to
the 'hardware PMU' abstraction.
   


But it is limited to whatever the host stack supports.  At least that's 
our control, but things like PEBS will take a ton of work.



  - There's proper event scheduling and event allocation. Time-slicing, etc.


The thing is, we made quite similar arguments in the past, during the perfmon
vs. perfcounters discussions. There's really a big advantage to proper
abstractions, both on the host and on the guest side.
   


We only control half of the equation.  That's very different compared to 
tools/perf.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 03:37 PM, Jes Sorensen wrote:

On 02/26/10 14:31, Ingo Molnar wrote:

You are missing two big things wrt. compatibility here:

  1) The first upgrade overhead a one time overhead only.

  2) Once a Linux guest has upgraded, it will work in the future, 
with _any_

 future CPU - _without_ having to upgrade the guest!

Dont you see the advantage of that? You can instrument an old system 
on new

hardware, without having to upgrade that guest for the new CPU support.


That would only work if you are guaranteed to be able to emulate old
hardware on new hardware. Not going to be feasible, so then we are in a
real mess.



That actually works on the Intel-only architectural pmu.  I'm beginning 
to like it more and more.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enhance perf to support KVM

2010-02-26 Thread Jes Sorensen

On 02/26/10 14:16, Ingo Molnar wrote:


* Avi Kivitya...@redhat.com  wrote:


That was not what i suggested tho. tools/kvm/ would work plenty fine.


I'll wait until we have tools/libc and tools/X.  After all, they affect a
lot more people and are concerned with a lot more kernel/user interfaces
than kvm.


So your answer can be summed up as: 'we wont do what makes sense technically
because others suck even more' ?


Well in this discussion what makes sense technically differs depending
on who you ask.

I will argue that emulating the MSR access doesn't make sense
technically because there is no fixed specification we can rely on, 
since the spec seems to change randomly with every cpu family release

from Inte. In addition the overhead is making the resulting numbers
less if at all interesting.

Jes
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] KVM: SVM: Optimize nested svm msrpm merging

2010-02-26 Thread Avi Kivity

On 02/26/2010 03:30 PM, Joerg Roedel wrote:



So the msrpm bitmap changes dynamically for each vcpu? Great, make it
fully dynamic then, changing the vcpu-arch.msrpm only from within its
vcpu context. No need for atomic ops.
 

The msrpm_offsets table is global. But I think I will follow Avis
suggestions and create a static direct_access_msrs list and generate the
msrpm_offsets at module_init. This solves the problem of two independent
lists too.

   


But with LBR virt, maybe a fully dynamic approach is better.  Just have 
static lists for updating the msrpm and offset table dynamically.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity a...@redhat.com wrote:

 On 02/26/2010 03:31 PM, Ingo Molnar wrote:
 * Avi Kivitya...@redhat.com  wrote:
 
 Or do you mean to define a new, kvm-specific pmu model and feed it off the
 host pmu?  In this case all the guests will need to be taught about it,
 which raises the compatibility problem.
 You are missing two big things wrt. compatibility here:
 
   1) The first upgrade overhead a one time overhead only.
 
 May be one too many, for certain guests.  Of course it may be argued
 that if the guest wants performance monitoring that much, they will
 upgrade.

Yes, that can certainly be argued.

Note another logical inconsistency: you are assuming reluctance to upgrade for 
a set of users who are doing _performance analysis_.

In fact those types of users are amongst the most upgrade-happy. Often they'll 
run modern hardware and modern software. Most of the time they are developers 
themselves who try to make sure their stuff works on the latest  greatest 
hardware _and_ software.

So people running P4's trying to tune their stuff under Red Hat Linux 9 and 
trying to use the PMU uner KVM is not really a concern rooted overly deeply in 
reality.

 Certainly guests that we don't port won't be able to use this.  I doubt 
 we'll be able to make Windows work with this - the only performance tool I'm 
 familiar with on Windows is Intel's VTune, and that's proprietary.

Dont you see the extreme irony of your wish to limit Linux kernel design 
decisions and features based on ... Windows and other proprietary software?

   2) Once a Linux guest has upgraded, it will work in the future, with _any_
  future CPU - _without_ having to upgrade the guest!
 
 Dont you see the advantage of that? You can instrument an old system on new
 hardware, without having to upgrade that guest for the new CPU support.
 
 That also works for the architectural pmu, of course that's Intel
 only.  And there you don't need to upgrade the guest even once.

Besides being Intel only, it only exposes a limited sub-set of hw events. (far 
fewer than the generic ones offered by perf events)

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enhance perf to support KVM

2010-02-26 Thread Avi Kivity

On 02/26/2010 03:16 PM, Ingo Molnar wrote:

* Avi Kivitya...@redhat.com  wrote:

   

That was not what i suggested tho. tools/kvm/ would work plenty fine.
   

I'll wait until we have tools/libc and tools/X.  After all, they affect a
lot more people and are concerned with a lot more kernel/user interfaces
than kvm.
 

So your answer can be summed up as: 'we wont do what makes sense technically
because others suck even more' ?
   


I can sum up your this remark as 'whenever you disagree with me, I will 
rephrase your words to make you look like an idiot'.


If you believe I'm an idiot, there's no need to have this (or any) 
conversation.  If not, please refrain from this type of verbal gymnastics.



And it's not just the kernel-user interface (which btw., for the case of X
is far narrower than what KVM currently has to Qemu).

The issue is a basic question of software design: does kvm-qemu really make as
much sense without the kernel component as with it? The answer is: it will
borderline-work with CPU emulation (and i'm sure there are people making use
of it that way), but 90%+ of the userbase uses it with KVM and vice versa. It
is really a single logical component as far as maintenance goes, and
tools/kvm/ would make quite a bit of sense.
   


There are two separate questions.  Is there room for a kvm-only 
userspace component?  I believe so, but throwing away the momentum 
behind qemu would be foolish.


Does it make sense for such a component to live in linux.git?  IMO, no, 
and certainly a lot less than libc and X.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 14:27, Ingo Molnar wrote:


* Jes Sorensenjes.soren...@redhat.com  wrote:

You certainly cannot emulate the Core2 on a P4. The Core2 is Perfmon v2,
whereas Nehalem and Atom are v3 if I remember correctly. [...]


Of course you can emulate a good portion of it, as long as there's perf
support on the host side for P4.


Actually P4 is pretty uninteresting in this discussion due to the lack
of VMX support, it's the same issue for Nehalem vs Core2. The problem
is the same though, we cannot tell the guest that yes P4 has this
event, but no, we are going to feed you bogus data.


If the guest programs a cachemiss event, you program a cachemiss perf event on
the host and feed its values to the emulated MSR state. You _dont_ program the
raw PMU on the host side - just use the API i outlined to get struct
perf_event.

The emulation wont be perfect: not all events will count and not all events
will be available in a P4 (and some Core2 events might not even make sense in
a P4), but that is reality as well: often documented events dont count, and
often non-documented events count.

What matters to 99.9% of people who actually use this stuff is a few core sets
of events - which are available in P4s and in Core2 as well. Cycles,
instructions, branches, maybe cache-misses. Sometimes FPU stuff.


I really do not like to make guesses about how people use this stuff.
The things you and I look for as kernel hackers are often very different
than application authors look for and use. That is one thing I learned
from being expose to strange Fortran programmers at SGI.

It makes me very uncomfortable telling a guest OS that we offer features
X, Y, Z and then start lying feeding back numbers that do not match what
was requested, and there is no way to to tell the guest that.


For Linux-Linux the sanest, tier-1 approach would be to map sys_perf_open()
on the guest side over to the host, transparently, via a paravirt driver.


Paravirt is a nice optimization, but is and will always be an
optimization. Fact of the matter is that the bulk of usage of
virtualization is for running distributions with slow kernel
upgrade rates, like SLES and RHEL, and other proprietary operating
systems which we have no control over. Para-virt will do little good for
either of these groups.

Cheers,
Jes
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


PCI hotplug broken?

2010-02-26 Thread Alexander Graf
Hi list,

While trying to upgrade some internal infrastructure to qemu-kvm-0.12 I 
stumbled across this really weird problem that I see with current qemu-kvm git 
too:

I start qemu-kvm using:

./qemu-system-x86_64 -L ../pc-bios/ -m 512 -net nic,model=virtio -net 
tap,ifname=tap0,script=/bin/true -snapshot sles11.qcow2 -vnc :0 -monitor stdio

The system boots up just fine, networking works.

On the qemu monitor I then issue:

(qemu) pci_add auto storage file=/tmp/image.raw,if=virtio

after which I get a fully functional virtio block device, but the network stops 
sending/receiving packets.

The same thing with qemu-kvm-0.10 works just fine. Has anyone seen this before?


Alex--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 04:07 PM, Jes Sorensen wrote:

On 02/26/10 14:27, Ingo Molnar wrote:


* Jes Sorensenjes.soren...@redhat.com  wrote:
You certainly cannot emulate the Core2 on a P4. The Core2 is Perfmon 
v2,

whereas Nehalem and Atom are v3 if I remember correctly. [...]


Of course you can emulate a good portion of it, as long as there's perf
support on the host side for P4.


Actually P4 is pretty uninteresting in this discussion due to the lack
of VMX support, it's the same issue for Nehalem vs Core2. The problem
is the same though, we cannot tell the guest that yes P4 has this
event, but no, we are going to feed you bogus data.


The Pentium D which is a P4 derivative has vmx support.  However it is 
so slow I'm fine with ignoring it for this feature.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity a...@redhat.com wrote:

 On 02/26/2010 03:44 PM, Ingo Molnar wrote:
 * Avi Kivitya...@redhat.com  wrote:
 
 On 02/26/2010 03:06 PM, Ingo Molnar wrote:
 Firstly, an emulated PMU was only the second-tier option i suggested. By 
 far
 the best approach is native API to the host regarding performance events 
 and
 good guest side integration.
 
 Secondly, the PMU cannot be 'given' to the guest in the general case. 
 Those
 are privileged registers. They can expose sensitive host execution 
 details,
 etc. etc. So if you emulate a PMU you have to exit out of most PMU 
 accesses
 anyway for a secure solution. (RDPMC can still be supported, but in close
 cooperation with the host)
 There is nothing secret in the host PMU, and it's easy to clear out the
 counters before passing them off to the guest.
 That's wrong. On some CPUs the host PMU can be used to say sample aspects 
 of
 another CPU, allowing statistical attacks to recover crypto keys. It can be
 used to sample memory access patterns of another node.
 
 There's a good reason PMU configuration registers are privileged and 
 there's
 good value in only giving a certain sub-set to less privileged entities by
 default.
 Even if there were no security considerations, if the guest can observe host
 data in the pmu, it means the pmu is inaccurate.  We should expose guest
 data only in the guest pmu.  That's not difficult to do, you stop the pmu on
 exit and swap the counters on context switches.
 Again you are making an incorrect assumption: that information leakage via 
 the
 PMU only occurs while the host is running on that CPU. It does not - the PMU
 can leak general system details _while the guest is running_.
 
 You mean like bus transactions on a multicore?  Well, we're already
 exposed to cache timing attacks.

If you give a full PMU to a guest it's a whole different dimension and quality 
of information. Literally hundreds of different events about all sorts of 
aspects of the CPU and the hardware in general.

 So for this and for the many other reasons we dont want to give a raw PMU to
 guests:
 
   - A paravirt event driver is more compatible and more transparent in the 
  long
 run: it allows hardware upgrade and upgraded PMU functionality (for 
  Linux)
 without having to upgrade the guest OS. Via that a guest OS could even be
 live-migrated to a different PMU, without noticing anything about it.
 
 What about Windows?

What is your question? Why should i limit Linux kernel design decisions based 
on any aspect of Windows? You might want to support it, but _please_ dont let 
the design be dictated by it ...

 In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS
 always assumes the guest OS is upgraded to the host. Also, 'raw' PMU 
  state
 cannot be live-migrated. (save/restore doesnt help)
 
 Why not?  So long as the source and destination are compatible?

'As long as it works' is certainly a good enough filter for quality ;-)

   - It's far cleaner on the host side as well: more granular, per event usage
 is possible. The guest can use portion of the PMU (managed by the host),
 and the host can use a portion too.
 
 In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS
 precludes the host OS from running some different piece of 
  instrumentation
 at the same time.
 
 Right, time slicing is something we want.
 
   - It's more secure: the host can have a finegrained policy about what 
  kinds of
 events it exposes to the guest. It might chose to only expose software
 events for example.
 
 In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS is
 an all-or-nothing policy affair: either you fully allow the guest (and 
  live
 with whatever consequences the piece of hardware that takes up a fair 
  chunk
 on the CPU die causes), or you allow none of it.
 
 No, we can hide insecure events with a full pmu.  Trap the control register 
 and don't pass it on to the hardware.

So you basically concede partial emulation ...

   - A proper paravirt event driver gives more features as well: it can 
  exposes
 host software events and tracepoints, probes - not restricting itself to
 the 'hardware PMU' abstraction.
 
 But it is limited to whatever the host stack supports.  At least
 that's our control, but things like PEBS will take a ton of work.

PEBS support is being implemented for perf, as a transparent feature. So once 
it's available, PEBS support will magically improve the quality of guest OS 
samples, if a paravirt driver approach is used and if sys_perf_event_open() is 
taught about that driver. Without any other change needed on the guest side.

   - There's proper event scheduling and event allocation. Time-slicing, etc.
 
 
  The thing is, we made quite similar arguments in the past, during the 
  perfmon vs. perfcounters discussions. There's really a big advantage to 
  proper abstractions, both on the host 

Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 04:01 PM, Ingo Molnar wrote:

* Avi Kivitya...@redhat.com  wrote:

   

On 02/26/2010 03:31 PM, Ingo Molnar wrote:
 

* Avi Kivitya...@redhat.com   wrote:

   

Or do you mean to define a new, kvm-specific pmu model and feed it off the
host pmu?  In this case all the guests will need to be taught about it,
which raises the compatibility problem.
 

You are missing two big things wrt. compatibility here:

  1) The first upgrade overhead a one time overhead only.
   

May be one too many, for certain guests.  Of course it may be argued
that if the guest wants performance monitoring that much, they will
upgrade.
 

Yes, that can certainly be argued.

Note another logical inconsistency: you are assuming reluctance to upgrade for
a set of users who are doing _performance analysis_.

In fact those types of users are amongst the most upgrade-happy. Often they'll
run modern hardware and modern software. Most of the time they are developers
themselves who try to make sure their stuff works on the latest  greatest
hardware _and_ software.
   


I wouldn't go as far, but I agree there is less resistance to change 
here.  A Windows user certainly ought to be willing to install a new 
VTune release, and a RHEL user can be convinced to upgrade from (say) 
5.4 to 5.6 with new backported paravirt pmu support.


I wouldn't like to force them to upgrade to 2.6.3x though.  Many of 
those users will be developers of in-house applications who are trying 
to understand their applications under production loads.



Certainly guests that we don't port won't be able to use this.  I doubt
we'll be able to make Windows work with this - the only performance tool I'm
familiar with on Windows is Intel's VTune, and that's proprietary.
 

Dont you see the extreme irony of your wish to limit Linux kernel design
decisions and features based on ... Windows and other proprietary software?
   


Not at all.  Virtualization is a hardware compatibility game.  To see 
what happens if you don't play it, see Xen.  Eventually they to 
implemented hardware support even though the pv approach is so wonderful.


If we go the pv route, we'll limit the usefulness of Linux in this 
scenario to a subset of guests.  Users will simply walk away and choose 
a hypervisor whose authors have less interest in irony and more in 
providing the features they want.


A pv approach can come after we have a baseline that is useful to all users.


  2) Once a Linux guest has upgraded, it will work in the future, with _any_
 future CPU - _without_ having to upgrade the guest!

Dont you see the advantage of that? You can instrument an old system on new
hardware, without having to upgrade that guest for the new CPU support.
   

That also works for the architectural pmu, of course that's Intel
only.  And there you don't need to upgrade the guest even once.
 

Besides being Intel only, it only exposes a limited sub-set of hw events. (far
fewer than the generic ones offered by perf events)

   


Things aren't mutually exclusive.  Offer the arch pmu for maximum future 
compatibility (Intel only, alas), the full pmu for maximum features, and 
the pv pmu for flexibility.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enhance perf to support KVM

2010-02-26 Thread Ingo Molnar

* Avi Kivity a...@redhat.com wrote:

 On 02/26/2010 03:16 PM, Ingo Molnar wrote:
 * Avi Kivitya...@redhat.com  wrote:
 
  That was not what i suggested tho. tools/kvm/ would work plenty fine.
 
  I'll wait until we have tools/libc and tools/X.  After all, they affect a 
  lot more people and are concerned with a lot more kernel/user interfaces 
  than kvm.
 
  So your answer can be summed up as: 'we wont do what makes sense 
  technically because others suck even more' ?
 
 I can sum up your this remark as 'whenever you disagree with me, I will 
 rephrase your words to make you look like an idiot'.

Two points:

1)

You can try to ridicule me if you want, but do you actually claim that my 
summary is inaccurate?

I do claim it's a substantially accurate summary: you said you will (quote:) 
wait with tools/kvm/ until we have tools/libc and tools/X.

I do think tools/X and tools/libc would make quite a bit of sense - this is 
one of the better design aspects of FreeBSD et al. It's a mistake that it's 
not being done.

2)

I used a question mark (the sentence was not a statement of fact), and you 
have no obligation to agree with the summary i provided.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 15:55 +0200, Avi Kivity wrote:
 
 That actually works on the Intel-only architectural pmu.  I'm beginning 
 to like it more and more. 

Only for the arch defined events, all _7_ of them.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity a...@redhat.com wrote:

  Certainly guests that we don't port won't be able to use this.  I doubt
  we'll be able to make Windows work with this - the only performance tool 
  I'm
  familiar with on Windows is Intel's VTune, and that's proprietary.
 
  Dont you see the extreme irony of your wish to limit Linux kernel design 
  decisions and features based on ... Windows and other proprietary 
  software?
 
 Not at all.  Virtualization is a hardware compatibility game.  To see what 
 happens if you don't play it, see Xen.  Eventually they to implemented 
 hardware support even though the pv approach is so wonderful.

That's not quite equivalent though.

KVM used to be the clean, integrate-code-with-Linux virtualization approach, 
designed specifically for CPUs that can be virtualized properly. (VMX support 
first, then SVM, etc.)

KVM virtualized ages-old concepts with relatively straightforward hardware 
ABIs: x86 execution, IRQ abstractions, device abstractions, etc.

Now you are in essence turning that all around:

 - the PMU is by no means properly virtualized nor really virtualizable by 
   direct access. There's no virtual PMU that ticks independently of the host 
   PMU.

 - the PMU hardware itself is not a well standardized piece of hardware. It's 
   very vendor dependent and very limiting.

So to some degree you are playing the role of Xen in this specific affair. You 
are pushing for something that shouldnt be done in that form. You want to 
interfere with the host PMU by going via the fast  easy short-term hack to 
just let the guest OS have the PMU, without any regard to how this impacts 
long-term feasible solutions.

I.e. you are a bit like the guy who would have told Linus in 1994:

  Dude, why dont you use the Windows APIs? It's far more compatible and 
   that's the only way you could run any serious apps. Besides, it requires 
   no upgrade. Admittedly it's a bit messy and 16-bit but hey, that's our 
   installed base after all. 

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 14:51 +0100, Jes Sorensen wrote:
 
  Furthermore, when KVM doesn't virtualize the physical system topology,
  some PMU features cannot even be sanely used from a vcpu.
 
 That is definitely an issue, and there is nothing we can really do about
 that. Having two guests running in parallel under KVM means that they
 are going to see more cache misses than they would if they ran barebone
 on the hardware.
 
 However even with all of this, we have to keep in mind who is going to
 use the performance monitoring in a guest. It is going to be application
 writers, mostly people writing analytical/scientific applications. They
 rarely have control over the OS they are running on, but are given
 systems and told to work on what they are given. Driver upgrades and
 things like that don't come quickly. However they also tend to
 understand limitations like these and will be able to still benefit from
 perf on a system like that.

What I meant was things like memory controller bound counters, intel
uncore and amd northbridge, without knowing what node the vcpu got
scheduled to there is no way they can program the raw hardware in a
meaningful way, amd nb in particular is interesting in that you could
choose not to offer the intel uncore msrs, but the amd nb are shadowed
over the generic pmcs, so you have no way to filter those out.

Same goes for stuff like the intel ANY flag, LBR filter control and
similar muck, a vcpu can't make use of those things in a meaningful
manner.

Also, intel debugstore things requires a host linear address, again, not
something a vcpu can easily provide (although that might be worked
around with an msr trap, but that still limits you to 1 page data sizes,
not a limitation all software will respect).

 All that said, what we really want is for Intel+AMD to come up with
 proper hw PMU virtualization support that makes it easy to rotate the
 full PMU in and out for a guest. Then this whole discussion will become
 a non issue.

As it stands there simply are a number of PMU features that defy being
virtualized, simply because the virt stuff doesn't do system topology.
So even if they were to support a virtualized pmu, it would likely be a
different beast than the native hardware is, and it will be several
hardware models in the future, coming up with a paravirt interface and
getting !linux hosts to adapt and !linux guests to use is probably as
'easy'.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 15:30 +0200, Avi Kivity wrote:
 
 Even if there were no security considerations, if the guest can observe 
 host data in the pmu, it means the pmu is inaccurate.  We should expose 
 guest data only in the guest pmu.  That's not difficult to do, you stop 
 the pmu on exit and swap the counters on context switches. 

That's not enough, memory node wide counters are impossible to isolate
like that, the same for core wide (ANY flag) counters.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 15:30 +0200, Avi Kivity wrote:
 
 Scheduling at event granularity would be a good thing.  However we need 
 to be able to handle the guest using the full pmu. 

Does the full PMU include things like LBR, PEBS and uncore? in that
case, there is no way you're going to get that properly and securely
virtualized by using raw access.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PCI hotplug broken?

2010-02-26 Thread Alexander Graf

On 26.02.2010, at 15:12, Alexander Graf wrote:

 Hi list,
 
 While trying to upgrade some internal infrastructure to qemu-kvm-0.12 I 
 stumbled across this really weird problem that I see with current qemu-kvm 
 git too:
 
 I start qemu-kvm using:
 
 ./qemu-system-x86_64 -L ../pc-bios/ -m 512 -net nic,model=virtio -net 
 tap,ifname=tap0,script=/bin/true -snapshot sles11.qcow2 -vnc :0 -monitor stdio
 
 The system boots up just fine, networking works.
 
 On the qemu monitor I then issue:
 
 (qemu) pci_add auto storage file=/tmp/image.raw,if=virtio
 
 after which I get a fully functional virtio block device, but the network 
 stops sending/receiving packets.
 
 The same thing with qemu-kvm-0.10 works just fine. Has anyone seen this 
 before?

Same thing happens when hotplug only:

pci_add auto nic model=virtio,vlan=0

- network works

pci_add auto storage file=/tmp/image.raw,if=virtio

- network stops working

pci_add auto nic model=virtio,vlan=0

- network works again on the new device


Alex--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 04:12 PM, Ingo Molnar wrote:


Again you are making an incorrect assumption: that information leakage via the
PMU only occurs while the host is running on that CPU. It does not - the PMU
can leak general system details _while the guest is running_.
   

You mean like bus transactions on a multicore?  Well, we're already
exposed to cache timing attacks.
 

If you give a full PMU to a guest it's a whole different dimension and quality
of information. Literally hundreds of different events about all sorts of
aspects of the CPU and the hardware in general.
   


Well, we filter out the bad events then.


So for this and for the many other reasons we dont want to give a raw PMU to
guests:

  - A paravirt event driver is more compatible and more transparent in the long
run: it allows hardware upgrade and upgraded PMU functionality (for Linux)
without having to upgrade the guest OS. Via that a guest OS could even be
live-migrated to a different PMU, without noticing anything about it.
   

What about Windows?
 

What is your question? Why should i limit Linux kernel design decisions based
on any aspect of Windows? You might want to support it, but _please_ dont let
the design be dictated by it ...
   


In our case the quality of implementation is judged by how well we 
support workloads that users run, and that means we have to support 
Windows well.  And that more or less means we can't have a pv-only pmu.


Which part of this do you disagree with?


In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS
always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state
cannot be live-migrated. (save/restore doesnt help)
   

Why not?  So long as the source and destination are compatible?
 

'As long as it works' is certainly a good enough filter for quality ;-)
   


We already have this.  If you expose sse4.2 to the guest, you can't 
migrate to a host which doesn't support it.  If you expose a Nehalem pmu 
to the guest, you can't migrate to a host which supports it.  Users and 
tools already understand this.


It's true that the pmu case is more difficult since you can't migrate 
forwards as well as backwards, but that's life.



No, we can hide insecure events with a full pmu.  Trap the control register
and don't pass it on to the hardware.
 

So you basically concede partial emulation ...
   


Yes.  Still appears to follow the spec to the guest, though.  And with 
the option of full emulation for those who need it and sign on the 
dotted line.



  - There's proper event scheduling and event allocation. Time-slicing, etc.


The thing is, we made quite similar arguments in the past, during the
perfmon vs. perfcounters discussions. There's really a big advantage to
proper abstractions, both on the host and on the guest side.
   

We only control half of the equation.  That's very different compared to
tools/perf.
 

You mean Windows?

For heaven's sake, why dont you think like Linus thought 20 years ago. To the
hell with Windows suckiness and lets make sure our stuff works well.


In our case, making our stuff work well means making sure guests of the 
user's choice run well.  Not ours.  Currently users mostly choose 
Windows and Linux, so we have to make them both work.


(btw, the analogy would be, 'To hell with Unix suckiness, let's make 
sure our stuff works well'; where Linux reimplemented the Unix APIs, 
ensuring source compatibility with applications, kvm reimplements the 
hardware interface, ensuring binary compatibility with guests).



  Then the
users will come, developers will come, and people will profile Linux under
Linux and maybe the tools will be so good that they'll profile under Linux
using Wine just to be able to use those good tools...
   


If we don't support Windows well, users will walk away, followed by 
starving developers.



If you gut Linux capabilities like that to accomodate for the suckiness of
Windows, without giving a technological edge to Linux, and then we are bound
to fail in the long run ...
   


I'm all for abusing the tight relationship between Linux-as-a-host and 
Linux-as-a-guest to gain an advantage for both.  One fruitful area would 
be asynchronous page faults, which has the potential to increase memory 
overcommit, for example.  But first of all we need to make sure that 
there is a baseline of support for all commonly used guests.


I think of it this way: once kvm deployment becomes widespread, 
Linux-as-a-guest gains an advantage.  But in order for kvm deployment to 
become widespread, it needs excellent support for all guests users 
actually use.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 04:27 PM, Peter Zijlstra wrote:

On Fri, 2010-02-26 at 15:55 +0200, Avi Kivity wrote:
   

That actually works on the Intel-only architectural pmu.  I'm beginning
to like it more and more.
 

Only for the arch defined events, all _7_ of them.
   


That's 7 more than what we support now, and 7 more than what we can 
guarantee without it.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enhance perf to support KVM

2010-02-26 Thread Avi Kivity

On 02/26/2010 04:23 PM, Ingo Molnar wrote:

* Avi Kivitya...@redhat.com  wrote:

   

On 02/26/2010 03:16 PM, Ingo Molnar wrote:
 

* Avi Kivitya...@redhat.com   wrote:

   

That was not what i suggested tho. tools/kvm/ would work plenty fine.

   

I'll wait until we have tools/libc and tools/X.  After all, they affect a
lot more people and are concerned with a lot more kernel/user interfaces
than kvm.
 

So your answer can be summed up as: 'we wont do what makes sense
technically because others suck even more' ?
   

I can sum up your this remark as 'whenever you disagree with me, I will
rephrase your words to make you look like an idiot'.
 

Two points:

1)

You can try to ridicule me if you want,


I'd much prefer it if if no ridiculing was employed on either side.


  but do you actually claim that my
summary is inaccurate?

I do claim it's a substantially accurate summary: you said you will (quote:)
wait with tools/kvm/ until we have tools/libc and tools/X.

I do think tools/X and tools/libc would make quite a bit of sense - this is
one of the better design aspects of FreeBSD et al. It's a mistake that it's
not being done.
   


There are arguments for libc to be developed in linux-2.6.git, and 
arguments against.  The fact is that they are not, so presumably the 
arguments against plus inertia outweigh the arguments for.


The same logic holds for kvm, except that there are less arguments for 
development in linux-2.6.git.  Only a small part of qemu is actually 
concerned with kvm; most of it is mucking around with X, emulating old 
devices, emulating instruction sets (irrelevant for tools/kvm) and doing 
boring managementy stuff.


Do we really want to add several hundered thousand lines to Linux, only 
a few thousand or of which talk to the kernel?



2)

I used a question mark (the sentence was not a statement of fact), and you
have no obligation to agree with the summary i provided.

   


Thanks.  I hope you don't agree with it either.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 16:54 +0200, Avi Kivity wrote:
 On 02/26/2010 04:27 PM, Peter Zijlstra wrote:
  On Fri, 2010-02-26 at 15:55 +0200, Avi Kivity wrote:
 
  That actually works on the Intel-only architectural pmu.  I'm beginning
  to like it more and more.
   
  Only for the arch defined events, all _7_ of them.
 
 
 That's 7 more than what we support now, and 7 more than what we can 
 guarantee without it.

Again, what windows software uses only those 7? Does it pay to only have
access to those 7 or does it limit the usability to exactly the same
subset a paravirt interface would?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 05:08 PM, Peter Zijlstra wrote:

That's 7 more than what we support now, and 7 more than what we can
guarantee without it.
 

Again, what windows software uses only those 7? Does it pay to only have
access to those 7 or does it limit the usability to exactly the same
subset a paravirt interface would?
   


Good question.  Would be interesting to try out VTune with the non-arch 
pmu masked out.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 16:53 +0200, Avi Kivity wrote:
  If you give a full PMU to a guest it's a whole different dimension and 
  quality
  of information. Literally hundreds of different events about all sorts of
  aspects of the CPU and the hardware in general.
 
 
 Well, we filter out the bad events then. 

Which requires trapping the MSR access, at which point a soft-PMU is
almost there, right?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 17:11 +0200, Avi Kivity wrote:
 On 02/26/2010 05:08 PM, Peter Zijlstra wrote:
  That's 7 more than what we support now, and 7 more than what we can
  guarantee without it.
   
  Again, what windows software uses only those 7? Does it pay to only have
  access to those 7 or does it limit the usability to exactly the same
  subset a paravirt interface would?
 
 
 Good question.  Would be interesting to try out VTune with the non-arch 
 pmu masked out.

From what I understood VTune uses PEBS+LBR, although I suppose they have
simple PMU modes too, never actually seen the software.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] kvm-kmod-2.6.33

2010-02-26 Thread Ingmar Schraub
Hello Jan,

I can compile kvm-kmod-2.6.32.9 under Ubuntu 9.1 64-Bit, but 'make
install' fails with

ing...@nexoc:~/KVM/kvm-kmod-2.6.32.9$ sudo make install
[sudo] password for ingmar:
mkdir -p ///usr/local/include/kvm-kmod/asm/
install -m 644 usr/include/asm-x86/{kvm,kvm_para}.h
///usr/local/include/kvm-kmod/asm/
install: cannot stat `usr/include/asm-x86/{kvm,kvm_para}.h': No such
file or directory
make: *** [install-hdr] Error 1

Before I used kvm-kmod-2.6.32.3 which installs just fine:

ing...@nexoc:~/KVM/kvm-kmod-2.6.32.3$ sudo make install
mkdir -p ///lib/modules/2.6.31-19-generic/extra
cp x86/*.ko ///lib/modules/2.6.31-19-generic/extra
for i in ///lib/modules/2.6.31-19-generic/kernel/drivers/kvm/*.ko \
 ///lib/modules/2.6.31-19-generic/kernel/arch/x86/kvm/*.ko; do \
if [ -f $i ]; then mv $i $i.orig; fi; \
done
/sbin/depmod -a 2.6.31-19-generic -b /
install -m 644 -D scripts/65-kvm.rules //etc/udev/rules.d/65-kvm.rules
install -m 644 -D usr/include/asm-x86/kvm.h
///usr/local/include/kvm-kmod/asm/kvm.h
install -m 644 -D usr/include/linux/kvm.h
///usr/local/include/kvm-kmod/linux/kvm.h
sed 's|PREFIX|/usr/local|; s/VERSION/kvm-kmod-2.6.32.3/' kvm-kmod.pc 
.tmp.kvm-kmod.pc
install -m 644 -D .tmp.kvm-kmod.pc ///usr/local/lib/pkgconfig/kvm-kmod.pc

Any idea what could be wrong?

Regards,

Ingmar

Jan Kiszka wrote:
 Now that 2.6.33 is out, time to release the corresponding kvm-kmod
 package as well. Not much has happened since 2.6.33-rc6, though.
 
 KVM changes since kvm-kmod-2.6.33-rc6:
  - PIT: control word is write-only
(fixes side-effects of spurious reads)
  - kvmclock: count total_sleep_time when updating guest clock
(requires = 2.6.32.9 as host, falls back to unfixed version
otherwise)
 
 kvm-kmod changes:
  - warn about kvmclock issues across host suspend/resume
  - detect host kernel extra version to make use of fixes in stable
series
 
 See [1] for the delta to 2.6.32.
 
 I also released kvm-kmod-2.6.32.9 with basically the same changes. That
 may be the last release based on that kernel, but nothing is set in
 stone yet (specifically as we already maintain kvm-kmod-2.6.32
 internally for a customer).
 
 Jan
 
 [1] 
 https://sourceforge.net/projects/kvm/files/kvm-kmod/2.6.33-rc6/changelog/view
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] kvm-kmod-2.6.33

2010-02-26 Thread Jan Kiszka
Ingmar Schraub wrote:
 Hello Jan,
 
 I can compile kvm-kmod-2.6.32.9 under Ubuntu 9.1 64-Bit, but 'make
 install' fails with
 
 ing...@nexoc:~/KVM/kvm-kmod-2.6.32.9$ sudo make install
 [sudo] password for ingmar:
 mkdir -p ///usr/local/include/kvm-kmod/asm/
 install -m 644 usr/include/asm-x86/{kvm,kvm_para}.h
 ///usr/local/include/kvm-kmod/asm/
 install: cannot stat `usr/include/asm-x86/{kvm,kvm_para}.h': No such
 file or directory
 make: *** [install-hdr] Error 1
 
 Before I used kvm-kmod-2.6.32.3 which installs just fine:
 
 ing...@nexoc:~/KVM/kvm-kmod-2.6.32.3$ sudo make install
 mkdir -p ///lib/modules/2.6.31-19-generic/extra
 cp x86/*.ko ///lib/modules/2.6.31-19-generic/extra
 for i in ///lib/modules/2.6.31-19-generic/kernel/drivers/kvm/*.ko \
///lib/modules/2.6.31-19-generic/kernel/arch/x86/kvm/*.ko; do \
   if [ -f $i ]; then mv $i $i.orig; fi; \
   done
 /sbin/depmod -a 2.6.31-19-generic -b /
 install -m 644 -D scripts/65-kvm.rules //etc/udev/rules.d/65-kvm.rules
 install -m 644 -D usr/include/asm-x86/kvm.h
 ///usr/local/include/kvm-kmod/asm/kvm.h
 install -m 644 -D usr/include/linux/kvm.h
 ///usr/local/include/kvm-kmod/linux/kvm.h
 sed 's|PREFIX|/usr/local|; s/VERSION/kvm-kmod-2.6.32.3/' kvm-kmod.pc 
 .tmp.kvm-kmod.pc
 install -m 644 -D .tmp.kvm-kmod.pc ///usr/local/lib/pkgconfig/kvm-kmod.pc
 
 Any idea what could be wrong?
 

Likely bash'ism of mine (what's your shell?). This should fix it:

diff --git a/Makefile b/Makefile
index 94dde5c..c031701 100644
--- a/Makefile
+++ b/Makefile
@@ -62,9 +62,9 @@ KVM_KMOD_VERSION = $(strip $(if $(wildcard KVM_VERSION), \
 
 install-hdr:
mkdir -p $(DESTDIR)/$(HEADERDIR)/asm/
-   install -m 644 usr/include/asm-$(ARCH_DIR)/{kvm,kvm_para}.h 
$(DESTDIR)/$(HEADERDIR)/asm/
+   install -m 644 usr/include/asm-$(ARCH_DIR)/*.h 
$(DESTDIR)/$(HEADERDIR)/asm/
mkdir -p $(DESTDIR)/$(HEADERDIR)/linux/
-   install -m 644 usr/include/linux/{kvm,kvm_para}.h 
$(DESTDIR)/$(HEADERDIR)/linux/
+   install -m 644 usr/include/linux/*.h $(DESTDIR)/$(HEADERDIR)/linux/
sed 's|PREFIX|$(PREFIX)|; s/VERSION/$(KVM_KMOD_VERSION)/' kvm-kmod.pc  
$(tmppc)
install -m 644 -D $(tmppc) $(DESTDIR)/$(PKGCONFIGDIR)/kvm-kmod.pc
 

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Offline for a week

2010-02-26 Thread Avi Kivity
I will be on vacation and offline, pmu threads included, for a week. 
Marcelo will handle all kvm issues as usual.


--
Do not meddle in the internals of kernels, for they are subtle and quick 
to panic.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 17:11 +0200, Avi Kivity wrote:
 On 02/26/2010 05:08 PM, Peter Zijlstra wrote:
  That's 7 more than what we support now, and 7 more than what we can
  guarantee without it.
   
  Again, what windows software uses only those 7? Does it pay to only have
  access to those 7 or does it limit the usability to exactly the same
  subset a paravirt interface would?
 
 
 Good question.  Would be interesting to try out VTune with the non-arch 
 pmu masked out.

Also, the ANY bit is part of the intel arch pmu, but you still have to
mask it out.

BTW, just wondering, why would a developer be running VTune in a guest
anyway? I'd think that a developer that windows oriented would simply
run windows on his desktop and VTune there.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 04:37 PM, Ingo Molnar wrote:

* Avi Kivitya...@redhat.com  wrote:

   

Certainly guests that we don't port won't be able to use this.  I doubt
we'll be able to make Windows work with this - the only performance tool I'm
familiar with on Windows is Intel's VTune, and that's proprietary.
 

Dont you see the extreme irony of your wish to limit Linux kernel design
decisions and features based on ... Windows and other proprietary
software?
   

Not at all.  Virtualization is a hardware compatibility game.  To see what
happens if you don't play it, see Xen.  Eventually they to implemented
hardware support even though the pv approach is so wonderful.
 

That's not quite equivalent though.

KVM used to be the clean, integrate-code-with-Linux virtualization approach,
designed specifically for CPUs that can be virtualized properly. (VMX support
first, then SVM, etc.)

KVM virtualized ages-old concepts with relatively straightforward hardware
ABIs: x86 execution, IRQ abstractions, device abstractions, etc.

Now you are in essence turning that all around:

  - the PMU is by no means properly virtualized nor really virtualizable by
direct access. There's no virtual PMU that ticks independently of the host
PMU.
   


There's no guest debug registers that can be programmed independently of 
the host debug registers, but we manage somehow.  It's not perfect, but 
better than nothing.


For the common case of host-only or guest-only monitoring, things will 
work, perhaps without socketwide counters in security concious 
environments.  When both are used at the same time, something will have 
to give.



  - the PMU hardware itself is not a well standardized piece of hardware. It's
very vendor dependent and very limiting.
   


That's life.  If we force standardization by having a soft pmu, we'll be 
very limited as well.  If we don't, we reduce hardware independence 
which is a strong point of virtualization.  Clearly we need to make a 
trade-off here.


In favour of hardware dependence is that tools and users are already 
used to it.  There is also the architectural pmu that can provide a 
limited form of hardware independence.


Going pv trades off hardware dependence for software dependence.  
Suddenly only guests that you have control over can use the pmu.



So to some degree you are playing the role of Xen in this specific affair. You
are pushing for something that shouldnt be done in that form. You want to
interfere with the host PMU by going via the fast  easy short-term hack to
just let the guest OS have the PMU, without any regard to how this impacts
long-term feasible solutions.
   


Maybe.  And maybe the vendors will improve virtualization support for 
the pmu, rendering the pv approach obsolete on new hardware.



I.e. you are a bit like the guy who would have told Linus in 1994:

   Dude, why dont you use the Windows APIs? It's far more compatible and
that's the only way you could run any serious apps. Besides, it requires
no upgrade. Admittedly it's a bit messy and 16-bit but hey, that's our
installed base after all. 
   


Hey, maybe we'd have significant desktop market share if he'd done this 
(though a replay of the wine history is much more likely).


But what are you suggesting?  That we make Windows a second class 
guest?  Most users run a mix of workloads, that will not go down well 
with them.  The choice is between first-class Windows support vs 
becoming a hobby hypervisor.


Let's make a kerner/user analogy again.  Would you be in favour of 
GPL-only-ing new syscalls, to give open source applications an edge over 
proprietary apps (technically known as crap among some)?


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 05:55 PM, Peter Zijlstra wrote:

BTW, just wondering, why would a developer be running VTune in a guest
anyway? I'd think that a developer that windows oriented would simply
run windows on his desktop and VTune there.
   


Cloud.

You have an app running somewhere on a cloud, internally or externally 
(you may not even know).  It's running a production workload and it 
isn't doing well.  You can't reproduce it on your desktop (works for 
me, now go away).  So you rdesktop to your guest and monitor it.


You can't run anything on the host - you don't have access to it, you 
don't know who admins it (it's a program anyway), the host doesn't 
even exist, the guest moves around whenever the cloud feels like it.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 06:03 PM, Avi Kivity wrote:

Note, I'll be away for a week, so will not be responsive for a while

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: repeatable hang with loop mount and heavy IO in guest

2010-02-26 Thread Antoine Martin



  1   0   0  98   0   1|   0 0 |  66B  354B|   0 0 |  3011
  1   1   0  98   0   0|   0 0 |  66B  354B|   0 0 |  2911
From that point onwards, nothing will happen.
The host has disk IO to spare... So what is it waiting for??

Moved to an AMD64 host. No effect.
Disabled swap before running the test. No effect.
Moved the guest to a fully up-to-date FC12 server 
(2.6.31.6-145.fc12.x86_64), no effect.
I have narrowed it down to the guest's filesystem used for backing the 
disk image which is loop mounted: although it was not completely full 
(and had enough inodes), freeing some space on it prevents the system 
from misbehaving.


FYI: the disk image was clean and was fscked before each test. kvm had 
been updated to 0.12.3
The weird thing is that the same filesystem works fine (no system hang) 
if used directly from the host, it is only misbehaving via kvm...


So I am not dismissing the possibility that kvm may be at least partly 
to blame, or that it is exposing a filesystem bug (race?) not normally 
encountered.
(I have backed up the full 32GB virtual disk in case someone suggests 
further investigation)


Cheers
Antoine
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enhance perf to support KVM

2010-02-26 Thread Avi Kivity

On 02/26/2010 01:17 PM, Ingo Molnar wrote:

Nobody is really 'in charge' of how KVM gets delivered to the user. You
isolated the fun kernel part for you and pushed out the boring bits to
user-space. So if mundane things like mouse integration sucks 'hey that's a
user-space tooling problem', if file integration sucks then 'hey, that's an
admin problem', if it cannot be used over the network 'hey, that's an Xorg
problem', etc. etc.
   


btw, mouse integration works with -usbdevice tablet and recent Fedoras, 
'it was an X.org driver problem'.


Really, I don't understand your problems.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vlan disable TSO of virtio in KVM

2010-02-26 Thread Sridhar Samudrala
On Fri, 2010-02-26 at 10:51 +0800, David V. Cloud wrote:
 Hi,
 
 I read some kernel source. My basic understanding is that, in
 net/8021q/vlan_dev.c, vlan_dev_init, the dev-features of vconfig
 created interface is defined to be
 dev-features |= real_dev-features  real_dev-vlan_features;
 
 However, in drivers/net/virtio_net.c, vlan_features are never set (I
 will assume it would be 0). So, dev-features will be 0 for the
 ethX.vid interface.
 
 I verify it  using ethtool -k on each KVM.
 # ethtool -k eth0
 shows that rx/tx csum, sg, tso, gso are on
 # ethtool -k eth0.3003
 all offloading features are off.
 
 I think that is why TSO is never enabled when running large package
 traffic between two vlan interfaces of different KVMs.
 
 I also took a look at VMware's pv implementation, which is
 drivers/net/vmxnet3/vmxnet3_drv.c, they have enable dev-vlan_features
 when probing, by
   netdev-vlan_features = netdev-features;
 
 
 I was wondering why vlan_features was not defined in virtio_net. Is it
 a BUG? Or, it is due to some constraints?
 Could any explain that?

I saw the same issue some time back and submitted a couple of patches
to address it, but were not accepted as the fix is not done at the
right place. Not sure if we can do this right without updating
virtio_net_hdr with  vlan specific info.
http://thread.gmane.org/gmane.linux.network/150197/focus=150838
http://thread.gmane.org/gmane.linux.network/150198/focus=150837

Thanks
Sridhar

 
 Thanks,
 -D
 
 
 
 On Thu, Feb 25, 2010 at 6:30 PM, David V. Cloud david.v.cl...@gmail.com 
 wrote:
  Hi all,
  I have been deploying two KVMs on my Debian testing box. Two KVMs each
  use one tap device connecting to the host.
 
  When I
  doing netperf with large package size from KVM2 (tap1) to KVM1 (tap0)
  using ethX on them, I could verify that TSO did happened by
   # tcpdump -nt -i tap1
  I can see messages like,
  IP 192.168.101.2.39994  192.168.101.1.41316: Flags [P.], seq
  7912865:7918657, ack 0, win 92, options [nop,nop,TS val 874151 ecr
  874803], length 5792
 
  So, according the 'length', skb didn't get segmented.
 
  However, When I
  (1) setup VLAN using vconfig on KVM2, KVM1, and got two new interface
  eth1.3003, eth0.3003 on both machines.
  (2) netperf between two new interface, TSO no longer showed up,
   # tcpdump -nt -i tap1
  I only got,
  vlan 3003, p 0, IP 10.214.10.2.42324  10.214.10.1.56460: Flags [P.],
  seq 2127976:2129424, ack 1, win 92, options [nop,nop,TS val 926034 ecr
  926686], length 1448
 
  So, all the large packages get segmented in virtio (is that right?)
 
  My KVM command line options are,
  kvm -hda $IMG -m 768 -net nic,model=virtio,macaddr=52:54:00:12:34:56
  -net tap,ifname=$TAP,script=no
 
 
 
  My question is whether it is the expected behavior? Can VLAN tagging
  coexist with TSO in virtio_net driver?
  If this is not desired result. Any hint for fixing the problem?
 
  Thanks
  -D
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vlan disable TSO of virtio in KVM

2010-02-26 Thread Sridhar Samudrala
On Fri, 2010-02-26 at 10:51 +0800, David V. Cloud wrote:
 Hi,
 
 I read some kernel source. My basic understanding is that, in
 net/8021q/vlan_dev.c, vlan_dev_init, the dev-features of vconfig
 created interface is defined to be
 dev-features |= real_dev-features  real_dev-vlan_features;
 
 However, in drivers/net/virtio_net.c, vlan_features are never set (I
 will assume it would be 0). So, dev-features will be 0 for the
 ethX.vid interface.
 
 I verify it  using ethtool -k on each KVM.
 # ethtool -k eth0
 shows that rx/tx csum, sg, tso, gso are on
 # ethtool -k eth0.3003
 all offloading features are off.
 
 I think that is why TSO is never enabled when running large package
 traffic between two vlan interfaces of different KVMs.
 
 I also took a look at VMware's pv implementation, which is
 drivers/net/vmxnet3/vmxnet3_drv.c, they have enable dev-vlan_features
 when probing, by
   netdev-vlan_features = netdev-features;
 
 
 I was wondering why vlan_features was not defined in virtio_net. Is it
 a BUG? Or, it is due to some constraints?
 Could any explain that?

I saw the same issue some time back and submitted a couple of patches
to address it, but were not accepted as the fix is not done at the
right place. Not sure if we can do this right without updating
virtio_net_hdr with  vlan specific info.
http://thread.gmane.org/gmane.linux.network/150197/focus=150838
http://thread.gmane.org/gmane.linux.network/150198/focus=150837

Thanks
Sridhar

 
 Thanks,
 -D
 
 
 
 On Thu, Feb 25, 2010 at 6:30 PM, David V. Cloud david.v.cl...@gmail.com 
 wrote:
  Hi all,
  I have been deploying two KVMs on my Debian testing box. Two KVMs each
  use one tap device connecting to the host.
 
  When I
  doing netperf with large package size from KVM2 (tap1) to KVM1 (tap0)
  using ethX on them, I could verify that TSO did happened by
   # tcpdump -nt -i tap1
  I can see messages like,
  IP 192.168.101.2.39994  192.168.101.1.41316: Flags [P.], seq
  7912865:7918657, ack 0, win 92, options [nop,nop,TS val 874151 ecr
  874803], length 5792
 
  So, according the 'length', skb didn't get segmented.
 
  However, When I
  (1) setup VLAN using vconfig on KVM2, KVM1, and got two new interface
  eth1.3003, eth0.3003 on both machines.
  (2) netperf between two new interface, TSO no longer showed up,
   # tcpdump -nt -i tap1
  I only got,
  vlan 3003, p 0, IP 10.214.10.2.42324  10.214.10.1.56460: Flags [P.],
  seq 2127976:2129424, ack 1, win 92, options [nop,nop,TS val 926034 ecr
  926686], length 1448
 
  So, all the large packages get segmented in virtio (is that right?)
 
  My KVM command line options are,
  kvm -hda $IMG -m 768 -net nic,model=virtio,macaddr=52:54:00:12:34:56
  -net tap,ifname=$TAP,script=no
 
 
 
  My question is whether it is the expected behavior? Can VLAN tagging
  coexist with TSO in virtio_net driver?
  If this is not desired result. Any hint for fixing the problem?
 
  Thanks
  -D
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >