Re: qemu-kvm: Inconsistent vgabios reference

2012-01-16 Thread Gerd Hoffmann

  Hi,

 Now, I've a question: what seabios implementation of extboot we're
 talking about?

seabios boots natively from virtio now, so for that use case extboot
isn't needed any more.

  -option-rom which is impossible to select as a first
 boot device?

'qemu-kvm -option-rom romfile=/root/roms/8xx_64.rom,bootindex=1' will do.

HTH,
  Gerd

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH] KVM: perf: a smart tool to analyse kvm events

2012-01-16 Thread Xiao Guangrong
This tool is very like xenoprof(if i remember correctly), and traces kvm events
smartly. currently, it supports vmexit/mmio/ioport events.

Usage:
- to trace kvm events:
# ./perf kvm-events record

- show the result
# ./perf kvm-events report

Some output are as follow:
# ./perf kvm-events report
  Warning: Error: expected type 5 but read 4
  Warning: Error: expected type 5 but read 0
  Warning: unknown op '}'


Analyze events for all VCPUs:

 VM-EXITSamples  Samples% Time%Avg time

 APIC_ACCESS 43810744.89% 6.20%17.91us
  EXTERNAL_INTERRUPT 21922622.46% 8.01%46.20us
  IO_INSTRUCTION 12265112.57% 1.88%19.44us
   EPT_VIOLATION  83110 8.52% 1.36%20.75us
   PENDING_INTERRUPT  37055 3.80% 0.16% 5.38us
   CPUID  32718 3.35% 0.08% 3.15us
   EXCEPTION_NMI  23601 2.42% 0.17% 8.87us
 HLT  15424 1.58%82.12%  6735.06us
   CR_ACCESS   4089 0.42% 0.02% 6.08us

Total Samples:975981, Total events handled time:126502464.88us.

The default event to be analysed is vmexit, we can use --event to specify it,
for example, if we want to trace mmio event:
# ./perf kvm-events report --event mmio
  Warning: Error: expected type 5 but read 4
  Warning: Error: expected type 5 but read 0
  Warning: unknown op '}'


Analyze events for all VCPUs:

 MMIO AccessSamples  Samples% Time%Avg time

0xfee00380:W 19658964.95%70.01% 3.83us
0xfee00310:W  3535611.68% 6.48% 1.97us
0xfee00300:W  3535611.68%16.37% 4.97us
0xfee00300:R  3535611.68% 7.14% 2.17us

Total Samples:302657, Total events handled time:1074746.01us.

We can use --vcpu to specify which vcpu is traced:
root@localhost perf]# ./perf kvm-events report --event mmio --vcpu 1
  Warning: Error: expected type 5 but read 4
  Warning: Error: expected type 5 but read 0
  Warning: unknown op '}'


Analyze events for VCPU 1:

 MMIO AccessSamples  Samples% Time%Avg time

0xfee00380:W  5804171.20%74.90% 3.70us
0xfee00310:W   7826 9.60% 5.28% 1.93us
0xfee00300:W   7826 9.60%13.82% 5.06us
0xfee00300:R   7826 9.60% 6.01% 2.20us

Total Samples:81519, Total events handled time:286577.81us.

And, '--key' is used to sort the result, the possible value sample (default,
the result is sorted by samples number), time(the result is sorted by time%):
# ./perf kvm-events report --key time
  Warning: Error: expected type 5 but read 4
  Warning: Error: expected type 5 but read 0
  Warning: unknown op '}'


Analyze events for all VCPUs:

 VM-EXITSamples  Samples% Time%Avg time

 HLT  15424 1.58%82.12%  6735.06us
  EXTERNAL_INTERRUPT 21922622.46% 8.01%46.20us
 APIC_ACCESS 43810744.89% 6.20%17.91us
  IO_INSTRUCTION 12265112.57% 1.88%19.44us
   EPT_VIOLATION  83110 8.52% 1.36%20.75us
   EXCEPTION_NMI  23601 2.42% 0.17% 8.87us
   PENDING_INTERRUPT  37055 3.80% 0.16% 5.38us
   CPUID  32718 3.35% 0.08% 3.15us
   CR_ACCESS   4089 0.42% 0.02% 6.08us

Total Samples:975981, Total events handled time:126502464.88us.

I hope guys will like it and any comments are welcome! :)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] KVM: trace mmio read event properly

2012-01-16 Thread Xiao Guangrong
In current code, we use KVM_TRACE_MMIO_READ to trace mmio read event which
only can be completed immediately, instead of it, we trace the time when
read event occur, then cooperate with then later patch, we can know the time
of mmio read emulation

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/x86.c |6 ++
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c95ca2d..0d41cfc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3432,7 +3432,6 @@ static int vcpu_mmio_read(struct kvm_vcpu *vcpu, gpa_t 
addr, int len, void *v)
  !kvm_iodevice_read(vcpu-arch.apic-dev, addr, n, v))
 kvm_io_bus_read(vcpu-kvm, KVM_MMIO_BUS, addr, n, v))
break;
-   trace_kvm_mmio(KVM_TRACE_MMIO_READ, n, addr, *(u64 *)v);
handled += n;
addr += n;
len -= n;
@@ -3658,8 +3657,6 @@ static int read_prepare(struct kvm_vcpu *vcpu, void *val, 
int bytes)
 {
if (vcpu-mmio_read_completed) {
memcpy(val, vcpu-mmio_data, bytes);
-   trace_kvm_mmio(KVM_TRACE_MMIO_READ, bytes,
-  vcpu-mmio_phys_addr, *(u64 *)val);
vcpu-mmio_read_completed = 0;
return 1;
}
@@ -3681,7 +3678,6 @@ static int write_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,

 static int write_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes, void *val)
 {
-   trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, *(u64 *)val);
return vcpu_mmio_write(vcpu, gpa, bytes, val);
 }

@@ -3744,6 +3740,8 @@ mmio:
/*
 * Is this MMIO handled locally?
 */
+   trace_kvm_mmio(write ? KVM_TRACE_MMIO_WRITE : KVM_TRACE_MMIO_READ,
+   bytes, gpa, *(u64 *)val);
handled = ops-read_write_mmio(vcpu, gpa, bytes, val);
if (handled == bytes)
return X86EMUL_CONTINUE;
-- 
1.7.7.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] KVM: improve trace events of vmexit/mmio/ioport

2012-01-16 Thread Xiao Guangrong
- trace vcpu_id for these events
- add kvm_mmio_done to trace the time when mmio/ioport emulation is completed

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 arch/x86/kvm/trace.h   |   33 ++---
 arch/x86/kvm/x86.c |   19 +--
 include/trace/events/kvm.h |   32 +---
 3 files changed, 64 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 911d264..e556458 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -91,11 +91,12 @@ TRACE_EVENT(kvm_hv_hypercall,
  * Tracepoint for PIO.
  */
 TRACE_EVENT(kvm_pio,
-   TP_PROTO(unsigned int rw, unsigned int port, unsigned int size,
-unsigned int count),
-   TP_ARGS(rw, port, size, count),
+   TP_PROTO(unsigned int vcpu_id, unsigned int rw, unsigned int port,
+unsigned int size, unsigned int count),
+   TP_ARGS(vcpu_id, rw, port, size, count),

TP_STRUCT__entry(
+   __field(unsigned int,   vcpu_id )
__field(unsigned int,   rw  )
__field(unsigned int,   port)
__field(unsigned int,   size)
@@ -103,17 +104,33 @@ TRACE_EVENT(kvm_pio,
),

TP_fast_assign(
+   __entry-vcpu_id= vcpu_id;
__entry-rw = rw;
__entry-port   = port;
__entry-size   = size;
__entry-count  = count;
),

-   TP_printk(pio_%s at 0x%x size %d count %d,
- __entry-rw ? write : read,
+   TP_printk(vcpu %u pio_%s at 0x%x size %d count %d,
+ __entry-vcpu_id, __entry-rw ? write : read,
  __entry-port, __entry-size, __entry-count)
 );

+TRACE_EVENT(kvm_pio_done,
+   TP_PROTO(unsigned int vcpu_id),
+   TP_ARGS(vcpu_id),
+
+   TP_STRUCT__entry(
+   __field(unsigned int,   vcpu_id )
+   ),
+
+   TP_fast_assign(
+   __entry-vcpu_id= vcpu_id;
+   ),
+
+   TP_printk(vcpu %u, __entry-vcpu_id)
+);
+
 /*
  * Tracepoint for cpuid.
  */
@@ -280,6 +297,7 @@ TRACE_EVENT(kvm_exit,
TP_ARGS(exit_reason, vcpu, isa),

TP_STRUCT__entry(
+   __field(unsigned int,   vcpu_id )
__field(unsigned int,   exit_reason )
__field(unsigned long,  guest_rip   )
__field(u32,isa )
@@ -288,6 +306,7 @@ TRACE_EVENT(kvm_exit,
),

TP_fast_assign(
+   __entry-vcpu_id= vcpu-vcpu_id;
__entry-exit_reason= exit_reason;
__entry-guest_rip  = kvm_rip_read(vcpu);
__entry-isa= isa;
@@ -295,8 +314,8 @@ TRACE_EVENT(kvm_exit,
   __entry-info2);
),

-   TP_printk(reason %s rip 0x%lx info %llx %llx,
-(__entry-isa == KVM_ISA_VMX) ?
+   TP_printk(vcpu %u reason %s rip 0x%lx info %llx %llx,
+__entry-vcpu_id, (__entry-isa == KVM_ISA_VMX) ?
 __print_symbolic(__entry-exit_reason, VMX_EXIT_REASONS) :
 __print_symbolic(__entry-exit_reason, SVM_EXIT_REASONS),
 __entry-guest_rip, __entry-info1, __entry-info2)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0d41cfc..cf54478 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3684,7 +3684,8 @@ static int write_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, 
int bytes, void *val)
 static int read_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa,
  void *val, int bytes)
 {
-   trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, bytes, gpa, 0);
+   trace_kvm_mmio(vcpu-vcpu_id, KVM_TRACE_MMIO_READ_UNSATISFIED,
+  bytes, gpa, 0);
return X86EMUL_IO_NEEDED;
 }

@@ -3740,11 +3741,14 @@ mmio:
/*
 * Is this MMIO handled locally?
 */
-   trace_kvm_mmio(write ? KVM_TRACE_MMIO_WRITE : KVM_TRACE_MMIO_READ,
-   bytes, gpa, *(u64 *)val);
+   trace_kvm_mmio(vcpu-vcpu_id,
+  write ? KVM_TRACE_MMIO_WRITE : KVM_TRACE_MMIO_READ,
+  bytes, gpa, *(u64 *)val);
handled = ops-read_write_mmio(vcpu, gpa, bytes, val);
-   if (handled == bytes)
+   if (handled == bytes) {
+   trace_kvm_mmio_done(vcpu-vcpu_id);
return X86EMUL_CONTINUE;
+   }

gpa += handled;
bytes -= handled;
@@ -3902,7 +3906,7 @@ static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int 
size,
   unsigned short port, void *val,
   unsigned int count, bool in)
 {
-   trace_kvm_pio(!in, port, size, count);
+   

[PATCH 3/3] KVM: perf: kvm events analysis tool

2012-01-16 Thread Xiao Guangrong
Add 'perf kvm-events' support to analyze kvm vmexit/mmio/ioport smartly

Usage:
perf kvm-events record
perf kvm-events report

Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
---
 tools/perf/Documentation/perf-kvm-events.txt |   54 ++
 tools/perf/Makefile  |1 +
 tools/perf/builtin-kvm-events.c  |  860 ++
 tools/perf/builtin.h |1 +
 tools/perf/perf.c|1 +
 5 files changed, 917 insertions(+), 0 deletions(-)
 create mode 100644 tools/perf/Documentation/perf-kvm-events.txt
 create mode 100644 tools/perf/builtin-kvm-events.c

diff --git a/tools/perf/Documentation/perf-kvm-events.txt 
b/tools/perf/Documentation/perf-kvm-events.txt
new file mode 100644
index 000..73bcb82
--- /dev/null
+++ b/tools/perf/Documentation/perf-kvm-events.txt
@@ -0,0 +1,54 @@
+perf-kvm-events(1)
+
+
+NAME
+
+perf-kvm-events - Analyze kvm events
+
+SYNOPSIS
+
+[verse]
+'perf kvm-events' {record|report}
+
+DESCRIPTION
+---
+You can analyze some crucial events and statistics with this
+'perf kvm-events' command.
+
+  'perf kvm-events record command' records kvm events
+  between start and end command. And this command
+  produces the file perf.data which contains tracing
+  results of kvm events.
+
+  'perf kvm-events report' reports statistical data.
+
+COMMON OPTIONS
+--
+
+-i::
+--input=file::
+Input file name. (default: perf.data unless stdin is a fifo)
+
+-v::
+--verbose::
+Be more verbose (show symbol address, etc).
+
+-D::
+--dump-raw-trace::
+Dump raw trace in ASCII.
+
+REPORT OPTIONS
+--
+--vcpu=value::
+   analyze events which occures on this vcpu
+
+--events=value::
+   events to be analyzed. Possible values: vmexit, mmio, ioport.
+-k::
+--key=value::
+Sorting key. Possible values: sample(default, sort by samples number),
+time(sort by time%).
+
+SEE ALSO
+
+linkperf:perf[1]
diff --git a/tools/perf/Makefile b/tools/perf/Makefile
index ac86d67..ee43451 100644
--- a/tools/perf/Makefile
+++ b/tools/perf/Makefile
@@ -382,6 +382,7 @@ BUILTIN_OBJS += $(OUTPUT)builtin-probe.o
 BUILTIN_OBJS += $(OUTPUT)builtin-kmem.o
 BUILTIN_OBJS += $(OUTPUT)builtin-lock.o
 BUILTIN_OBJS += $(OUTPUT)builtin-kvm.o
+BUILTIN_OBJS += $(OUTPUT)builtin-kvm-events.o
 BUILTIN_OBJS += $(OUTPUT)builtin-test.o
 BUILTIN_OBJS += $(OUTPUT)builtin-inject.o

diff --git a/tools/perf/builtin-kvm-events.c b/tools/perf/builtin-kvm-events.c
new file mode 100644
index 000..55dc680
--- /dev/null
+++ b/tools/perf/builtin-kvm-events.c
@@ -0,0 +1,860 @@
+#include builtin.h
+#include perf.h
+#include util/util.h
+#include util/cache.h
+#include util/symbol.h
+#include util/thread.h
+#include util/header.h
+#include util/parse-options.h
+#include util/trace-event.h
+#include util/debug.h
+#include util/session.h
+#include util/tool.h
+
+#include linux/hash.h
+
+/*
+ * Todo: improve the print format of kvm_exit to let it is easier
+ * parsed by perf, then we can get the exit reason from print
+ * format directly.
+ */
+#define EXIT_REASON_EXCEPTION_NMI   0
+#define EXIT_REASON_EXTERNAL_INTERRUPT  1
+#define EXIT_REASON_TRIPLE_FAULT2
+
+#define EXIT_REASON_PENDING_INTERRUPT   7
+#define EXIT_REASON_NMI_WINDOW 8
+#define EXIT_REASON_TASK_SWITCH 9
+#define EXIT_REASON_CPUID   10
+#define EXIT_REASON_HLT 12
+#define EXIT_REASON_INVD13
+#define EXIT_REASON_INVLPG  14
+#define EXIT_REASON_RDPMC   15
+#define EXIT_REASON_RDTSC   16
+#define EXIT_REASON_VMCALL  18
+#define EXIT_REASON_VMCLEAR 19
+#define EXIT_REASON_VMLAUNCH20
+#define EXIT_REASON_VMPTRLD 21
+#define EXIT_REASON_VMPTRST 22
+#define EXIT_REASON_VMREAD  23
+#define EXIT_REASON_VMRESUME24
+#define EXIT_REASON_VMWRITE 25
+#define EXIT_REASON_VMOFF   26
+#define EXIT_REASON_VMON27
+#define EXIT_REASON_CR_ACCESS   28
+#define EXIT_REASON_DR_ACCESS   29
+#define EXIT_REASON_IO_INSTRUCTION  30
+#define EXIT_REASON_MSR_READ31
+#define EXIT_REASON_MSR_WRITE   32
+#define EXIT_REASON_INVALID_STATE  33
+#define EXIT_REASON_MWAIT_INSTRUCTION   36
+#define EXIT_REASON_MONITOR_INSTRUCTION 39
+#define EXIT_REASON_PAUSE_INSTRUCTION   40
+#define EXIT_REASON_MCE_DURING_VMENTRY  41
+#define EXIT_REASON_TPR_BELOW_THRESHOLD 43
+#define EXIT_REASON_APIC_ACCESS 44
+#define EXIT_REASON_EPT_VIOLATION   48
+#define EXIT_REASON_EPT_MISCONFIG   49
+#define EXIT_REASON_WBINVD 54
+#define EXIT_REASON_XSETBV 55
+
+#define SVM_EXIT_READ_CR0  0x000
+#define SVM_EXIT_READ_CR3  0x003
+#define SVM_EXIT_READ_CR4  0x004
+#define SVM_EXIT_READ_CR8  0x008
+#define 

Re: [PATCH 2/3] KVM: improve trace events of vmexit/mmio/ioport

2012-01-16 Thread Avi Kivity
On 01/16/2012 11:32 AM, Xiao Guangrong wrote:
 - trace vcpu_id for these events

We can infer the vcpu id from the kvm_entry tracepoints, no?

 - add kvm_mmio_done to trace the time when mmio/ioport emulation is completed

ditto?


Relying on the existing tracepoints will make the tool work on older
kernels.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: perf: kvm events analysis tool

2012-01-16 Thread Avi Kivity
On 01/16/2012 11:32 AM, Xiao Guangrong wrote:
 Add 'perf kvm-events' support to analyze kvm vmexit/mmio/ioport smartly

 Usage:
   perf kvm-events record

Why not 'perf record -e kvm'?

   perf kvm-events report



 +static const char *get_exit_reason(long isa, u64 exit_code)
 +{
 + int table_size = ARRAY_SIZE(svm_exit_reasons);
 + struct exit_reasons_table *table = svm_exit_reasons;
 +
 +
 + if (isa == 1) {
 + table = vmx_exit_reasons;
 + table_size = ARRAY_SIZE(vmx_exit_reasons);
 + }
 +
 + while (table_size--) {
 + if (table-exit_code == exit_code)
 + return table-reason;

... table[exit_code] ...

 + table++;
 + }
 +
 + die(unkonw kvm exit code:%ld on %s\n, exit_code, isa == 1 ?
 + VMX : SVM);

unknown

 +
 +struct kvm_events_ops {
 + bool (*is_begain_event)(struct event *event, void *data);

begin

 + bool (*is_end_event)(struct event *event);
 + struct event_key (*get_key)(struct event *event, void *data);
 + void (*decode_key)(struct event_key *key, char decode[20]);
 + const char *name;
 +};
 +
 +
 +static struct event_key exit_event_get_key(struct event *event, void *data)
 +{
 + struct event_key key;
 +
 + key.key = raw_field_value(event, exit_reason, data);
 + key.info = raw_field_value(event, isa, data);

isa is not available on all kernel versions; need to fall back to
/proc/cpuid detection.

 + return key;
 +}


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: perf: kvm events analysis tool

2012-01-16 Thread Stefan Hajnoczi
On Mon, Jan 16, 2012 at 9:32 AM, Xiao Guangrong
xiaoguangr...@linux.vnet.ibm.com wrote:
 +DESCRIPTION
 +---
 +You can analyze some crucial events and statistics with this
 +'perf kvm-events' command.

This line is very general and does not explain which events/statistics
can be collected or how you can use that information.  I suggest
making this description more specific.  Explain that this subcommand
observers kvm.ko tracepoints and annotates/decodes them with
additional information (this is why I would use this command and not
raw perf record -e kvm:\*).

 +       { SVM_EXIT_MONITOR,                     monitor }, \
 +       { SVM_EXIT_MWAIT,                       mwait }, \
 +       { SVM_EXIT_XSETBV,                      xsetbv }, \
 +       { SVM_EXIT_NPF,                         npf }

All this copy-paste could be avoided by sharing this stuff with the
arch/x86/kvm/ code.

 +static void exit_event_decode_key(struct event_key *key, char decode[20])
 +{
 +       const char *exit_reason = get_exit_reason(key-info, key-key);
 +
 +       memset(decode, 0, 20);
 +       strncpy(decode, exit_reason, 20);

This is a bad pattern to follow when using strncpy(3) because if there
was a strlen(exit_reason) == 20 string then decode[] would not be
NUL-terminated.  Right now it's safe but it's better to just use
strlcpy() and drop the memset(3).

 +}
 +
 +static struct kvm_events_ops exit_events = {
 +       .is_begain_event = exit_event_begain,
 +       .is_end_event = exit_event_end,
 +       .get_key = exit_event_get_key,
 +       .decode_key = exit_event_decode_key,
 +       .name = VM-EXIT
 +};
 +
 +#define KVM_TRACE_MMIO_READ_UNSATISFIED 0
 +#define KVM_TRACE_MMIO_READ 1
 +#define KVM_TRACE_MMIO_WRITE 2
 +static bool mmio_event_begain(struct event *event, void *data)
 +{
 +       if (!strcmp(event-name, kvm_mmio)) {
 +               long type = raw_field_value(event, type, data);
 +
 +               if (type != KVM_TRACE_MMIO_READ_UNSATISFIED)
 +                       return true;
 +       };
 +
 +       return false;
 +}
 +
 +static bool mmio_event_end(struct event *event)
 +{
 +       return !strcmp(event-name, kvm_mmio_done);
 +}
 +
 +static struct event_key mmio_event_get_key(struct event *event, void *data)
 +{
 +       struct event_key key;
 +
 +       key.key = raw_field_value(event, gpa, data);
 +       key.info = raw_field_value(event, type, data);
 +
 +       return key;
 +}
 +
 +static void mmio_event_decode_key(struct event_key *key, char decode[20])
 +{
 +       memset(decode, 0, 20);
 +       sprintf(decode, %#lx:%s, key-key,
 +               key-info == KVM_TRACE_MMIO_READ ? R : W);

Please drop the memset and use snprintf(3) instead of sprintf(3).  It
places the NUL-terminator and ensures you don't exceed the buffer
size.

Same pattern below.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] KVM: perf: a smart tool to analyse kvm events

2012-01-16 Thread Avi Kivity
On 01/16/2012 11:30 AM, Xiao Guangrong wrote:
 This tool is very like xenoprof(if i remember correctly), and traces kvm 
 events
 smartly. currently, it supports vmexit/mmio/ioport events.

 Usage:
 - to trace kvm events:
 # ./perf kvm-events record

 - show the result
 # ./perf kvm-events report

 Some output are as follow:
 # ./perf kvm-events report
   Warning: Error: expected type 5 but read 4
   Warning: Error: expected type 5 but read 0
   Warning: unknown op '}'


 Analyze events for all VCPUs:

  VM-EXITSamples  Samples% Time%Avg time

  APIC_ACCESS 43810744.89% 6.20%17.91us
   EXTERNAL_INTERRUPT 21922622.46% 8.01%46.20us
   IO_INSTRUCTION 12265112.57% 1.88%19.44us
EPT_VIOLATION  83110 8.52% 1.36%20.75us
PENDING_INTERRUPT  37055 3.80% 0.16% 5.38us
CPUID  32718 3.35% 0.08% 3.15us
EXCEPTION_NMI  23601 2.42% 0.17% 8.87us
  HLT  15424 1.58%82.12%  6735.06us
CR_ACCESS   4089 0.42% 0.02% 6.08us

 Total Samples:975981, Total events handled time:126502464.88us.

Nice!  If we can have a live version as well, this can replace kvm_stat.

The average numbers are really high.  Like a factor of 3x-4x off.  Would
be good to print the standard deviation and see why.  Maybe it's due to
the tracing overhead.

 The default event to be analysed is vmexit, we can use --event to specify it,
 for example, if we want to trace mmio event:
 # ./perf kvm-events report --event mmio
   Warning: Error: expected type 5 but read 4
   Warning: Error: expected type 5 but read 0
   Warning: unknown op '}'


 Analyze events for all VCPUs:

  MMIO AccessSamples  Samples% Time%Avg time

 0xfee00380:W 19658964.95%70.01% 3.83us
 0xfee00310:W  3535611.68% 6.48% 1.97us
 0xfee00300:W  3535611.68%16.37% 4.97us
 0xfee00300:R  3535611.68% 7.14% 2.17us

These are more reasonable (though still high - 5us for an ICR write?)


 Total Samples:302657, Total events handled time:1074746.01us.

 We can use --vcpu to specify which vcpu is traced:
 root@localhost perf]# ./perf kvm-events report --event mmio --vcpu 1
   Warning: Error: expected type 5 but read 4
   Warning: Error: expected type 5 but read 0
   Warning: unknown op '}'


 Analyze events for VCPU 1:

  MMIO AccessSamples  Samples% Time%Avg time

 0xfee00380:W  5804171.20%74.90% 3.70us
 0xfee00310:W   7826 9.60% 5.28% 1.93us
 0xfee00300:W   7826 9.60%13.82% 5.06us
 0xfee00300:R   7826 9.60% 6.01% 2.20us

 Total Samples:81519, Total events handled time:286577.81us.

 And, '--key' is used to sort the result, the possible value sample (default,
 the result is sorted by samples number), time(the result is sorted by time%):
 # ./perf kvm-events report --key time
   Warning: Error: expected type 5 but read 4
   Warning: Error: expected type 5 but read 0
   Warning: unknown op '}'


 Analyze events for all VCPUs:

  VM-EXITSamples  Samples% Time%Avg time

  HLT  15424 1.58%82.12%  6735.06us
   EXTERNAL_INTERRUPT 21922622.46% 8.01%46.20us
  APIC_ACCESS 43810744.89% 6.20%17.91us
   IO_INSTRUCTION 12265112.57% 1.88%19.44us
EPT_VIOLATION  83110 8.52% 1.36%20.75us
EXCEPTION_NMI  23601 2.42% 0.17% 8.87us
PENDING_INTERRUPT  37055 3.80% 0.16% 5.38us
CPUID  32718 3.35% 0.08% 3.15us
CR_ACCESS   4089 0.42% 0.02% 6.08us

 Total Samples:975981, Total events handled time:126502464.88us.

 I hope guys will like it and any comments are welcome! :)

I think it's great!  A live version would be a nice addition too.

Please copy the perf userspace maintainers to get more detailed review
in the next version.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] KVM: trace mmio read event properly

2012-01-16 Thread Avi Kivity
On 01/16/2012 11:31 AM, Xiao Guangrong wrote:
 In current code, we use KVM_TRACE_MMIO_READ to trace mmio read event which
 only can be completed immediately, instead of it, we trace the time when
 read event occur, then cooperate with then later patch, we can know the time
 of mmio read emulation

 @@ -3744,6 +3740,8 @@ mmio:
   /*
* Is this MMIO handled locally?
*/
 + trace_kvm_mmio(write ? KVM_TRACE_MMIO_WRITE : KVM_TRACE_MMIO_READ,

It's better to push the conditional to the trace event itself, so it's
only evaluated if tracing is enabled.

 + bytes, gpa, *(u64 *)val);

We get the wrong value for reads here, no?

Can't we leave the code as is, and infer the start of the event from the
last kvm_exit trace?

   handled = ops-read_write_mmio(vcpu, gpa, bytes, val);
   if (handled == bytes)
   return X86EMUL_CONTINUE;


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: SVM: comment nested paging and virtualization module parameters

2012-01-16 Thread Marcelo Tosatti
On Fri, Jan 13, 2012 at 02:51:56PM +0100, Davidlohr Bueso wrote:
 From: Davidlohr Bueso d...@gnu.org
 
 Also use true instead of 1 for enabling by default.
 
 Signed-off-by: Davidlohr Bueso d...@gnu.org
 ---
  arch/x86/kvm/svm.c |6 --
  1 files changed, 4 insertions(+), 2 deletions(-)
 
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index 5fa553b..3c9b0dc 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -176,11 +176,13 @@ static bool npt_enabled = true;
  #else
  static bool npt_enabled;
  #endif
 -static int npt = 1;
  
 +/* disable nested paging (virtualized MMU) for all guests */
 +static int npt = true;
  module_param(npt, int, S_IRUGO);

Should be allow nested paging...

 -static int nested = 1;
 +/* allow nested virtualization in KVM/SVM */
 +static int nested = true;
  module_param(nested, int, S_IRUGO);
  
  static void svm_flush_tlb(struct kvm_vcpu *vcpu);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 00/52] ppc patch queue 2012-01-13

2012-01-16 Thread Marcelo Tosatti
On Fri, Jan 13, 2012 at 03:31:03PM +0100, Alexander Graf wrote:
 Hi Avi,
 
 This is my current patch queue for ppc. Please pull.
 
 Alex
 
 
 The following changes since commit 188fc33198ddb1469562d40de33bcc29e7e2ed5f:
   Christian Borntraeger (1):
 kvm-s390: provide access guest registers via kvm_run
 
 are available in the git repository at:
 
   git://github.com/agraf/linux-2.6.git for-upstream

Pulled, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] KVM guest-kernel panics double fault

2012-01-16 Thread Marcelo Tosatti
On Sun, Jan 15, 2012 at 08:44:50PM +0100, Stephan Bärwolf wrote:
 Thank you for applying this, Marcelo.
 
 I fear we (or me after I agreed) did some mistake by erasing the additional
 cpuid 0x8001 checks.
 In contradiction to only AMD it MUST also apply on Intel-CPUs.
 
 Documentation
 http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-software-developer-manual-325462.pdf;
 Vol. 2A 3-207 (PDF-page 811) first block of table.
 (in addition AMD's doku 
 http://support.amd.com/us/Processor_TechDocs/APM_V3_24594.pdf;
 page 376 (PDF-page 408) table Exceptions on the bottom)
 
 Not all CPUs might have a syscall op at all (even in longmode) - they 
 informing about that
 via cpuid (But MSR_EFER may be still set).
 (You can force it externally in qemu-kvm-emulation via -cpu host,-syscall 
 ...)
 So an (guest) operating-system might not install *STAR-registers and crash 
 again on such vcpus, right?

No because if the operating system does not install the STAR MSRs, it
will not set SCE bit in MSR_EFER (and your patch handles that 
situation).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Don't mistreat edge-triggered INIT IPI as INIT de-assert. (LAPIC)

2012-01-16 Thread Marcelo Tosatti
On Fri, Jan 13, 2012 at 12:46:19PM +0100, Julian Stecklina wrote:
 Am Freitag, den 13.01.2012, 08:52 -0200 schrieb Marcelo Tosatti:
  On Thu, Jan 12, 2012 at 06:07:51PM +0100, Julian Stecklina wrote:
   Am Freitag, den 23.12.2011, 08:40 -0200 schrieb Marcelo Tosatti:
On Mon, Dec 19, 2011 at 02:14:27AM +0100, Julian Stecklina wrote:
 If the guest programs an IPI with level=0 (de-assert) and trig_mode=0 
 (edge),
 it is erroneously treated as INIT de-assert and ignored, but to quote 
 the
 spec: For this delivery mode [INIT de-assert], the level flag must 
 be set to
 0 and trigger mode flag to 1.

Yes, the implementation ignores INIT de-assert. Quoting the spec:

(INIT Level De-assert) (Not supported in the Pentium 4 and Intel Xeon
processors.)

Your patch below is not improving the implementation to be closer to the
spec: it'll trigger the INIT state initialization with trig_mode == 0
(which is not in accordance with your spec quote above).
   
   After reading the spec again and consulting with the guy who wrote the
   code triggering this, it seems the whole if (level) in the code path
   below is superfluous. 
  
  No. Look at whats inside if (level): the mp_state assignment is the
  internal implementation of delivers an INIT request to the target
  processor.
  
  According to the spec, the INIT level de-assert 
  
  Sends a synchronization message to all the local APICs in the system
  to set their arbitration IDs (stored in their Arb ID registers) to the
  values of their APIC IDs (see Section 10.7, “System and APIC Bus
  Arbitration”).
  
  So if you remove the if (level) check, INIT de-assert will be emulated
  as INIT!
 
 Newer processors don't support INIT level de-assert and will interpret
 this as INIT. Without the if (level) check, KVM would behave in the
 same way, thus not breaking code that actually runs on real processors.
 
 For processors that still supported INIT level de-assert: If you look
 into older specs (243192), you read:
 
 101 (INIT) ... INIT is treated as an edge triggered interrupt even if
 programmed otherwise.
 
 101 (INIT Level De-assert) The trigger mode must also be set to 1 and
 level mode to 0.
 
 This means that if you don't set trigger mode to 1, you will get an INIT
 instead of INIT level de-assert. This is where the current code in KVM
 is wrong. So with my original patch, KVM would behave like the old spec
 mandates (check for trigger mode). With the if (level) check removed,
 it would behave like recent processors. Either way, the current code is
 bogus.
 
 Regards, Julian

Yes, the original patch is fine. Please resend it.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] KVM guest-kernel panics double fault

2012-01-16 Thread Stephan Bärwolf
Okay, thx to hear a second opinion.

Then everything is ok, have a nice day.

regards Stephan

On 01/16/12 10:58, Marcelo Tosatti wrote:
 On Sun, Jan 15, 2012 at 08:44:50PM +0100, Stephan Bärwolf wrote:
 Thank you for applying this, Marcelo.

 I fear we (or me after I agreed) did some mistake by erasing the additional
 cpuid 0x8001 checks.
 In contradiction to only AMD it MUST also apply on Intel-CPUs.

 Documentation
 http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-software-developer-manual-325462.pdf;
 Vol. 2A 3-207 (PDF-page 811) first block of table.
 (in addition AMD's doku 
 http://support.amd.com/us/Processor_TechDocs/APM_V3_24594.pdf;
 page 376 (PDF-page 408) table Exceptions on the bottom)

 Not all CPUs might have a syscall op at all (even in longmode) - they 
 informing about that
 via cpuid (But MSR_EFER may be still set).
 (You can force it externally in qemu-kvm-emulation via -cpu host,-syscall 
 ...)
 So an (guest) operating-system might not install *STAR-registers and crash 
 again on such vcpus, right?
 No because if the operating system does not install the STAR MSRs, it
 will not set SCE bit in MSR_EFER (and your patch handles that 
 situation).




-- 
Dipl.-Inf. Stephan Bärwolf
Ilmenau University of Technology, Integrated Communication Systems Group
Phone: +49 (0)3677 69 4130
Email: stephan.baerw...@tu-ilmenau.de,  
Web: http://www.tu-ilmenau.de/iks

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] acpi_piix4: Add stub functions for CPU eject callback

2012-01-16 Thread Vasilis Liaskovitis
On Sun, Jan 15, 2012 at 02:38:52PM +0200, Avi Kivity wrote:
 On 01/13/2012 01:11 PM, Vasilis Liaskovitis wrote:
  Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
  ---
   hw/acpi_piix4.c |   15 +++
   1 files changed, 15 insertions(+), 0 deletions(-)
 
  diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
  index d5743b6..8bf30dd 100644
  --- a/hw/acpi_piix4.c
  +++ b/hw/acpi_piix4.c
  @@ -37,6 +37,7 @@
   
   #define GPE_BASE 0xafe0
   #define PROC_BASE 0xaf00
  +#define PROC_EJ_BASE 0xaf20
 
 
 We're adding stuff to piix4 which was never there.  At a minimum this
 needs to be documented.  Also needs to be -M pc-1.1 and later only.

Where should this be documented? PCI/ACPI hotplug addresses are documented in
docs/specs/acpi_pci_hotplug.txt but for CPU hotplug documentation (i.e.
for the existing PROC_BASE) I don't see relevant documentation. I will
create a docs/specs/acpi_cpu_hotplug.txt if that sounds reasonable.

For pc-1.1, a new QEMUmachine type will be needed I assume. Should a check be
made against the machine version in the piix4 code? any relevant examples? 

thanks,

- Vasilis

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3][Seabios] Add bitmap for cpu _EJ0 callback

2012-01-16 Thread Vasilis Liaskovitis
On Fri, Jan 13, 2012 at 07:27:01PM -0500, Kevin O'Connor wrote:
 On Fri, Jan 13, 2012 at 12:11:30PM +0100, Vasilis Liaskovitis wrote:
  
  Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
 
 The SeaBIOS change is okay with me, but the qemu/kvm change needs to
 be accepted first.
 
 [...]
   Method (CPEJ, 2, NotSerialized) {
   // _EJ0 method - eject callback
  +Store(ShiftLeft(1, Arg0), PRE)
   Sleep(200)
   }
 
 Is the Sleep() still needed?

I believe it's unneccesary. I 'll test without it and resend.
thanks,

- Vasilis

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost-net: add module alias (v2.1)

2012-01-16 Thread Alan Cox
  ACKs, NACKs?  What is happening here?
 
 I would like an Ack from Alan Cox who switched vhost-net
 to a dynamic minor in the first place, in commit
 79907d89c397b8bc2e05b347ec94e928ea919d33.

Sorry dev...@lanana.org isn't yet back from the kernel hack incident.

I don't read netdev so someone needs to summarise the issue and send me
a copy of the patch to look at.

Alan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] acpi_piix4: Add stub functions for CPU eject callback

2012-01-16 Thread Avi Kivity
On 01/16/2012 01:32 PM, Vasilis Liaskovitis wrote:
 On Sun, Jan 15, 2012 at 02:38:52PM +0200, Avi Kivity wrote:
  On 01/13/2012 01:11 PM, Vasilis Liaskovitis wrote:
   Signed-off-by: Vasilis Liaskovitis vasilis.liaskovi...@profitbricks.com
   ---
hw/acpi_piix4.c |   15 +++
1 files changed, 15 insertions(+), 0 deletions(-)
  
   diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
   index d5743b6..8bf30dd 100644
   --- a/hw/acpi_piix4.c
   +++ b/hw/acpi_piix4.c
   @@ -37,6 +37,7 @@

#define GPE_BASE 0xafe0
#define PROC_BASE 0xaf00
   +#define PROC_EJ_BASE 0xaf20
  
  
  We're adding stuff to piix4 which was never there.  At a minimum this
  needs to be documented.  Also needs to be -M pc-1.1 and later only.

 Where should this be documented? PCI/ACPI hotplug addresses are documented in
 docs/specs/acpi_pci_hotplug.txt 

A pleasant surprise

 but for CPU hotplug documentation (i.e.
 for the existing PROC_BASE) I don't see relevant documentation. I will
 create a docs/specs/acpi_cpu_hotplug.txt if that sounds reasonable.

I suggest renaming it to acpi_hotplug.txt, so it covers both cases.

 For pc-1.1, a new QEMUmachine type will be needed I assume. Should a check be
 made against the machine version in the piix4 code? any relevant examples? 


The standard practice is to set a property.  See for example
pc_machine_v0_14 in hw/pc_piix.c, it autosets properties for devices
(erroneously called drivers in the code).

btw, I notice the I/O ports are write only and don't remember their
state.  I can't think offhand if there's anything bad about it (in fact
not having state makes live migration more robust), but perhaps someone
else will.


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost-net: add module alias

2012-01-16 Thread Avi Kivity
On 01/11/2012 06:54 AM, Stephen Hemminger wrote:
 By adding the a module alias, programs (or users) won't have to explicitly
 call modprobe. Vhost-net will always be available if built into the kernel.
 It does require assigning a permanent minor number for depmod to work.
 Choose one next to TUN since this driver is related to it.

Statically allocated numbers have to go through lanana, no?

This increases the security exposure and the kernel footprint for hosts
that don't want vhost-net.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: remove dependence on delay-accounting

2012-01-16 Thread Marcelo Tosatti
On Sat, Jan 14, 2012 at 08:30:51PM +0400, Konstantin Khlebnikov wrote:
 KVM selects delay-accounting only to get sched-info for steal-time accounting.
 Meanwhile delay-accounting can be disabled by boot option. This is ridiculous.
 
 This patch adds internal boolean option CONFIG_TASK_SCHED_INFO to enable only
 task-sched_info and its collecting inside scheduler.
 
 v2:
 * stupid misprint fixed
 
 Signed-off-by: Konstantin Khlebnikov khlebni...@openvz.org
 ---
  arch/x86/kvm/Kconfig  |5 +
  include/linux/sched.h |6 ++
  init/Kconfig  |7 +++
  kernel/sched/core.c   |2 +-
  kernel/sched/stats.h  |4 ++--
  lib/Kconfig.debug |1 +
  6 files changed, 14 insertions(+), 11 deletions(-)
 
 diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
 index 1a7fe86..e3952e8 100644
 --- a/arch/x86/kvm/Kconfig
 +++ b/arch/x86/kvm/Kconfig
 @@ -22,8 +22,6 @@ config KVM
   depends on HAVE_KVM
   # for device assignment:
   depends on PCI
 - # for TASKSTATS/TASK_DELAY_ACCT:
 - depends on NET
   select PREEMPT_NOTIFIERS
   select MMU_NOTIFIER
   select ANON_INODES
 @@ -33,8 +31,7 @@ config KVM
   select KVM_ASYNC_PF
   select USER_RETURN_NOTIFIER
   select KVM_MMIO
 - select TASKSTATS
 - select TASK_DELAY_ACCT
 + select TASK_SCHED_INFO
   select PERF_EVENTS
   ---help---
 Support hosting fully virtualized guest machines using hardware
 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index 868cb83..dd5bf78 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -734,7 +734,6 @@ extern struct user_struct root_user;
  
  struct backing_dev_info;
  
 -#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
  struct sched_info {
   /* cumulative counters */
   unsigned long pcount; /* # of times run on this cpu */
 @@ -744,7 +743,6 @@ struct sched_info {
   unsigned long long last_arrival,/* when we last ran on a cpu */
  last_queued; /* when we were last queued to run */
  };
 -#endif /* defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) */
  
  #ifdef CONFIG_TASK_DELAY_ACCT
  struct task_delay_info {
 @@ -782,7 +780,7 @@ struct task_delay_info {
  
  static inline int sched_info_on(void)
  {
 -#ifdef CONFIG_SCHEDSTATS
 +#if IS_ENABLED(CONFIG_SCHEDSTATS) || IS_ENABLED(CONFIG_KVM)
   return 1;

CONFIG_TASK_SCHED_INFO?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: Don't mistreat edge-triggered INIT IPI as INIT de-assert. (LAPIC)

2012-01-16 Thread js
From: Julian Stecklina j...@alien8.de

If the guest programs an IPI with level=0 (de-assert) and trig_mode=0 (edge),
it is erroneously treated as INIT de-assert and ignored, but to quote the
spec: For this delivery mode [INIT de-assert], the level flag must be set to
0 and trigger mode flag to 1.

Signed-off-by: Julian Stecklina j...@alien8.de
---
 arch/x86/kvm/lapic.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index cfdc6e0..3ee1d83 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -433,7 +433,7 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int 
delivery_mode,
break;
 
case APIC_DM_INIT:
-   if (level) {
+   if (!trig_mode || level) {
result = 1;
vcpu-arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
kvm_make_request(KVM_REQ_EVENT, vcpu);
-- 
1.7.8.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM: x86 emulator: add 8-bit memory operands

2012-01-16 Thread Avi Kivity
Useful for MOVSX/MOVZX.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/emulate.c |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 05a562b..92a45dd 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -57,6 +57,7 @@
 #define OpDS  23ull  /* DS */
 #define OpFS  24ull  /* FS */
 #define OpGS  25ull  /* GS */
+#define OpMem826ull  /* 8-bit zero extended memory operand */
 
 #define OpBits 5  /* Width of operand field */
 #define OpMask ((1ull  OpBits) - 1)
@@ -101,6 +102,7 @@
 #define SrcAcc  (OpAcc  SrcShift)
 #define SrcImmU16   (OpImmU16  SrcShift)
 #define SrcDX   (OpDX  SrcShift)
+#define SrcMem8 (OpMem8  SrcShift)
 #define SrcMask (OpMask  SrcShift)
 #define BitOp   (111)
 #define MemAbs  (112)  /* Memory operand is absolute displacement */
@@ -3605,6 +3607,9 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, 
struct operand *op,
case OpImm:
rc = decode_imm(ctxt, op, imm_size(ctxt), true);
break;
+   case OpMem8:
+   ctxt-memop.bytes = 1;
+   goto mem_common;
case OpMem16:
ctxt-memop.bytes = 2;
goto mem_common;
-- 
1.7.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] KVM: x86 emulator: Remove byte-sized MOVSX/MOVZX hack

2012-01-16 Thread Avi Kivity
Currently we treat MOVSX/MOVZX with a byte source as a byte instruction,
and change the destination operand size with a hack.  Change it to be
a word instruction, so the destination receives its natural size, and
change the source to be SrcMem8.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/emulate.c |   13 +
 1 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 92a45dd..1b4edb3 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -860,8 +860,7 @@ static void write_sse_reg(struct x86_emulate_ctxt *ctxt, 
sse128_t *data,
 }
 
 static void decode_register_operand(struct x86_emulate_ctxt *ctxt,
-   struct operand *op,
-   int inhibit_bytereg)
+   struct operand *op)
 {
unsigned reg = ctxt-modrm_reg;
int highbyte_regs = ctxt-rex_prefix == 0;
@@ -878,7 +877,7 @@ static void decode_register_operand(struct x86_emulate_ctxt 
*ctxt,
}
 
op-type = OP_REG;
-   if ((ctxt-d  ByteOp)  !inhibit_bytereg) {
+   if (ctxt-d  ByteOp) {
op-addr.reg = decode_register(reg, ctxt-regs, highbyte_regs);
op-bytes = 1;
} else {
@@ -3465,13 +3464,13 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
I(DstMem | SrcReg | ModRM | BitOp | Lock, em_btr),
I(DstReg | SrcMemFAddr | ModRM | Src2FS, em_lseg),
I(DstReg | SrcMemFAddr | ModRM | Src2GS, em_lseg),
-   D(ByteOp | DstReg | SrcMem | ModRM | Mov), D(DstReg | SrcMem16 | ModRM 
| Mov),
+   D(DstReg | SrcMem8 | ModRM | Mov), D(DstReg | SrcMem16 | ModRM | Mov),
/* 0xB8 - 0xBF */
N, N,
G(BitOp, group8),
I(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable, em_btc),
I(DstReg | SrcMem | ModRM, em_bsf), I(DstReg | SrcMem | ModRM, em_bsr),
-   D(ByteOp | DstReg | SrcMem | ModRM | Mov), D(DstReg | SrcMem16 | ModRM 
| Mov),
+   D(DstReg | SrcMem8 | ModRM | Mov), D(DstReg | SrcMem16 | ModRM | Mov),
/* 0xC0 - 0xCF */
D2bv(DstMem | SrcReg | ModRM | Lock),
N, D(DstMem | SrcReg | ModRM | Mov),
@@ -3553,9 +3552,7 @@ static int decode_operand(struct x86_emulate_ctxt *ctxt, 
struct operand *op,
 
switch (d) {
case OpReg:
-   decode_register_operand(ctxt, op,
-op == ctxt-dst 
-ctxt-twobyte  (ctxt-b == 0xb6 || ctxt-b == 0xb7));
+   decode_register_operand(ctxt, op);
break;
case OpImmUByte:
rc = decode_imm(ctxt, op, 1, false);
-- 
1.7.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Remove hack from movsx/movzx decoding

2012-01-16 Thread Avi Kivity
movsx/movzx destination operands currently have a hack for the operand size.
Add OpMem8 and use it to remove the hack.

I'll wait with this until Nadav's more direct fix is in.

Avi Kivity (2):
  KVM: x86 emulator: add 8-bit memory operands
  KVM: x86 emulator: Remove byte-sized MOVSX/MOVZX hack

 arch/x86/kvm/emulate.c |   18 ++
 1 files changed, 10 insertions(+), 8 deletions(-)

-- 
1.7.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: Move gfn_to_memslot() to kvm_host.h

2012-01-16 Thread Alexander Graf

On 13.01.2012, at 07:09, Paul Mackerras wrote:

 This moves __gfn_to_memslot() and search_memslots() from kvm_main.c to
 kvm_host.h to reduce the code duplication caused by the need for
 non-modular code in arch/powerpc/kvm/book3s_hv_rm_mmu.c to call
 gfn_to_memslot() in real mode.
 
 Rather than putting gfn_to_memslot() itself in a header, which would
 lead to increased code size, this puts __gfn_to_memslot() in a header.
 Then, the non-modular uses of gfn_to_memslot() are changed to call
 __gfn_to_memslot() instead.  This way there is only one place in the
 source code that needs to be changed should the gfn_to_memslot()
 implementation need to be modified.
 
 On powerpc, the Book3S HV style of KVM has code that is called from
 real mode which needs to call gfn_to_memslot() and thus needs this.
 (Module code is allocated in the vmalloc region, which can't be
 accessed in real mode.)
 
 With this, we can remove builtin_gfn_to_memslot() from book3s_hv_rm_mmu.c.
 
 Signed-off-by: Paul Mackerras pau...@samba.org

Avi?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: Move gfn_to_memslot() to kvm_host.h

2012-01-16 Thread Avi Kivity
On 01/16/2012 03:18 PM, Alexander Graf wrote:
 Avi?

ACK!

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: Move gfn_to_memslot() to kvm_host.h

2012-01-16 Thread Alexander Graf

On 13.01.2012, at 07:09, Paul Mackerras wrote:

 This moves __gfn_to_memslot() and search_memslots() from kvm_main.c to
 kvm_host.h to reduce the code duplication caused by the need for
 non-modular code in arch/powerpc/kvm/book3s_hv_rm_mmu.c to call
 gfn_to_memslot() in real mode.
 
 Rather than putting gfn_to_memslot() itself in a header, which would
 lead to increased code size, this puts __gfn_to_memslot() in a header.
 Then, the non-modular uses of gfn_to_memslot() are changed to call
 __gfn_to_memslot() instead.  This way there is only one place in the
 source code that needs to be changed should the gfn_to_memslot()
 implementation need to be modified.
 
 On powerpc, the Book3S HV style of KVM has code that is called from
 real mode which needs to call gfn_to_memslot() and thus needs this.
 (Module code is allocated in the vmalloc region, which can't be
 accessed in real mode.)
 
 With this, we can remove builtin_gfn_to_memslot() from book3s_hv_rm_mmu.c.

Which tree is this against? I got this diff between your patch and the patch 
when applied on my tree:

-@@ -97,7 +78,7 @@ static void remove_revmap_chain(struct kvm *kvm, long 
pte_index,
-   rev = real_vmalloc_addr(kvm-arch.revmap[pte_index]);
-   ptel = rev-guest_rpte;
+@@ -99,7 +80,7 @@ static void remove_revmap_chain(struct kvm *kvm, long 
pte_index,
+   rcbits = hpte_r  (HPTE_R_R | HPTE_R_C);
+   ptel = rev-guest_rpte |= rcbits;

Since this is completely unrelated to the actual change, I'll apply the patch 
either way. It'd just be interesting to know.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: remove dependence on delay-accounting

2012-01-16 Thread Peter Zijlstra
On Sat, 2012-01-14 at 20:30 +0400, Konstantin Khlebnikov wrote:
 KVM selects delay-accounting only to get sched-info for steal-time accounting.
 Meanwhile delay-accounting can be disabled by boot option. This is ridiculous.
 
 This patch adds internal boolean option CONFIG_TASK_SCHED_INFO to enable only
 task-sched_info and its collecting inside scheduler.

Urgh, more stupid config knobs, we should be removing them, not adding
moar.

 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index 868cb83..dd5bf78 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -734,7 +734,6 @@ extern struct user_struct root_user;
  
  struct backing_dev_info;
  
 -#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
  struct sched_info {
   /* cumulative counters */
   unsigned long pcount; /* # of times run on this cpu */
 @@ -744,7 +743,6 @@ struct sched_info {
   unsigned long long last_arrival,/* when we last ran on a cpu */
  last_queued; /* when we were last queued to run */
  };
 -#endif /* defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) */

Not having that structure helps with compile errors.

  #ifdef CONFIG_TASK_DELAY_ACCT
  struct task_delay_info {
 @@ -782,7 +780,7 @@ struct task_delay_info {
  
  static inline int sched_info_on(void)
  {
 -#ifdef CONFIG_SCHEDSTATS
 +#if IS_ENABLED(CONFIG_SCHEDSTATS) || IS_ENABLED(CONFIG_KVM)

WTF is IS_ENABLED and why do you use it?


Not much like this stuff.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: remove dependence on delay-accounting

2012-01-16 Thread Konstantin Khlebnikov

Marcelo Tosatti wrote:

On Sat, Jan 14, 2012 at 08:30:51PM +0400, Konstantin Khlebnikov wrote:

KVM selects delay-accounting only to get sched-info for steal-time accounting.
Meanwhile delay-accounting can be disabled by boot option. This is ridiculous.

This patch adds internal boolean option CONFIG_TASK_SCHED_INFO to enable only
task-sched_info and its collecting inside scheduler.



cut


  static inline int sched_info_on(void)
  {
-#ifdef CONFIG_SCHEDSTATS
+#if IS_ENABLED(CONFIG_SCHEDSTATS) || IS_ENABLED(CONFIG_KVM)
return 1;


CONFIG_TASK_SCHED_INFO?



It makes it equal to constant 1, because all its callers are
under #ifdef CONFIG_TASK_SCHED_INFO =)

its current code:

static inline int sched_info_on(void)
{
#ifdef CONFIG_SCHEDSTATS
return 1;
#elif defined(CONFIG_TASK_DELAY_ACCT)
extern int delayacct_on;
return delayacct_on;
#else
return 0;
#endif
}

CONFIG_SCHEDSTATS == debug option in lib/Kconfig.debug for /proc/schedstat
CONFIG_TASK_DELAY_ACCT == wierd net-link based statistics collecting tool

Thus, may be better to remove this function, because it can return 0
only if delay-accounting is compiled but disabled by boot option
(delayacct_on =1 by default since 2.6.18)

patch follows...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] sched: remove task-sched-info dynamic disabler

2012-01-16 Thread Konstantin Khlebnikov
Currently per-task sched-info statistics can be disabled if delay-accounting is
disabled by boot option. There is no reason for it, sched-info is built-in
into task-struct, and its collecting does not add any extra atomic operations.
In other combinations it either not compiled or cannot be disabled.

This patch removes sched_info_on() and fixes all its users.

Signed-off-by: Konstantin Khlebnikov khlebni...@openvz.org
---
 arch/x86/kvm/cpuid.c  |6 ++
 arch/x86/kvm/x86.c|4 
 include/linux/sched.h |   12 
 kernel/sched/core.c   |3 +--
 kernel/sched/stats.h  |   18 +-
 5 files changed, 8 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 89b02bf..12870d8 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -408,10 +408,8 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
u32 function,
 (1  KVM_FEATURE_NOP_IO_DELAY) |
 (1  KVM_FEATURE_CLOCKSOURCE2) |
 (1  KVM_FEATURE_ASYNC_PF) |
-(1  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
-
-   if (sched_info_on())
-   entry-eax |= (1  KVM_FEATURE_STEAL_TIME);
+(1  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
+(1  KVM_FEATURE_STEAL_TIME);
 
entry-ebx = 0;
entry-ecx = 0;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 14d6cad..a60645c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1579,10 +1579,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
return 1;
break;
case MSR_KVM_STEAL_TIME:
-
-   if (unlikely(!sched_info_on()))
-   return 1;
-
if (data  KVM_STEAL_RESERVED_MASK)
return 1;
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index dd5bf78..eb8842e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -778,18 +778,6 @@ struct task_delay_info {
 };
 #endif /* CONFIG_TASK_DELAY_ACCT */
 
-static inline int sched_info_on(void)
-{
-#if IS_ENABLED(CONFIG_SCHEDSTATS) || IS_ENABLED(CONFIG_KVM)
-   return 1;
-#elif defined(CONFIG_TASK_DELAY_ACCT)
-   extern int delayacct_on;
-   return delayacct_on;
-#else
-   return 0;
-#endif
-}
-
 enum cpu_idle_type {
CPU_IDLE,
CPU_NOT_IDLE,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0c59dcb..73acf77 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1759,8 +1759,7 @@ void sched_fork(struct task_struct *p)
raw_spin_unlock_irqrestore(p-pi_lock, flags);
 
 #ifdef CONFIG_TASK_SCHED_INFO
-   if (likely(sched_info_on()))
-   memset(p-sched_info, 0, sizeof(p-sched_info));
+   memset(p-sched_info, 0, sizeof(p-sched_info));
 #endif
 #if defined(CONFIG_SMP)
p-on_cpu = 0;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 2322b86..e6363ed 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -63,9 +63,8 @@ static inline void sched_info_dequeued(struct task_struct *t)
 {
unsigned long long now = task_rq(t)-clock, delta = 0;
 
-   if (unlikely(sched_info_on()))
-   if (t-sched_info.last_queued)
-   delta = now - t-sched_info.last_queued;
+   if (t-sched_info.last_queued)
+   delta = now - t-sched_info.last_queued;
sched_info_reset_dequeued(t);
t-sched_info.run_delay += delta;
 
@@ -98,9 +97,8 @@ static void sched_info_arrive(struct task_struct *t)
  */
 static inline void sched_info_queued(struct task_struct *t)
 {
-   if (unlikely(sched_info_on()))
-   if (!t-sched_info.last_queued)
-   t-sched_info.last_queued = task_rq(t)-clock;
+   if (!t-sched_info.last_queued)
+   t-sched_info.last_queued = task_rq(t)-clock;
 }
 
 /*
@@ -127,7 +125,7 @@ static inline void sched_info_depart(struct task_struct *t)
  * the idle task.)  We are only called when prev != next.
  */
 static inline void
-__sched_info_switch(struct task_struct *prev, struct task_struct *next)
+sched_info_switch(struct task_struct *prev, struct task_struct *next)
 {
struct rq *rq = task_rq(prev);
 
@@ -142,12 +140,6 @@ __sched_info_switch(struct task_struct *prev, struct 
task_struct *next)
if (next != rq-idle)
sched_info_arrive(next);
 }
-static inline void
-sched_info_switch(struct task_struct *prev, struct task_struct *next)
-{
-   if (unlikely(sched_info_on()))
-   __sched_info_switch(prev, next);
-}
 #else
 #define sched_info_queued(t)   do { } while (0)
 #define sched_info_reset_dequeued(t)   do { } while (0)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Offline for a bit

2012-01-16 Thread Avi Kivity
I'll be offline until next Monday (inclusive).  Please send kvm patches
to Marcelo, as usual.  Urgent or non-core memory API fixes can go to
Anthony, I'll review core memory patches, if any, when I return.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: remove dependence on delay-accounting

2012-01-16 Thread Konstantin Khlebnikov

Peter Zijlstra wrote:

On Sat, 2012-01-14 at 20:30 +0400, Konstantin Khlebnikov wrote:

KVM selects delay-accounting only to get sched-info for steal-time accounting.
Meanwhile delay-accounting can be disabled by boot option. This is ridiculous.

This patch adds internal boolean option CONFIG_TASK_SCHED_INFO to enable only
task-sched_info and its collecting inside scheduler.


Urgh, more stupid config knobs, we should be removing them, not adding
moar.


Unfortunately, removing task-delay-accounting is not the option =)




diff --git a/include/linux/sched.h b/include/linux/sched.h
index 868cb83..dd5bf78 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -734,7 +734,6 @@ extern struct user_struct root_user;

  struct backing_dev_info;

-#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
  struct sched_info {
/* cumulative counters */
unsigned long pcount; /* # of times run on this cpu */
@@ -744,7 +743,6 @@ struct sched_info {
unsigned long long last_arrival,/* when we last ran on a cpu */
   last_queued; /* when we were last queued to run */
  };
-#endif /* defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) */


Not having that structure helps with compile errors.


I don't think so, this only helps to catch useless declarations and definitions
in code and inside other structures.




  #ifdef CONFIG_TASK_DELAY_ACCT
  struct task_delay_info {
@@ -782,7 +780,7 @@ struct task_delay_info {

  static inline int sched_info_on(void)
  {
-#ifdef CONFIG_SCHEDSTATS
+#if IS_ENABLED(CONFIG_SCHEDSTATS) || IS_ENABLED(CONFIG_KVM)


WTF is IS_ENABLED and why do you use it?


IS_ENABLED(smth) == (defined(smth) || defined(smth_MODULE))
but using it for CONFIG_SCHEDSTATS is overkill, indeed.




Not much like this stuff.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] emulator: Fix task switch into/out of VM86

2012-01-16 Thread Kevin Wolf
Am 10.01.2012 18:51, schrieb Joerg Roedel:
 On Tue, Jan 10, 2012 at 01:30:47PM +0200, Gleb Natapov wrote:
 On Tue, Jan 10, 2012 at 12:25:18PM +0100, Kevin Wolf wrote:
 Did that now, and it looks like exit_int_info is always 0 during the
 task switch intercept for a task gate in the IDT. So special-casing VM86
 won't help, I'm afraid.

 Joerg, do you know how can we check that task switch was cause by an
 exception triggering task gate if exit_int_info is not provided during
 intercept (short of decoding instruction that caused intercept to see if
 it's a call to a gate, which is racy) ?
 
 Hmm, havn't found a solution yet. But I will check further and let you
 know if I find out something.

Jörg, any news on this?

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost-net: add module alias (v2.1)

2012-01-16 Thread Stephen Hemminger
On Mon, 16 Jan 2012 12:26:45 +
Alan Cox a...@linux.intel.com wrote:

   ACKs, NACKs?  What is happening here?
  
  I would like an Ack from Alan Cox who switched vhost-net
  to a dynamic minor in the first place, in commit
  79907d89c397b8bc2e05b347ec94e928ea919d33.
 
 Sorry dev...@lanana.org isn't yet back from the kernel hack incident.
 
 I don't read netdev so someone needs to summarise the issue and send me
 a copy of the patch to look at.
 
 Alan

Subject: vhost-net: add module alias (v2.1)

By adding some module aliases, programs (or users) won't have to explicitly
call modprobe. Vhost-net will always be available if built into the kernel.
It does require assigning a permanent minor number for depmod to work.

Also:
  - use C99 style initialization.
  - add missing entry in documentation for loop-control

Signed-off-by: Stephen Hemminger shemmin...@vyatta.com

---
2.1 - add missing documentation for loop control as well

 Documentation/devices.txt  |3 +++
 drivers/vhost/net.c|8 +---
 include/linux/miscdevice.h |1 +
 3 files changed, 9 insertions(+), 3 deletions(-)

--- a/drivers/vhost/net.c   2012-01-12 14:14:25.681815487 -0800
+++ b/drivers/vhost/net.c   2012-01-12 18:09:56.810680816 -0800
@@ -856,9 +856,9 @@ static const struct file_operations vhos
 };
 
 static struct miscdevice vhost_net_misc = {
-   MISC_DYNAMIC_MINOR,
-   vhost-net,
-   vhost_net_fops,
+   .minor = VHOST_NET_MINOR,
+   .name = vhost-net,
+   .fops = vhost_net_fops,
 };
 
 static int vhost_net_init(void)
@@ -879,3 +879,5 @@ MODULE_VERSION(0.0.1);
 MODULE_LICENSE(GPL v2);
 MODULE_AUTHOR(Michael S. Tsirkin);
 MODULE_DESCRIPTION(Host kernel accelerator for virtio net);
+MODULE_ALIAS_MISCDEV(VHOST_NET_MINOR);
+MODULE_ALIAS(devname:vhost-net);
--- a/include/linux/miscdevice.h2012-01-12 14:14:25.725815981 -0800
+++ b/include/linux/miscdevice.h2012-01-12 18:09:56.810680816 -0800
@@ -42,6 +42,7 @@
 #define AUTOFS_MINOR   235
 #define MAPPER_CTRL_MINOR  236
 #define LOOP_CTRL_MINOR237
+#define VHOST_NET_MINOR238
 #define MISC_DYNAMIC_MINOR 255
 
 struct device;
--- a/Documentation/devices.txt 2012-01-12 14:14:25.701815712 -0800
+++ b/Documentation/devices.txt 2012-01-12 18:09:56.814680860 -0800
@@ -447,6 +447,9 @@ Your cooperation is appreciated.
234 = /dev/btrfs-controlBtrfs control device
235 = /dev/autofs   Autofs control device
236 = /dev/mapper/control   Device-Mapper control device
+   237 = /dev/loop-control Loopback control device
+   238 = /dev/vhost-netHost kernel accelerator for virtio net
+
240-254 Reserved for local use
255 Reserved for MISC_DYNAMIC_MINOR
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 10/18] ioapic: Drop post-load irr initialization

2012-01-16 Thread Jan Kiszka
As all devices undergo a reset prior to vmloa, and the reset value of
irr is 0, we do not need to do this clearing for older vmstates
explicitly. Dropping this redundant code will also make KVM integration
a bit simpler.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 hw/ioapic.c |   12 
 1 files changed, 0 insertions(+), 12 deletions(-)

diff --git a/hw/ioapic.c b/hw/ioapic.c
index 27b07c6..0743af6 100644
--- a/hw/ioapic.c
+++ b/hw/ioapic.c
@@ -278,21 +278,9 @@ ioapic_mem_write(void *opaque, target_phys_addr_t addr, 
uint64_t val,
 }
 }
 
-static int ioapic_post_load(void *opaque, int version_id)
-{
-IOAPICState *s = opaque;
-
-if (version_id == 1) {
-/* set sane value */
-s-irr = 0;
-}
-return 0;
-}
-
 static const VMStateDescription vmstate_ioapic = {
 .name = ioapic,
 .version_id = 3,
-.post_load = ioapic_post_load,
 .minimum_version_id = 1,
 .minimum_version_id_old = 1,
 .fields = (VMStateField[]) {
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 12/18] memory: Introduce memory_region_init_reservation

2012-01-16 Thread Jan Kiszka
Introduce a memory region type that can reserve I/O space. Such regions
are useful for modeling I/O that is only handled outside of QEMU, i.e.
in the context of an accelerator like KVM.

Any access to such a region from QEMU is a bug, but could theoretically
be triggered by guest code (DMA to reserved region). So only warning
about such events once, then ignore them.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 memory.c |   36 
 memory.h |   16 
 2 files changed, 52 insertions(+), 0 deletions(-)

diff --git a/memory.c b/memory.c
index a7e615a..30eb1b9 100644
--- a/memory.c
+++ b/memory.c
@@ -1049,6 +1049,42 @@ void memory_region_init_rom_device(MemoryRegion *mr,
 mr-backend_registered = true;
 }
 
+static uint64_t invalid_read(void *opaque, target_phys_addr_t addr,
+ unsigned size)
+{
+MemoryRegion *mr = opaque;
+
+if (!mr-warning_printed) {
+fprintf(stderr, Invalid read from memory region %s\n, mr-name);
+mr-warning_printed = true;
+}
+return -1U;
+}
+
+static void invalid_write(void *opaque, target_phys_addr_t addr, uint64_t data,
+  unsigned size)
+{
+MemoryRegion *mr = opaque;
+
+if (!mr-warning_printed) {
+fprintf(stderr, Invalid write to memory region %s\n, mr-name);
+mr-warning_printed = true;
+}
+}
+
+static const MemoryRegionOps reservation_ops = {
+.read = invalid_read,
+.write = invalid_write,
+.endianness = DEVICE_NATIVE_ENDIAN,
+};
+
+void memory_region_init_reservation(MemoryRegion *mr,
+const char *name,
+uint64_t size)
+{
+memory_region_init_io(mr, reservation_ops, mr, name, size);
+}
+
 void memory_region_destroy(MemoryRegion *mr)
 {
 assert(QTAILQ_EMPTY(mr-subregions));
diff --git a/memory.h b/memory.h
index fe643ff..4b5645f 100644
--- a/memory.h
+++ b/memory.h
@@ -124,6 +124,7 @@ struct MemoryRegion {
 bool readable;
 bool readonly; /* For RAM regions */
 bool enabled;
+bool warning_printed; /* For reservations */
 MemoryRegion *alias;
 target_phys_addr_t alias_offset;
 unsigned priority;
@@ -251,6 +252,21 @@ void memory_region_init_rom_device(MemoryRegion *mr,
uint64_t size);
 
 /**
+ * memory_region_init_reservation: Initialize a memory region that reserves
+ * I/O space.
+ *
+ * A reservation region primariy serves debugging purposes.  It claims I/O
+ * space that is not supposed to be handled by QEMU itself.  Any access via
+ * the memory API will cause an abort().
+ *
+ * @mr: the #MemoryRegion to be initialized
+ * @name: used for debugging; not visible to the user or ABI
+ * @size: size of the region.
+ */
+void memory_region_init_reservation(MemoryRegion *mr,
+const char *name,
+uint64_t size);
+/**
  * memory_region_destroy: Destroy a memory region and reclaim all resources.
  *
  * @mr: the region to be destroyed.  May not currently be a subregion
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 00/18] uq/master: Introduce basic irqchip support

2012-01-16 Thread Jan Kiszka
Changes in v7:
- introduce {apic,pic,ioapic}_qdev_register and use
  {APIC,PIC,IOAPIC}CommonInfo to move more code into the common modules
- clean up forgotten fragments of backend/frontend approach
- rephrased potentially misleading title of last patch ;)

CC: Lai Jiangshan la...@cn.fujitsu.com

Jan Kiszka (18):
  msi: Generalize msix_supported to msi_supported
  kvm: Move kvmclock into hw/kvm folder
  apic: Stop timer on reset
  apic: Inject external NMI events via LINT1
  apic: Introduce apic_report_irq_delivered
  apic: Factor out base class for KVM reuse
  apic: Open-code timer save/restore
  i8259: Completely privatize PicState
  i8259: Factor out base class for KVM reuse
  ioapic: Drop post-load irr initialization
  ioapic: Factor out base class for KVM reuse
  memory: Introduce memory_region_init_reservation
  kvm: Introduce core services for in-kernel irqchip support
  kvm: x86: Establish IRQ0 override control
  kvm: x86: Add user space part for in-kernel APIC
  kvm: x86: Add user space part for in-kernel i8259
  kvm: x86: Add user space part for in-kernel IOAPIC
  kvm: Activate in-kernel irqchip support

 Makefile.objs  |2 +-
 Makefile.target|6 +-
 configure  |1 +
 cpus.c |6 +-
 hw/apic.c  |  356 ++--
 hw/apic.h  |1 +
 hw/apic_common.c   |  302 ++
 hw/apic_internal.h |  115 +
 hw/i8259.c |  163 --
 hw/i8259_common.c  |  147 +
 hw/i8259_internal.h|   76 +
 hw/ioapic.c|  142 ++--
 hw/ioapic_common.c |  104 
 hw/ioapic_internal.h   |   97 +++
 hw/kvm/apic.c  |  138 
 hw/{kvmclock.c = kvm/clock.c} |4 +-
 hw/{kvmclock.h = kvm/clock.h} |0
 hw/kvm/i8259.c |  128 ++
 hw/kvm/ioapic.c|  114 +
 hw/msi.c   |8 +
 hw/msi.h   |2 +
 hw/msix.c  |9 +-
 hw/msix.h  |2 -
 hw/pc.c|   20 ++-
 hw/pc.h|8 +-
 hw/pc_piix.c   |   67 +++-
 kvm-all.c  |  154 +
 kvm-stub.c |5 +
 kvm.h  |   14 ++
 memory.c   |   36 
 memory.h   |   16 ++
 qemu-config.c  |4 +
 qemu-options.hx|5 +-
 sysemu.h   |1 -
 target-i386/kvm.c  |   49 ++
 trace-events   |2 +-
 vl.c   |1 -
 37 files changed, 1714 insertions(+), 591 deletions(-)
 create mode 100644 hw/apic_common.c
 create mode 100644 hw/apic_internal.h
 create mode 100644 hw/i8259_common.c
 create mode 100644 hw/i8259_internal.h
 create mode 100644 hw/ioapic_common.c
 create mode 100644 hw/ioapic_internal.h
 create mode 100644 hw/kvm/apic.c
 rename hw/{kvmclock.c = kvm/clock.c} (98%)
 rename hw/{kvmclock.h = kvm/clock.h} (100%)
 create mode 100644 hw/kvm/i8259.c
 create mode 100644 hw/kvm/ioapic.c

-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 08/18] i8259: Completely privatize PicState

2012-01-16 Thread Jan Kiszka
Use DeviceState instead of PicState in the public i8259 API. This is
cleaner and allows to reorganize the PIC data structures for KVM reuse.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 hw/i8259.c |   17 +++--
 hw/pc.h|7 +++
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/hw/i8259.c b/hw/i8259.c
index 7331e0e..cfaa35c 100644
--- a/hw/i8259.c
+++ b/hw/i8259.c
@@ -40,6 +40,8 @@
 //#define DEBUG_IRQ_LATENCY
 //#define DEBUG_IRQ_COUNT
 
+typedef struct PicState PicState;
+
 struct PicState {
 ISADevice dev;
 uint8_t last_irr; /* edge detection */
@@ -76,7 +78,7 @@ static uint64_t irq_count[16];
 #ifdef DEBUG_IRQ_LATENCY
 static int64_t irq_time[16];
 #endif
-PicState *isa_pic;
+DeviceState *isa_pic;
 static PicState *slave_pic;
 
 /* return the highest priority found in mask (highest = smallest
@@ -206,8 +208,9 @@ static void pic_intack(PicState *s, int irq)
 pic_update_irq(s);
 }
 
-int pic_read_irq(PicState *s)
+int pic_read_irq(DeviceState *d)
 {
+PicState *s = DO_UPCAST(PicState, dev.qdev, d);
 int irq, irq2, intno;
 
 irq = pic_get_irq(s);
@@ -269,7 +272,7 @@ static void pic_init_reset(PicState *s)
 
 static void pic_reset(DeviceState *dev)
 {
-PicState *s = container_of(dev, PicState, dev.qdev);
+PicState *s = DO_UPCAST(PicState, dev.qdev, dev);
 
 pic_init_reset(s);
 s-elcr = 0;
@@ -399,8 +402,10 @@ static uint64_t pic_ioport_read(void *opaque, 
target_phys_addr_t addr,
 return ret;
 }
 
-int pic_get_output(PicState *s)
+int pic_get_output(DeviceState *d)
 {
+PicState *s = DO_UPCAST(PicState, dev.qdev, d);
+
 return (pic_get_irq(s) = 0);
 }
 
@@ -491,7 +496,7 @@ void pic_info(Monitor *mon)
 return;
 }
 for (i = 0; i  2; i++) {
-s = i == 0 ? isa_pic : slave_pic;
+s = i == 0 ? DO_UPCAST(PicState, dev.qdev, isa_pic) : slave_pic;
 monitor_printf(mon, pic%d: irr=%02x imr=%02x isr=%02x hprio=%d 
irq_base=%02x rr_sel=%d elcr=%02x fnm=%d\n,
i, s-irr, s-imr, s-isr, s-priority_add,
@@ -538,7 +543,7 @@ qemu_irq *i8259_init(ISABus *bus, qemu_irq parent_irq)
 irq_set[i] = qdev_get_gpio_in(dev-qdev, i);
 }
 
-isa_pic = DO_UPCAST(PicState, dev, dev);
+isa_pic = dev-qdev;
 
 dev = isa_create(bus, isa-i8259);
 qdev_prop_set_uint32(dev-qdev, iobase, 0xa0);
diff --git a/hw/pc.h b/hw/pc.h
index 13e41f1..ece069a 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -62,11 +62,10 @@ bool parallel_mm_init(MemoryRegion *address_space,
 
 /* i8259.c */
 
-typedef struct PicState PicState;
-extern PicState *isa_pic;
+extern DeviceState *isa_pic;
 qemu_irq *i8259_init(ISABus *bus, qemu_irq parent_irq);
-int pic_read_irq(PicState *s);
-int pic_get_output(PicState *s);
+int pic_read_irq(DeviceState *d);
+int pic_get_output(DeviceState *d);
 void pic_info(Monitor *mon);
 void irq_info(Monitor *mon);
 
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 17/18] kvm: x86: Add user space part for in-kernel IOAPIC

2012-01-16 Thread Jan Kiszka
This introduces the KVM-accelerated IOAPIC model 'kvm-ioapic' and
extends the IRQ routing setup by the 0-2 redirection when needed.

The kvm-ioapic model has a property that allows to define its GSI base
for injecting interrupts into the kernel model. This will allow to
disentangle PIC and IOAPIC pins for chipsets that support more
sophisticated IRQ routes than the PIIX3. So far the base is kept at 0,
i.e. PIC and IOAPIC share pins 0..15.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 Makefile.target |2 +-
 hw/kvm/ioapic.c |  114 +++
 hw/pc_piix.c|   15 +++-
 3 files changed, 129 insertions(+), 2 deletions(-)
 create mode 100644 hw/kvm/ioapic.c

diff --git a/Makefile.target b/Makefile.target
index f49f96e..c82671b 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -228,7 +228,7 @@ obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += debugcon.o multiboot.o
 obj-i386-y += pc_piix.o
-obj-i386-$(CONFIG_KVM) += kvm/clock.o kvm/apic.o kvm/i8259.o
+obj-i386-$(CONFIG_KVM) += kvm/clock.o kvm/apic.o kvm/i8259.o kvm/ioapic.o
 obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
 
 # shared objects
diff --git a/hw/kvm/ioapic.c b/hw/kvm/ioapic.c
new file mode 100644
index 000..10ffdd4
--- /dev/null
+++ b/hw/kvm/ioapic.c
@@ -0,0 +1,114 @@
+/*
+ * KVM in-kernel IOPIC support
+ *
+ * Copyright (c) 2011 Siemens AG
+ *
+ * Authors:
+ *  Jan Kiszka  jan.kis...@siemens.com
+ *
+ * This work is licensed under the terms of the GNU GPL version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include hw/pc.h
+#include hw/ioapic_internal.h
+#include hw/apic_internal.h
+#include kvm.h
+
+typedef struct KVMIOAPICState KVMIOAPICState;
+
+struct KVMIOAPICState {
+IOAPICCommonState ioapic;
+uint32_t kvm_gsi_base;
+};
+
+static void kvm_ioapic_get(IOAPICCommonState *s)
+{
+struct kvm_irqchip chip;
+struct kvm_ioapic_state *kioapic;
+int ret, i;
+
+chip.chip_id = KVM_IRQCHIP_IOAPIC;
+ret = kvm_vm_ioctl(kvm_state, KVM_GET_IRQCHIP, chip);
+if (ret  0) {
+fprintf(stderr, KVM_GET_IRQCHIP failed: %s\n, strerror(ret));
+abort();
+}
+
+kioapic = chip.chip.ioapic;
+
+s-id = kioapic-id;
+s-ioregsel = kioapic-ioregsel;
+s-irr = kioapic-irr;
+for (i = 0; i  IOAPIC_NUM_PINS; i++) {
+s-ioredtbl[i] = kioapic-redirtbl[i].bits;
+}
+}
+
+static void kvm_ioapic_put(IOAPICCommonState *s)
+{
+struct kvm_irqchip chip;
+struct kvm_ioapic_state *kioapic;
+int ret, i;
+
+chip.chip_id = KVM_IRQCHIP_IOAPIC;
+kioapic = chip.chip.ioapic;
+
+kioapic-id = s-id;
+kioapic-ioregsel = s-ioregsel;
+kioapic-base_address = s-busdev.mmio[0].addr;
+kioapic-irr = s-irr;
+for (i = 0; i  IOAPIC_NUM_PINS; i++) {
+kioapic-redirtbl[i].bits = s-ioredtbl[i];
+}
+
+ret = kvm_vm_ioctl(kvm_state, KVM_SET_IRQCHIP, chip);
+if (ret  0) {
+fprintf(stderr, KVM_GET_IRQCHIP failed: %s\n, strerror(ret));
+abort();
+}
+}
+
+static void kvm_ioapic_reset(DeviceState *dev)
+{
+IOAPICCommonState *s = DO_UPCAST(IOAPICCommonState, busdev.qdev, dev);
+
+ioapic_reset_common(dev);
+kvm_ioapic_put(s);
+}
+
+static void kvm_ioapic_set_irq(void *opaque, int irq, int level)
+{
+KVMIOAPICState *s = opaque;
+int delivered;
+
+delivered = kvm_irqchip_set_irq(kvm_state, s-kvm_gsi_base + irq, level);
+apic_report_irq_delivered(delivered);
+}
+
+static void kvm_ioapic_init(IOAPICCommonState *s, int instance_no)
+{
+memory_region_init_reservation(s-io_memory, kvm-ioapic, 0x1000);
+
+qdev_init_gpio_in(s-busdev.qdev, kvm_ioapic_set_irq, IOAPIC_NUM_PINS);
+}
+
+static IOAPICCommonInfo kvm_ioapic_info = {
+.busdev.qdev.name  = kvm-ioapic,
+.busdev.qdev.size = sizeof(KVMIOAPICState),
+.busdev.qdev.reset = kvm_ioapic_reset,
+.busdev.qdev.props = (Property[]) {
+DEFINE_PROP_UINT32(gsi_base, KVMIOAPICState, kvm_gsi_base, 0),
+DEFINE_PROP_END_OF_LIST()
+},
+.init  = kvm_ioapic_init,
+.pre_save  = kvm_ioapic_get,
+.post_load = kvm_ioapic_put,
+};
+
+static void kvm_ioapic_register_device(void)
+{
+ioapic_qdev_register(kvm_ioapic_info);
+}
+
+device_init(kvm_ioapic_register_device)
diff --git a/hw/pc_piix.c b/hw/pc_piix.c
index 868f4a5..fcc374f 100644
--- a/hw/pc_piix.c
+++ b/hw/pc_piix.c
@@ -68,6 +68,15 @@ static void kvm_piix3_setup_irq_routing(bool pci_enabled)
 for (i = 8; i  16; ++i) {
 kvm_irqchip_add_route(s, i, KVM_IRQCHIP_PIC_SLAVE, i - 8);
 }
+if (pci_enabled) {
+for (i = 0; i  24; ++i) {
+if (i == 0) {
+kvm_irqchip_add_route(s, i, KVM_IRQCHIP_IOAPIC, 2);
+} else if (i != 2) {
+kvm_irqchip_add_route(s, i, KVM_IRQCHIP_IOAPIC, i);
+}
+}
+}
 ret = 

[PATCH v7 03/18] apic: Stop timer on reset

2012-01-16 Thread Jan Kiszka
All LVTs are masked on reset, so the timer becomes ineffective. Letting
it tick nevertheless is harmless, but will at least create a spurious
trace event.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 hw/apic.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index 9d0f460..4b97b17 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -528,6 +528,8 @@ void apic_init_reset(DeviceState *d)
 s-initial_count_load_time = 0;
 s-next_time = 0;
 s-wait_for_sipi = 1;
+
+qemu_del_timer(s-timer);
 }
 
 static void apic_startup(APICState *s, int vector_num)
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 13/18] kvm: Introduce core services for in-kernel irqchip support

2012-01-16 Thread Jan Kiszka
Add the basic infrastructure to active in-kernel irqchip support, inject
interrupts into these models, and maintain IRQ routes.

Routing is optional and depends on the host arch supporting
KVM_CAP_IRQ_ROUTING. When it's not available on x86, we looe the HPET as
we can't route GSI0 to IOAPIC pin 2.

In-kernel irqchip support will once be controlled by the machine
property 'kernel_irqchip', but this is not yet wired up.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 kvm-all.c |  149 +
 kvm.h |8 +++
 target-i386/kvm.c |   11 
 3 files changed, 168 insertions(+), 0 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index ac048bc..e91bb46 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -77,6 +77,13 @@ struct KVMState
 int pit_in_kernel;
 int xsave, xcrs;
 int many_ioeventfds;
+int irqchip_inject_ioctl;
+#ifdef KVM_CAP_IRQ_ROUTING
+struct kvm_irq_routing *irq_routes;
+int nr_allocated_irq_routes;
+uint32_t *used_gsi_bitmap;
+unsigned int max_gsi;
+#endif
 };
 
 KVMState *kvm_state;
@@ -693,6 +700,138 @@ static void kvm_handle_interrupt(CPUState *env, int mask)
 }
 }
 
+int kvm_irqchip_set_irq(KVMState *s, int irq, int level)
+{
+struct kvm_irq_level event;
+int ret;
+
+assert(s-irqchip_in_kernel);
+
+event.level = level;
+event.irq = irq;
+ret = kvm_vm_ioctl(s, s-irqchip_inject_ioctl, event);
+if (ret  0) {
+perror(kvm_set_irqchip_line);
+abort();
+}
+
+return (s-irqchip_inject_ioctl == KVM_IRQ_LINE) ? 1 : event.status;
+}
+
+#ifdef KVM_CAP_IRQ_ROUTING
+static void set_gsi(KVMState *s, unsigned int gsi)
+{
+assert(gsi  s-max_gsi);
+
+s-used_gsi_bitmap[gsi / 32] |= 1U  (gsi % 32);
+}
+
+static void kvm_init_irq_routing(KVMState *s)
+{
+int gsi_count;
+
+gsi_count = kvm_check_extension(s, KVM_CAP_IRQ_ROUTING);
+if (gsi_count  0) {
+unsigned int gsi_bits, i;
+
+/* Round up so we can search ints using ffs */
+gsi_bits = (gsi_count + 31) / 32;
+s-used_gsi_bitmap = g_malloc0(gsi_bits / 8);
+s-max_gsi = gsi_bits;
+
+/* Mark any over-allocated bits as already in use */
+for (i = gsi_count; i  gsi_bits; i++) {
+set_gsi(s, i);
+}
+}
+
+s-irq_routes = g_malloc0(sizeof(*s-irq_routes));
+s-nr_allocated_irq_routes = 0;
+
+kvm_arch_init_irq_routing(s);
+}
+
+static void kvm_add_routing_entry(KVMState *s,
+  struct kvm_irq_routing_entry *entry)
+{
+struct kvm_irq_routing_entry *new;
+int n, size;
+
+if (s-irq_routes-nr == s-nr_allocated_irq_routes) {
+n = s-nr_allocated_irq_routes * 2;
+if (n  64) {
+n = 64;
+}
+size = sizeof(struct kvm_irq_routing);
+size += n * sizeof(*new);
+s-irq_routes = g_realloc(s-irq_routes, size);
+s-nr_allocated_irq_routes = n;
+}
+n = s-irq_routes-nr++;
+new = s-irq_routes-entries[n];
+memset(new, 0, sizeof(*new));
+new-gsi = entry-gsi;
+new-type = entry-type;
+new-flags = entry-flags;
+new-u = entry-u;
+
+set_gsi(s, entry-gsi);
+}
+
+void kvm_irqchip_add_route(KVMState *s, int irq, int irqchip, int pin)
+{
+struct kvm_irq_routing_entry e;
+
+e.gsi = irq;
+e.type = KVM_IRQ_ROUTING_IRQCHIP;
+e.flags = 0;
+e.u.irqchip.irqchip = irqchip;
+e.u.irqchip.pin = pin;
+kvm_add_routing_entry(s, e);
+}
+
+int kvm_irqchip_commit_routes(KVMState *s)
+{
+s-irq_routes-flags = 0;
+return kvm_vm_ioctl(s, KVM_SET_GSI_ROUTING, s-irq_routes);
+}
+
+#else /* !KVM_CAP_IRQ_ROUTING */
+
+static void kvm_init_irq_routing(KVMState *s)
+{
+}
+#endif /* !KVM_CAP_IRQ_ROUTING */
+
+static int kvm_irqchip_create(KVMState *s)
+{
+QemuOptsList *list = qemu_find_opts(machine);
+int ret;
+
+if (QTAILQ_EMPTY(list-head) ||
+!qemu_opt_get_bool(QTAILQ_FIRST(list-head),
+   kernel_irqchip, false) ||
+!kvm_check_extension(s, KVM_CAP_IRQCHIP)) {
+return 0;
+}
+
+ret = kvm_vm_ioctl(s, KVM_CREATE_IRQCHIP);
+if (ret  0) {
+fprintf(stderr, Create kernel irqchip failed\n);
+return ret;
+}
+
+s-irqchip_inject_ioctl = KVM_IRQ_LINE;
+if (kvm_check_extension(s, KVM_CAP_IRQ_INJECT_STATUS)) {
+s-irqchip_inject_ioctl = KVM_IRQ_LINE_STATUS;
+}
+s-irqchip_in_kernel = 1;
+
+kvm_init_irq_routing(s);
+
+return 0;
+}
+
 int kvm_init(void)
 {
 static const char upgrade_note[] =
@@ -788,6 +927,11 @@ int kvm_init(void)
 goto err;
 }
 
+ret = kvm_irqchip_create(s);
+if (ret  0) {
+goto err;
+}
+
 kvm_state = s;
 cpu_register_phys_memory_client(kvm_cpu_phys_memory_client);
 
@@ -1123,6 +1267,11 @@ int kvm_has_many_ioeventfds(void)
 return kvm_state-many_ioeventfds;
 }
 
+int kvm_has_gsi_routing(void)
+{
+return 

[PATCH v7 04/18] apic: Inject external NMI events via LINT1

2012-01-16 Thread Jan Kiszka
On real hardware, NMI button events are injected via the LINT1 line of
the APICs. E.g. kdump expect this wiring and gets upset if the per-APIC
LINT1 mask is not respected, i.e. if NMIs are injected to VCPUs that
should not receive them. Change the APIC emulation code to reflect this.

Based on qemu-kvm patch by Lai Jiangshan.

CC: Lai Jiangshan la...@cn.fujitsu.com
Reported-by: Kenji Kaneshige kaneshige.ke...@jp.fujitsu.com
Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 cpus.c|6 +-
 hw/apic.c |7 +++
 hw/apic.h |1 +
 3 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/cpus.c b/cpus.c
index b421a71..857f96f 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1225,7 +1225,11 @@ void qmp_inject_nmi(Error **errp)
 CPUState *env;
 
 for (env = first_cpu; env != NULL; env = env-next_cpu) {
-cpu_interrupt(env, CPU_INTERRUPT_NMI);
+if (!env-apic_state) {
+cpu_interrupt(env, CPU_INTERRUPT_NMI);
+} else {
+apic_deliver_nmi(env-apic_state);
+}
 }
 #else
 error_set(errp, QERR_UNSUPPORTED);
diff --git a/hw/apic.c b/hw/apic.c
index 4b97b17..b9d733c 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -205,6 +205,13 @@ void apic_deliver_pic_intr(DeviceState *d, int level)
 }
 }
 
+void apic_deliver_nmi(DeviceState *d)
+{
+APICState *s = DO_UPCAST(APICState, busdev.qdev, d);
+
+apic_local_deliver(s, APIC_LVT_LINT1);
+}
+
 #define foreach_apic(apic, deliver_bitmask, code) \
 {\
 int __i, __j, __mask;\
diff --git a/hw/apic.h b/hw/apic.h
index a5c910f..a62d83b 100644
--- a/hw/apic.h
+++ b/hw/apic.h
@@ -8,6 +8,7 @@ void apic_deliver_irq(uint8_t dest, uint8_t dest_mode, uint8_t 
delivery_mode,
   uint8_t vector_num, uint8_t trigger_mode);
 int apic_accept_pic_intr(DeviceState *s);
 void apic_deliver_pic_intr(DeviceState *s, int level);
+void apic_deliver_nmi(DeviceState *d);
 int apic_get_interrupt(DeviceState *s);
 void apic_reset_irq_delivered(void);
 int apic_get_irq_delivered(void);
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 15/18] kvm: x86: Add user space part for in-kernel APIC

2012-01-16 Thread Jan Kiszka
This introduces the alternative APIC device which makes use of KVM's
in-kernel device model. External NMI injection via LINT1 is emulated by
checking the current state of the in-kernel APIC, only injecting a NMI
into the VCPU if LINT1 is unmasked and configured to DM_NMI.

MSI is not yet supported, so we disable this when the in-kernel model is
in use.

CC: Lai Jiangshan la...@cn.fujitsu.com
Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 Makefile.target   |2 +-
 hw/kvm/apic.c |  138 +
 hw/pc.c   |   15 --
 kvm.h |4 ++
 target-i386/kvm.c |   38 +++
 5 files changed, 191 insertions(+), 6 deletions(-)
 create mode 100644 hw/kvm/apic.c

diff --git a/Makefile.target b/Makefile.target
index 4fa91d3..5e5b5d1 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -228,7 +228,7 @@ obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += debugcon.o multiboot.o
 obj-i386-y += pc_piix.o
-obj-i386-$(CONFIG_KVM) += kvm/clock.o
+obj-i386-$(CONFIG_KVM) += kvm/clock.o kvm/apic.o
 obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
 
 # shared objects
diff --git a/hw/kvm/apic.c b/hw/kvm/apic.c
new file mode 100644
index 000..6300695
--- /dev/null
+++ b/hw/kvm/apic.c
@@ -0,0 +1,138 @@
+/*
+ * KVM in-kernel APIC support
+ *
+ * Copyright (c) 2011 Siemens AG
+ *
+ * Authors:
+ *  Jan Kiszka  jan.kis...@siemens.com
+ *
+ * This work is licensed under the terms of the GNU GPL version 2.
+ * See the COPYING file in the top-level directory.
+ */
+#include hw/apic_internal.h
+#include kvm.h
+
+static inline void kvm_apic_set_reg(struct kvm_lapic_state *kapic,
+int reg_id, uint32_t val)
+{
+*((uint32_t *)(kapic-regs + (reg_id  4))) = val;
+}
+
+static inline uint32_t kvm_apic_get_reg(struct kvm_lapic_state *kapic,
+int reg_id)
+{
+return *((uint32_t *)(kapic-regs + (reg_id  4)));
+}
+
+void kvm_put_apic_state(DeviceState *d, struct kvm_lapic_state *kapic)
+{
+APICCommonState *s = DO_UPCAST(APICCommonState, busdev.qdev, d);
+int i;
+
+memset(kapic, 0, sizeof(kapic));
+kvm_apic_set_reg(kapic, 0x2, s-id  24);
+kvm_apic_set_reg(kapic, 0x8, s-tpr);
+kvm_apic_set_reg(kapic, 0xd, s-log_dest  24);
+kvm_apic_set_reg(kapic, 0xe, s-dest_mode  28 | 0x0fff);
+kvm_apic_set_reg(kapic, 0xf, s-spurious_vec);
+for (i = 0; i  8; i++) {
+kvm_apic_set_reg(kapic, 0x10 + i, s-isr[i]);
+kvm_apic_set_reg(kapic, 0x18 + i, s-tmr[i]);
+kvm_apic_set_reg(kapic, 0x20 + i, s-irr[i]);
+}
+kvm_apic_set_reg(kapic, 0x28, s-esr);
+kvm_apic_set_reg(kapic, 0x30, s-icr[0]);
+kvm_apic_set_reg(kapic, 0x31, s-icr[1]);
+for (i = 0; i  APIC_LVT_NB; i++) {
+kvm_apic_set_reg(kapic, 0x32 + i, s-lvt[i]);
+}
+kvm_apic_set_reg(kapic, 0x38, s-initial_count);
+kvm_apic_set_reg(kapic, 0x3e, s-divide_conf);
+}
+
+void kvm_get_apic_state(DeviceState *d, struct kvm_lapic_state *kapic)
+{
+APICCommonState *s = DO_UPCAST(APICCommonState, busdev.qdev, d);
+int i, v;
+
+s-id = kvm_apic_get_reg(kapic, 0x2)  24;
+s-tpr = kvm_apic_get_reg(kapic, 0x8);
+s-arb_id = kvm_apic_get_reg(kapic, 0x9);
+s-log_dest = kvm_apic_get_reg(kapic, 0xd)  24;
+s-dest_mode = kvm_apic_get_reg(kapic, 0xe)  28;
+s-spurious_vec = kvm_apic_get_reg(kapic, 0xf);
+for (i = 0; i  8; i++) {
+s-isr[i] = kvm_apic_get_reg(kapic, 0x10 + i);
+s-tmr[i] = kvm_apic_get_reg(kapic, 0x18 + i);
+s-irr[i] = kvm_apic_get_reg(kapic, 0x20 + i);
+}
+s-esr = kvm_apic_get_reg(kapic, 0x28);
+s-icr[0] = kvm_apic_get_reg(kapic, 0x30);
+s-icr[1] = kvm_apic_get_reg(kapic, 0x31);
+for (i = 0; i  APIC_LVT_NB; i++) {
+s-lvt[i] = kvm_apic_get_reg(kapic, 0x32 + i);
+}
+s-initial_count = kvm_apic_get_reg(kapic, 0x38);
+s-divide_conf = kvm_apic_get_reg(kapic, 0x3e);
+
+v = (s-divide_conf  3) | ((s-divide_conf  1)  4);
+s-count_shift = (v + 1)  7;
+
+s-initial_count_load_time = qemu_get_clock_ns(vm_clock);
+apic_next_timer(s, s-initial_count_load_time);
+}
+
+static void kvm_apic_set_base(APICCommonState *s, uint64_t val)
+{
+s-apicbase = val;
+}
+
+static void kvm_apic_set_tpr(APICCommonState *s, uint8_t val)
+{
+s-tpr = (val  0x0f)  4;
+}
+
+static void do_inject_external_nmi(void *data)
+{
+APICCommonState *s = data;
+CPUState *env = s-cpu_env;
+uint32_t lvt;
+int ret;
+
+cpu_synchronize_state(env);
+
+lvt = s-lvt[APIC_LVT_LINT1];
+if (!(lvt  APIC_LVT_MASKED)  ((lvt  8)  7) == APIC_DM_NMI) {
+ret = kvm_vcpu_ioctl(env, KVM_NMI);
+if (ret  0) {
+fprintf(stderr, KVM: injection failed, NMI lost (%s)\n,
+strerror(-ret));
+}
+}
+}
+
+static void kvm_apic_external_nmi(APICCommonState *s)
+{
+

[PATCH v7 02/18] kvm: Move kvmclock into hw/kvm folder

2012-01-16 Thread Jan Kiszka
More KVM-specific devices will come, so let's start with moving the
kvmclock into a dedicated folder.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 Makefile.target|4 ++--
 configure  |1 +
 hw/{kvmclock.c = kvm/clock.c} |4 ++--
 hw/{kvmclock.h = kvm/clock.h} |0
 hw/pc_piix.c   |2 +-
 5 files changed, 6 insertions(+), 5 deletions(-)
 rename hw/{kvmclock.c = kvm/clock.c} (98%)
 rename hw/{kvmclock.h = kvm/clock.h} (100%)

diff --git a/Makefile.target b/Makefile.target
index 3261383..974cd72 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -228,7 +228,7 @@ obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += debugcon.o multiboot.o
 obj-i386-y += pc_piix.o
-obj-i386-$(CONFIG_KVM) += kvmclock.o
+obj-i386-$(CONFIG_KVM) += kvm/clock.o
 obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
 
 # shared objects
@@ -418,7 +418,7 @@ qmp-commands-old.h: $(SRC_PATH)/qmp-commands.hx
 
 clean:
rm -f *.o *.a *~ $(PROGS) nwfpe/*.o fpu/*.o
-   rm -f *.d */*.d tcg/*.o ide/*.o 9pfs/*.o
+   rm -f *.d */*.d tcg/*.o ide/*.o 9pfs/*.o kvm/*.o
rm -f hmp-commands.h qmp-commands-old.h gdbstub-xml.c
 ifdef CONFIG_TRACE_SYSTEMTAP
rm -f *.stp
diff --git a/configure b/configure
index 640e815..11edff6 100755
--- a/configure
+++ b/configure
@@ -3384,6 +3384,7 @@ mkdir -p $target_dir/fpu
 mkdir -p $target_dir/tcg
 mkdir -p $target_dir/ide
 mkdir -p $target_dir/9pfs
+mkdir -p $target_dir/kvm
 if test $target = arm-linux-user -o $target = armeb-linux-user -o 
$target = arm-bsd-user -o $target = armeb-bsd-user ; then
   mkdir -p $target_dir/nwfpe
 fi
diff --git a/hw/kvmclock.c b/hw/kvm/clock.c
similarity index 98%
rename from hw/kvmclock.c
rename to hw/kvm/clock.c
index 5388bc4..5983271 100644
--- a/hw/kvmclock.c
+++ b/hw/kvm/clock.c
@@ -13,9 +13,9 @@
 
 #include qemu-common.h
 #include sysemu.h
-#include sysbus.h
 #include kvm.h
-#include kvmclock.h
+#include hw/sysbus.h
+#include hw/kvm/clock.h
 
 #include linux/kvm.h
 #include linux/kvm_para.h
diff --git a/hw/kvmclock.h b/hw/kvm/clock.h
similarity index 100%
rename from hw/kvmclock.h
rename to hw/kvm/clock.h
diff --git a/hw/pc_piix.c b/hw/pc_piix.c
index b70431f..f44f00e 100644
--- a/hw/pc_piix.c
+++ b/hw/pc_piix.c
@@ -34,7 +34,7 @@
 #include boards.h
 #include ide.h
 #include kvm.h
-#include kvmclock.h
+#include kvm/clock.h
 #include sysemu.h
 #include sysbus.h
 #include arch_init.h
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 11/18] ioapic: Factor out base class for KVM reuse

2012-01-16 Thread Jan Kiszka
Split up the IOAPIC analogously to APIC and i8259. KVM will share the
IOAPICCommonState, the vmstate, reset logic and certain init parts with
the user space model.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 Makefile.target  |2 +-
 hw/ioapic.c  |  130 ++---
 hw/ioapic_common.c   |  104 
 hw/ioapic_internal.h |   97 +
 4 files changed, 218 insertions(+), 115 deletions(-)
 create mode 100644 hw/ioapic_common.c
 create mode 100644 hw/ioapic_internal.h

diff --git a/Makefile.target b/Makefile.target
index a8acece..4fa91d3 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -223,7 +223,7 @@ obj-$(CONFIG_IVSHMEM) += ivshmem.o
 # Hardware support
 obj-i386-y += vga.o
 obj-i386-y += mc146818rtc.o pc.o
-obj-i386-y += cirrus_vga.o sga.o apic_common.o apic.o ioapic.o piix_pci.o
+obj-i386-y += cirrus_vga.o sga.o apic_common.o apic.o ioapic_common.o ioapic.o 
piix_pci.o
 obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += debugcon.o multiboot.o
diff --git a/hw/ioapic.c b/hw/ioapic.c
index 0743af6..0c8be50 100644
--- a/hw/ioapic.c
+++ b/hw/ioapic.c
@@ -24,9 +24,7 @@
 #include pc.h
 #include apic.h
 #include ioapic.h
-#include qemu-timer.h
-#include host-utils.h
-#include sysbus.h
+#include ioapic_internal.h
 
 //#define DEBUG_IOAPIC
 
@@ -37,65 +35,9 @@
 #define DPRINTF(fmt, ...)
 #endif
 
-#define MAX_IOAPICS 1
+static IOAPICCommonState *ioapics[MAX_IOAPICS];
 
-#define IOAPIC_VERSION  0x11
-
-#define IOAPIC_LVT_DEST_SHIFT   56
-#define IOAPIC_LVT_MASKED_SHIFT 16
-#define IOAPIC_LVT_TRIGGER_MODE_SHIFT   15
-#define IOAPIC_LVT_REMOTE_IRR_SHIFT 14
-#define IOAPIC_LVT_POLARITY_SHIFT   13
-#define IOAPIC_LVT_DELIV_STATUS_SHIFT   12
-#define IOAPIC_LVT_DEST_MODE_SHIFT  11
-#define IOAPIC_LVT_DELIV_MODE_SHIFT 8
-
-#define IOAPIC_LVT_MASKED   (1  IOAPIC_LVT_MASKED_SHIFT)
-#define IOAPIC_LVT_REMOTE_IRR   (1  IOAPIC_LVT_REMOTE_IRR_SHIFT)
-
-#define IOAPIC_TRIGGER_EDGE 0
-#define IOAPIC_TRIGGER_LEVEL1
-
-/*io{apic,sapic} delivery mode*/
-#define IOAPIC_DM_FIXED 0x0
-#define IOAPIC_DM_LOWEST_PRIORITY   0x1
-#define IOAPIC_DM_PMI   0x2
-#define IOAPIC_DM_NMI   0x4
-#define IOAPIC_DM_INIT  0x5
-#define IOAPIC_DM_SIPI  0x6
-#define IOAPIC_DM_EXTINT0x7
-#define IOAPIC_DM_MASK  0x7
-
-#define IOAPIC_VECTOR_MASK  0xff
-
-#define IOAPIC_IOREGSEL 0x00
-#define IOAPIC_IOWIN0x10
-
-#define IOAPIC_REG_ID   0x00
-#define IOAPIC_REG_VER  0x01
-#define IOAPIC_REG_ARB  0x02
-#define IOAPIC_REG_REDTBL_BASE  0x10
-#define IOAPIC_ID   0x00
-
-#define IOAPIC_ID_SHIFT 24
-#define IOAPIC_ID_MASK  0xf
-
-#define IOAPIC_VER_ENTRIES_SHIFT16
-
-typedef struct IOAPICState IOAPICState;
-
-struct IOAPICState {
-SysBusDevice busdev;
-MemoryRegion io_memory;
-uint8_t id;
-uint8_t ioregsel;
-uint32_t irr;
-uint64_t ioredtbl[IOAPIC_NUM_PINS];
-};
-
-static IOAPICState *ioapics[MAX_IOAPICS];
-
-static void ioapic_service(IOAPICState *s)
+static void ioapic_service(IOAPICCommonState *s)
 {
 uint8_t i;
 uint8_t trig_mode;
@@ -135,7 +77,7 @@ static void ioapic_service(IOAPICState *s)
 
 static void ioapic_set_irq(void *opaque, int vector, int level)
 {
-IOAPICState *s = opaque;
+IOAPICCommonState *s = opaque;
 
 /* ISA IRQs map to GSI 1-1 except for IRQ0 which maps
  * to GSI 2.  GSI maps to ioapic 1-1.  This is not
@@ -174,7 +116,7 @@ static void ioapic_set_irq(void *opaque, int vector, int 
level)
 
 void ioapic_eoi_broadcast(int vector)
 {
-IOAPICState *s;
+IOAPICCommonState *s;
 uint64_t entry;
 int i, n;
 
@@ -199,7 +141,7 @@ void ioapic_eoi_broadcast(int vector)
 static uint64_t
 ioapic_mem_read(void *opaque, target_phys_addr_t addr, unsigned int size)
 {
-IOAPICState *s = opaque;
+IOAPICCommonState *s = opaque;
 int index;
 uint32_t val = 0;
 
@@ -242,7 +184,7 @@ static void
 ioapic_mem_write(void *opaque, target_phys_addr_t addr, uint64_t val,
  unsigned int size)
 {
-IOAPICState *s = opaque;
+IOAPICCommonState *s = opaque;
 int index;
 
 switch (addr  0xff) {
@@ -278,71 +220,31 @@ ioapic_mem_write(void *opaque, target_phys_addr_t addr, 
uint64_t val,
 }
 }
 
-static const VMStateDescription vmstate_ioapic = {
-.name = ioapic,
-.version_id = 3,
-.minimum_version_id = 1,
-.minimum_version_id_old = 1,
-.fields = (VMStateField[]) {
-VMSTATE_UINT8(id, IOAPICState),
-VMSTATE_UINT8(ioregsel, IOAPICState),
-

[PATCH v7 16/18] kvm: x86: Add user space part for in-kernel i8259

2012-01-16 Thread Jan Kiszka
Introduce the alternative 'kvm-i8259' device model that exploits KVM
in-kernel acceleration.

The PIIX3 initialization code is furthermore extended by KVM specific
IRQ route setup. GSI injection differs in KVM mode from the user space
model. As we can dispatch ISA-range IRQs to both IOAPIC and PIC inside
the kernel, we do not need to inject them separately. This is reflected
by a KVM-specific GSI handler.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 Makefile.target |2 +-
 hw/kvm/i8259.c  |  128 +++
 hw/pc.h |1 +
 hw/pc_piix.c|   50 --
 4 files changed, 176 insertions(+), 5 deletions(-)
 create mode 100644 hw/kvm/i8259.c

diff --git a/Makefile.target b/Makefile.target
index 5e5b5d1..f49f96e 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -228,7 +228,7 @@ obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += debugcon.o multiboot.o
 obj-i386-y += pc_piix.o
-obj-i386-$(CONFIG_KVM) += kvm/clock.o kvm/apic.o
+obj-i386-$(CONFIG_KVM) += kvm/clock.o kvm/apic.o kvm/i8259.o
 obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
 
 # shared objects
diff --git a/hw/kvm/i8259.c b/hw/kvm/i8259.c
new file mode 100644
index 000..3bfcd1f
--- /dev/null
+++ b/hw/kvm/i8259.c
@@ -0,0 +1,128 @@
+/*
+ * KVM in-kernel PIC (i8259) support
+ *
+ * Copyright (c) 2011 Siemens AG
+ *
+ * Authors:
+ *  Jan Kiszka  jan.kis...@siemens.com
+ *
+ * This work is licensed under the terms of the GNU GPL version 2.
+ * See the COPYING file in the top-level directory.
+ */
+#include hw/i8259_internal.h
+#include hw/apic_internal.h
+#include kvm.h
+
+static void kvm_pic_get(PICCommonState *s)
+{
+struct kvm_irqchip chip;
+struct kvm_pic_state *kpic;
+int ret;
+
+chip.chip_id = s-master ? KVM_IRQCHIP_PIC_MASTER : KVM_IRQCHIP_PIC_SLAVE;
+ret = kvm_vm_ioctl(kvm_state, KVM_GET_IRQCHIP, chip);
+if (ret  0) {
+fprintf(stderr, KVM_GET_IRQCHIP failed: %s\n, strerror(ret));
+abort();
+}
+
+kpic = chip.chip.pic;
+
+s-last_irr = kpic-last_irr;
+s-irr = kpic-irr;
+s-imr = kpic-imr;
+s-isr = kpic-isr;
+s-priority_add = kpic-priority_add;
+s-irq_base = kpic-irq_base;
+s-read_reg_select = kpic-read_reg_select;
+s-poll = kpic-poll;
+s-special_mask = kpic-special_mask;
+s-init_state = kpic-init_state;
+s-auto_eoi = kpic-auto_eoi;
+s-rotate_on_auto_eoi = kpic-rotate_on_auto_eoi;
+s-special_fully_nested_mode = kpic-special_fully_nested_mode;
+s-init4 = kpic-init4;
+s-elcr = kpic-elcr;
+s-elcr_mask = kpic-elcr_mask;
+}
+
+static void kvm_pic_put(PICCommonState *s)
+{
+struct kvm_irqchip chip;
+struct kvm_pic_state *kpic;
+int ret;
+
+chip.chip_id = s-master ? KVM_IRQCHIP_PIC_MASTER : KVM_IRQCHIP_PIC_SLAVE;
+
+kpic = chip.chip.pic;
+
+kpic-last_irr = s-last_irr;
+kpic-irr = s-irr;
+kpic-imr = s-imr;
+kpic-isr = s-isr;
+kpic-priority_add = s-priority_add;
+kpic-irq_base = s-irq_base;
+kpic-read_reg_select = s-read_reg_select;
+kpic-poll = s-poll;
+kpic-special_mask = s-special_mask;
+kpic-init_state = s-init_state;
+kpic-auto_eoi = s-auto_eoi;
+kpic-rotate_on_auto_eoi = s-rotate_on_auto_eoi;
+kpic-special_fully_nested_mode = s-special_fully_nested_mode;
+kpic-init4 = s-init4;
+kpic-elcr = s-elcr;
+kpic-elcr_mask = s-elcr_mask;
+
+ret = kvm_vm_ioctl(kvm_state, KVM_SET_IRQCHIP, chip);
+if (ret  0) {
+fprintf(stderr, KVM_GET_IRQCHIP failed: %s\n, strerror(ret));
+abort();
+}
+}
+
+static void kvm_pic_reset(DeviceState *dev)
+{
+PICCommonState *s = container_of(dev, PICCommonState, dev.qdev);
+
+pic_reset_common(s);
+s-elcr = 0;
+
+kvm_pic_put(s);
+}
+
+static void kvm_pic_set_irq(void *opaque, int irq, int level)
+{
+int delivered;
+
+delivered = kvm_irqchip_set_irq(kvm_state, irq, level);
+apic_report_irq_delivered(delivered);
+}
+
+static void kvm_pic_init(PICCommonState *s)
+{
+memory_region_init_reservation(s-base_io, kvm-pic, 2);
+memory_region_init_reservation(s-elcr_io, kvm-elcr, 1);
+}
+
+qemu_irq *kvm_i8259_init(ISABus *bus)
+{
+i8259_init_chip(kvm-i8259, bus, true);
+i8259_init_chip(kvm-i8259, bus, false);
+
+return qemu_allocate_irqs(kvm_pic_set_irq, NULL, ISA_NUM_IRQS);
+}
+
+static PICCommonInfo kvm_i8259_info = {
+.isadev.qdev.name  = kvm-i8259,
+.isadev.qdev.reset = kvm_pic_reset,
+.init   = kvm_pic_init,
+.pre_save   = kvm_pic_get,
+.post_load  = kvm_pic_put,
+};
+
+static void kvm_pic_register(void)
+{
+pic_qdev_register(kvm_i8259_info);
+}
+
+device_init(kvm_pic_register)
diff --git a/hw/pc.h b/hw/pc.h
index ece069a..5e913db 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -64,6 +64,7 @@ bool parallel_mm_init(MemoryRegion *address_space,
 
 extern DeviceState *isa_pic;
 qemu_irq 

[PATCH v7 18/18] kvm: Activate in-kernel irqchip support

2012-01-16 Thread Jan Kiszka
Make the basic in-kernel irqchip support selectable via
-machine ...,kernel_irqchip=on. Leave it off by default until it can
fully replace user space models.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 qemu-config.c   |4 
 qemu-options.hx |5 -
 2 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/qemu-config.c b/qemu-config.c
index 18f3020..bd93d6f 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -518,6 +518,10 @@ static QemuOptsList qemu_machine_opts = {
 .name = accel,
 .type = QEMU_OPT_STRING,
 .help = accelerator list,
+}, {
+.name = kernel_irqchip,
+.type = QEMU_OPT_BOOL,
+.help = use KVM in-kernel irqchip,
 },
 { /* End of list */ }
 },
diff --git a/qemu-options.hx b/qemu-options.hx
index a60191f..f293a50 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -31,7 +31,8 @@ DEF(machine, HAS_ARG, QEMU_OPTION_machine, \
 -machine [type=]name[,prop[=value][,...]]\n
 selects emulated machine (-machine ? for list)\n
 property accel=accel1[:accel2[:...]] selects 
accelerator\n
-supported accelerators are kvm, xen, tcg (default: 
tcg)\n,
+supported accelerators are kvm, xen, tcg (default: tcg)\n
+kernel_irqchip=on|off controls accelerated irqchip 
support\n,
 QEMU_ARCH_ALL)
 STEXI
 @item -machine [type=]@var{name}[,prop=@var{value}[,...]]
@@ -44,6 +45,8 @@ This is used to enable an accelerator. Depending on the 
target architecture,
 kvm, xen, or tcg can be available. By default, tcg is used. If there is more
 than one accelerator specified, the next one is used if the previous one fails
 to initialize.
+@item kernel_irqchip=on|off
+Enables in-kernel irqchip support for the chosen accelerator when available.
 @end table
 ETEXI
 
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 06/18] apic: Factor out base class for KVM reuse

2012-01-16 Thread Jan Kiszka
The KVM in-kernel APIC model will reuse parts of the user space model
while providing the same frontend view to guest and most management
interfaces.

Factor out an APIC base class to encapsulate those parts that will be
shared by user space and KVM model. This class offers callback hooks for
init, base/tpr setting, and the external NMI delivery that will be
set via APICCommonInfo structure and implemented specifically in the
subclasses.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 Makefile.target|2 +-
 hw/apic.c  |  338 +++-
 hw/apic.h  |1 -
 hw/apic_common.c   |  252 ++
 hw/apic_internal.h |  112 +
 5 files changed, 406 insertions(+), 299 deletions(-)
 create mode 100644 hw/apic_common.c
 create mode 100644 hw/apic_internal.h

diff --git a/Makefile.target b/Makefile.target
index 974cd72..a8acece 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -223,7 +223,7 @@ obj-$(CONFIG_IVSHMEM) += ivshmem.o
 # Hardware support
 obj-i386-y += vga.o
 obj-i386-y += mc146818rtc.o pc.o
-obj-i386-y += cirrus_vga.o sga.o apic.o ioapic.o piix_pci.o
+obj-i386-y += cirrus_vga.o sga.o apic_common.o apic.o ioapic.o piix_pci.o
 obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += debugcon.o multiboot.o
diff --git a/hw/apic.c b/hw/apic.c
index bec493b..387a469 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -16,53 +16,13 @@
  * You should have received a copy of the GNU Lesser General Public
  * License along with this library; if not, see http://www.gnu.org/licenses/
  */
-#include hw.h
+#include apic_internal.h
 #include apic.h
 #include ioapic.h
-#include qemu-timer.h
 #include host-utils.h
-#include sysbus.h
 #include trace.h
 #include pc.h
 
-/* APIC Local Vector Table */
-#define APIC_LVT_TIMER   0
-#define APIC_LVT_THERMAL 1
-#define APIC_LVT_PERFORM 2
-#define APIC_LVT_LINT0   3
-#define APIC_LVT_LINT1   4
-#define APIC_LVT_ERROR   5
-#define APIC_LVT_NB  6
-
-/* APIC delivery modes */
-#define APIC_DM_FIXED  0
-#define APIC_DM_LOWPRI 1
-#define APIC_DM_SMI2
-#define APIC_DM_NMI4
-#define APIC_DM_INIT   5
-#define APIC_DM_SIPI   6
-#define APIC_DM_EXTINT 7
-
-/* APIC destination mode */
-#define APIC_DESTMODE_FLAT 0xf
-#define APIC_DESTMODE_CLUSTER  1
-
-#define APIC_TRIGGER_EDGE  0
-#define APIC_TRIGGER_LEVEL 1
-
-#defineAPIC_LVT_TIMER_PERIODIC (117)
-#defineAPIC_LVT_MASKED (116)
-#defineAPIC_LVT_LEVEL_TRIGGER  (115)
-#defineAPIC_LVT_REMOTE_IRR (114)
-#defineAPIC_INPUT_POLARITY (113)
-#defineAPIC_SEND_PENDING   (112)
-
-#define ESR_ILLEGAL_ADDRESS (1  7)
-
-#define APIC_SV_DIRECTED_IO (112)
-#define APIC_SV_ENABLE  (18)
-
-#define MAX_APICS 255
 #define MAX_APIC_WORDS 8
 
 /* Intel APIC constants: from include/asm/msidef.h */
@@ -75,43 +35,10 @@
 #define MSI_ADDR_DEST_ID_SHIFT 12
 #defineMSI_ADDR_DEST_ID_MASK   0x000
 
-#define MSI_ADDR_SIZE   0x10
-
-typedef struct APICState APICState;
-
-struct APICState {
-SysBusDevice busdev;
-MemoryRegion io_memory;
-void *cpu_env;
-uint32_t apicbase;
-uint8_t id;
-uint8_t arb_id;
-uint8_t tpr;
-uint32_t spurious_vec;
-uint8_t log_dest;
-uint8_t dest_mode;
-uint32_t isr[8];  /* in service register */
-uint32_t tmr[8];  /* trigger mode register */
-uint32_t irr[8]; /* interrupt request register */
-uint32_t lvt[APIC_LVT_NB];
-uint32_t esr; /* error register */
-uint32_t icr[2];
-
-uint32_t divide_conf;
-int count_shift;
-uint32_t initial_count;
-int64_t initial_count_load_time, next_time;
-uint32_t idx;
-QEMUTimer *timer;
-int sipi_vector;
-int wait_for_sipi;
-};
-
-static APICState *local_apics[MAX_APICS + 1];
-static int apic_irq_delivered;
+static APICCommonState *local_apics[MAX_APICS + 1];
 
-static void apic_set_irq(APICState *s, int vector_num, int trigger_mode);
-static void apic_update_irq(APICState *s);
+static void apic_set_irq(APICCommonState *s, int vector_num, int trigger_mode);
+static void apic_update_irq(APICCommonState *s);
 static void apic_get_delivery_bitmask(uint32_t *deliver_bitmask,
   uint8_t dest, uint8_t dest_mode);
 
@@ -151,7 +78,7 @@ static inline int get_bit(uint32_t *tab, int index)
 return !!(tab[i]  mask);
 }
 
-static void apic_local_deliver(APICState *s, int vector)
+static void apic_local_deliver(APICCommonState *s, int vector)
 {
 uint32_t lvt = s-lvt[vector];
 int trigger_mode;
@@ -185,7 +112,7 @@ static void apic_local_deliver(APICState *s, int vector)
 
 void apic_deliver_pic_intr(DeviceState *d, int level)
 {
-APICState *s = DO_UPCAST(APICState, busdev.qdev, d);
+APICCommonState *s = 

[PATCH v7 07/18] apic: Open-code timer save/restore

2012-01-16 Thread Jan Kiszka
To enable migration between accelerated and non-accelerated APIC models,
we will need to handle the timer saving and restoring specially and can
no longer rely on the automatics of VMSTATE_TIMER. Specifically,
accelerated model will not start any QEMUTimer.

This patch therefore factors out the generic bits into apic_next_timer
and use a post-load callback to implemented model-specific logic.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 hw/apic.c  |   30 +++-
 hw/apic_common.c   |   54 ++-
 hw/apic_internal.h |3 ++
 3 files changed, 67 insertions(+), 20 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index 387a469..e59c964 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -521,25 +521,9 @@ static uint32_t apic_get_current_count(APICCommonState *s)
 
 static void apic_timer_update(APICCommonState *s, int64_t current_time)
 {
-int64_t next_time, d;
-
-if (!(s-lvt[APIC_LVT_TIMER]  APIC_LVT_MASKED)) {
-d = (current_time - s-initial_count_load_time) 
-s-count_shift;
-if (s-lvt[APIC_LVT_TIMER]  APIC_LVT_TIMER_PERIODIC) {
-if (!s-initial_count)
-goto no_timer;
-d = ((d / ((uint64_t)s-initial_count + 1)) + 1) * 
((uint64_t)s-initial_count + 1);
-} else {
-if (d = s-initial_count)
-goto no_timer;
-d = (uint64_t)s-initial_count + 1;
-}
-next_time = s-initial_count_load_time + (d  s-count_shift);
-qemu_mod_timer(s-timer, next_time);
-s-next_time = next_time;
+if (apic_next_timer(s, current_time)) {
+qemu_mod_timer(s-timer, s-next_time);
 } else {
-no_timer:
 qemu_del_timer(s-timer);
 }
 }
@@ -753,6 +737,15 @@ static void apic_mem_writel(void *opaque, 
target_phys_addr_t addr, uint32_t val)
 }
 }
 
+static void apic_post_load(APICCommonState *s)
+{
+if (s-timer_expiry != -1) {
+qemu_mod_timer(s-timer, s-timer_expiry);
+} else {
+qemu_del_timer(s-timer);
+}
+}
+
 static const MemoryRegionOps apic_io_ops = {
 .old_mmio = {
 .read = { apic_mem_readb, apic_mem_readw, apic_mem_readl, },
@@ -776,6 +769,7 @@ static APICCommonInfo apic_info = {
 .set_base = apic_set_base,
 .set_tpr = apic_set_tpr,
 .external_nmi = apic_external_nmi,
+.post_load = apic_post_load,
 };
 
 static void apic_register_devices(void)
diff --git a/hw/apic_common.c b/hw/apic_common.c
index eef977f..e05369c 100644
--- a/hw/apic_common.c
+++ b/hw/apic_common.c
@@ -93,6 +93,39 @@ void apic_deliver_nmi(DeviceState *d)
 info-external_nmi(s);
 }
 
+bool apic_next_timer(APICCommonState *s, int64_t current_time)
+{
+int64_t d;
+
+/* We need to store the timer state separately to support APIC
+ * implementations that maintain a non-QEMU timer, e.g. inside the
+ * host kernel. This open-coded state allows us to migrate between
+ * both models. */
+s-timer_expiry = -1;
+
+if (s-lvt[APIC_LVT_TIMER]  APIC_LVT_MASKED) {
+return false;
+}
+
+d = (current_time - s-initial_count_load_time)  s-count_shift;
+
+if (s-lvt[APIC_LVT_TIMER]  APIC_LVT_TIMER_PERIODIC) {
+if (!s-initial_count) {
+return false;
+}
+d = ((d / ((uint64_t)s-initial_count + 1)) + 1) *
+((uint64_t)s-initial_count + 1);
+} else {
+if (d = s-initial_count) {
+return false;
+}
+d = (uint64_t)s-initial_count + 1;
+}
+s-next_time = s-initial_count_load_time + (d  s-count_shift);
+s-timer_expiry = s-next_time;
+return true;
+}
+
 void apic_init_reset(DeviceState *d)
 {
 APICCommonState *s = DO_UPCAST(APICCommonState, busdev.qdev, d);
@@ -120,7 +153,10 @@ void apic_init_reset(DeviceState *d)
 s-next_time = 0;
 s-wait_for_sipi = 1;
 
-qemu_del_timer(s-timer);
+if (s-timer) {
+qemu_del_timer(s-timer);
+}
+s-timer_expiry = -1;
 }
 
 static void apic_reset_common(DeviceState *d)
@@ -203,12 +239,25 @@ static int apic_init_common(SysBusDevice *dev)
 return 0;
 }
 
+static int apic_dispatch_post_load(void *opaque, int version_id)
+{
+APICCommonState *s = opaque;
+APICCommonInfo *info =
+DO_UPCAST(APICCommonInfo, busdev.qdev, s-busdev.qdev.info);
+
+if (info-post_load) {
+info-post_load(s);
+}
+return 0;
+}
+
 static const VMStateDescription vmstate_apic_common = {
 .name = apic,
 .version_id = 3,
 .minimum_version_id = 3,
 .minimum_version_id_old = 1,
 .load_state_old = apic_load_old,
+.post_load = apic_dispatch_post_load,
 .fields = (VMStateField[]) {
 VMSTATE_UINT32(apicbase, APICCommonState),
 VMSTATE_UINT8(id, APICCommonState),
@@ -228,7 +277,8 @@ static const VMStateDescription vmstate_apic_common = {
 VMSTATE_UINT32(initial_count, APICCommonState),
 VMSTATE_INT64(initial_count_load_time, 

[PATCH v7 09/18] i8259: Factor out base class for KVM reuse

2012-01-16 Thread Jan Kiszka
Analogously to the APIC, we will reuse some parts of the user space
i8259 model for KVM. The base class provides a common device state, the
vmstate, the property list, a reset core and some shared init bits.

This also introduces a common helper to instantiate a single i8259 chip
from the cascade-creating i8259_init function.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 Makefile.objs   |2 +-
 hw/i8259.c  |  158 ++-
 hw/i8259_common.c   |  147 +++
 hw/i8259_internal.h |   76 
 4 files changed, 254 insertions(+), 129 deletions(-)
 create mode 100644 hw/i8259_common.c
 create mode 100644 hw/i8259_internal.h

diff --git a/Makefile.objs b/Makefile.objs
index f753d83..39c994b 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -220,7 +220,7 @@ hw-obj-$(CONFIG_APPLESMC) += applesmc.o
 hw-obj-$(CONFIG_SMARTCARD) += usb-ccid.o ccid-card-passthru.o
 hw-obj-$(CONFIG_SMARTCARD_NSS) += ccid-card-emulated.o
 hw-obj-$(CONFIG_USB_REDIR) += usb-redir.o
-hw-obj-$(CONFIG_I8259) += i8259.o
+hw-obj-$(CONFIG_I8259) += i8259_common.o i8259.o
 
 # PPC devices
 hw-obj-$(CONFIG_PREP_PCI) += prep_pci.o
diff --git a/hw/i8259.c b/hw/i8259.c
index cfaa35c..3005ce2 100644
--- a/hw/i8259.c
+++ b/hw/i8259.c
@@ -26,6 +26,7 @@
 #include isa.h
 #include monitor.h
 #include qemu-timer.h
+#include i8259_internal.h
 
 /* debug PIC */
 //#define DEBUG_PIC
@@ -40,35 +41,6 @@
 //#define DEBUG_IRQ_LATENCY
 //#define DEBUG_IRQ_COUNT
 
-typedef struct PicState PicState;
-
-struct PicState {
-ISADevice dev;
-uint8_t last_irr; /* edge detection */
-uint8_t irr; /* interrupt request register */
-uint8_t imr; /* interrupt mask register */
-uint8_t isr; /* interrupt service register */
-uint8_t priority_add; /* highest irq priority */
-uint8_t irq_base;
-uint8_t read_reg_select;
-uint8_t poll;
-uint8_t special_mask;
-uint8_t init_state;
-uint8_t auto_eoi;
-uint8_t rotate_on_auto_eoi;
-uint8_t special_fully_nested_mode;
-uint8_t init4; /* true if 4 byte init */
-uint8_t single_mode; /* true if slave pic is not initialized */
-uint8_t elcr; /* PIIX edge/trigger selection*/
-uint8_t elcr_mask;
-qemu_irq int_out[1];
-uint32_t master; /* reflects /SP input pin */
-uint32_t iobase;
-uint32_t elcr_addr;
-MemoryRegion base_io;
-MemoryRegion elcr_io;
-};
-
 #if defined(DEBUG_PIC) || defined(DEBUG_IRQ_COUNT)
 static int irq_level[16];
 #endif
@@ -79,11 +51,11 @@ static uint64_t irq_count[16];
 static int64_t irq_time[16];
 #endif
 DeviceState *isa_pic;
-static PicState *slave_pic;
+static PICCommonState *slave_pic;
 
 /* return the highest priority found in mask (highest = smallest
number). Return 8 if no irq */
-static int get_priority(PicState *s, int mask)
+static int get_priority(PICCommonState *s, int mask)
 {
 int priority;
 
@@ -98,7 +70,7 @@ static int get_priority(PicState *s, int mask)
 }
 
 /* return the pic wanted interrupt. return -1 if none */
-static int pic_get_irq(PicState *s)
+static int pic_get_irq(PICCommonState *s)
 {
 int mask, cur_priority, priority;
 
@@ -127,7 +99,7 @@ static int pic_get_irq(PicState *s)
 }
 
 /* Update INT output. Must be called every time the output may have changed. */
-static void pic_update_irq(PicState *s)
+static void pic_update_irq(PICCommonState *s)
 {
 int irq;
 
@@ -144,7 +116,7 @@ static void pic_update_irq(PicState *s)
 /* set irq level. If an edge is detected, then the IRR is set to 1 */
 static void pic_set_irq(void *opaque, int irq, int level)
 {
-PicState *s = opaque;
+PICCommonState *s = opaque;
 int mask = 1  irq;
 
 #if defined(DEBUG_PIC) || defined(DEBUG_IRQ_COUNT) || \
@@ -192,7 +164,7 @@ static void pic_set_irq(void *opaque, int irq, int level)
 }
 
 /* acknowledge interrupt 'irq' */
-static void pic_intack(PicState *s, int irq)
+static void pic_intack(PICCommonState *s, int irq)
 {
 if (s-auto_eoi) {
 if (s-rotate_on_auto_eoi) {
@@ -210,7 +182,7 @@ static void pic_intack(PicState *s, int irq)
 
 int pic_read_irq(DeviceState *d)
 {
-PicState *s = DO_UPCAST(PicState, dev.qdev, d);
+PICCommonState *s = DO_UPCAST(PICCommonState, dev.qdev, d);
 int irq, irq2, intno;
 
 irq = pic_get_irq(s);
@@ -249,30 +221,15 @@ int pic_read_irq(DeviceState *d)
 return intno;
 }
 
-static void pic_init_reset(PicState *s)
+static void pic_init_reset(PICCommonState *s)
 {
-s-last_irr = 0;
-s-irr = 0;
-s-imr = 0;
-s-isr = 0;
-s-priority_add = 0;
-s-irq_base = 0;
-s-read_reg_select = 0;
-s-poll = 0;
-s-special_mask = 0;
-s-init_state = 0;
-s-auto_eoi = 0;
-s-rotate_on_auto_eoi = 0;
-s-special_fully_nested_mode = 0;
-s-init4 = 0;
-s-single_mode = 0;
-/* Note: ELCR is not reset */
+pic_reset_common(s);
 pic_update_irq(s);
 }
 
 static void pic_reset(DeviceState 

[PATCH v7 01/18] msi: Generalize msix_supported to msi_supported

2012-01-16 Thread Jan Kiszka
Rename msix_supported to msi_supported and control MSI and MSI-X
activation this way. That was likely to original intention for this
flag, but MSI support came after MSI-X.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 hw/msi.c  |8 
 hw/msi.h  |2 ++
 hw/msix.c |9 -
 hw/msix.h |2 --
 hw/pc.c   |4 ++--
 5 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/hw/msi.c b/hw/msi.c
index f214fcf..5d6ceb6 100644
--- a/hw/msi.c
+++ b/hw/msi.c
@@ -36,6 +36,9 @@
 
 #define PCI_MSI_VECTORS_MAX 32
 
+/* Flag for interrupt controller to declare MSI/MSI-X support */
+bool msi_supported;
+
 /* If we get rid of cap allocator, we won't need this. */
 static inline uint8_t msi_cap_sizeof(uint16_t flags)
 {
@@ -116,6 +119,11 @@ int msi_init(struct PCIDevice *dev, uint8_t offset,
 uint16_t flags;
 uint8_t cap_size;
 int config_offset;
+
+if (!msi_supported) {
+return -ENOTSUP;
+}
+
 MSI_DEV_PRINTF(dev,
init offset: 0x%PRIx8 vector: %PRId8
 64bit %d mask %d\n,
diff --git a/hw/msi.h b/hw/msi.h
index 5766018..3040bb0 100644
--- a/hw/msi.h
+++ b/hw/msi.h
@@ -24,6 +24,8 @@
 #include qemu-common.h
 #include pci.h
 
+extern bool msi_supported;
+
 bool msi_enabled(const PCIDevice *dev);
 int msi_init(struct PCIDevice *dev, uint8_t offset,
  unsigned int nr_vectors, bool msi64bit, bool msi_per_vector_mask);
diff --git a/hw/msix.c b/hw/msix.c
index 149eed2..107d4e5 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -12,6 +12,7 @@
  */
 
 #include hw.h
+#include msi.h
 #include msix.h
 #include pci.h
 #include range.h
@@ -32,9 +33,6 @@
 #define MSIX_MAX_ENTRIES 32
 
 
-/* Flag for interrupt controller to declare MSI-X support */
-int msix_supported;
-
 /* Add MSI-X capability to the config space for the device. */
 /* Given a bar and its size, add MSI-X table on top of it
  * and fill MSI-X capability in the config space.
@@ -235,10 +233,11 @@ int msix_init(struct PCIDevice *dev, unsigned short 
nentries,
   unsigned bar_nr, unsigned bar_size)
 {
 int ret;
+
 /* Nothing to do if MSI is not supported by interrupt controller */
-if (!msix_supported)
+if (!msi_supported) {
 return -ENOTSUP;
-
+}
 if (nentries  MSIX_MAX_ENTRIES)
 return -EINVAL;
 
diff --git a/hw/msix.h b/hw/msix.h
index 7e04336..5aba22b 100644
--- a/hw/msix.h
+++ b/hw/msix.h
@@ -29,6 +29,4 @@ void msix_notify(PCIDevice *dev, unsigned vector);
 
 void msix_reset(PCIDevice *dev);
 
-extern int msix_supported;
-
 #endif
diff --git a/hw/pc.c b/hw/pc.c
index f51afa8..3f69e09 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -36,7 +36,7 @@
 #include elf.h
 #include multiboot.h
 #include mc146818rtc.h
-#include msix.h
+#include msi.h
 #include sysbus.h
 #include sysemu.h
 #include blockdev.h
@@ -896,7 +896,7 @@ static DeviceState *apic_init(void *env, uint8_t apic_id)
 apic_mapped = 1;
 }
 
-msix_supported = 1;
+msi_supported = true;
 
 return dev;
 }
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 14/18] kvm: x86: Establish IRQ0 override control

2012-01-16 Thread Jan Kiszka
KVM is forced to disable the IRQ0 override when we run with in-kernel
irqchip but without IRQ routing support of the kernel. Set the fwcfg
value correspondingly. This aligns us with qemu-kvm.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 hw/pc.c|3 ++-
 kvm-all.c  |5 +
 kvm-stub.c |5 +
 kvm.h  |2 ++
 sysemu.h   |1 -
 vl.c   |1 -
 6 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index 3f69e09..9519079 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -39,6 +39,7 @@
 #include msi.h
 #include sysbus.h
 #include sysemu.h
+#include kvm.h
 #include blockdev.h
 #include ui/qemu-spice.h
 #include memory.h
@@ -609,7 +610,7 @@ static void *bochs_bios_init(void)
 fw_cfg_add_i64(fw_cfg, FW_CFG_RAM_SIZE, (uint64_t)ram_size);
 fw_cfg_add_bytes(fw_cfg, FW_CFG_ACPI_TABLES, (uint8_t *)acpi_tables,
  acpi_tables_len);
-fw_cfg_add_bytes(fw_cfg, FW_CFG_IRQ0_OVERRIDE, irq0override, 1);
+fw_cfg_add_i32(fw_cfg, FW_CFG_IRQ0_OVERRIDE, kvm_allows_irq0_override());
 
 smbios_table = smbios_get_table(smbios_len);
 if (smbios_table)
diff --git a/kvm-all.c b/kvm-all.c
index e91bb46..b06ed7d 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1272,6 +1272,11 @@ int kvm_has_gsi_routing(void)
 return kvm_check_extension(kvm_state, KVM_CAP_IRQ_ROUTING);
 }
 
+int kvm_allows_irq0_override(void)
+{
+return !kvm_enabled() || !kvm_irqchip_in_kernel() || kvm_has_gsi_routing();
+}
+
 void kvm_setup_guest_memory(void *start, size_t size)
 {
 if (!kvm_has_sync_mmu()) {
diff --git a/kvm-stub.c b/kvm-stub.c
index 06064b9..6c2b06b 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -78,6 +78,11 @@ int kvm_has_many_ioeventfds(void)
 return 0;
 }
 
+int kvm_allows_irq0_override(void)
+{
+return 1;
+}
+
 void kvm_setup_guest_memory(void *start, size_t size)
 {
 }
diff --git a/kvm.h b/kvm.h
index 0d6c453..a3c87af 100644
--- a/kvm.h
+++ b/kvm.h
@@ -53,6 +53,8 @@ int kvm_has_xcrs(void);
 int kvm_has_many_ioeventfds(void);
 int kvm_has_gsi_routing(void);
 
+int kvm_allows_irq0_override(void);
+
 #ifdef NEED_CPU_H
 int kvm_init_vcpu(CPUState *env);
 
diff --git a/sysemu.h b/sysemu.h
index 3806901..6abded1 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -102,7 +102,6 @@ extern int vga_interface_type;
 extern int graphic_width;
 extern int graphic_height;
 extern int graphic_depth;
-extern uint8_t irq0override;
 extern DisplayType display_type;
 extern const char *keyboard_layout;
 extern int win2k_install_hack;
diff --git a/vl.c b/vl.c
index d925424..f77e687 100644
--- a/vl.c
+++ b/vl.c
@@ -218,7 +218,6 @@ int no_reboot = 0;
 int no_shutdown = 0;
 int cursor_hide = 1;
 int graphic_rotate = 0;
-uint8_t irq0override = 1;
 const char *watchdog;
 QEMUOptionRom option_rom[MAX_OPTION_ROMS];
 int nb_option_roms;
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v7 05/18] apic: Introduce apic_report_irq_delivered

2012-01-16 Thread Jan Kiszka
The in-kernel i8259 and IOAPIC backends for KVM will need this, so
encapsulate the shared bits.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 hw/apic.c|   11 ---
 hw/apic.h|1 +
 trace-events |2 +-
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/hw/apic.c b/hw/apic.c
index b9d733c..bec493b 100644
--- a/hw/apic.c
+++ b/hw/apic.c
@@ -413,6 +413,13 @@ static void apic_update_irq(APICState *s)
 }
 }
 
+void apic_report_irq_delivered(int delivered)
+{
+apic_irq_delivered += delivered;
+
+trace_apic_report_irq_delivered(apic_irq_delivered);
+}
+
 void apic_reset_irq_delivered(void)
 {
 trace_apic_reset_irq_delivered(apic_irq_delivered);
@@ -429,9 +436,7 @@ int apic_get_irq_delivered(void)
 
 static void apic_set_irq(APICState *s, int vector_num, int trigger_mode)
 {
-apic_irq_delivered += !get_bit(s-irr, vector_num);
-
-trace_apic_set_irq(apic_irq_delivered);
+apic_report_irq_delivered(!get_bit(s-irr, vector_num));
 
 set_bit(s-irr, vector_num);
 if (trigger_mode)
diff --git a/hw/apic.h b/hw/apic.h
index a62d83b..8173d8a 100644
--- a/hw/apic.h
+++ b/hw/apic.h
@@ -10,6 +10,7 @@ int apic_accept_pic_intr(DeviceState *s);
 void apic_deliver_pic_intr(DeviceState *s, int level);
 void apic_deliver_nmi(DeviceState *d);
 int apic_get_interrupt(DeviceState *s);
+void apic_report_irq_delivered(int delivered);
 void apic_reset_irq_delivered(void);
 int apic_get_irq_delivered(void);
 void cpu_set_apic_base(DeviceState *s, uint64_t val);
diff --git a/trace-events b/trace-events
index 514849a..c101216 100644
--- a/trace-events
+++ b/trace-events
@@ -95,9 +95,9 @@ cpu_get_apic_base(uint64_t val) %016PRIx64
 apic_mem_readl(uint64_t addr, uint32_t val)  %PRIx64 = %08x
 apic_mem_writel(uint64_t addr, uint32_t val) %PRIx64 = %08x
 # coalescing
+apic_report_irq_delivered(int apic_irq_delivered) coalescing %d
 apic_reset_irq_delivered(int apic_irq_delivered) old coalescing %d
 apic_get_irq_delivered(int apic_irq_delivered) returning coalescing %d
-apic_set_irq(int apic_irq_delivered) coalescing %d
 
 # hw/cs4231.c
 cs4231_mem_readl_dreg(uint32_t reg, uint32_t ret) read dreg %d: 0x%02x
-- 
1.7.3.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] emulator: Fix task switch into/out of VM86

2012-01-16 Thread Joerg Roedel
On Mon, Jan 16, 2012 at 04:37:33PM +0100, Kevin Wolf wrote:
 Am 10.01.2012 18:51, schrieb Joerg Roedel:
  Joerg, do you know how can we check that task switch was cause by an
  exception triggering task gate if exit_int_info is not provided during
  intercept (short of decoding instruction that caused intercept to see if
  it's a call to a gate, which is racy) ?
  
  Hmm, havn't found a solution yet. But I will check further and let you
  know if I find out something.
 
 Jörg, any news on this?

Not yet. We are currently trying to figure out which priviledge checks
the software has to do and which ones are done by the hardware before a
task-switch intercept is thrown.


Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] pci-assign: Fix multifunction support

2012-01-16 Thread Alex Williamson
The core PCI code sets the multifunction bit in the header before
calling the device initfn.  For device assignment, we're blasting
that value with the actual hardware value, so nobody sees the
additional functions if the devices isn't physically multifunction.
Switch the HEADER_TYPE to a fully emulated field (all read-only
anyway) and add setting and clearing of the multifunction bit to
match qemu directive.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/device-assignment.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 2a9e66d..7f4a5ec 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -540,6 +540,13 @@ again:
 fprintf(stderr, %s: read failed, errno = %d\n, __func__, errno);
 }
 
+/* Restore or clear multifunction, this is always controlled by qemu */
+if (pci_dev-dev.cap_present  QEMU_PCI_CAP_MULTIFUNCTION) {
+pci_dev-dev.config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
+} else {
+pci_dev-dev.config[PCI_HEADER_TYPE] = 
~PCI_HEADER_TYPE_MULTI_FUNCTION;
+}
+
 /* Clear host resource mapping info.  If we choose not to register a
  * BAR, such as might be the case with the option ROM, we can get
  * confusing, unwritable, residual addresses from the host here. */
@@ -1575,7 +1582,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev)
 assigned_dev_direct_config_read(dev, PCI_CLASS_PROG, 3);
 assigned_dev_direct_config_read(dev, PCI_CACHE_LINE_SIZE, 1);
 assigned_dev_direct_config_read(dev, PCI_LATENCY_TIMER, 1);
-assigned_dev_direct_config_read(dev, PCI_HEADER_TYPE, 1);
 assigned_dev_direct_config_read(dev, PCI_BIST, 1);
 assigned_dev_direct_config_read(dev, PCI_CARDBUS_CIS, 4);
 assigned_dev_direct_config_read(dev, PCI_SUBSYSTEM_VENDOR_ID, 2);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[18/48] KVM: Device assignment permission checks

2012-01-16 Thread Greg KH
3.1-stable review patch.  If anyone has any objections, please let me know.

--


From: Alex Williamson alex.william...@redhat.com

(cherry picked from commit 3d27e23b17010c668db311140b1770c78fb9)

Only allow KVM device assignment to attach to devices which:

 - Are not bridges
 - Have BAR resources (assume others are special devices)
 - The user has permissions to use

Assigning a bridge is a configuration error, it's not supported, and
typically doesn't result in the behavior the user is expecting anyway.
Devices without BAR resources are typically chipset components that
also don't have host drivers.  We don't want users to hold such devices
captive or cause system problems by fencing them off into an iommu
domain.  We determine permission to use by testing whether the user
has access to the PCI sysfs resource files.  By default a normal user
will not have access to these files, so it provides a good indication
that an administration agent has granted the user access to the device.

[Yang Bai: add missing #include]
[avi: fix comment style]

Signed-off-by: Alex Williamson alex.william...@redhat.com
Signed-off-by: Yang Bai hamo...@gmail.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Greg Kroah-Hartman gre...@suse.de
---
 Documentation/virtual/kvm/api.txt |4 ++
 virt/kvm/assigned-dev.c   |   75 ++
 2 files changed, 79 insertions(+)

--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1134,6 +1134,10 @@ following flags are specified:
 The KVM_DEV_ASSIGN_ENABLE_IOMMU flag is a mandatory option to ensure
 isolation of the device.  Usages not specifying this flag are deprecated.
 
+Only PCI header type 0 devices with PCI BAR resources are supported by
+device assignment.  The user requesting this ioctl must have read/write
+access to the PCI sysfs resource files associated with the device.
+
 4.49 KVM_DEASSIGN_PCI_DEVICE
 
 Capability: KVM_CAP_DEVICE_DEASSIGNMENT
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -17,6 +17,8 @@
 #include linux/pci.h
 #include linux/interrupt.h
 #include linux/slab.h
+#include linux/namei.h
+#include linux/fs.h
 #include irq.h
 
 static struct kvm_assigned_dev_kernel *kvm_find_assigned_dev(struct list_head 
*head,
@@ -474,12 +476,73 @@ out:
return r;
 }
 
+/*
+ * We want to test whether the caller has been granted permissions to
+ * use this device.  To be able to configure and control the device,
+ * the user needs access to PCI configuration space and BAR resources.
+ * These are accessed through PCI sysfs.  PCI config space is often
+ * passed to the process calling this ioctl via file descriptor, so we
+ * can't rely on access to that file.  We can check for permissions
+ * on each of the BAR resource files, which is a pretty clear
+ * indicator that the user has been granted access to the device.
+ */
+static int probe_sysfs_permissions(struct pci_dev *dev)
+{
+#ifdef CONFIG_SYSFS
+   int i;
+   bool bar_found = false;
+
+   for (i = PCI_STD_RESOURCES; i = PCI_STD_RESOURCE_END; i++) {
+   char *kpath, *syspath;
+   struct path path;
+   struct inode *inode;
+   int r;
+
+   if (!pci_resource_len(dev, i))
+   continue;
+
+   kpath = kobject_get_path(dev-dev.kobj, GFP_KERNEL);
+   if (!kpath)
+   return -ENOMEM;
+
+   /* Per sysfs-rules, sysfs is always at /sys */
+   syspath = kasprintf(GFP_KERNEL, /sys%s/resource%d, kpath, i);
+   kfree(kpath);
+   if (!syspath)
+   return -ENOMEM;
+
+   r = kern_path(syspath, LOOKUP_FOLLOW, path);
+   kfree(syspath);
+   if (r)
+   return r;
+
+   inode = path.dentry-d_inode;
+
+   r = inode_permission(inode, MAY_READ | MAY_WRITE | MAY_ACCESS);
+   path_put(path);
+   if (r)
+   return r;
+
+   bar_found = true;
+   }
+
+   /* If no resources, probably something special */
+   if (!bar_found)
+   return -EPERM;
+
+   return 0;
+#else
+   return -EINVAL; /* No way to control the device without sysfs */
+#endif
+}
+
 static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
  struct kvm_assigned_pci_dev *assigned_dev)
 {
int r = 0, idx;
struct kvm_assigned_dev_kernel *match;
struct pci_dev *dev;
+   u8 header_type;
 
if (!(assigned_dev-flags  KVM_DEV_ASSIGN_ENABLE_IOMMU))
return -EINVAL;
@@ -510,6 +573,18 @@ static int kvm_vm_ioctl_assign_device(st
r = -EINVAL;
goto out_free;
}
+
+   /* Don't allow bridges to be assigned */
+   pci_read_config_byte(dev, PCI_HEADER_TYPE, header_type);
+   if 

[17/48] KVM: Remove ability to assign a device without iommu support

2012-01-16 Thread Greg KH
3.1-stable review patch.  If anyone has any objections, please let me know.

--


From: Alex Williamson alex.william...@redhat.com

(cherry picked from commit 423873736b78f549fbfa2f715f2e4de7e6c5e1e9)

This option has no users and it exposes a security hole that we
can allow devices to be assigned without iommu protection.  Make
KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory option.

Signed-off-by: Alex Williamson alex.william...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Greg Kroah-Hartman gre...@suse.de
---
 Documentation/virtual/kvm/api.txt |3 +++
 virt/kvm/assigned-dev.c   |   18 +-
 2 files changed, 12 insertions(+), 9 deletions(-)

--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1131,6 +1131,9 @@ following flags are specified:
 /* Depends on KVM_CAP_IOMMU */
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1  0)
 
+The KVM_DEV_ASSIGN_ENABLE_IOMMU flag is a mandatory option to ensure
+isolation of the device.  Usages not specifying this flag are deprecated.
+
 4.49 KVM_DEASSIGN_PCI_DEVICE
 
 Capability: KVM_CAP_DEVICE_DEASSIGNMENT
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -481,6 +481,9 @@ static int kvm_vm_ioctl_assign_device(st
struct kvm_assigned_dev_kernel *match;
struct pci_dev *dev;
 
+   if (!(assigned_dev-flags  KVM_DEV_ASSIGN_ENABLE_IOMMU))
+   return -EINVAL;
+
mutex_lock(kvm-lock);
idx = srcu_read_lock(kvm-srcu);
 
@@ -538,16 +541,14 @@ static int kvm_vm_ioctl_assign_device(st
 
list_add(match-list, kvm-arch.assigned_dev_head);
 
-   if (assigned_dev-flags  KVM_DEV_ASSIGN_ENABLE_IOMMU) {
-   if (!kvm-arch.iommu_domain) {
-   r = kvm_iommu_map_guest(kvm);
-   if (r)
-   goto out_list_del;
-   }
-   r = kvm_assign_device(kvm, match);
+   if (!kvm-arch.iommu_domain) {
+   r = kvm_iommu_map_guest(kvm);
if (r)
goto out_list_del;
}
+   r = kvm_assign_device(kvm, match);
+   if (r)
+   goto out_list_del;
 
 out:
srcu_read_unlock(kvm-srcu, idx);
@@ -587,8 +588,7 @@ static int kvm_vm_ioctl_deassign_device(
goto out;
}
 
-   if (match-flags  KVM_DEV_ASSIGN_ENABLE_IOMMU)
-   kvm_deassign_device(kvm, match);
+   kvm_deassign_device(kvm, match);
 
kvm_free_assigned_device(kvm, match);
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[15/48] KVM guest: prevent tracing recursion with kvmclock

2012-01-16 Thread Greg KH
3.1-stable review patch.  If anyone has any objections, please let me know.

--

From: Avi Kivity a...@redhat.com

(cherry picked from commit 95ef1e52922cf75b1ea2eae54ef886f2cc47eecb)

Prevent tracing of preempt_disable() in get_cpu_var() in
kvm_clock_read(). When CONFIG_DEBUG_PREEMPT is enabled,
preempt_disable/enable() are traced and this causes the function_graph
tracer to go into an infinite recursion. By open coding the
preempt_disable() around the get_cpu_var(), we can use the notrace
version which prevents preempt_disable/enable() from being traced and
prevents the recursion.

Based on a similar patch for Xen from Jeremy Fitzhardinge.

Tested-by: Gleb Natapov g...@redhat.com
Acked-by: Steven Rostedt rost...@goodmis.org
Signed-off-by: Avi Kivity a...@redhat.com
Signed-off-by: Greg Kroah-Hartman gre...@suse.de
---
 arch/x86/kernel/kvmclock.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -74,9 +74,10 @@ static cycle_t kvm_clock_read(void)
struct pvclock_vcpu_time_info *src;
cycle_t ret;
 
-   src = get_cpu_var(hv_clock);
+   preempt_disable_notrace();
+   src = __get_cpu_var(hv_clock);
ret = pvclock_clocksource_read(src);
-   put_cpu_var(hv_clock);
+   preempt_enable_notrace();
return ret;
 }
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[16/48] KVM: x86: Prevent starting PIT timers in the absence of irqchip support

2012-01-16 Thread Greg KH
3.1-stable review patch.  If anyone has any objections, please let me know.

--


From: Jan Kiszka jan.kis...@siemens.com

(cherry picked from commit 0924ab2cfa98b1ece26c033d696651fd62896c69)

User space may create the PIT and forgets about setting up the irqchips.
In that case, firing PIT IRQs will crash the host:

BUG: unable to handle kernel NULL pointer dereference at 0128
IP: [a10f6280] kvm_set_irq+0x30/0x170 [kvm]
...
Call Trace:
 [a11228c1] pit_do_work+0x51/0xd0 [kvm]
 [81071431] process_one_work+0x111/0x4d0
 [81071bb2] worker_thread+0x152/0x340
 [81075c8e] kthread+0x7e/0x90
 [815a4474] kernel_thread_helper+0x4/0x10

Prevent this by checking the irqchip mode before starting a timer. We
can't deny creating the PIT if the irqchips aren't set up yet as
current user land expects this order to work.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Greg Kroah-Hartman gre...@suse.de
---
 arch/x86/kvm/i8254.c |   10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -338,11 +338,15 @@ static enum hrtimer_restart pit_timer_fn
return HRTIMER_NORESTART;
 }
 
-static void create_pit_timer(struct kvm_kpit_state *ps, u32 val, int is_period)
+static void create_pit_timer(struct kvm *kvm, u32 val, int is_period)
 {
+   struct kvm_kpit_state *ps = kvm-arch.vpit-pit_state;
struct kvm_timer *pt = ps-pit_timer;
s64 interval;
 
+   if (!irqchip_in_kernel(kvm))
+   return;
+
interval = muldiv64(val, NSEC_PER_SEC, KVM_PIT_FREQ);
 
pr_debug(create pit timer, interval is %llu nsec\n, interval);
@@ -394,13 +398,13 @@ static void pit_load_count(struct kvm *k
 /* FIXME: enhance mode 4 precision */
case 4:
if (!(ps-flags  KVM_PIT_FLAGS_HPET_LEGACY)) {
-   create_pit_timer(ps, val, 0);
+   create_pit_timer(kvm, val, 0);
}
break;
case 2:
case 3:
if (!(ps-flags  KVM_PIT_FLAGS_HPET_LEGACY)){
-   create_pit_timer(ps, val, 1);
+   create_pit_timer(kvm, val, 1);
}
break;
default:


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 2/2] KVM: Use -cpu best as default on x86

2012-01-16 Thread Ryan Harper
* Alexander Graf ag...@suse.de [2012-01-08 17:53]:
 When running QEMU without -cpu parameter, the user usually wants a sane
 default. So far, we're using the qemu64/qemu32 CPU type, which basically
 means the maximum TCG can emulate.

it also means we all maximum possible migration targets.  Have you
given any thought to migration with -cpu best? 

 
 That's a really good default when using TCG, but when running with KVM
 we much rather want a default saying the maximum performance I can get.
 
 Fortunately we just added an option that gives us the best performance
 while still staying safe on the testability side of things: -cpu best.
 So all we need to do is make -cpu best the default when the user doesn't
 define any.
 
 This fixes a lot of subtile breakage in the GNU toolchain (libgmp) which
 hicks up on QEMU's non-existent CPU models.
 
 This patch also adds a new pc-1.1 machine type to keep backwards compatible
 with older versions of QEMU.
 
 Signed-off-by: Alexander Graf ag...@suse.de
 ---
  hw/pc_piix.c |   42 ++
  1 files changed, 34 insertions(+), 8 deletions(-)
 
 diff --git a/hw/pc_piix.c b/hw/pc_piix.c
 index 00f525e..3d78ccb 100644
 --- a/hw/pc_piix.c
 +++ b/hw/pc_piix.c
 @@ -79,7 +79,8 @@ static void pc_init1(MemoryRegion *system_memory,
   const char *initrd_filename,
   const char *cpu_model,
   int pci_enabled,
 - int kvmclock_enabled)
 + int kvmclock_enabled,
 + int may_cpu_best)
  {
  int i;
  ram_addr_t below_4g_mem_size, above_4g_mem_size;
 @@ -102,6 +103,9 @@ static void pc_init1(MemoryRegion *system_memory,
  MemoryRegion *rom_memory;
  DeviceState *dev;
 
 +if (!cpu_model  kvm_enabled()  may_cpu_best) {
 +cpu_model = best;
 +}
  pc_cpus_init(cpu_model);
 
  if (kvmclock_enabled) {
 @@ -263,7 +267,21 @@ static void pc_init_pci(ram_addr_t ram_size,
   get_system_io(),
   ram_size, boot_device,
   kernel_filename, kernel_cmdline,
 - initrd_filename, cpu_model, 1, 1);
 + initrd_filename, cpu_model, 1, 1, 1);
 +}
 +
 +static void pc_init_pci_oldcpu(ram_addr_t ram_size,
 +   const char *boot_device,
 +   const char *kernel_filename,
 +   const char *kernel_cmdline,
 +   const char *initrd_filename,
 +   const char *cpu_model)
 +{
 +pc_init1(get_system_memory(),
 + get_system_io(),
 + ram_size, boot_device,
 + kernel_filename, kernel_cmdline,
 + initrd_filename, cpu_model, 1, 1, 0);
  }
 
  static void pc_init_pci_no_kvmclock(ram_addr_t ram_size,
 @@ -277,7 +295,7 @@ static void pc_init_pci_no_kvmclock(ram_addr_t ram_size,
   get_system_io(),
   ram_size, boot_device,
   kernel_filename, kernel_cmdline,
 - initrd_filename, cpu_model, 1, 0);
 + initrd_filename, cpu_model, 1, 0, 0);
  }
 
  static void pc_init_isa(ram_addr_t ram_size,
 @@ -293,7 +311,7 @@ static void pc_init_isa(ram_addr_t ram_size,
   get_system_io(),
   ram_size, boot_device,
   kernel_filename, kernel_cmdline,
 - initrd_filename, cpu_model, 0, 1);
 + initrd_filename, cpu_model, 0, 1, 0);
  }
 
  #ifdef CONFIG_XEN
 @@ -314,8 +332,8 @@ static void pc_xen_hvm_init(ram_addr_t ram_size,
  }
  #endif
 
 -static QEMUMachine pc_machine_v1_0 = {
 -.name = pc-1.0,
 +static QEMUMachine pc_machine_v1_1 = {
 +.name = pc-1.1,
  .alias = pc,
  .desc = Standard PC,
  .init = pc_init_pci,
 @@ -323,17 +341,24 @@ static QEMUMachine pc_machine_v1_0 = {
  .is_default = 1,
  };
 
 +static QEMUMachine pc_machine_v1_0 = {
 +.name = pc-1.0,
 +.desc = Standard PC,
 +.init = pc_init_pci_oldcpu,
 +.max_cpus = 255,
 +};
 +
  static QEMUMachine pc_machine_v0_15 = {
  .name = pc-0.15,
  .desc = Standard PC,
 -.init = pc_init_pci,
 +.init = pc_init_pci_oldcpu,
  .max_cpus = 255,
  };
 
  static QEMUMachine pc_machine_v0_14 = {
  .name = pc-0.14,
  .desc = Standard PC,
 -.init = pc_init_pci,
 +.init = pc_init_pci_oldcpu,
  .max_cpus = 255,
  .compat_props = (GlobalProperty[]) {
  {
 @@ -612,6 +637,7 @@ static QEMUMachine xenfv_machine = {
 
  static void pc_machine_init(void)
  {
 +qemu_register_machine(pc_machine_v1_1);
  qemu_register_machine(pc_machine_v1_0);
  qemu_register_machine(pc_machine_v0_15);
  qemu_register_machine(pc_machine_v0_14);
 -- 
 1.6.0.2
 

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ry...@us.ibm.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More 

Re: [Qemu-devel] [PATCH 1/2] KVM: Add new -cpu best

2012-01-16 Thread Anthony Liguori

On 01/08/2012 05:52 PM, Alexander Graf wrote:

During discussions on whether to make -cpu host the default in SLE, I found
myself disagreeing to the thought, because it potentially opens a big can
of worms for potential bugs. But if I already am so opposed to it for SLE, how
can it possibly be reasonable to default to -cpu host in upstream QEMU? And
what would a sane default look like?



What are the arguments against -cpu host?

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 2/2] KVM: Use -cpu best as default on x86

2012-01-16 Thread Alexander Graf

On 16.01.2012, at 20:30, Ryan Harper wrote:

 * Alexander Graf ag...@suse.de [2012-01-08 17:53]:
 When running QEMU without -cpu parameter, the user usually wants a sane
 default. So far, we're using the qemu64/qemu32 CPU type, which basically
 means the maximum TCG can emulate.
 
 it also means we all maximum possible migration targets.  Have you
 given any thought to migration with -cpu best? 

If you have the same boxes in your cluster, migration just works. If you don't, 
you usually use a specific CPU model that is the least dominator between your 
boxes either way.

The current kvm64 type is broken. Libgmp just abort()s when we pass it in. So 
anything is better than what we do today on AMD hosts :).


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/2] KVM: Add new -cpu best

2012-01-16 Thread Alexander Graf

On 16.01.2012, at 20:33, Anthony Liguori wrote:

 On 01/08/2012 05:52 PM, Alexander Graf wrote:
 During discussions on whether to make -cpu host the default in SLE, I found
 myself disagreeing to the thought, because it potentially opens a big can
 of worms for potential bugs. But if I already am so opposed to it for SLE, 
 how
 can it possibly be reasonable to default to -cpu host in upstream QEMU? And
 what would a sane default look like?
 
 
 What are the arguments against -cpu host?

It's hard to test. New CPUs have new features and we're having a hard time to 
catch up. With -cpu best we only select from a pool of known-good CPU types. If 
you want to check that everything works, go to a box that has the maximum 
available features, go through all -cpu options that users could run into and 
you're good. With -cpu host you can't really test (unless you own all possible 
CPUs there are).

We expose CPUID information that doesn't exist that way in the real world.

A small example from today's code.

There are a bunch of CPUID leafs. On Nehalem, one of them is a list of possible 
C-States to go into. With -cpu host we sync feature bits, CPU name, CPU family 
and some other bits of information, but not the C-State information. So we end 
up with a CPU inside the guest that looks and feels like a Nehalem CPU, but 
doesn't expose any C-State information.

Linux now boots, goes in, checks that it's running on Nehalem, sets the 
powersave mechanism to the respective model and fills an internal callback 
table with the C-State information with a loop that ends without any action, 
since we expose 0 C-State bits. When the guest now calls the idle callback, it 
dereferences that table, which contains a NULL pointer, oops.

That is just one example from current Linux. Another one would be my 
development AMD box that when it came out wasn't around in the market yet, so 
guests would just refuse to boot at all. Since they'd just say the CPUID is 
unknown.

Overall, I used to be a big fan of -cpu host, but it's a maintainability 
nightmare. It can be great for testing stuff, so we should definitely keep it 
around. But after thinking about it again, I don't think it should be the 
default. The default should be something safe.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 2/2] KVM: Use -cpu best as default on x86

2012-01-16 Thread Ryan Harper
* Alexander Graf ag...@suse.de [2012-01-16 13:37]:
 
 On 16.01.2012, at 20:30, Ryan Harper wrote:
 
  * Alexander Graf ag...@suse.de [2012-01-08 17:53]:
  When running QEMU without -cpu parameter, the user usually wants a sane
  default. So far, we're using the qemu64/qemu32 CPU type, which basically
  means the maximum TCG can emulate.
  
  it also means we all maximum possible migration targets.  Have you
  given any thought to migration with -cpu best? 
 
 If you have the same boxes in your cluster, migration just works. If
 you don't, you usually use a specific CPU model that is the least
 dominator between your boxes either way.

Sure, but the idea behind -cpu best is to not have to figure that out;
you had suggested that the qemu64/qemu32 were just related to TCG, and
what I'm suggesting is that it's also the most compatible w.r.t
migration.  

it sounds like if migration is a requirement, then -cpu best probably
isn't something that would be used.  I suppose I'm OK with that, or at
least I don't have a better suggestion on how to carefully push up the
capabilities without at some point breaking migration.

 
 The current kvm64 type is broken. Libgmp just abort()s when we pass it
 in. So anything is better than what we do today on AMD hosts :).

I wonder if it breaks with Cyris cpus... other tools tend to do runtime
detection (mplayer).


 
 
 Alex

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ry...@us.ibm.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 2/2] KVM: Use -cpu best as default on x86

2012-01-16 Thread Ryan Harper
* Alexander Graf ag...@suse.de [2012-01-16 13:52]:
 
 On 16.01.2012, at 20:46, Ryan Harper wrote:
 
  * Alexander Graf ag...@suse.de [2012-01-16 13:37]:
  
  On 16.01.2012, at 20:30, Ryan Harper wrote:
  
  * Alexander Graf ag...@suse.de [2012-01-08 17:53]:
  When running QEMU without -cpu parameter, the user usually wants a sane
  default. So far, we're using the qemu64/qemu32 CPU type, which basically
  means the maximum TCG can emulate.
  
  it also means we all maximum possible migration targets.  Have you
  given any thought to migration with -cpu best? 
  
  If you have the same boxes in your cluster, migration just works. If
  you don't, you usually use a specific CPU model that is the least
  dominator between your boxes either way.
  
  Sure, but the idea behind -cpu best is to not have to figure that out;
  you had suggested that the qemu64/qemu32 were just related to TCG, and
  what I'm suggesting is that it's also the most compatible w.r.t
  migration.  
 
 The, the most compatible wrt migration is -cpu kvm64 / kvm32.
 
  it sounds like if migration is a requirement, then -cpu best probably
  isn't something that would be used.  I suppose I'm OK with that, or at
  least I don't have a better suggestion on how to carefully push up the
  capabilities without at some point breaking migration.
 
 Yes, if you're interested in migration, then you're almost guaranteed to have 
 a toolstack on top that has knowledge of your whole cluster and can do the 
 least dominator detection over all of your nodes. On the QEMU level we don't 
 know anything about other machines.
 
  
  
  The current kvm64 type is broken. Libgmp just abort()s when we pass it
  in. So anything is better than what we do today on AMD hosts :).
  
  I wonder if it breaks with Cyris cpus... other tools tend to do runtime
  detection (mplayer).
 
 It probably does :). But then again those don't do KVM, do they?

not following; mplayer issues SSE2, 3 and 4 instructions to see what
works to figure out how to optimize; it doesn't care if the cpu is
called QEMU64 or Cyrus or AMD.  I'm not saying that we can't do better
than qemu64 w.r.t best cpu to select by default, but there are plenty of
applications that want to optimize their code based on what's available,
but this is done via code execution instead of string comparision.

 
 
 Alex
 

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ry...@us.ibm.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 2/2] KVM: Use -cpu best as default on x86

2012-01-16 Thread Alexander Graf

On 16.01.2012, at 21:13, Ryan Harper wrote:

 * Alexander Graf ag...@suse.de [2012-01-16 13:52]:
 
 On 16.01.2012, at 20:46, Ryan Harper wrote:
 
 * Alexander Graf ag...@suse.de [2012-01-16 13:37]:
 
 On 16.01.2012, at 20:30, Ryan Harper wrote:
 
 * Alexander Graf ag...@suse.de [2012-01-08 17:53]:
 When running QEMU without -cpu parameter, the user usually wants a sane
 default. So far, we're using the qemu64/qemu32 CPU type, which basically
 means the maximum TCG can emulate.
 
 it also means we all maximum possible migration targets.  Have you
 given any thought to migration with -cpu best? 
 
 If you have the same boxes in your cluster, migration just works. If
 you don't, you usually use a specific CPU model that is the least
 dominator between your boxes either way.
 
 Sure, but the idea behind -cpu best is to not have to figure that out;
 you had suggested that the qemu64/qemu32 were just related to TCG, and
 what I'm suggesting is that it's also the most compatible w.r.t
 migration.  
 
 The, the most compatible wrt migration is -cpu kvm64 / kvm32.
 
 it sounds like if migration is a requirement, then -cpu best probably
 isn't something that would be used.  I suppose I'm OK with that, or at
 least I don't have a better suggestion on how to carefully push up the
 capabilities without at some point breaking migration.
 
 Yes, if you're interested in migration, then you're almost guaranteed to 
 have a toolstack on top that has knowledge of your whole cluster and can do 
 the least dominator detection over all of your nodes. On the QEMU level we 
 don't know anything about other machines.
 
 
 
 The current kvm64 type is broken. Libgmp just abort()s when we pass it
 in. So anything is better than what we do today on AMD hosts :).
 
 I wonder if it breaks with Cyris cpus... other tools tend to do runtime
 detection (mplayer).
 
 It probably does :). But then again those don't do KVM, do they?
 
 not following; mplayer issues SSE2, 3 and 4 instructions to see what
 works to figure out how to optimize; it doesn't care if the cpu is
 called QEMU64 or Cyrus or AMD.  I'm not saying that we can't do better
 than qemu64 w.r.t best cpu to select by default, but there are plenty of
 applications that want to optimize their code based on what's available,
 but this is done via code execution instead of string comparision.

The problem with -cpu kvm64 is that we choose a family/model that doesn't exist 
in the real world, and then glue AuthenticAMD or GenuineIntel in the vendor 
string. Libgmp checks for existing CPUs, finds that this CPU doesn't match any 
real world IDs and abort()s.

The problem is that there is not a single CPU on this planet in silicon that 
has the same model+family numbers, but exists in AuthenticAMD _and_ 
GenuineIntel flavors. We need to pass the host vendor in though, because the 
guest uses it to detect if it should execute SYSCALL or SYSENTER, because intel 
and amd screwed up heavily on that one.

It's not about feature flags which is what mplayer uses. Those are fine :).

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 2/2] KVM: Use -cpu best as default on x86

2012-01-16 Thread Ryan Harper
* Alexander Graf ag...@suse.de [2012-01-16 14:52]:
 
 On 16.01.2012, at 21:13, Ryan Harper wrote:
 
  * Alexander Graf ag...@suse.de [2012-01-16 13:52]:
  
  On 16.01.2012, at 20:46, Ryan Harper wrote:
  
  * Alexander Graf ag...@suse.de [2012-01-16 13:37]:
  
  On 16.01.2012, at 20:30, Ryan Harper wrote:
  
  * Alexander Graf ag...@suse.de [2012-01-08 17:53]:
  When running QEMU without -cpu parameter, the user usually wants a sane
  default. So far, we're using the qemu64/qemu32 CPU type, which 
  basically
  means the maximum TCG can emulate.
  
  it also means we all maximum possible migration targets.  Have you
  given any thought to migration with -cpu best? 
  
  If you have the same boxes in your cluster, migration just works. If
  you don't, you usually use a specific CPU model that is the least
  dominator between your boxes either way.
  
  Sure, but the idea behind -cpu best is to not have to figure that out;
  you had suggested that the qemu64/qemu32 were just related to TCG, and
  what I'm suggesting is that it's also the most compatible w.r.t
  migration.  
  
  The, the most compatible wrt migration is -cpu kvm64 / kvm32.
  
  it sounds like if migration is a requirement, then -cpu best probably
  isn't something that would be used.  I suppose I'm OK with that, or at
  least I don't have a better suggestion on how to carefully push up the
  capabilities without at some point breaking migration.
  
  Yes, if you're interested in migration, then you're almost guaranteed to 
  have a toolstack on top that has knowledge of your whole cluster and can 
  do the least dominator detection over all of your nodes. On the QEMU level 
  we don't know anything about other machines.
  
  
  
  The current kvm64 type is broken. Libgmp just abort()s when we pass it
  in. So anything is better than what we do today on AMD hosts :).
  
  I wonder if it breaks with Cyris cpus... other tools tend to do runtime
  detection (mplayer).
  
  It probably does :). But then again those don't do KVM, do they?
  
  not following; mplayer issues SSE2, 3 and 4 instructions to see what
  works to figure out how to optimize; it doesn't care if the cpu is
  called QEMU64 or Cyrus or AMD.  I'm not saying that we can't do better
  than qemu64 w.r.t best cpu to select by default, but there are plenty of
  applications that want to optimize their code based on what's available,
  but this is done via code execution instead of string comparision.
 
 The problem with -cpu kvm64 is that we choose a family/model that
 doesn't exist in the real world, and then glue AuthenticAMD or
 GenuineIntel in the vendor string. Libgmp checks for existing CPUs,
 finds that this CPU doesn't match any real world IDs and abort()s.
 
 The problem is that there is not a single CPU on this planet in
 silicon that has the same model+family numbers, but exists in
 AuthenticAMD _and_ GenuineIntel flavors. We need to pass the host
 vendor in though, because the guest uses it to detect if it should
 execute SYSCALL or SYSENTER, because intel and amd screwed up heavily
 on that one.

I forgot about this one.  =(


-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ry...@us.ibm.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] KVM: perf: a smart tool to analyse kvm events

2012-01-16 Thread David Ahern


On 01/16/2012 02:30 AM, Xiao Guangrong wrote:
 This tool is very like xenoprof(if i remember correctly), and traces kvm 
 events
 smartly. currently, it supports vmexit/mmio/ioport events.
 
 Usage:
 - to trace kvm events:
 # ./perf kvm-events record
 
 - show the result
 # ./perf kvm-events report
 
 Some output are as follow:
 # ./perf kvm-events report
   Warning: Error: expected type 5 but read 4
   Warning: Error: expected type 5 but read 0
   Warning: unknown op '}'

Integrating the trace-cmd plugins into perf will remedy the above errors:
https://lkml.org/lkml/2011/8/16/352

Unfortunately, that effort is stalled at the moment.

 
 
 Analyze events for all VCPUs:
 
  VM-EXITSamples  Samples% Time%Avg time
 
  APIC_ACCESS 43810744.89% 6.20%17.91us
   EXTERNAL_INTERRUPT 21922622.46% 8.01%46.20us
   IO_INSTRUCTION 12265112.57% 1.88%19.44us
EPT_VIOLATION  83110 8.52% 1.36%20.75us
PENDING_INTERRUPT  37055 3.80% 0.16% 5.38us
CPUID  32718 3.35% 0.08% 3.15us
EXCEPTION_NMI  23601 2.42% 0.17% 8.87us
  HLT  15424 1.58%82.12%  6735.06us
CR_ACCESS   4089 0.42% 0.02% 6.08us
 
 Total Samples:975981, Total events handled time:126502464.88us.

Have you thought about dumping a time history -- something similar to
what perf-script can do with dumping events but adding in kvm-specific
analysis like what you are doing in these examples?

David


 
 The default event to be analysed is vmexit, we can use --event to specify it,
 for example, if we want to trace mmio event:
 # ./perf kvm-events report --event mmio
   Warning: Error: expected type 5 but read 4
   Warning: Error: expected type 5 but read 0
   Warning: unknown op '}'
 
 
 Analyze events for all VCPUs:
 
  MMIO AccessSamples  Samples% Time%Avg time
 
 0xfee00380:W 19658964.95%70.01% 3.83us
 0xfee00310:W  3535611.68% 6.48% 1.97us
 0xfee00300:W  3535611.68%16.37% 4.97us
 0xfee00300:R  3535611.68% 7.14% 2.17us
 
 Total Samples:302657, Total events handled time:1074746.01us.
 
 We can use --vcpu to specify which vcpu is traced:
 root@localhost perf]# ./perf kvm-events report --event mmio --vcpu 1
   Warning: Error: expected type 5 but read 4
   Warning: Error: expected type 5 but read 0
   Warning: unknown op '}'
 
 
 Analyze events for VCPU 1:
 
  MMIO AccessSamples  Samples% Time%Avg time
 
 0xfee00380:W  5804171.20%74.90% 3.70us
 0xfee00310:W   7826 9.60% 5.28% 1.93us
 0xfee00300:W   7826 9.60%13.82% 5.06us
 0xfee00300:R   7826 9.60% 6.01% 2.20us
 
 Total Samples:81519, Total events handled time:286577.81us.
 
 And, '--key' is used to sort the result, the possible value sample (default,
 the result is sorted by samples number), time(the result is sorted by time%):
 # ./perf kvm-events report --key time
   Warning: Error: expected type 5 but read 4
   Warning: Error: expected type 5 but read 0
   Warning: unknown op '}'
 
 
 Analyze events for all VCPUs:
 
  VM-EXITSamples  Samples% Time%Avg time
 
  HLT  15424 1.58%82.12%  6735.06us
   EXTERNAL_INTERRUPT 21922622.46% 8.01%46.20us
  APIC_ACCESS 43810744.89% 6.20%17.91us
   IO_INSTRUCTION 12265112.57% 1.88%19.44us
EPT_VIOLATION  83110 8.52% 1.36%20.75us
EXCEPTION_NMI  23601 2.42% 0.17% 8.87us
PENDING_INTERRUPT  37055 3.80% 0.16% 5.38us
CPUID  32718 3.35% 0.08% 3.15us
CR_ACCESS   4089 0.42% 0.02% 6.08us
 
 Total Samples:975981, Total events handled time:126502464.88us.
 
 I hope guys will like it and any comments are welcome! :)
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost-net: add module alias (v2.1)

2012-01-16 Thread David Miller
From: Stephen Hemminger shemmin...@vyatta.com
Date: Mon, 16 Jan 2012 07:52:36 -0800

 On Mon, 16 Jan 2012 12:26:45 +
 Alan Cox a...@linux.intel.com wrote:
 
   ACKs, NACKs?  What is happening here?
  
  I would like an Ack from Alan Cox who switched vhost-net
  to a dynamic minor in the first place, in commit
  79907d89c397b8bc2e05b347ec94e928ea919d33.
 
 Sorry dev...@lanana.org isn't yet back from the kernel hack incident.
 
 I don't read netdev so someone needs to summarise the issue and send me
 a copy of the patch to look at.
 
 Alan
 
 Subject: vhost-net: add module alias (v2.1)
 
 By adding some module aliases, programs (or users) won't have to explicitly
 call modprobe. Vhost-net will always be available if built into the kernel.
 It does require assigning a permanent minor number for depmod to work.
 
 Also:
   - use C99 style initialization.
   - add missing entry in documentation for loop-control
 
 Signed-off-by: Stephen Hemminger shemmin...@vyatta.com

I already applied your first patch, so you need to give me something
relative to apply on top of your original one.

And it also shows that you're really not generating these patches
against current 'net', otherwise you'd have noticed your other patch
already there.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost-net: add module alias (v2.1)

2012-01-16 Thread Alan Cox
 Also:
   - use C99 style initialization.
   - add missing entry in documentation for loop-control
 
 Signed-off-by: Stephen Hemminger shemmin...@vyatta.com

For the device allocation

Acked-by: Alan Cox devi...@lanana.org
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] KVM: improve trace events of vmexit/mmio/ioport

2012-01-16 Thread Xiao Guangrong
On 01/16/2012 05:38 PM, Avi Kivity wrote:

 On 01/16/2012 11:32 AM, Xiao Guangrong wrote:
 - trace vcpu_id for these events
 
 We can infer the vcpu id from the kvm_entry tracepoints, no?
 


Thanks for your review, Avi!

Hmm. i think it is hard to do since the vcpu thread can be scheduled
anytime, one example is as follow:

CPU 0

kvm_entry vcpu 0
..
kvm_entry vcpu 1
..
event1 occurs
..
event2 occurs

It is hard to know the event belong to which kvm_entry?

 - add kvm_mmio_done to trace the time when mmio/ioport emulation is completed
 
 ditto?
 


I think is ok to get the event end time by using kvm_entry.

 
 Relying on the existing tracepoints will make the tool work on older
 kernels.
 


We can drop all new events, but unfortunately, the information of the origin
tracepoints is not enough, at least vcpu_id need be traced in theses events
to match its events. Yes?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: perf: kvm events analysis tool

2012-01-16 Thread Xiao Guangrong
On 01/16/2012 06:04 PM, Avi Kivity wrote:

 On 01/16/2012 11:32 AM, Xiao Guangrong wrote:
 Add 'perf kvm-events' support to analyze kvm vmexit/mmio/ioport smartly

 Usage:
  perf kvm-events record
 
 Why not 'perf record -e kvm'?


It works, many perf tools have this style, like:
perf lock record,
perf sched record,
perf kmem record,

I think the reason is that only enable the tracepoints we need to avoid
unnecessary overload so that the result is more exacter.

 
  perf kvm-events report
 
 
 
 +static const char *get_exit_reason(long isa, u64 exit_code)
 +{
 +int table_size = ARRAY_SIZE(svm_exit_reasons);
 +struct exit_reasons_table *table = svm_exit_reasons;
 +
 +
 +if (isa == 1) {
 +table = vmx_exit_reasons;
 +table_size = ARRAY_SIZE(vmx_exit_reasons);
 +}
 +
 +while (table_size--) {
 +if (table-exit_code == exit_code)
 +return table-reason;
 
 ... table[exit_code] ...
 


Actually, this array is not indexed by exit_code, it means:
table[exit_code].exit_code != exit_code.

 +table++;
 +}
 +
 +die(unkonw kvm exit code:%ld on %s\n, exit_code, isa == 1 ?
 +VMX : SVM);
 
 unknown
 


..

 +
 +struct kvm_events_ops {
 +bool (*is_begain_event)(struct event *event, void *data);
 
 begin
 


Sorry for my careless.

 +bool (*is_end_event)(struct event *event);
 +struct event_key (*get_key)(struct event *event, void *data);
 +void (*decode_key)(struct event_key *key, char decode[20]);
 +const char *name;
 +};
 +
 +
 +static struct event_key exit_event_get_key(struct event *event, void *data)
 +{
 +struct event_key key;
 +
 +key.key = raw_field_value(event, exit_reason, data);
 +key.info = raw_field_value(event, isa, data);
 
 isa is not available on all kernel versions; need to fall back to
 /proc/cpuid detection.
 


Got it, will do it in the next version. Thanks Avi!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] KVM: perf: a smart tool to analyse kvm events

2012-01-16 Thread Xiao Guangrong
On 01/16/2012 06:11 PM, Avi Kivity wrote:


 Total Samples:975981, Total events handled time:126502464.88us.
 
 Nice!  If we can have a live version as well, this can replace kvm_stat.
 
 The average numbers are really high.  Like a factor of 3x-4x off.  Would
 be good to print the standard deviation and see why.  Maybe it's due to
 the tracing overhead.
 


It is a good suggestion, i will print stddev in the next version.

 The default event to be analysed is vmexit, we can use --event to specify it,
 for example, if we want to trace mmio event:
 # ./perf kvm-events report --event mmio
   Warning: Error: expected type 5 but read 4
   Warning: Error: expected type 5 but read 0
   Warning: unknown op '}'


 Analyze events for all VCPUs:

  MMIO AccessSamples  Samples% Time%Avg time

 0xfee00380:W 19658964.95%70.01% 3.83us
 0xfee00310:W  3535611.68% 6.48% 1.97us
 0xfee00300:W  3535611.68%16.37% 4.97us
 0xfee00300:R  3535611.68% 7.14% 2.17us
 
 These are more reasonable (though still high - 5us for an ICR write?)
 


Hmm, maybe i need look into it...

 Total Samples:975981, Total events handled time:126502464.88us.

 I hope guys will like it and any comments are welcome! :)
 
 I think it's great!  A live version would be a nice addition too.
 
 Please copy the perf userspace maintainers to get more detailed review
 in the next version.
 


Okay, Thanks!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: perf: kvm events analysis tool

2012-01-16 Thread Xiao Guangrong
On 01/16/2012 06:08 PM, Stefan Hajnoczi wrote:

 On Mon, Jan 16, 2012 at 9:32 AM, Xiao Guangrong
 xiaoguangr...@linux.vnet.ibm.com wrote:
 +DESCRIPTION
 +---
 +You can analyze some crucial events and statistics with this
 +'perf kvm-events' command.
 
 This line is very general and does not explain which events/statistics
 can be collected or how you can use that information.  I suggest
 making this description more specific.  Explain that this subcommand
 observers kvm.ko tracepoints and annotates/decodes them with
 additional information (this is why I would use this command and not
 raw perf record -e kvm:\*).
 


Okay.

 + � � � { SVM_EXIT_MONITOR, � � � � � � � � � � monitor }, \
 + � � � { SVM_EXIT_MWAIT, � � � � � � � � � � � mwait }, \
 + � � � { SVM_EXIT_XSETBV, � � � � � � � � � � �xsetbv }, \
 + � � � { SVM_EXIT_NPF, � � � � � � � � � � � � npf }
 
 All this copy-paste could be avoided by sharing this stuff with the
 arch/x86/kvm/ code.
 


I will try to combine them in the next version.

 +static void exit_event_decode_key(struct event_key *key, char decode[20])
 +{
 + � � � const char *exit_reason = get_exit_reason(key-info, key-key);
 +
 + � � � memset(decode, 0, 20);
 + � � � strncpy(decode, exit_reason, 20);
 
 This is a bad pattern to follow when using strncpy(3) because if there
 was a strlen(exit_reason) == 20 string then decode[] would not be
 NUL-terminated.  Right now it's safe but it's better to just use
 strlcpy() and drop the memset(3).
 


Good point.

 +static void mmio_event_decode_key(struct event_key *key, char decode[20])
 +{
 + � � � memset(decode, 0, 20);
 + � � � sprintf(decode, %#lx:%s, key-key,
 + � � � � � � � key-info == KVM_TRACE_MMIO_READ ? R : W);
 
 Please drop the memset and use snprintf(3) instead of sprintf(3).  It
 places the NUL-terminator and ensures you don't exceed the buffer
 size.
 
 Same pattern below.
 


Will do, thanks Stefan! :)



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] KVM: perf: a smart tool to analyse kvm events

2012-01-16 Thread Xiao Guangrong
On 01/17/2012 06:53 AM, David Ahern wrote:

 
 
 On 01/16/2012 02:30 AM, Xiao Guangrong wrote:
 This tool is very like xenoprof(if i remember correctly), and traces kvm 
 events
 smartly. currently, it supports vmexit/mmio/ioport events.

 Usage:
 - to trace kvm events:
 # ./perf kvm-events record

 - show the result
 # ./perf kvm-events report

 Some output are as follow:
 # ./perf kvm-events report
   Warning: Error: expected type 5 but read 4
   Warning: Error: expected type 5 but read 0
   Warning: unknown op '}'
 
 Integrating the trace-cmd plugins into perf will remedy the above errors:
 https://lkml.org/lkml/2011/8/16/352
 


Yes, it is great!

 Unfortunately, that effort is stalled at the moment.
 


 Analyze events for all VCPUs:

  VM-EXITSamples  Samples% Time%Avg time

  APIC_ACCESS 43810744.89% 6.20%17.91us
   EXTERNAL_INTERRUPT 21922622.46% 8.01%46.20us
   IO_INSTRUCTION 12265112.57% 1.88%19.44us
EPT_VIOLATION  83110 8.52% 1.36%20.75us
PENDING_INTERRUPT  37055 3.80% 0.16% 5.38us
CPUID  32718 3.35% 0.08% 3.15us
EXCEPTION_NMI  23601 2.42% 0.17% 8.87us
  HLT  15424 1.58%82.12%  6735.06us
CR_ACCESS   4089 0.42% 0.02% 6.08us

 Total Samples:975981, Total events handled time:126502464.88us.
 
 Have you thought about dumping a time history -- something similar to
 what perf-script can do with dumping events but adding in kvm-specific
 analysis like what you are doing in these examples?
 


I will look into it and put it to my todo list if it is possible.
Thanks, David!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Add migration function and test for libvirt.

2012-01-16 Thread tangchen
Hi~

1. There is no muigrate() function in class VM in libvirt_vm.py.
2. There is no tests for libvirt in client/tests/libvirt.

So, I would like to add some tests for libvirt.

Here are three patches, 

1. Add a migrate() function for class VM in libvirt_vm.py, which 
   encapsulates virsh migrate command.
2. Add a tests directory in client/tests/libvirt/, and a test 
   for virsh migrate command.
3. Add configuration for this test.

-- 
Best Regards,
Tang chen

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest] [PATCH 3/3] Add migration function and test for libvirt.

2012-01-16 Thread tangchen

This adds configuration for virsh migrate test.

Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
---
 client/virt/subtests.cfg.sample |   13 +
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/client/virt/subtests.cfg.sample b/client/virt/subtests.cfg.sample
index 843de30..97e62e9 100644
--- a/client/virt/subtests.cfg.sample
+++ b/client/virt/subtests.cfg.sample
@@ -323,6 +323,19 @@ variants:
 create_image_stg = yes
 image_size_stg = 10M
 
+- virsh_migrate: install setup image_copy unattended_install.cdrom
+type = virsh_migrate
+live = yes
+verbose = yes
+desturi = qemu+ssh://Destination Host IP/system
+destuser = Destination Host Username
+destpwd = Destination Host Password
+destip = Destination Host IP
+destprompt = #
+timeout = 30
+variants:
+- live:
+
 - migrate: install setup image_copy unattended_install.cdrom
 type = migration
 migration_test_command = help
-- 
1.7.3.1


-- 
Best Regards,
Tang chen
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest] [PATCH 1/3] Add migration function and test for libvirt.

2012-01-16 Thread tangchen

This patch adds a migrate() function for libvirt, which is a encapsulation for 
virsh migrate command.

Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
---
 client/virt/libvirt_vm.py |  212 +
 1 files changed, 212 insertions(+), 0 deletions(-)

diff --git a/client/virt/libvirt_vm.py b/client/virt/libvirt_vm.py
index c825661..773e170 100644
--- a/client/virt/libvirt_vm.py
+++ b/client/virt/libvirt_vm.py
@@ -293,6 +293,36 @@ def virsh_domain_exists(name, uri = ):
 logging.warning(VM %s does not exist, name)
 return False
 
+def virsh_migrate(name, migrate_cmd, params = None, uri = ):
+
+Migrate a guest to another host.
+
+@params name: VM name
+@params migrate_cmd: Migrate command to be executed
+@param params: A dict containing VM params
+
+uri_arg = 
+if uri:
+uri_arg = -c  + uri
+destpwd = params.get(destpwd)
+destuser = params.get(destuser)
+try:
+cmd = virsh %s %s % (uri_arg, migrate_cmd)
+session = aexpect.ShellSession(cmd, linesep=\n)
+virt_utils._remote_login(session, destuser, destpwd, , timeout=60)
+return True
+# If migration succeeds, migrate command will terminate the session 
+# automically. So we have to catch the LoginProcessTerminatedError.
+except virt_utils.LoginProcessTerminatedError, e:
+logging.info(%s, e)
+session.close()
+if str(e).find(status: 0) = 0:
+return True
+else:
+return False
+except error.CmdError:
+logging.error(Migrating VM %s failed, name)
+return False
 
 class VM(virt_vm.BaseVM):
 
@@ -978,6 +1008,188 @@ class VM(virt_vm.BaseVM):
 fcntl.lockf(lockfile, fcntl.LOCK_UN)
 lockfile.close()
 
+def __make_migrate_command(self, name=None, params=None):
+
+Generate a migrate command line.
+
+@param name: The name of the object
+@param params: A dict containing VM params
+
+@note: The params dict should contain:
+   live -- yes/no
+   method -- direct/p2p/p2p_tunnelled
+   persistent -- yes/no
+   undefinesource -- yes/no
+   suspend -- yes/no
+   copy-storage -- all/inc
+   change-protection -- yes/no
+   verbose -- /yes/no
+   domain -- VM name
+   desturi -- Destination URI
+   migrateuri -- Migration URI
+   dname -- VM name on destination
+   timeout -- timeout  0
+   xml -- Path to xml file to be used on destination
+
+def add_live():
+return  --live
+
+def add_method(method):
+if method == direct:
+return  --direct
+elif method == p2p:
+return  --p2p
+# Cannot use --tunnelled without --p2p.
+elif method == p2p_tunnelled:
+return  --p2p --tunnelled
+else:
+logging.warning(Unknown migrate method, using default.)
+return 
+
+def add_persistent():
+return  --persistent
+
+def add_undefinesource():
+return  --undefinesource
+
+def add_suspend():
+return  --suspend
+
+def add_copy_storage(copy_storage_mode):
+if copy_storage_mode == all:
+return  --copy-storage-all
+elif copy_storage_mode == inc:
+return  --copy-storage-inc
+else:
+logging.warning(Unknown copy storage mode, using default.)
+return 
+
+def add_change_protection():
+return  --change-protection
+
+def add_verbose():
+return  --verbose
+
+# Domain name must be specified.
+def add_domain(domain_name):
+if virsh_domain_exists(domain_name):
+return  --domain %s % domain_name
+else:
+raise virt_vm.VMMigrateError(Wrong domain name.)
+
+# Destination uri must be specified.
+def add_desturi(desturi):
+if desturi:
+return  --desturi %s % desturi
+else:
+raise virt_vm.VMMigrateError(Wrong destination uri.)
+
+def add_migrateuri(migrateuri):
+if migrateuri:
+return  --migrateuri %s % migrateuri
+else:
+return 
+
+def add_dname(dname):
+if dname:
+return  --dname %s % dname
+else:
+return 
+
+def add_timeout(timeout):
+if int(timeout)  0:
+return  --timeout %s % timeout
+else:
+logging.warning(Invalid timeout value. Ingoring it.)
+return 
+
+def add_xml(xml_file):
+if os.path.isfile(xml_file):
+return  

Re: [Autotest] [PATCH 0/3] Add migration function and test for libvirt.

2012-01-16 Thread tangchen

This patch tests the virsh migrate command with --live parameter.

Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
---
 client/tests/libvirt/tests/virsh_migrate.py |   33 +++
 1 files changed, 33 insertions(+), 0 deletions(-)
 create mode 100644 client/tests/libvirt/tests/virsh_migrate.py

diff --git a/client/tests/libvirt/tests/virsh_migrate.py 
b/client/tests/libvirt/tests/virsh_migrate.py
new file mode 100644
index 000..5576e78
--- /dev/null
+++ b/client/tests/libvirt/tests/virsh_migrate.py
@@ -0,0 +1,33 @@
+import re, os, logging, commands, shutil
+from autotest_lib.client.common_lib import utils, error
+from autotest_lib.client.virt import virt_vm, virt_utils, virt_env_process
+
+def run_virsh_migrate(test, params, env):
+
+Test the migrate command with parameter --live.
+
+
+vm_name = params.get(main_vm)
+vm = env.get_vm(params[main_vm])
+vm.verify_alive()
+
+destuser = params.get(destuser)
+destpwd = params.get(destpwd)
+destip = params.get(destip)
+destprompt = params.get(destprompt)
+
+# Migrate the guest.
+ret = vm.migrate()
+if ret == False:
+raise error.TestFail(Migration of %s failed. % vm_name)
+
+session = virt_utils.remote_login(ssh, destip, 22, destuser, destpwd, 
destprompt)
+status, output = session.cmd_status_output(virsh domstate %s % vm_name)
+logging.info(Out put of virsh domstate %s: %s % (vm_name, output))
+
+if status == 0 and output.find(running) = 0:
+logging.info(Running guest %s is found on destination. % vm_name)
+session.cmd(virsh destroy %s % vm_name)
+else:
+raise error.TestFail(Destination has no running guest named %s. % 
vm_name)
+
-- 
1.7.3.1


-- 
Best Regards,
Tang chen
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] KVM: perf: a smart tool to analyse kvm events

2012-01-16 Thread David Ahern

On 01/16/2012 07:41 PM, Xiao Guangrong wrote:
 Have you thought about dumping a time history -- something similar to
 what perf-script can do with dumping events but adding in kvm-specific
 analysis like what you are doing in these examples?

 
 
 I will look into it and put it to my todo list if it is possible.
 Thanks, David!
 

I've played around with ways to do it as time (and motivation) allowed.
Attached is one example using perf with the trace-cmd plugin plus a
patch on perf-script to dump time between events:

perf record -e kvm:* -fo /tmp/perf.data -p 2540 -- sleep 1
perf script -i /tmp/perf.data

The output of perf-script is in the attached file. The 5th column is the
dt between successive events which is mainly a convenience.

A perf-kvm-events type command would allow more customization in the
output -- like correlating specific events and computing total time
between exit and entry accounting for HLT reasons -- as well as various
statistical dumps (average, stddev, max/min, histograms).

David


qemu-kvm  2542 [001] 20757.662426  0.01 kvm_cr: cr_write 3 = 0x6f3000
qemu-kvm  2542 [001] 20757.662430  0.04 kvm_entry: vcpu 0
qemu-kvm  2542 [001] 20757.662432  0.02 kvm_exit: reason IO_INSTRUCTION rip 
0x806d8dbc info b008000b 0

qemu-kvm  2542 [001] 20757.662434  0.02 kvm_emulate_insn: 0:806d8dbc: ed
qemu-kvm  2542 [001] 20757.662435  0.01 kvm_pio: pio_read at 0xb008 size 4 
count 1
qemu-kvm  2542 [001] 20757.662436  0.01 kvm_userspace_exit: reason 
KVM_EXIT_IO (2)
qemu-kvm  2542 [001] 20757.662442  0.06 kvm_entry: vcpu 0
qemu-kvm  2542 [001] 20757.662444  0.02 kvm_exit: reason HLT rip 0xf770fd3d 
info 0 0

qemu-kvm  2540 [000] 20757.666479  0.004287 kvm_set_irq: gsi 9 level 1 source 0
qemu-kvm  2540 [000] 20757.666481  0.02 kvm_pic_set_irq: chip 1 pin 1 
(edge|masked)
qemu-kvm  2540 [000] 20757.666482  0.01 kvm_apic_accept_irq: apicid 0 vec 
177 (LowPrio|level)
qemu-kvm  2540 [000] 20757.666485  0.03 kvm_ioapic_set_irq: pin 9 dst 1 
vec=177 (LowPrio|logical|level)

qemu-kvm  2542 [001] 20757.666505  0.004061 kvm_inj_virq: irq 177
qemu-kvm  2542 [001] 20757.666506  0.01 kvm_entry: vcpu 0
qemu-kvm  2542 [001] 20757.666512  0.06 kvm_exit: reason IO_INSTRUCTION rip 
0x806d88ca info b009 0

qemu-kvm  2542 [001] 20757.666516  0.04 kvm_emulate_insn: 0:806d88ca: 66 ed
qemu-kvm  2542 [001] 20757.666517  0.01 kvm_pio: pio_read at 0xb000 size 2 
count 1
qemu-kvm  2542 [001] 20757.666519  0.02 kvm_userspace_exit: reason 
KVM_EXIT_IO (2)
qemu-kvm  2542 [001] 20757.666528  0.09 kvm_entry: vcpu 0
qemu-kvm  2542 [001] 20757.666531  0.03 kvm_exit: reason IO_INSTRUCTION rip 
0x806d88be info afe8 0

qemu-kvm  2542 [001] 20757.666534  0.03 kvm_emulate_insn: 0:806d88be: ec
qemu-kvm  2542 [001] 20757.666535  0.01 kvm_pio: pio_read at 0xafe0 size 1 
count 1
qemu-kvm  2542 [001] 20757.666537  0.02 kvm_userspace_exit: reason 
KVM_EXIT_IO (2)
qemu-kvm  2542 [001] 20757.666544  0.07 kvm_entry: vcpu 0
qemu-kvm  2542 [001] 20757.666547  0.03 kvm_exit: reason IO_INSTRUCTION rip 
0x806d88be info afe10008 0

qemu-kvm  2542 [001] 20757.666550  0.03 kvm_emulate_insn: 0:806d88be: ec
qemu-kvm  2542 [001] 20757.666551  0.01 kvm_pio: pio_read at 0xafe1 size 1 
count 1
qemu-kvm  2542 [001] 20757.666552  0.01 kvm_userspace_exit: reason 
KVM_EXIT_IO (2)
qemu-kvm  2542 [001] 20757.666558  0.06 kvm_entry: vcpu 0
qemu-kvm  2542 [001] 20757.666562  0.04 kvm_exit: reason IO_INSTRUCTION rip 
0x806d8934 info b001 0



Re: [PATCH v3] KVM: Move gfn_to_memslot() to kvm_host.h

2012-01-16 Thread Paul Mackerras
On Mon, Jan 16, 2012 at 02:30:41PM +0100, Alexander Graf wrote:

 Which tree is this against? I got this diff between your patch and the patch 
 when applied on my tree:

It's against your tree (previously) plus my first 13-patch series, but
not the second series of 5 patches, which presumably why you got the
difference.  Sounds like you got it applied OK; if not let me know and
I'll rebase against your current tree.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Host processes priority over guests

2012-01-16 Thread Romain LE DISEZ

Hello all,

we are currently evaluating a migration from Xen to KVM.

We host our VMs on a SAN, access to LUNs is done by the iSCSI protocol, 
all multipathed.


With Xen, we faced a problem when VMs were consuming a lot of CPU and a 
lot of I/O. In this situation, the Dom0 did not get enough CPU time to 
process the I/O requests through the iSCSI processes and to run the 
iscsid daemon. So, the Dom0 was always loosing the iSCSI connection to 
the SAN because of timeout (node.conn[0].timeo.noop_out_timeout).


We solved it by dedicating a core to the Dom0, as recommended in the Xen 
FAQ.



Can this happen with KVM ? Why can(not) it happen ? If it could happen, 
what is the recommended solution (renice, cgroups, ...) ?


Thanks for your help.

--
Romain LE DISEZ
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 00/52] ppc patch queue 2012-01-13

2012-01-16 Thread Marcelo Tosatti
On Fri, Jan 13, 2012 at 03:31:03PM +0100, Alexander Graf wrote:
 Hi Avi,
 
 This is my current patch queue for ppc. Please pull.
 
 Alex
 
 
 The following changes since commit 188fc33198ddb1469562d40de33bcc29e7e2ed5f:
   Christian Borntraeger (1):
 kvm-s390: provide access guest registers via kvm_run
 
 are available in the git repository at:
 
   git://github.com/agraf/linux-2.6.git for-upstream

Pulled, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM PATCH 2/2] KVM: PPC: Book3S HV: Report stolen time to guest through dispatch trace log

2012-01-16 Thread Alexander Graf

On 20.12.2011, at 11:37, Paul Mackerras wrote:

 This adds code to measure stolen time per virtual core in units of
 timebase ticks, and to report the stolen time to the guest using the
 dispatch trace log (DTL).  The guest can register an area of memory
 for the DTL for a given vcpu.  The DTL is a ring buffer where KVM
 fills in one entry every time it enters the guest for that vcpu.
 
 Stolen time is measured as time when the virtual core is not running,
 either because the vcore is not runnable (e.g. some of its vcpus are
 executing elsewhere in the kernel or in userspace), or when the vcpu
 thread that is running the vcore is preempted.  This includes time
 when all the vcpus are idle (i.e. have executed the H_CEDE hypercall),
 which is OK because the guest accounts stolen time while idle as idle
 time.
 
 Each vcpu keeps a record of how much stolen time has been reported to
 the guest for that vcpu so far.  When we are about to enter the guest,
 we create a new DTL entry (if the guest vcpu has a DTL) and report the
 difference between total stolen time for the vcore and stolen time
 reported so far for the vcpu as the enqueue to dispatch time in the
 DTL entry.
 
 Signed-off-by: Paul Mackerras pau...@samba.org

This patch makes sense and looks good to me :)


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: Move gfn_to_memslot() to kvm_host.h

2012-01-16 Thread Alexander Graf

On 13.01.2012, at 07:09, Paul Mackerras wrote:

 This moves __gfn_to_memslot() and search_memslots() from kvm_main.c to
 kvm_host.h to reduce the code duplication caused by the need for
 non-modular code in arch/powerpc/kvm/book3s_hv_rm_mmu.c to call
 gfn_to_memslot() in real mode.
 
 Rather than putting gfn_to_memslot() itself in a header, which would
 lead to increased code size, this puts __gfn_to_memslot() in a header.
 Then, the non-modular uses of gfn_to_memslot() are changed to call
 __gfn_to_memslot() instead.  This way there is only one place in the
 source code that needs to be changed should the gfn_to_memslot()
 implementation need to be modified.
 
 On powerpc, the Book3S HV style of KVM has code that is called from
 real mode which needs to call gfn_to_memslot() and thus needs this.
 (Module code is allocated in the vmalloc region, which can't be
 accessed in real mode.)
 
 With this, we can remove builtin_gfn_to_memslot() from book3s_hv_rm_mmu.c.
 
 Signed-off-by: Paul Mackerras pau...@samba.org

Avi?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: Move gfn_to_memslot() to kvm_host.h

2012-01-16 Thread Avi Kivity
On 01/16/2012 03:18 PM, Alexander Graf wrote:
 Avi?

ACK!

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: Move gfn_to_memslot() to kvm_host.h

2012-01-16 Thread Alexander Graf

On 13.01.2012, at 07:09, Paul Mackerras wrote:

 This moves __gfn_to_memslot() and search_memslots() from kvm_main.c to
 kvm_host.h to reduce the code duplication caused by the need for
 non-modular code in arch/powerpc/kvm/book3s_hv_rm_mmu.c to call
 gfn_to_memslot() in real mode.
 
 Rather than putting gfn_to_memslot() itself in a header, which would
 lead to increased code size, this puts __gfn_to_memslot() in a header.
 Then, the non-modular uses of gfn_to_memslot() are changed to call
 __gfn_to_memslot() instead.  This way there is only one place in the
 source code that needs to be changed should the gfn_to_memslot()
 implementation need to be modified.
 
 On powerpc, the Book3S HV style of KVM has code that is called from
 real mode which needs to call gfn_to_memslot() and thus needs this.
 (Module code is allocated in the vmalloc region, which can't be
 accessed in real mode.)
 
 With this, we can remove builtin_gfn_to_memslot() from book3s_hv_rm_mmu.c.

Which tree is this against? I got this diff between your patch and the patch 
when applied on my tree:

-@@ -97,7 +78,7 @@ static void remove_revmap_chain(struct kvm *kvm, long 
pte_index,
-   rev = real_vmalloc_addr(kvm-arch.revmap[pte_index]);
-   ptel = rev-guest_rpte;
+@@ -99,7 +80,7 @@ static void remove_revmap_chain(struct kvm *kvm, long 
pte_index,
+   rcbits = hpte_r  (HPTE_R_R | HPTE_R_C);
+   ptel = rev-guest_rpte |= rcbits;

Since this is completely unrelated to the actual change, I'll apply the patch 
either way. It'd just be interesting to know.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/2] KVM: PPC: Book3S HV: Make virtual processor area registration more robust

2012-01-16 Thread Paul Mackerras
On Mon, Jan 16, 2012 at 02:04:29PM +0100, Alexander Graf wrote:
 
 On 20.12.2011, at 11:22, Paul Mackerras wrote:

  @@ -152,6 +152,8 @@ static unsigned long do_h_register_vpa(struct kvm_vcpu 
  *vcpu,
  flags = 7;
  if (flags == 0 || flags == 4)
 
 This could probably use a new variable name. Also, what do 0 and 4 mean? 
 Constant defines would be nice here.

Those constants are defined in PAPR as being a subfunction code
indicating what sort of area and whether it is to be registered or
unregistered.  I'll make up some names for them.

 [pasted from real source]
  va = kvmppc_pin_guest_page(kvm, vpa, nb);
 
 Here you're pinning the page, setting va to that temporarily available 
 address.

Well, it's not just temporarily available, it's available until we
unpin it, since we increment the page count, which inhibits migration.

  len = *(unsigned int *)(va + 4);
 
 va + 4 isn't really descriptive. Is this a defined struct? Why not actually 
 define one which you can just read data from? Or at least make this a define 
 too. Reading random numbers in code is barely readable.

It's not really a struct, at least not one that is used for anything
else.  PAPR defines that the length of the buffer has to be placed in
the second 32-bit word at registration time.

 
  +   free_va = va;
 
 Now free_va is the temporarily available address.
...
  +   free_va = tvcpu-arch.next_vpa;
  +   tvcpu-arch.next_vpa = va;
 
 Now you're setting next_vpa to this temporarily available address? But 
 next_vpa will be used after va is getting free'd, no? Or is that why you have 
 free_va?

Yes; here we are freeing any previously-set value of next_vpa.  The
idea of free_va is that it is initially set to va so that we correctly
unpin va if any error occurs.  But if there is no error, va gets put
into next_vpa and we free anything that was previously in next_vpa
instead.

 
 Wouldn't it be easier to just map it every time we actually use it and only 
 shove the GPA around? We could basically save ourselves a lot of the logic 
 here.

There are fields in the VPA that we really want to be able to access
from real mode, for instance the fields that indicate whether we need
to save the FPR and/or VR values.  As far as the DTL is concerned, we
could in fact use copy_to_user to access it, so it doesn't strictly
need to be pinned.  We don't currently use the slb_shadow buffer, but
if we did we would need to access it from real mode, since we would be
reading it in order to set up guest SLB entries.

The other thing is that the VPA registration/unregistration is only
done a few times in the life of the guest, whereas we use the VPAs
constantly while the guest is running.  So it is more efficient to do
more of the work at registration time to make it quicker to access the
VPAs.

I'll send revised patches.  There's a small change I want to make to
patch 2 to avoid putting a very large stolen time value in the first
entry that gets put in after the DTL is registered, which can happen
currently if the DTL gets registered some time after the guest started
running.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: Move gfn_to_memslot() to kvm_host.h

2012-01-16 Thread Paul Mackerras
On Mon, Jan 16, 2012 at 02:30:41PM +0100, Alexander Graf wrote:

 Which tree is this against? I got this diff between your patch and the patch 
 when applied on my tree:

It's against your tree (previously) plus my first 13-patch series, but
not the second series of 5 patches, which presumably why you got the
difference.  Sounds like you got it applied OK; if not let me know and
I'll rebase against your current tree.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html