Re: [PATCHv3 4/4] acpi: automatically generated ssdt proc

2011-10-05 Thread Michael S. Tsirkin
On Tue, Oct 04, 2011 at 10:52:33PM -0400, Kevin O'Connor wrote:
 On Tue, Oct 04, 2011 at 03:26:19PM +0200, Michael S. Tsirkin wrote:
  Get rid of manually cut and pasted ssdt_proc,
  use ssdt compiled by iasl and offsets extracted
  by acpi_extract instead.
 
 Thanks - I like the idea of auto-generating the offsets.
 
 [...]
  +#define AmlCode static ssdp_proc_aml
  +#include ssdt-proc.hex
  +#undef AmlCode
 
 Side note - since you're post-processing the acpi data, it would be
 nice to update the name in the hex file too.

That would mean we will output hex as well, ignore the
hex produced by iasl. Sure, I can do that.
Another benefit would be that we can skip the ssdt
preamble generated by iasl in the hex we output.

  +/* 0x5B 0x83 ProcessorOp PkgLength NameString ProcID */
  +#define SD_OFFSET_CPUHEX (*ssdt_proc_name - *ssdt_proc_start + 2)
  +#define SD_OFFSET_CPUID1 (*ssdt_proc_name - *ssdt_proc_start + 4)
  +#define SD_OFFSET_CPUID2 (*ssdt_proc_id - *ssdt_proc_start)
  +#define SD_SIZEOF (*ssdt_proc_end - *ssdt_proc_start)
  +#define SD_PROC (ssdp_proc_aml + *ssdt_proc_start)
 [...]
   DefinitionBlock (ssdt-proc.aml, SSDT, 0x01, BXPC, BXSSDT, 0x1)
  -/*  v-- DO NOT EDIT --v */
   {
  +ACPI_EXTRACT_PROCESSOR_START ssdt_proc_start
  +ACPI_EXTRACT_PROCESSOR_END ssdt_proc_end
  +ACPI_EXTRACT_PROCESSOR_STRING ssdt_proc_name
   Processor (CPAA, 0xAA, 0xb010, 0x06) {
 
 Since the acpi.c code needs to know the processor object format
 anyway, what about making a generic ACPI_EXTRACT indicator that
 exports the location, size, and parameter location in one go.
 Something like:
 
 ACPI_EXTRACT ssdt_proc_obj
 Processor (CPAA, 0xAA, 0xb010, 0x06) {
 

I considered this, sure. We could parse AML to figure out
what is the object type we are trying to look up.

However I decided sanity-checking that we get the right type of object
in AML is better. This way if iasl output format breaks
we will have a better chance to detect that.
Makes sense?

Also this can not be as generic as it seems: each type of
object in AML bytecode is encoded slightly differently.
So we would still have types of objects we support
and types of object we don't.


 which would produce something like:
 
 static struct aml_object ssdt_proc_obj = {.addr=0x24, .size=0x40, 
 .param=0x28};

What is the param offset here?

I really think multiple arrays are better so we don't waste memory
on fields that we don't need and alignment.
I also dislike hardcoding field names. With a struct,
if you want to know where did 'param' come from you must
read python. My way, all names come from DSL source.


 As for the other parts of this patch series - I'm still leary of
 changing the DSDT dynamically.

Hmm, not sure I understand why. Could you explain pls?

  I'd be curious to see if we can add
 the following to ssdt-proc.dsl:
 
 ACPI_EXTRACT hotplug_obj
 Device (SL00) {
 ACPI_EXTRACT_NAME_DWORD_CONST hotplog_id
 Name (ID, 0xAABBCCDD)
 Name (_ADR, ID)
 Method (_EJ0, 1) { Return(PCEJ(ID)) }
 Name (_SUN, ID)
 }
 
 and then just memcpy the hotplug_obj N number of times into the ssdt
 for each available slot.  (This would be on top of the DSDT
 simplification patch series that I posted previously.)

This assumes all devices are the same. But unfortunately this will not
work for other devices such as VGA.

 
 -Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] qemu-kvm: Remove extboot support

2011-10-05 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

Except for booting from SCSI via the LSI controller, SeaBIOS has native
support for mass storage interfaces now. And SCSI can be worked around
via [1] - or someone finally adds the necessary bits to SeaBIOS if there
is a real need.

[1] http://thread.gmane.org/gmane.comp.emulators.qemu/78467

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Acked-by: Anthony Liguori aligu...@us.ibm.com
---
 .gitignore  |1 -
 Makefile|1 -
 Makefile.target |1 -
 blockdev.c  |   13 -
 blockdev.h  |2 -
 hw/extboot.c|  123 
 hw/pc.c |   22 --
 hw/pc.h |4 -
 pc-bios/optionrom/Makefile  |2 +-
 pc-bios/optionrom/extboot.S |  691 ---
 qemu-config.c   |4 -
 qemu-options.hx |4 +-
 12 files changed, 2 insertions(+), 866 deletions(-)
 delete mode 100644 hw/extboot.c
 delete mode 100644 pc-bios/optionrom/extboot.S

diff --git a/.gitignore b/.gitignore
index 625f28e..ea2bd8a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -66,7 +66,6 @@ pc-bios/vgabios-pq/status
 pc-bios/optionrom/linuxboot.bin
 pc-bios/optionrom/multiboot.bin
 pc-bios/optionrom/multiboot.raw
-pc-bios/optionrom/extboot.bin
 pc-bios/optionrom/vapic.bin
 .stgit-*
 cscope.*
diff --git a/Makefile b/Makefile
index 936f130..f7f33a6 100644
--- a/Makefile
+++ b/Makefile
@@ -252,7 +252,6 @@ mpc8544ds.dtb \
 multiboot.bin linuxboot.bin \
 s390-zipl.rom \
 spapr-rtas.bin slof.bin
-BLOBS += extboot.bin
 BLOBS += vapic.bin
 else
 BLOBS=
diff --git a/Makefile.target b/Makefile.target
index 324b4f1..f84d8cb 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -229,7 +229,6 @@ obj-i386-y += mc146818rtc.o i8259.o pc.o
 obj-i386-y += cirrus_vga.o sga.o apic.o ioapic.o piix_pci.o
 obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
-obj-i386-y += extboot.o
 obj-i386-y += debugcon.o multiboot.o
 obj-i386-y += pc_piix.o
 obj-i386-$(CONFIG_KVM) += kvmclock.o
diff --git a/blockdev.c b/blockdev.c
index da74171..0827bf7 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -16,8 +16,6 @@
 #include sysemu.h
 #include block_int.h
 
-DriveInfo *extboot_drive = NULL;
-
 static QTAILQ_HEAD(drivelist, DriveInfo) drives = 
QTAILQ_HEAD_INITIALIZER(drives);
 
 static const char *const if_name[IF_COUNT] = {
@@ -237,7 +235,6 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
 int on_read_error, on_write_error;
 const char *devaddr;
 DriveInfo *dinfo;
-int is_extboot = 0;
 int snapshot = 0;
 int ret;
 
@@ -356,12 +353,6 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
 }
 }
 
-is_extboot = qemu_opt_get_bool(opts, boot, 0);
-if (is_extboot  extboot_drive) {
-fprintf(stderr, qemu: two bootable drives specified\n);
-return NULL;
-}
-
 on_write_error = BLOCK_ERR_STOP_ENOSPC;
 if ((buf = qemu_opt_get(opts, werror)) != NULL) {
 if (type != IF_IDE  type != IF_SCSI  type != IF_VIRTIO  type != 
IF_NONE) {
@@ -467,10 +458,6 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
 strncpy(dinfo-serial, serial, sizeof(dinfo-serial) - 1);
 QTAILQ_INSERT_TAIL(drives, dinfo, next);
 
-if (is_extboot) {
-extboot_drive = dinfo;
-}
-
 bdrv_set_on_error(dinfo-bdrv, on_read_error, on_write_error);
 
 switch(type) {
diff --git a/blockdev.h b/blockdev.h
index 0a5144c..3587786 100644
--- a/blockdev.h
+++ b/blockdev.h
@@ -66,6 +66,4 @@ int do_drive_del(Monitor *mon, const QDict *qdict, QObject 
**ret_data);
 int do_snapshot_blkdev(Monitor *mon, const QDict *qdict, QObject **ret_data);
 int do_block_resize(Monitor *mon, const QDict *qdict, QObject **ret_data);
 
-extern DriveInfo *extboot_drive;
-
 #endif
diff --git a/hw/extboot.c b/hw/extboot.c
deleted file mode 100644
index d517834..000
--- a/hw/extboot.c
+++ /dev/null
@@ -1,123 +0,0 @@
-/*
- * Extended boot option ROM support.
- *
- * Copyright IBM, Corp. 2007
- *
- * Authors:
- *  Anthony Liguori   aligu...@us.ibm.com
- *
- * This work is licensed under the terms of the GNU GPL, version 2.  See
- * the COPYING file in the top-level directory.
- *
- */
-
-#include hw.h
-#include pc.h
-#include isa.h
-#include block.h
-
-/* Extended Boot ROM suport */
-
-union extboot_cmd
-{
-uint16_t type;
-struct {
-   uint16_t type;
-   uint16_t cylinders;
-   uint16_t heads;
-   uint16_t sectors;
-   uint64_t nb_sectors;
-} query_geometry;
-struct {
-   uint16_t type;
-   uint16_t nb_sectors;
-   uint16_t segment;
-   uint16_t offset;
-   uint64_t sector;
-} xfer;
-};
-
-static void get_translated_chs(BlockDriverState *bs, int *c, int *h, int *s)
-{
-bdrv_get_geometry_hint(bs, c, h, s);
-
-if (*c = 1024) {
-   *c = 0;
-   *h = 0;
-} else if (*c = 2048) {
-   *c 

Re: [Qemu-devel] [PATCH 2/2] LAPIC: make lapic support cpu hotplug

2011-10-05 Thread Jan Kiszka
On 2011-10-05 12:26, liu ping fan wrote:
   And make the creation of apic as part of cpu initialization, so
 apic's state has been ready, before setting kvm_apic.

 There is no kvm-apic upstream yet, so it's hard to judge why we need
 this here. If we do, this has to be a separate patch. But I seriously
 doubt we need it (my hack worked without it, and that was not because of
 its hack nature).

 Sorry, I did not explain it clearly. What I mean is that “env-apic_state”
 must be prepared
 before qemu_kvm_cpu_thread_fn() - ... - kvm_put_sregs(), where we get
 apic_base by
 “ sregs.apic_base = cpu_get_apic_base(env-apic_state);”
 and then call “kvm_vcpu_ioctl(env, KVM_SET_SREGS, sregs);” which will
 finally affect the
 kvm_apic structure in kernel.
 
 But as current code, in pc_new_cpu(), we call apic_init() to initialize
 apic_state, after cpu_init(),
 so we can not guarantee the order of apic_state initializaion and the
 setting to kernel.
 
 Because LAPIC is part of x86 chip, I want to move it into cpu_x86_init(),
 and ensure apic_init()
 called before thread “qemu_kvm_cpu_thread_fn()” creation.

The LAPIC is part of the CPU, the classic APIC was a dedicated chip.

For various reasons, a safer approach for creating a new CPU is to stop
the machine, add the new device models, run cpu_synchronize_post_init on
that new cpu (looks like you missed that) and then resume everything.
See
http://git.kiszka.org/?p=qemu-kvm.git;a=commitdiff;h=be8f21c6b54eac82f7add7ee9d4ecf9cb8ebb320

...
 diff --git a/hw/icc_bus.c b/hw/icc_bus.c
 new file mode 100644
 index 000..360ca2a
 --- /dev/null
 +++ b/hw/icc_bus.c
 @@ -0,0 +1,62 @@
 +/*
 +*/
 +#define ICC_BUS_PLUG
 +#ifdef ICC_BUS_PLUG
 +#include icc_bus.h
 +
 +
 +
 +struct icc_bus_info icc_info = {
 +.qinfo.name = icc,
 +.qinfo.size = sizeof(struct icc_bus),
 +.qinfo.props = (Property[]) {
 +DEFINE_PROP_END_OF_LIST(),
 +}
 +
 +};
 +
 +
 +static const VMStateDescription vmstate_icc_bus = {
 +.name = icc_bus,
 +.version_id = 1,
 +.minimum_version_id = 1,
 +.minimum_version_id_old = 1,
 +.pre_save = NULL,
 +.post_load = NULL,
 +};
 +
 +struct icc_bus *g_iccbus;
 +
 +struct icc_bus *icc_init_bus(DeviceState *parent, const char *name)
 +{
 +struct icc_bus *bus;
 +
 +bus = FROM_QBUS(icc_bus, qbus_create(icc_info.qinfo, parent,
 name));
 +bus-qbus.allow_hotplug = 1; /* Yes, we can */
 +bus-qbus.name = icc;
 +vmstate_register(NULL, -1, vmstate_icc_bus, bus);

 The chipset is the owner of this bus and instantiates it. So it also
 provides a vmstate. You can drop this unneeded one here (it's created
 via an obsolete API anyway).

 
 No familiar with Qemu bus emulation, keep on learning :) . But what I
 thought is,
 the x86-ICC bus is not the same as bus like PCI.
 For a PCI bus, it lies behind a host bridge, but ICC is shared by all x86
 processors in SMP system,
 so there is not a outstanding owner.  And I right?

ICC is also attached to the chipset (due to the IOAPIC). So it looks
reasonable to me to let the chipset do the lifecycle management as well.
It is the fixed point, CPUs may come and go.

Jan



signature.asc
Description: OpenPGP digital signature


[PATCH v2 5/9] perf, tools: Do guest-only counting in perf-kvm by default

2011-10-05 Thread Gleb Natapov
From: Joerg Roedel joerg.roe...@amd.com

Make use of exclude_guest and exlude_host in perf-kvm to do
only guest-only counting by default.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
Signed-off-by: Gleb Natapov g...@redhat.com
---
 tools/perf/builtin-kvm.c   |3 ++-
 tools/perf/util/event.c|8 
 tools/perf/util/event.h|2 ++
 tools/perf/util/evlist.c   |5 -
 tools/perf/util/parse-events.c |1 +
 5 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
index 032324a..9b05afa 100644
--- a/tools/perf/builtin-kvm.c
+++ b/tools/perf/builtin-kvm.c
@@ -107,7 +107,8 @@ static int __cmd_buildid_list(int argc, const char **argv)
 
 int cmd_kvm(int argc, const char **argv, const char *prefix __used)
 {
-   perf_host = perf_guest = 0;
+   perf_host  = 0;
+   perf_guest = 1;
 
argc = parse_options(argc, argv, kvm_options, kvm_usage,
PARSE_OPT_STOP_AT_NON_OPTION);
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 437f8ca..31a6d7f 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -44,6 +44,14 @@ static struct perf_sample synth_sample = {
.period= 1,
 };
 
+void event_attr_init(struct perf_event_attr *attr)
+{
+   if (!perf_host)
+   attr-exclude_host  = 1;
+   if (!perf_guest)
+   attr-exclude_guest = 1;
+}
+
 static pid_t perf_event__synthesize_comm(union perf_event *event, pid_t pid,
 int full, perf_event__handler_t 
process,
 struct perf_session *session)
diff --git a/tools/perf/util/event.h b/tools/perf/util/event.h
index 357a85b..e5f101e 100644
--- a/tools/perf/util/event.h
+++ b/tools/perf/util/event.h
@@ -150,6 +150,8 @@ typedef int (*perf_event__handler_t)(union perf_event 
*event,
 struct perf_sample *sample,
  struct perf_session *session);
 
+void event_attr_init(struct perf_event_attr *attr);
+
 int perf_event__synthesize_thread_map(struct thread_map *threads,
  perf_event__handler_t process,
  struct perf_session *session);
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 72e9f48..4773fbe 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -82,8 +82,11 @@ int perf_evlist__add_default(struct perf_evlist *evlist)
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
};
-   struct perf_evsel *evsel = perf_evsel__new(attr, 0);
+   struct perf_evsel *evsel;
+
+   event_attr_init(attr);
 
+   evsel = perf_evsel__new(attr, 0);
if (evsel == NULL)
goto error;
 
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 3b00775..620ba98 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -848,6 +848,7 @@ int parse_events(struct perf_evlist *evlist , const char 
*str, int unset __used)
for (;;) {
ostr = str;
memset(attr, 0, sizeof(attr));
+   event_attr_init(attr);
ret = parse_event_symbols(evlist, str, attr);
if (ret == EVT_FAILED)
return -1;
-- 
1.7.5.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/9] perf, core: Introduce attrs to count in either host or guest mode

2011-10-05 Thread Gleb Natapov
From: Joerg Roedel joerg.roe...@amd.com

The two new attributes exclude_guest and exclude_host can
bes used by user-space to tell the kernel to setup
performance counter to either only count while the CPU is in
guest or in host mode.
An additional check is also introduced to make sure
user-space does not try to exclude guest and host mode from
counting.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
Signed-off-by: Gleb Natapov g...@redhat.com
---
 include/linux/perf_event.h |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c816075..1e9ebe5 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -220,7 +220,10 @@ struct perf_event_attr {
mmap_data  :  1, /* non-exec mmap data*/
sample_id_all  :  1, /* sample_type all events 
*/
 
-   __reserved_1   : 45;
+   exclude_host   :  1, /* don't count in host   */
+   exclude_guest  :  1, /* don't count in guest  */
+
+   __reserved_1   : 43;
 
union {
__u32   wakeup_events;/* wakeup every n events */
-- 
1.7.5.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/9] perf, tools: Add support for guest/host-only profiling

2011-10-05 Thread Gleb Natapov
From: Joerg Roedel joerg.roe...@amd.com

To restrict a counter to either host or guest mode this
patch introduces two new event modifiers: G and H.
With G the counter is configured in guest-only mode and with
H in host-only mode.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
Signed-off-by: Gleb Natapov g...@redhat.com
---
 tools/perf/util/parse-events.c |   14 --
 1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 928918b..3b00775 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -735,8 +735,8 @@ static int
 parse_event_modifier(const char **strp, struct perf_event_attr *attr)
 {
const char *str = *strp;
-   int exclude = 0;
-   int eu = 0, ek = 0, eh = 0, precise = 0;
+   int exclude = 0, exclude_GH = 0;
+   int eu = 0, ek = 0, eh = 0, eH = 0, eG = 0, precise = 0;
 
if (!*str)
return 0;
@@ -760,6 +760,14 @@ parse_event_modifier(const char **strp, struct 
perf_event_attr *attr)
if (!exclude)
exclude = eu = ek = eh = 1;
eh = 0;
+   } else if (*str == 'G') {
+   if (!exclude_GH)
+   exclude_GH = eG = eH = 1;
+   eG = 0;
+   } else if (*str == 'H') {
+   if (!exclude_GH)
+   exclude_GH = eG = eH = 1;
+   eH = 0;
} else if (*str == 'p') {
precise++;
} else
@@ -776,6 +784,8 @@ parse_event_modifier(const char **strp, struct 
perf_event_attr *attr)
attr-exclude_kernel = ek;
attr-exclude_hv = eh;
attr-precise_ip = precise;
+   attr-exclude_host   = eH;
+   attr-exclude_guest  = eG;
 
return 0;
 }
-- 
1.7.5.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 6/9] perf, intel: Use GO/HO bits in perf-ctr

2011-10-05 Thread Gleb Natapov
Intel does not have guest/host-only bit in perf counters like AMD
does.  To support GO/HO bits KVM needs to switch EVENTSELn values
(or PERF_GLOBAL_CTRL if available) at a guest entry. If a counter is
configured to count only in a guest mode it stays disabled in a host,
but VMX is configured to switch it to enabled value during guest entry.

This patch adds GO/HO tracking to Intel perf code and provides interface
for KVM to get a list of MSRs that need to be switched on a guest entry.

Only cpus with architectural PMU (v1 or later) are supported with this
patch.  To my knowledge there is not p6 models with VMX but without
architectural PMU and p4 with VMX are rare and the interface is general
enough to support them if need arise.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/include/asm/perf_event.h  |   12 
 arch/x86/kernel/cpu/perf_event.c   |   14 +
 arch/x86/kernel/cpu/perf_event_intel.c |   90 ++-
 3 files changed, 113 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index ce2bfb3..e47cb61 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -162,7 +162,19 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs);
);  \
 }
 
+struct perf_guest_switch_msr {
+   unsigned msr;
+   u64 host, guest;
+};
+
+extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
 #else
+static inline perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
+{
+   *nr = 0;
+   return NULL;
+}
+
 static inline void perf_events_lapic_init(void){ }
 #endif
 
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index cfa62ec..531f5c6 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -119,6 +119,10 @@ struct cpu_hw_events {
struct perf_branch_stacklbr_stack;
struct perf_branch_entrylbr_entries[MAX_LBR_ENTRIES];
 
+   /* Intel host/guest exclude bits */
+   u64 intel_ctrl_guest_mask;
+   u64 intel_ctrl_host_mask;
+
/*
 * manage shared (per-core, per-cpu) registers
 * used on Intel NHM/WSM/SNB
@@ -129,6 +133,11 @@ struct cpu_hw_events {
 * AMD specific bits
 */
struct amd_nb   *amd_nb;
+
+   /*
+* Intel guest/host-only support
+*/
+   struct perf_guest_switch_msr guest_switch_msrs[X86_PMC_IDX_MAX];
 };
 
 #define __EVENT_CONSTRAINT(c, n, m, w) {\
@@ -292,6 +301,11 @@ struct x86_pmu {
 */
struct extra_reg *extra_regs;
unsigned int er_flags;
+
+   /*
+* Guest event support
+*/
+   struct perf_guest_switch_msr* (*guest_get_msrs)(int *nr);
 };
 
 #define ERF_NO_HT_SHARING  1
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index f88af2c..e35b916 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -746,7 +746,8 @@ static void intel_pmu_enable_all(int added)
 
intel_pmu_pebs_enable_all();
intel_pmu_lbr_enable_all();
-   wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, x86_pmu.intel_ctrl);
+   wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL,
+   x86_pmu.intel_ctrl  ~cpuc-intel_ctrl_guest_mask);
 
if (test_bit(X86_PMC_IDX_FIXED_BTS, cpuc-active_mask)) {
struct perf_event *event =
@@ -869,6 +870,7 @@ static void intel_pmu_disable_fixed(struct hw_perf_event 
*hwc)
 static void intel_pmu_disable_event(struct perf_event *event)
 {
struct hw_perf_event *hwc = event-hw;
+   struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events);
 
if (unlikely(hwc-idx == X86_PMC_IDX_FIXED_BTS)) {
intel_pmu_disable_bts();
@@ -876,6 +878,9 @@ static void intel_pmu_disable_event(struct perf_event 
*event)
return;
}
 
+   cpuc-intel_ctrl_guest_mask = ~(1ull  hwc-idx);
+   cpuc-intel_ctrl_host_mask = ~(1ull  hwc-idx);
+
if (unlikely(hwc-config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
intel_pmu_disable_fixed(hwc);
return;
@@ -921,6 +926,7 @@ static void intel_pmu_enable_fixed(struct hw_perf_event 
*hwc)
 static void intel_pmu_enable_event(struct perf_event *event)
 {
struct hw_perf_event *hwc = event-hw;
+   struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events);
 
if (unlikely(hwc-idx == X86_PMC_IDX_FIXED_BTS)) {
if (!__this_cpu_read(cpu_hw_events.enabled))
@@ -930,6 +936,11 @@ static void intel_pmu_enable_event(struct perf_event 
*event)
return;
}
 
+   if (event-attr.exclude_host)
+   cpuc-intel_ctrl_guest_mask |= (1ull  hwc-idx);
+   if (event-attr.exclude_guest)
+   cpuc-intel_ctrl_host_mask |= (1ull  

[PATCH v2 7/9] KVM, VMX: add support for switching of PERF_GLOBAL_CTRL

2011-10-05 Thread Gleb Natapov
Some cpus have special support for switching PERF_GLOBAL_CTRL msr.
Add logic to detect if such support exists and works properly and extend
msr switching code to use it if available. Also extend number of generic
msr switching entries to 8.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/vmx.c |  104 ++-
 1 files changed, 93 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f3ec38f..a938ddf 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -118,7 +118,7 @@ module_param(ple_gap, int, S_IRUGO);
 static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
 module_param(ple_window, int, S_IRUGO);
 
-#define NR_AUTOLOAD_MSRS 1
+#define NR_AUTOLOAD_MSRS 8
 #define VMCS02_POOL_SIZE 1
 
 struct vmcs {
@@ -622,6 +622,7 @@ static unsigned long *vmx_msr_bitmap_legacy;
 static unsigned long *vmx_msr_bitmap_longmode;
 
 static bool cpu_has_load_ia32_efer;
+static bool cpu_has_load_perf_global_ctrl;
 
 static DECLARE_BITMAP(vmx_vpid_bitmap, VMX_NR_VPIDS);
 static DEFINE_SPINLOCK(vmx_vpid_lock);
@@ -1191,15 +1192,34 @@ static void update_exception_bitmap(struct kvm_vcpu 
*vcpu)
vmcs_write32(EXCEPTION_BITMAP, eb);
 }
 
+static void clear_atomic_switch_msr_special(unsigned long entry,
+   unsigned long exit)
+{
+   vmcs_clear_bits(VM_ENTRY_CONTROLS, entry);
+   vmcs_clear_bits(VM_EXIT_CONTROLS, exit);
+}
+
 static void clear_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr)
 {
unsigned i;
struct msr_autoload *m = vmx-msr_autoload;
 
-   if (msr == MSR_EFER  cpu_has_load_ia32_efer) {
-   vmcs_clear_bits(VM_ENTRY_CONTROLS, VM_ENTRY_LOAD_IA32_EFER);
-   vmcs_clear_bits(VM_EXIT_CONTROLS, VM_EXIT_LOAD_IA32_EFER);
-   return;
+   switch (msr) {
+   case MSR_EFER:
+   if (cpu_has_load_ia32_efer) {
+   clear_atomic_switch_msr_special(VM_ENTRY_LOAD_IA32_EFER,
+   VM_EXIT_LOAD_IA32_EFER);
+   return;
+   }
+   break;
+   case MSR_CORE_PERF_GLOBAL_CTRL:
+   if (cpu_has_load_perf_global_ctrl) {
+   clear_atomic_switch_msr_special(
+   VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL,
+   VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL);
+   return;
+   }
+   break;
}
 
for (i = 0; i  m-nr; ++i)
@@ -1215,18 +1235,44 @@ static void clear_atomic_switch_msr(struct vcpu_vmx 
*vmx, unsigned msr)
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m-nr);
 }
 
+static void add_atomic_switch_msr_special(unsigned long entry,
+   unsigned long exit, unsigned long guest_val_vmcs,
+   unsigned long host_val_vmcs, u64 guest_val, u64 host_val)
+{
+   vmcs_write64(guest_val_vmcs, guest_val);
+   vmcs_write64(host_val_vmcs, host_val);
+   vmcs_set_bits(VM_ENTRY_CONTROLS, entry);
+   vmcs_set_bits(VM_EXIT_CONTROLS, exit);
+}
+
 static void add_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr,
  u64 guest_val, u64 host_val)
 {
unsigned i;
struct msr_autoload *m = vmx-msr_autoload;
 
-   if (msr == MSR_EFER  cpu_has_load_ia32_efer) {
-   vmcs_write64(GUEST_IA32_EFER, guest_val);
-   vmcs_write64(HOST_IA32_EFER, host_val);
-   vmcs_set_bits(VM_ENTRY_CONTROLS, VM_ENTRY_LOAD_IA32_EFER);
-   vmcs_set_bits(VM_EXIT_CONTROLS, VM_EXIT_LOAD_IA32_EFER);
-   return;
+   switch (msr) {
+   case MSR_EFER:
+   if (cpu_has_load_ia32_efer) {
+   add_atomic_switch_msr_special(VM_ENTRY_LOAD_IA32_EFER,
+   VM_EXIT_LOAD_IA32_EFER,
+   GUEST_IA32_EFER,
+   HOST_IA32_EFER,
+   guest_val, host_val);
+   return;
+   }
+   break;
+   case MSR_CORE_PERF_GLOBAL_CTRL:
+   if (cpu_has_load_perf_global_ctrl) {
+   add_atomic_switch_msr_special(
+   VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL,
+   VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL,
+   GUEST_IA32_PERF_GLOBAL_CTRL,
+   HOST_IA32_PERF_GLOBAL_CTRL,
+   guest_val, host_val);
+   return;
+   }
+   break;
}
 
for (i = 0; i  m-nr; ++i)
@@ -2455,6 +2501,42 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
 allow_1_setting(MSR_IA32_VMX_EXIT_CTLS,
   VM_EXIT_LOAD_IA32_EFER);
 
+   

[PATCH v2 4/9] perf, tools: Fix copypaste error in perf-kvm option description

2011-10-05 Thread Gleb Natapov
From: Joerg Roedel joerg.roe...@amd.com

The --host option certainly enables host-data collection.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
Signed-off-by: Gleb Natapov g...@redhat.com
---
 tools/perf/builtin-kvm.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
index 34d1e85..032324a 100644
--- a/tools/perf/builtin-kvm.c
+++ b/tools/perf/builtin-kvm.c
@@ -38,7 +38,7 @@ static const struct option kvm_options[] = {
OPT_BOOLEAN(0, guest, perf_guest,
Collect guest os data),
OPT_BOOLEAN(0, host, perf_host,
-   Collect guest os data),
+   Collect host os data),
OPT_STRING(0, guestmount, symbol_conf.guestmount, directory,
   guest mount directory under which every guest os
instance has a subdir),
-- 
1.7.5.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/9] perf support for x86 guest/host-only bits

2011-10-05 Thread Gleb Natapov
This patch series consists of Joerg series named perf support for amd
guest/host-only bits v2 [1] rebased to 3.1.0-rc7 and in addition,
support for intel cpus for the same functionality.

[1] https://lkml.org/lkml/2011/6/17/171

Changelog:
 v1-v2
  - move perf_guest_switch_msr array to perf code.
  - small cosmetic changes.

Gleb Natapov (4):
  perf, intel: Use GO/HO bits in perf-ctr
  KVM, VMX: add support for switching of PERF_GLOBAL_CTRL
  KVM, VMX: Add support for guest/host-only profiling
  KVM, VMX: Check for automatic switch msr table overflow.

Joerg Roedel (5):
  perf, core: Introduce attrs to count in either host or guest mode
  perf, amd: Use GO/HO bits in perf-ctr
  perf, tools: Add support for guest/host-only profiling
  perf, tools: Fix copypaste error in perf-kvm option description
  perf, tools: Do guest-only counting in perf-kvm by default

 arch/x86/include/asm/perf_event.h  |   15 
 arch/x86/kernel/cpu/perf_event.c   |   14 
 arch/x86/kernel/cpu/perf_event_amd.c   |   13 +++
 arch/x86/kernel/cpu/perf_event_intel.c |   90 +-
 arch/x86/kvm/vmx.c |  131 +---
 include/linux/perf_event.h |5 +-
 tools/perf/builtin-kvm.c   |5 +-
 tools/perf/util/event.c|8 ++
 tools/perf/util/event.h|2 +
 tools/perf/util/evlist.c   |5 +-
 tools/perf/util/parse-events.c |   15 +++-
 11 files changed, 282 insertions(+), 21 deletions(-)

-- 
1.7.5.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 9/9] KVM, VMX: Check for automatic switch msr table overflow.

2011-10-05 Thread Gleb Natapov

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/vmx.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d0d4afa..6e28d58 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1280,7 +1280,11 @@ static void add_atomic_switch_msr(struct vcpu_vmx *vmx, 
unsigned msr,
if (m-guest[i].index == msr)
break;
 
-   if (i == m-nr) {
+   if (i == NR_AUTOLOAD_MSRS) {
+   printk_once(KERN_WARNINGNot enough mst switch entries. 
+   Can't add msr %x\n, msr);
+   return;
+   } else if (i == m-nr) {
++m-nr;
vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, m-nr);
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m-nr);
-- 
1.7.5.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/9] perf, amd: Use GO/HO bits in perf-ctr

2011-10-05 Thread Gleb Natapov
From: Joerg Roedel joerg.roe...@amd.com

The AMD perf-counters support counting in guest or host-mode
only. Make use of that feature when user-space specified
guest/host-mode only counting.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/include/asm/perf_event.h|3 +++
 arch/x86/kernel/cpu/perf_event_amd.c |   13 +
 2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index 094fb30..ce2bfb3 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -29,6 +29,9 @@
 #define ARCH_PERFMON_EVENTSEL_INV  (1ULL  23)
 #define ARCH_PERFMON_EVENTSEL_CMASK0xFF00ULL
 
+#define AMD_PERFMON_EVENTSEL_GUESTONLY (1ULL  40)
+#define AMD_PERFMON_EVENTSEL_HOSTONLY  (1ULL  41)
+
 #define AMD64_EVENTSEL_EVENT   \
(ARCH_PERFMON_EVENTSEL_EVENT | (0x0FULL  32))
 #define INTEL_ARCH_EVENT_MASK  \
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c 
b/arch/x86/kernel/cpu/perf_event_amd.c
index 941caa2..5c01a73 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -132,6 +132,19 @@ static int amd_pmu_hw_config(struct perf_event *event)
if (ret)
return ret;
 
+   if (event-attr.exclude_host  event-attr.exclude_guest)
+   /*
+* When HO == GO == 1 the hardware treats that as GO == HO == 0
+* and will count in both modes. We don't want to count in that
+* case so we emulate no-counting by setting US = OS = 0.
+*/
+   event-hw.config = ~(ARCH_PERFMON_EVENTSEL_USR |
+ ARCH_PERFMON_EVENTSEL_OS);
+   else if (event-attr.exclude_host)
+   event-hw.config |= AMD_PERFMON_EVENTSEL_GUESTONLY;
+   else if (event-attr.exclude_guest)
+   event-hw.config |= AMD_PERFMON_EVENTSEL_HOSTONLY;
+
if (event-attr.type != PERF_TYPE_RAW)
return 0;
 
-- 
1.7.5.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 8/9] KVM, VMX: Add support for guest/host-only profiling

2011-10-05 Thread Gleb Natapov
Support guest/host-only profiling by switch perf msrs on
a guest entry if needed.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/vmx.c |   21 +
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a938ddf..d0d4afa 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -39,6 +39,7 @@
 #include asm/mce.h
 #include asm/i387.h
 #include asm/xcr.h
+#include asm/perf_event.h
 
 #include trace.h
 
@@ -6054,6 +6055,24 @@ static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
 }
 
+static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
+{
+   int i, nr_msrs;
+   struct perf_guest_switch_msr *msrs;
+
+   msrs = perf_guest_get_msrs(nr_msrs);
+
+   if (!msrs)
+   return;
+
+   for (i = 0; i  nr_msrs; i++)
+   if (msrs[i].host == msrs[i].guest)
+   clear_atomic_switch_msr(vmx, msrs[i].msr);
+   else
+   add_atomic_switch_msr(vmx, msrs[i].msr, msrs[i].guest,
+   msrs[i].host);
+}
+
 #ifdef CONFIG_X86_64
 #define R r
 #define Q q
@@ -6103,6 +6122,8 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
if (vcpu-guest_debug  KVM_GUESTDBG_SINGLESTEP)
vmx_set_interrupt_shadow(vcpu, 0);
 
+   atomic_switch_perf_msrs(vmx);
+
vmx-__launched = vmx-loaded_vmcs-launched;
asm(
/* Store host registers */
-- 
1.7.5.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] virtio-net: Prevent NULL dereference

2011-10-05 Thread Sasha Levin
On Mon, 2011-10-03 at 20:40 +0200, Michael S. Tsirkin wrote:
 On Wed, Sep 28, 2011 at 05:40:55PM +0300, Sasha Levin wrote:
  This patch prevents a NULL dereference when the user has passed a length
  longer than an actual buffer to virtio-net.
  
  Cc: Rusty Russell ru...@rustcorp.com.au
  Cc: Michael S. Tsirkin m...@redhat.com
  Cc: virtualizat...@lists.linux-foundation.org
  Cc: net...@vger.kernel.org
  Cc: kvm@vger.kernel.org
  Signed-off-by: Sasha Levin levinsasha...@gmail.com
  ---
   drivers/net/virtio_net.c |   12 +++-
   1 files changed, 11 insertions(+), 1 deletions(-)
  
  diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
  index bde0dec..4a53d2a 100644
  --- a/drivers/net/virtio_net.c
  +++ b/drivers/net/virtio_net.c
  @@ -208,12 +208,22 @@ static struct sk_buff *page_to_skb(struct 
  virtnet_info *vi,
  return NULL;
  }
   
  -   while (len) {
  +   while (len  page) {
  set_skb_frag(skb, page, offset, len);
  page = (struct page *)page-private;
  offset = 0;
  }
   
  +   /*
  +* This is the case where we ran out of pages in our linked list, but
  +* supposedly have more data to read.
 
 Again, let's clarify that this only happens with broken devices.

I think that the code within the if() makes it clear that it isn't the
regular path.

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] virtio-net: Verify page list size before fitting into skb

2011-10-05 Thread Sasha Levin
On Mon, 2011-10-03 at 21:04 +0200, Michael S. Tsirkin wrote:
 On Wed, Sep 28, 2011 at 05:40:54PM +0300, Sasha Levin wrote:
  This patch verifies that the length of a buffer stored in a linked list
  of pages is small enough to fit into a skb.
  
  If the size is larger than a max size of a skb, it means that we shouldn't
  go ahead building skbs anyway since we won't be able to send the buffer as
  the user requested.
  
  Cc: Rusty Russell ru...@rustcorp.com.au
  Cc: Michael S. Tsirkin m...@redhat.com
  Cc: virtualizat...@lists.linux-foundation.org
  Cc: net...@vger.kernel.org
  Cc: kvm@vger.kernel.org
  Signed-off-by: Sasha Levin levinsasha...@gmail.com
  ---
   drivers/net/virtio_net.c |   13 +
   1 files changed, 13 insertions(+), 0 deletions(-)
  
  diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
  index 0c7321c..bde0dec 100644
  --- a/drivers/net/virtio_net.c
  +++ b/drivers/net/virtio_net.c
  @@ -195,6 +195,19 @@ static struct sk_buff *page_to_skb(struct virtnet_info 
  *vi,
  len -= copy;
  offset += copy;
   
  +   /*
  +* Verify that we can indeed put this data into a skb.
  +* This is here to handle cases when the device erroneously
  +* tries to receive more than is possible. This is usually
  +* the case of a broken device.
  +*/
  +   if (unlikely(len  MAX_SKB_FRAGS * PAGE_SIZE)) {
  +   if (net_ratelimit())
  +   pr_debug(%s: too much data\n, skb-dev-name);
  +   dev_kfree_skb(skb);
  +   return NULL;
  +   }
  +
 
 BTW, receive_mergeable does
 pr_debug(%s: packet too long\n, skb-dev-name);
 skb-dev-stats.rx_length_errors++;
 
 which makes sense.

Do you think we should increase rx_length_errors here as well?

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/11] KVM: x86: optimize for writing guest page

2011-10-05 Thread Avi Kivity

On 09/23/2011 02:51 PM, Marcelo Tosatti wrote:

On Thu, Sep 22, 2011 at 04:52:40PM +0800, Xiao Guangrong wrote:
  This patchset is against https://github.com/avikivity/kvm.git next branch.

  In this version, some changes come from Avi's comments:
  - fix instruction retried for nested guest
  - skip write-flooding for the sp whose level is 1
  - rename some functions

Looks good to me.



To me as well.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/9] perf support for x86 guest/host-only bits

2011-10-05 Thread Avi Kivity

On 10/05/2011 02:01 PM, Gleb Natapov wrote:

This patch series consists of Joerg series named perf support for amd
guest/host-only bits v2 [1] rebased to 3.1.0-rc7 and in addition,
support for intel cpus for the same functionality.



Looks good to me.  Peter, Ingo, if it works for you as well, please 
merge 1-6 into a branch on tip that I can pull and apply the rest on top of.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/9] perf, intel: Use GO/HO bits in perf-ctr

2011-10-05 Thread Peter Zijlstra


Had some serious conflicts with tip/master, queued the below merge.

---
Subject: perf, intel: Use GO/HO bits in perf-ctr
From: Gleb Natapov g...@redhat.com
Date: Wed, 5 Oct 2011 14:01:21 +0200

Intel does not have guest/host-only bit in perf counters like AMD
does.  To support GO/HO bits KVM needs to switch EVENTSELn values
(or PERF_GLOBAL_CTRL if available) at a guest entry. If a counter is
configured to count only in a guest mode it stays disabled in a host,
but VMX is configured to switch it to enabled value during guest entry.

This patch adds GO/HO tracking to Intel perf code and provides interface
for KVM to get a list of MSRs that need to be switched on a guest entry.

Only cpus with architectural PMU (v1 or later) are supported with this
patch.  To my knowledge there is not p6 models with VMX but without
architectural PMU and p4 with VMX are rare and the interface is general
enough to support them if need arise.

Signed-off-by: Gleb Natapov g...@redhat.com
Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl
Link: http://lkml.kernel.org/r/1317816084-18026-7-git-send-email-g...@redhat.com
---
 arch/x86/include/asm/perf_event.h  |   12 
 arch/x86/kernel/cpu/perf_event.h   |   12 
 arch/x86/kernel/cpu/perf_event_intel.c |   91 +++--
 3 files changed, 112 insertions(+), 3 deletions(-)
Index: linux-2.6/arch/x86/include/asm/perf_event.h
===
--- linux-2.6.orig/arch/x86/include/asm/perf_event.h
+++ linux-2.6/arch/x86/include/asm/perf_event.h
@@ -162,7 +162,19 @@ extern unsigned long perf_misc_flags(str
);  \
 }
 
+struct perf_guest_switch_msr {
+   unsigned msr;
+   u64 host, guest;
+};
+
+extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
 #else
+static inline perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
+{
+   *nr = 0;
+   return NULL;
+}
+
 static inline void perf_events_lapic_init(void){ }
 #endif
 
Index: linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
===
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event_intel.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event_intel.c
@@ -749,7 +749,8 @@ static void intel_pmu_enable_all(int add
 
intel_pmu_pebs_enable_all();
intel_pmu_lbr_enable_all();
-   wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, x86_pmu.intel_ctrl);
+   wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL,
+   x86_pmu.intel_ctrl  ~cpuc-intel_ctrl_guest_mask);
 
if (test_bit(X86_PMC_IDX_FIXED_BTS, cpuc-active_mask)) {
struct perf_event *event =
@@ -872,6 +873,7 @@ static void intel_pmu_disable_fixed(stru
 static void intel_pmu_disable_event(struct perf_event *event)
 {
struct hw_perf_event *hwc = event-hw;
+   struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events);
 
if (unlikely(hwc-idx == X86_PMC_IDX_FIXED_BTS)) {
intel_pmu_disable_bts();
@@ -879,6 +881,9 @@ static void intel_pmu_disable_event(stru
return;
}
 
+   cpuc-intel_ctrl_guest_mask = ~(1ull  hwc-idx);
+   cpuc-intel_ctrl_host_mask = ~(1ull  hwc-idx);
+
if (unlikely(hwc-config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
intel_pmu_disable_fixed(hwc);
return;
@@ -924,6 +929,7 @@ static void intel_pmu_enable_fixed(struc
 static void intel_pmu_enable_event(struct perf_event *event)
 {
struct hw_perf_event *hwc = event-hw;
+   struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events);
 
if (unlikely(hwc-idx == X86_PMC_IDX_FIXED_BTS)) {
if (!__this_cpu_read(cpu_hw_events.enabled))
@@ -933,6 +939,11 @@ static void intel_pmu_enable_event(struc
return;
}
 
+   if (event-attr.exclude_host)
+   cpuc-intel_ctrl_guest_mask |= (1ull  hwc-idx);
+   if (event-attr.exclude_guest)
+   cpuc-intel_ctrl_host_mask |= (1ull  hwc-idx);
+
if (unlikely(hwc-config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
intel_pmu_enable_fixed(hwc);
return;
@@ -1302,12 +1313,84 @@ static int intel_pmu_hw_config(struct pe
return 0;
 }
 
+struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
+{
+   if (x86_pmu.guest_get_msrs)
+   return x86_pmu.guest_get_msrs(nr);
+   *nr = 0;
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(perf_guest_get_msrs);
+
+static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
+{
+   struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events);
+   struct perf_guest_switch_msr *arr = cpuc-guest_switch_msrs;
+
+   arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
+   arr[0].host = x86_pmu.intel_ctrl  ~cpuc-intel_ctrl_guest_mask;
+   arr[0].guest = x86_pmu.intel_ctrl  ~cpuc-intel_ctrl_host_mask;
+
+   *nr = 1;
+   return arr;
+}
+
+static struct 

Re: [PATCH v2 8/9] KVM, VMX: Add support for guest/host-only profiling

2011-10-05 Thread Peter Zijlstra
On Wed, 2011-10-05 at 14:01 +0200, Gleb Natapov wrote:
 +static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
 +{
 +   int i, nr_msrs;
 +   struct perf_guest_switch_msr *msrs;
 +
 +   msrs = perf_guest_get_msrs(nr_msrs);
 +
 +   if (!msrs)
 +   return;
 +
 +   for (i = 0; i  nr_msrs; i++)
 +   if (msrs[i].host == msrs[i].guest)
 +   clear_atomic_switch_msr(vmx, msrs[i].msr);
 +   else
 +   add_atomic_switch_msr(vmx, msrs[i].msr, msrs[i].guest,
 +   msrs[i].host);
 +} 

I don't think this will actually compile with PERF_EVENTS=n due to
struct perf_guest_switch_msr not being defined.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/9] perf support for x86 guest/host-only bits

2011-10-05 Thread Peter Zijlstra
On Wed, 2011-10-05 at 15:48 +0200, Avi Kivity wrote:
 On 10/05/2011 02:01 PM, Gleb Natapov wrote:
  This patch series consists of Joerg series named perf support for amd
  guest/host-only bits v2 [1] rebased to 3.1.0-rc7 and in addition,
  support for intel cpus for the same functionality.
 
 
 Looks good to me.  Peter, Ingo, if it works for you as well, please 
 merge 1-6 into a branch on tip that I can pull and apply the rest on top of.

I queued bits, and they should hopefully find their way to tip:perf/core
soonish.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: PPC: E500: Support hugetlbfs

2011-10-05 Thread Alexander Graf
With hugetlbfs support emerging on e500, we should also support KVM
backing its guest memory by it.

This patch adds support for hugetlbfs into the e500 shadow mmu code.

Signed-off-by: Alexander Graf ag...@suse.de

---

v1 - v2:

  - address scott's comments
---
 arch/powerpc/kvm/e500_tlb.c |   24 
 1 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index ec17148..1dd96a9 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -24,6 +24,7 @@
 #include linux/sched.h
 #include linux/rwsem.h
 #include linux/vmalloc.h
+#include linux/hugetlb.h
 #include asm/kvm_ppc.h
 #include asm/kvm_e500.h
 
@@ -673,12 +674,31 @@ static inline void kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
pfn = ~(tsize_pages - 1);
break;
}
+   } else if (vma  hva = vma-vm_start 
+   (vma-vm_flags  VM_HUGETLB)) {
+   unsigned long psize = vma_kernel_pagesize(vma);
+
+   tsize = (gtlbe-mas1  MAS1_TSIZE_MASK) 
+   MAS1_TSIZE_SHIFT;
+
+   /*
+* Take the largest page size that satisfies both host
+* and guest mapping
+*/
+   tsize = min(__ilog2(psize) - 10, tsize);
+
+   /*
+* e500 doesn't implement the lowest tsize bit,
+* or 1K pages.
+*/
+   tsize = max(BOOK3E_PAGESZ_4K, tsize  ~1);
}
 
up_read(current-mm-mmap_sem);
}
 
if (likely(!pfnmap)) {
+   unsigned long tsize_pages = 1  (tsize + 10 - PAGE_SHIFT);
pfn = gfn_to_pfn_memslot(vcpu_e500-vcpu.kvm, slot, gfn);
if (is_error_pfn(pfn)) {
printk(KERN_ERR Couldn't get real page for gfn %lx!\n,
@@ -686,6 +706,10 @@ static inline void kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
kvm_release_pfn_clean(pfn);
return;
}
+
+   /* Align guest and physical address to page map boundaries */
+   pfn = ~(tsize_pages - 1);
+   gvaddr = ~((tsize_pages  PAGE_SHIFT) - 1);
}
 
/* Drop old ref and setup new one. */
-- 
1.6.0.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qemu-kvm: Remove extboot support

2011-10-05 Thread Marcelo Tosatti
On Wed, Oct 05, 2011 at 09:52:12AM +0200, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 Except for booting from SCSI via the LSI controller, SeaBIOS has native
 support for mass storage interfaces now. And SCSI can be worked around
 via [1] - or someone finally adds the necessary bits to SeaBIOS if there
 is a real need.
 
 [1] http://thread.gmane.org/gmane.comp.emulators.qemu/78467
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] virtio-net: Verify page list size before fitting into skb

2011-10-05 Thread Michael S. Tsirkin
On Wed, Oct 05, 2011 at 03:50:54PM +0200, Sasha Levin wrote:
 On Mon, 2011-10-03 at 21:04 +0200, Michael S. Tsirkin wrote:
  On Wed, Sep 28, 2011 at 05:40:54PM +0300, Sasha Levin wrote:
   This patch verifies that the length of a buffer stored in a linked list
   of pages is small enough to fit into a skb.
   
   If the size is larger than a max size of a skb, it means that we shouldn't
   go ahead building skbs anyway since we won't be able to send the buffer as
   the user requested.
   
   Cc: Rusty Russell ru...@rustcorp.com.au
   Cc: Michael S. Tsirkin m...@redhat.com
   Cc: virtualizat...@lists.linux-foundation.org
   Cc: net...@vger.kernel.org
   Cc: kvm@vger.kernel.org
   Signed-off-by: Sasha Levin levinsasha...@gmail.com
   ---
drivers/net/virtio_net.c |   13 +
1 files changed, 13 insertions(+), 0 deletions(-)
   
   diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
   index 0c7321c..bde0dec 100644
   --- a/drivers/net/virtio_net.c
   +++ b/drivers/net/virtio_net.c
   @@ -195,6 +195,19 @@ static struct sk_buff *page_to_skb(struct 
   virtnet_info *vi,
 len -= copy;
 offset += copy;

   + /*
   +  * Verify that we can indeed put this data into a skb.
   +  * This is here to handle cases when the device erroneously
   +  * tries to receive more than is possible. This is usually
   +  * the case of a broken device.
   +  */
   + if (unlikely(len  MAX_SKB_FRAGS * PAGE_SIZE)) {
   + if (net_ratelimit())
   + pr_debug(%s: too much data\n, skb-dev-name);
   + dev_kfree_skb(skb);
   + return NULL;
   + }
   +
  
  BTW, receive_mergeable does
  pr_debug(%s: packet too long\n, skb-dev-name);
  skb-dev-stats.rx_length_errors++;
  
  which makes sense.
 
 Do you think we should increase rx_length_errors here as well?

this is all debugging tool for devices/drivers, right?
so maybe not worth the noise.

 -- 
 
 Sasha.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] virtio-net: Prevent NULL dereference

2011-10-05 Thread Michael S. Tsirkin
On Wed, Oct 05, 2011 at 03:50:14PM +0200, Sasha Levin wrote:
 On Mon, 2011-10-03 at 20:40 +0200, Michael S. Tsirkin wrote:
  On Wed, Sep 28, 2011 at 05:40:55PM +0300, Sasha Levin wrote:
   This patch prevents a NULL dereference when the user has passed a length
   longer than an actual buffer to virtio-net.
   
   Cc: Rusty Russell ru...@rustcorp.com.au
   Cc: Michael S. Tsirkin m...@redhat.com
   Cc: virtualizat...@lists.linux-foundation.org
   Cc: net...@vger.kernel.org
   Cc: kvm@vger.kernel.org
   Signed-off-by: Sasha Levin levinsasha...@gmail.com
   ---
drivers/net/virtio_net.c |   12 +++-
1 files changed, 11 insertions(+), 1 deletions(-)
   
   diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
   index bde0dec..4a53d2a 100644
   --- a/drivers/net/virtio_net.c
   +++ b/drivers/net/virtio_net.c
   @@ -208,12 +208,22 @@ static struct sk_buff *page_to_skb(struct 
   virtnet_info *vi,
 return NULL;
 }

   - while (len) {
   + while (len  page) {
 set_skb_frag(skb, page, offset, len);
 page = (struct page *)page-private;
 offset = 0;
 }

   + /*
   +  * This is the case where we ran out of pages in our linked list, but
   +  * supposedly have more data to read.
  
  Again, let's clarify that this only happens with broken devices.
 
 I think that the code within the if() makes it clear that it isn't the
 regular path.

It doesn't make it clear that this never happens in absence of bugs.

 -- 
 
 Sasha.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 8/9] KVM, VMX: Add support for guest/host-only profiling

2011-10-05 Thread Gleb Natapov
On Wed, Oct 05, 2011 at 04:19:39PM +0200, Peter Zijlstra wrote:
 On Wed, 2011-10-05 at 14:01 +0200, Gleb Natapov wrote:
  +static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
  +{
  +   int i, nr_msrs;
  +   struct perf_guest_switch_msr *msrs;
  +
  +   msrs = perf_guest_get_msrs(nr_msrs);
  +
  +   if (!msrs)
  +   return;
  +
  +   for (i = 0; i  nr_msrs; i++)
  +   if (msrs[i].host == msrs[i].guest)
  +   clear_atomic_switch_msr(vmx, msrs[i].msr);
  +   else
  +   add_atomic_switch_msr(vmx, msrs[i].msr, 
  msrs[i].guest,
  +   msrs[i].host);
  +} 
 
 I don't think this will actually compile with PERF_EVENTS=n due to
 struct perf_guest_switch_msr not being defined.
Oops you are right. Turns out it is not enough to remove PERF_EVENTS
from .config to disable it. It re-appears after make oldconfig. Should
I send incremental patch to fix that?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 8/9] KVM, VMX: Add support for guest/host-only profiling

2011-10-05 Thread Peter Zijlstra
On Wed, 2011-10-05 at 17:29 +0200, Gleb Natapov wrote:
 On Wed, Oct 05, 2011 at 04:19:39PM +0200, Peter Zijlstra wrote:
  On Wed, 2011-10-05 at 14:01 +0200, Gleb Natapov wrote:
   +static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
   +{
   +   int i, nr_msrs;
   +   struct perf_guest_switch_msr *msrs;
   +
   +   msrs = perf_guest_get_msrs(nr_msrs);
   +
   +   if (!msrs)
   +   return;
   +
   +   for (i = 0; i  nr_msrs; i++)
   +   if (msrs[i].host == msrs[i].guest)
   +   clear_atomic_switch_msr(vmx, msrs[i].msr);
   +   else
   +   add_atomic_switch_msr(vmx, msrs[i].msr, 
   msrs[i].guest,
   +   msrs[i].host);
   +} 
  
  I don't think this will actually compile with PERF_EVENTS=n due to
  struct perf_guest_switch_msr not being defined.
 Oops you are right. Turns out it is not enough to remove PERF_EVENTS
 from .config to disable it. It re-appears after make oldconfig. Should
 I send incremental patch to fix that?

Frederic, what's the status of being able to disable PERF on x86 again?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Kernel bug - 3.1.0 - rc9

2011-10-05 Thread Steve
I'm not sure exactly where to report this (and the kernel.org reporting 
bugs page is still offline)


RC9 of the 3.1 kernel is causing VirtualBox to fail (certainly here on 
amd64 - Phenom II 965 quad). syslog output is:


Message from syslogd@Tex at Oct  5 16:16:51 ...
 kernel:Oops:  [#1] PREEMPT SMP

Message from syslogd@Tex at Oct  5 16:16:51 ...
 kernel:Stack:

Message from syslogd@Tex at Oct  5 16:16:51 ...
 kernel:Call Trace:

Message from syslogd@Tex at Oct  5 16:16:51 ...
 kernel:Code: ac e0 66 8b b3 b8 02 00 00 8b bb b0 02 00 00 48 c7 83 68 
02 00 00 00 00 00 00 e8 68 ff ff ff 48 c7 c7 00 aa 62 a0 e8 8e d4 cd e0


Message from syslogd@Tex at Oct  5 16:16:51 ...
 kernel:CR2: 0088

vboxsrv drivers install correctly and there are no obvious errors from 
vbox itself. The bug manifests itself by simply hanging the VM at startup.


Versions pre rc9 work fine in this regard.

I hope that this is of some help.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: E500: Support hugetlbfs

2011-10-05 Thread Scott Wood
On 10/05/2011 09:37 AM, Alexander Graf wrote:
 diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
 index ec17148..1dd96a9 100644
 --- a/arch/powerpc/kvm/e500_tlb.c
 +++ b/arch/powerpc/kvm/e500_tlb.c
 @@ -24,6 +24,7 @@
  #include linux/sched.h
  #include linux/rwsem.h
  #include linux/vmalloc.h
 +#include linux/hugetlb.h
  #include asm/kvm_ppc.h
  #include asm/kvm_e500.h
  
 @@ -673,12 +674,31 @@ static inline void kvmppc_e500_shadow_map(struct 
 kvmppc_vcpu_e500 *vcpu_e500,
   pfn = ~(tsize_pages - 1);
   break;
   }
 + } else if (vma  hva = vma-vm_start 
 +   (vma-vm_flags  VM_HUGETLB)) {
 + unsigned long psize = vma_kernel_pagesize(vma);
 +
 + tsize = (gtlbe-mas1  MAS1_TSIZE_MASK) 
 + MAS1_TSIZE_SHIFT;
 +
 + /*
 +  * Take the largest page size that satisfies both host
 +  * and guest mapping
 +  */
 + tsize = min(__ilog2(psize) - 10, tsize);

Any reason for __ilog2() rather than ilog2()?  Shouldn't make a
difference, just curious about avoiding the public interface.

Either way,
Acked-by: Scott Wood scottw...@freescale.com

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: KVM: emulate lapic tsc deadline timer for guest

2011-10-05 Thread Liu, Jinsong
Yes, Avi has noticed this issue and fix the bug as attached.

Thanks,
Jinsong

Dan Carpenter wrote:
 This patch causes a NULL dereference for me when I start qemu.
 
 [  136.130978] BUG: unable to handle kernel NULL pointer dereference
 at 0078 [  136.131032] IP: [a015a3d3]
 update_cpuid+0x63/0x90 [kvm] [  136.131076] PGD 3fcac4067 PUD
 3fc91a067 PMD 0 [  136.131108] Oops: 0002 [#1] SMP
 [  136.131132] CPU 0
 [  136.131145] Modules linked in: e1000e fuse kvm_intel kvm radeon
 ttm [last unloaded: e1000e] [  136.131208]
 [  136.131219] Pid: 2678, comm: qemu-system-x86 Not tainted
 3.1.0-rc8-next-20110930+ #92 System manufacturer System Product
 Name/P8Z68-V PRO [  136.131289] RIP: 0010:[a015a3d3] 
 [a015a3d3] update_cpuid+0x63/0x90 [kvm] [  136.131341] RSP:
 0018:880404761d20  EFLAGS: 00010282 [  136.131370] RAX:
  RBX: 8803fc408000 RCX: 8803fc408b18 [ 
 136.131408] RDX: 80802001 RSI: 0001 RDI:
 8803fc408000 [  136.131445] RBP: 880404761d28 R08:
 0015 R09: 0003 [  136.131483] R10:
 0003 R11:  R12: 8803fc408000 [ 
 136.131520] R13: 7fffb5e8d870 R14: 0015 R15:
  [  136.131559] FS:  7f96c6324760()
 GS:88042f40() knlGS: [  136.131602] CS: 
 0010 DS:  ES:  CR0: 80050033 [  136.131632] CR2:
 0078 CR3: 0003fcac5000 CR4: 000426f0 [ 
 136.131670] DR0:  DR1:  DR2:
  [  136.131707] DR3:  DR6:
 0ff0 DR7: 0400 [  136.131745] Process
 qemu-system-x86 (pid: 2678, threadinfo 88040476, task
 880404618000) [  136.131792] Stack: [  136.131804] 
  880404761df8 a016635f  [
 136.131851]  0015  0200
  [  136.131897]  88042f7ee000 880404761da8
 a01beba9 c90006722000 [  136.131944] Call Trace: [ 
 136.131966]  [a016635f] kvm_arch_vcpu_ioctl+0xd5f/0x15d0
 [kvm] [  136.132005]  [a01beba9] ? vmx_vcpu_load+0x39/0x1b0
 [kvm_intel] [  136.132046]  [a01654fb] ?
 kvm_arch_vcpu_load+0x6b/0x170 [kvm]  
 
 Here is the code listing from gdb:
 
 (gdb) list *(update_cpuid+0x63)
 0xc3d3 is in update_cpuid (arch/x86/kvm/x86.c:618).
 613   }
 614
 615   if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL 
 616   best-function == 0x1) {
 617   best-ecx |= bit(X86_FEATURE_TSC_DEADLINE_TIMER);
 618   vcpu-arch.apic-lapic_timer.timer_mode_mask = (3  
 17);
 619   } else
 620   vcpu-arch.apic-lapic_timer.timer_mode_mask = (1  
 17);
 621   }
 622
 (gdb)
 
 Reverting the patch fixes things for me.  I'm using linux-next from
 Friday.
 
 regards,
 dan carpenter

---BeginMessage---
vcpu-arch.apic may be NULL.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/x86.c |9 +++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 83b839f..aa11707 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -600,6 +600,8 @@ static bool guest_cpuid_has_fsgsbase(struct kvm_vcpu *vcpu)
 static void update_cpuid(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
+   struct kvm_lapic *apic = vcpu-arch.apic;
+   u32 timer_mode_mask;
 
best = kvm_find_cpuid_entry(vcpu, 1, 0);
if (!best)
@@ -615,9 +617,12 @@ static void update_cpuid(struct kvm_vcpu *vcpu)
if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL 
best-function == 0x1) {
best-ecx |= bit(X86_FEATURE_TSC_DEADLINE_TIMER);
-   vcpu-arch.apic-lapic_timer.timer_mode_mask = (3  17);
+   timer_mode_mask = 3  17;
} else
-   vcpu-arch.apic-lapic_timer.timer_mode_mask = (1  17);
+   timer_mode_mask = 1  17;
+
+   if (apic)
+   apic-lapic_timer.timer_mode_mask = timer_mode_mask;
 }
 
 int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
-- 
1.7.6.3

---End Message---


RE: [PATCH] KVM: emulate lapic tsc deadline timer for guest

2011-10-05 Thread Liu, Jinsong
Marcelo Tosatti wrote:
 On Sun, Sep 25, 2011 at 10:47:46PM +0800, Liu, Jinsong wrote:
 Marcelo Tosatti wrote:
 On Fri, Sep 23, 2011 at 04:25:51PM +0800, Liu, Jinsong wrote:
 Marcelo Tosatti wrote:
 On Thu, Sep 22, 2011 at 04:55:52PM +0800, Liu, Jinsong wrote:
 From 4d5b83aba40ce0d421add9a41a6c591a8590a32e Mon Sep 17
 00:00:00 2001
 From: Liu, Jinsong jinsong@intel.com
 Date: Thu, 22 Sep 2011 14:00:08 +0800
 Subject: [PATCH 2/2] KVM: emulate lapic tsc deadline timer for
 guest 
 
 This patch emulate lapic tsc deadline timer for guest:
 Enumerate tsc deadline timer capability by CPUID;
 Enable tsc deadline timer mode by lapic MMIO;
 Start tsc deadline timer by WRMSR;
 
 Signed-off-by: Liu, Jinsong jinsong@intel.com ---
  arch/x86/include/asm/kvm_host.h |2 +
  arch/x86/kvm/kvm_timer.h|2 +
  arch/x86/kvm/lapic.c|  123
  --- arch/x86/kvm/lapic.h
  |3 + arch/x86/kvm/x86.c  |   16 +-
  5 files changed, 123 insertions(+), 23 deletions(-)
 
 Looks good, please rebase against branch master of
 
 git://github.com/avikivity/kvm.git
 
 Rebased as attached.
 
 Thanks,
 Jinsong
 
 Please write a simple test case to arm a lapic timer via wrmsr (see
 https://github.com/avikivity/kvm-unit-tests).
 
 Kernel patches have been applied, thanks.
 
 Marcelo,
 
 I'm not quite clear the purpose and usage of test case of the
 kvm-unit-tests. Can you give me some hint?
 
 The purpose is to add unit tests for new features (such as lapic
 deadline timer). There are examples that make it relatively easy to
 construct new test case (or modify existing ones to accomodate new
 tests).
 
 Please add a new test case for lapic deadline timer, thanks.


Thanks Marcelo. I will add the test case. Sorry for slow email reply because of 
holiday.

Regards,
Jinsong
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcpdump locks up kvm host for a while.

2011-10-05 Thread Avi Kivity

On 10/04/2011 09:40 PM, Robin Lee Powell wrote:

  When I run tcpdump on a *guest*, the entire guest completely
  freezes up; no response even to hitting enter on the console.
  virsh list also locks up whenever it tries to print state about
  that VM (but the others work fine), as does any other operation
  that touches the state of that VM.  The VM takes up 100% of CPU
  on one core while this is happening.  Eventually it gets better.

  You can use 'perf kvm' to figure out where the guest is spinning.

OK, gathered with:

sudo  perf kvm --guest --host record -o /tmp/kvm_perf -a

I don't know how to read it at all, so it's at
http://users.digitalkingdom.org/~rlpowell/media/public/kvm_perf



Not accessible.

Please post the output of 'perf kvm report  log'.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


accessing a 'native' disk from KVM ?

2011-10-05 Thread Sujee Maniyam
HI all

Using KVM (qemu-kvm-0.12.1) on CentOS-6.

how can I supply an entire disk for KVM to use?  Is this correct
snippet in kvm.xml ?

disk type='block' device='disk'
  driver name='qemu' type='raw'/
  source dev='/dev/sdi1'/
  target dev='vdb' bus='virtio'/
/disk

thanks very much
Sujee Maniyam
http://sujee.net
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm] accessing a 'native' disk from KVM ?

2011-10-05 Thread Robin Lee Powell
On Wed, Oct 05, 2011 at 11:41:13AM -0700, Sujee Maniyam wrote:
 HI all
 
 Using KVM (qemu-kvm-0.12.1) on CentOS-6.
 
 how can I supply an entire disk for KVM to use?  Is this correct
 snippet in kvm.xml ?
 
 disk type='block' device='disk'
   driver name='qemu' type='raw'/
   source dev='/dev/sdi1'/
   target dev='vdb' bus='virtio'/
 /disk

This is what I have:

disk type='block' device='disk'
  driver name='qemu' type='raw' cache='none'/
  source dev='/dev/mapper/local-vrici_root'/
  target dev='vda' bus='virtio'/
  address type='pci' domain='0x' bus='0x00' slot='0x04' 
function='0x0'/
/disk

So, yeah, pretty much, although note tat you're making /dev/sdi*1*
available, which is a partition, not a disk.  Regardless,
http://publib.boulder.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=%2Fliaat%2Fliaattuncache.htm
(and other things I've seen) suggest cache=none for these cases.

-Robin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: E500: Support hugetlbfs

2011-10-05 Thread Alexander Graf

On 05.10.2011, at 18:06, Scott Wood wrote:

 On 10/05/2011 09:37 AM, Alexander Graf wrote:
 diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
 index ec17148..1dd96a9 100644
 --- a/arch/powerpc/kvm/e500_tlb.c
 +++ b/arch/powerpc/kvm/e500_tlb.c
 @@ -24,6 +24,7 @@
 #include linux/sched.h
 #include linux/rwsem.h
 #include linux/vmalloc.h
 +#include linux/hugetlb.h
 #include asm/kvm_ppc.h
 #include asm/kvm_e500.h
 
 @@ -673,12 +674,31 @@ static inline void kvmppc_e500_shadow_map(struct 
 kvmppc_vcpu_e500 *vcpu_e500,
  pfn = ~(tsize_pages - 1);
  break;
  }
 +} else if (vma  hva = vma-vm_start 
 +   (vma-vm_flags  VM_HUGETLB)) {
 +unsigned long psize = vma_kernel_pagesize(vma);
 +
 +tsize = (gtlbe-mas1  MAS1_TSIZE_MASK) 
 +MAS1_TSIZE_SHIFT;
 +
 +/*
 + * Take the largest page size that satisfies both host
 + * and guest mapping
 + */
 +tsize = min(__ilog2(psize) - 10, tsize);
 
 Any reason for __ilog2() rather than ilog2()?  Shouldn't make a
 difference, just curious about avoiding the public interface.

I grep'ed through the kernel tree and only found __ilog2 defined as well as 
mostly users for __ilog2, so I figured there's got to be a reason ;)


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[QEMU PATCH] kvm: support TSC deadline MSR with subsection

2011-10-05 Thread Marcelo Tosatti

Jinsong, please test this qemu-kvm patch by migrating a guest which is
currently using TSC deadline timer. Using subsections avoids breaking
migration to older qemu versions when the guest does not make use of TSC
deadline feature.

-

From: Liu, Jinsong jinsong@intel.com

KVM add emulation of lapic tsc deadline timer for guest.
This patch is co-operation work at qemu side.

Use subsections to save/restore the field (mtosatti).

Signed-off-by: Liu, Jinsong jinsong@intel.com

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index ae36489..29412dc 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -283,6 +283,7 @@
 #define MSR_IA32_APICBASE_BSP   (18)
 #define MSR_IA32_APICBASE_ENABLE(111)
 #define MSR_IA32_APICBASE_BASE  (0xf12)
+#define MSR_IA32_TSCDEADLINE0x6e0
 
 #define MSR_MTRRcap0xfe
 #define MSR_MTRRcap_VCNT   8
@@ -687,6 +688,7 @@ typedef struct CPUX86State {
 uint64_t async_pf_en_msr;
 
 uint64_t tsc;
+uint64_t tsc_deadline;
 
 uint64_t mcg_status;
 
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index b6eef04..90a6ffb 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -59,6 +59,7 @@ const KVMCapabilityInfo kvm_arch_required_capabilities[] = {
 
 static bool has_msr_star;
 static bool has_msr_hsave_pa;
+static bool has_msr_tsc_deadline;
 static bool has_msr_async_pf_en;
 static int lm_capable_kernel;
 
@@ -568,6 +569,10 @@ static int kvm_get_supported_msrs(KVMState *s)
 has_msr_hsave_pa = true;
 continue;
 }
+if (kvm_msr_list-indices[i] == MSR_IA32_TSCDEADLINE) {
+has_msr_tsc_deadline = true;
+continue;
+}
 }
 }
 
@@ -881,6 +886,9 @@ static int kvm_put_msrs(CPUState *env, int level)
 if (has_msr_hsave_pa) {
 kvm_msr_entry_set(msrs[n++], MSR_VM_HSAVE_PA, env-vm_hsave);
 }
+if (has_msr_tsc_deadline) {
+kvm_msr_entry_set(msrs[n++], MSR_IA32_TSCDEADLINE, env-tsc_deadline);
+}
 #ifdef TARGET_X86_64
 if (lm_capable_kernel) {
 kvm_msr_entry_set(msrs[n++], MSR_CSTAR, env-cstar);
@@ -1127,6 +1135,9 @@ static int kvm_get_msrs(CPUState *env)
 if (has_msr_hsave_pa) {
 msrs[n++].index = MSR_VM_HSAVE_PA;
 }
+if (has_msr_tsc_deadline) {
+msrs[n++].index = MSR_IA32_TSCDEADLINE;
+}
 
 if (!env-tsc_valid) {
 msrs[n++].index = MSR_IA32_TSC;
@@ -1195,6 +1206,9 @@ static int kvm_get_msrs(CPUState *env)
 case MSR_IA32_TSC:
 env-tsc = msrs[i].data;
 break;
+case MSR_IA32_TSCDEADLINE:
+env-tsc_deadline = msrs[i].data;
+break;
 case MSR_VM_HSAVE_PA:
 env-vm_hsave = msrs[i].data;
 break;
diff --git a/target-i386/machine.c b/target-i386/machine.c
index 9aca8e0..176d372 100644
--- a/target-i386/machine.c
+++ b/target-i386/machine.c
@@ -310,6 +310,24 @@ static const VMStateDescription vmstate_fpop_ip_dp = {
 }
 };
 
+static bool tscdeadline_needed(void *opaque)
+{
+CPUState *env = opaque;
+
+return env-tsc_deadline != 0;
+}
+
+static const VMStateDescription vmstate_msr_tscdeadline = {
+.name = cpu/msr_tscdeadline,
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT64(tsc_deadline, CPUState),
+VMSTATE_END_OF_LIST()
+}
+};
+
 static const VMStateDescription vmstate_cpu = {
 .name = cpu,
 .version_id = CPU_SAVE_VERSION,
@@ -420,6 +438,9 @@ static const VMStateDescription vmstate_cpu = {
 } , {
 .vmsd = vmstate_fpop_ip_dp,
 .needed = fpop_ip_dp_needed,
+}, {
+.vmsd = vmstate_msr_tscdeadline,
+.needed = tscdeadline_needed,
 } , {
 /* empty */
 }
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: E500: Support hugetlbfs

2011-10-05 Thread Scott Wood
On 10/05/2011 01:55 PM, Alexander Graf wrote:
 
 On 05.10.2011, at 18:06, Scott Wood wrote:
 
 Any reason for __ilog2() rather than ilog2()?  Shouldn't make a
 difference, just curious about avoiding the public interface.
 
 I grep'ed through the kernel tree and only found __ilog2 defined as well as 
 mostly users for __ilog2, so I figured there's got to be a reason ;)

ilog2() is defined in include/linux/ilog2.h.  It produces constant
output if the input is constant, and appears to be the front door to
__ilog2_u32/__ilog2_u64.  Plain __ilog2 is older and powerpc-specific,
which is probably why there are more users of that in arch/powerpc.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Perf for only one VM?

2011-10-05 Thread Robin Lee Powell
Is it possible to get perf kvm record to record only what's going on
on a single VM?  The options seem to be --host (all host data) and
--guest (all VM data).

I'm getting 74.86%  qemu-kvm  [unknown][u]
0x7f3795bc408d, and I think that's because I've got kallsyms loaded
from a particular guest, but not the others.

-Robin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] virtio: support unlocked queue kick

2011-10-05 Thread Christoph Hellwig
Split virtqueue_kick to be able to do the actual notification outside the
lock protecting the virtqueue.  This patch was originally done by
Stefan Hajnoczi, but I can't find the original one anymore and had to
recreated it from memory.  Pointers to the original or corrections for
the commit message are welcome.

Index: linux-2.6/drivers/virtio/virtio_ring.c
===
--- linux-2.6.orig/drivers/virtio/virtio_ring.c 2011-09-15 15:28:55.891347016 
+0200
+++ linux-2.6/drivers/virtio/virtio_ring.c  2011-10-03 18:45:32.492738431 
+0200
@@ -237,9 +237,11 @@ add_head:
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_buf_gfp);
 
-void virtqueue_kick(struct virtqueue *_vq)
+bool virtqueue_kick_prepare(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
+   bool need_kick = false;
+
u16 new, old;
START_USE(vq);
/* Descriptors and available array need to be set before we expose the
@@ -253,15 +255,32 @@ void virtqueue_kick(struct virtqueue *_v
/* Need to update avail index before checking if we should notify */
virtio_mb();
 
-   if (vq-event ?
-   vring_need_event(vring_avail_event(vq-vring), new, old) :
-   !(vq-vring.used-flags  VRING_USED_F_NO_NOTIFY))
-   /* Prod other side to tell it about changes. */
-   vq-notify(vq-vq);
-
+   if (vq-event) {
+   if (vring_need_event(vring_avail_event(vq-vring), new, old))
+   need_kick = true;
+   } else {
+   if (!(vq-vring.used-flags  VRING_USED_F_NO_NOTIFY))
+   need_kick = true;
+   }
END_USE(vq);
+   return need_kick;
+}
+EXPORT_SYMBOL_GPL(virtqueue_kick_prepare);
+
+void virtqueue_notify(struct virtqueue *_vq)
+{
+   struct vring_virtqueue *vq = to_vvq(_vq);
+
+   /* Prod other side to tell it about changes. */
+   vq-notify(_vq);
+}
+EXPORT_SYMBOL_GPL(virtqueue_notify);
+
+void virtqueue_kick(struct virtqueue *vq)
+{
+   if (virtqueue_kick_prepare(vq))
+   virtqueue_notify(vq);
 }
-EXPORT_SYMBOL_GPL(virtqueue_kick);
 
 static void detach_buf(struct vring_virtqueue *vq, unsigned int head)
 {
Index: linux-2.6/include/linux/virtio.h
===
--- linux-2.6.orig/include/linux/virtio.h   2011-09-15 15:28:57.078857804 
+0200
+++ linux-2.6/include/linux/virtio.h2011-10-03 18:41:07.309766531 +0200
@@ -86,6 +86,8 @@ static inline int virtqueue_add_buf(stru
 }
 
 void virtqueue_kick(struct virtqueue *vq);
+bool virtqueue_kick_prepare(struct virtqueue *vq);
+void virtqueue_notify(struct virtqueue *vq);
 
 void *virtqueue_get_buf(struct virtqueue *vq, unsigned int *len);
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] virtio-blk: implement -make_request

2011-10-05 Thread Christoph Hellwig
Add an alternate I/O path that implements -make_request for virtio-blk.
This is required for high IOPs devices which get slowed down to 1/5th of
the native speed by all the locking, memory allocation and other overhead
in the request based I/O path.

This patch is not quite merge ready due to two issues:

 - it doesn't implement FUA and FLUSH requests yet
 - it hardcodes which I/O path to chose

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/drivers/block/virtio_blk.c
===
--- linux-2.6.orig/drivers/block/virtio_blk.c   2011-10-05 10:36:42.883913334 
-0400
+++ linux-2.6/drivers/block/virtio_blk.c2011-10-05 15:29:35.591405323 
-0400
@@ -11,6 +11,8 @@
 
 #define PART_BITS 4
 
+static int use_make_request = 1;
+
 static int major, index;
 struct workqueue_struct *virtblk_wq;
 
@@ -20,6 +22,7 @@ struct virtio_blk
 
struct virtio_device *vdev;
struct virtqueue *vq;
+   wait_queue_head_t queue_wait;
 
/* The disk structure for the kernel. */
struct gendisk *disk;
@@ -39,11 +42,13 @@ struct virtio_blk
 struct virtblk_req
 {
void *private;
+   struct virtblk_req *next;
struct virtio_blk_outhdr out_hdr;
struct virtio_scsi_inhdr in_hdr;
u8 kind;
 #define VIRTIO_BLK_REQUEST 0x00
-#define VIRTIO_BLK_INTERNAL0x01
+#define VIRTIO_BLK_BIO 0x01
+#define VIRTIO_BLK_INTERNAL0x02
u8 status;
 };
 
@@ -74,10 +79,17 @@ static void virtblk_request_done(struct
mempool_free(vbr, vblk-pool);
 }
 
+static void virtblk_bio_done(struct virtio_blk *vblk,
+   struct virtblk_req *vbr)
+{
+   bio_endio(vbr-private, virtblk_result(vbr));
+   mempool_free(vbr, vblk-pool);
+}
+
 static void blk_done(struct virtqueue *vq)
 {
struct virtio_blk *vblk = vq-vdev-priv;
-   struct virtblk_req *vbr;
+   struct virtblk_req *vbr, *head = NULL, *tail = NULL;
unsigned int len;
unsigned long flags;
 
@@ -88,15 +100,47 @@ static void blk_done(struct virtqueue *v
virtblk_request_done(vblk, vbr);
break;
case VIRTIO_BLK_INTERNAL:
-   complete(vbr-private);
+   case VIRTIO_BLK_BIO:
+   if (head) {
+   tail-next = vbr;
+   tail = vbr;
+   } else {
+   tail = head = vbr;
+   }
break;
default:
BUG();
}
}
-   /* In case queue is stopped waiting for more buffers. */
-   blk_start_queue(vblk-disk-queue);
+
+   if (!use_make_request) {
+   /* In case queue is stopped waiting for more buffers. */
+   blk_start_queue(vblk-disk-queue);
+   }
spin_unlock_irqrestore(vblk-lock, flags);
+
+   wake_up(vblk-queue_wait);
+
+   /*
+* Process completions after freeing up space in the virtqueue and
+* dropping the lock.
+*/
+   while (head) {
+   vbr = head;
+   head = head-next;
+
+   switch (vbr-kind) {
+   case VIRTIO_BLK_BIO:
+   virtblk_bio_done(vblk, vbr);
+   break;
+   case VIRTIO_BLK_INTERNAL:
+   complete(vbr-private);
+   break;
+   default:
+   BUG();
+   }
+
+   }
 }
 
 static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
@@ -111,6 +155,7 @@ static bool do_req(struct request_queue
return false;
 
vbr-private = req;
+   vbr-next = NULL;
vbr-kind = VIRTIO_BLK_REQUEST;
 
if (req-cmd_flags  REQ_FLUSH) {
@@ -199,6 +244,128 @@ static void do_virtblk_request(struct re
virtqueue_kick(vblk-vq);
 }
 
+struct virtblk_plug_cb {
+   struct blk_plug_cb cb;
+   struct virtio_blk *vblk;
+};
+
+static void virtblk_unplug(struct blk_plug_cb *bcb)
+{
+   struct virtblk_plug_cb *cb =
+   container_of(bcb, struct virtblk_plug_cb, cb);
+
+   virtqueue_notify(cb-vblk-vq);
+   kfree(cb);
+}
+
+static bool virtblk_plugged(struct virtio_blk *vblk)
+{
+   struct blk_plug *plug = current-plug;
+   struct virtblk_plug_cb *cb;
+
+   if (!plug)
+   return false;
+
+   list_for_each_entry(cb, plug-cb_list, cb.list) {
+   if (cb-cb.callback == virtblk_unplug  cb-vblk == vblk)
+   return true;
+   }
+
+   /* Not currently on the callback list */
+   cb = kmalloc(sizeof(*cb), GFP_ATOMIC);
+   if (!cb)
+   return false;
+
+   cb-vblk = vblk;
+   cb-cb.callback = virtblk_unplug;
+   list_add(cb-cb.list, plug-cb_list);
+   return true;
+}
+
+static void 

[PATCH 3/5] virtio-blk: remove the unused list of pending requests

2011-10-05 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/drivers/block/virtio_blk.c
===
--- linux-2.6.orig/drivers/block/virtio_blk.c   2011-10-03 19:55:29.061215040 
+0200
+++ linux-2.6/drivers/block/virtio_blk.c2011-10-03 19:55:54.196412176 
+0200
@@ -24,9 +24,6 @@ struct virtio_blk
/* The disk structure for the kernel. */
struct gendisk *disk;
 
-   /* Request tracking. */
-   struct list_head reqs;
-
mempool_t *pool;
 
/* Process context for config space updates */
@@ -41,7 +38,6 @@ struct virtio_blk
 
 struct virtblk_req
 {
-   struct list_head list;
struct request *req;
struct virtio_blk_outhdr out_hdr;
struct virtio_scsi_inhdr in_hdr;
@@ -85,7 +81,6 @@ static void blk_done(struct virtqueue *v
}
 
__blk_end_request_all(vbr-req, error);
-   list_del(vbr-list);
mempool_free(vbr, vblk-pool);
}
/* In case queue is stopped waiting for more buffers. */
@@ -170,7 +165,6 @@ static bool do_req(struct request_queue
return false;
}
 
-   list_add_tail(vbr-list, vblk-reqs);
return true;
 }
 
@@ -368,7 +362,6 @@ static int __devinit virtblk_probe(struc
goto out;
}
 
-   INIT_LIST_HEAD(vblk-reqs);
spin_lock_init(vblk-lock);
vblk-vdev = vdev;
vblk-sg_elems = sg_elems;
@@ -526,9 +519,6 @@ static void __devexit virtblk_remove(str
 
flush_work(vblk-config_work);
 
-   /* Nothing should be pending. */
-   BUG_ON(!list_empty(vblk-reqs));
-
/* Stop all the virtqueues. */
vdev-config-reset(vdev);
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] RFC: -make_request support for virtio-blk

2011-10-05 Thread Christoph Hellwig
This patchset allows the virtio-blk driver to support much higher IOP
rates which can be driven out of modern PCI-e flash devices.  At this
point it really is just a RFC due to various issues.

The first four patches are infrastructure that could go in fairly
soon as far as I'm concerned.  Patch 5 implements the actual -make_request
support and still has a few issues, see there for more details.  With
it I can driver my PCI-e test devices to 85-90% of the native IOPS
and bandwith, but be warned that this is still a fairly low end setup
as far as expensive flash storage is concerned.

One big downside that is has is that it current exposes a nasty race
in the qemu virtqueue code - just running xfstests inside a guest
using the new virtio-blk driver (even on a slow device) will trigger
it and lead to a filesystem shutdown.  I've tracked it down to getting
data I/O segments overwritten with status s/g list entries, but got
lost at that point.  I can start a separate thread on it.

Besides that it is missing a few features, and we have to decided
how to select which mode to use in virtio-blk - either a module option,
sysfs attribute or something that the host communicates.  Or maybe
decide that just going with -make_request alone is fine, even on
my cheap laptop SSD it actually is just as fast if not slightly
faster than the request based variant on my laptop.

There are a few other bottlenecks in virtio that this exposes.  The
first one is the low queue length of just 128 entries in the virtio-blk
queue - to drive higher IOPs with a deep queue we absolutely need
to increment that.

Comments welcome!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] virtio-blk: reimplement the serial attribute without using requests

2011-10-05 Thread Christoph Hellwig
If we want to do bio-based I/O in virtio-blk we have to implement reading
the serial attribute ourselves.  Do that and also prepare struct virtblk_req
for dealing with different types of requests.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/drivers/block/virtio_blk.c
===
--- linux-2.6.orig/drivers/block/virtio_blk.c   2011-10-03 20:32:12.997713070 
+0200
+++ linux-2.6/drivers/block/virtio_blk.c2011-10-03 20:37:28.836714193 
+0200
@@ -38,12 +38,42 @@ struct virtio_blk
 
 struct virtblk_req
 {
-   struct request *req;
+   void *private;
struct virtio_blk_outhdr out_hdr;
struct virtio_scsi_inhdr in_hdr;
+   u8 kind;
+#define VIRTIO_BLK_REQUEST 0x00
+#define VIRTIO_BLK_INTERNAL0x01
u8 status;
 };
 
+static inline int virtblk_result(struct virtblk_req *vbr)
+{
+   switch (vbr-status) {
+   case VIRTIO_BLK_S_OK:
+   return 0;
+   case VIRTIO_BLK_S_UNSUPP:
+   return -ENOTTY;
+   default:
+   return -EIO;
+   }
+}
+
+static void virtblk_request_done(struct virtio_blk *vblk,
+   struct virtblk_req *vbr)
+{
+   struct request *req = vbr-private;
+
+   if (req-cmd_type == REQ_TYPE_BLOCK_PC) {
+   req-resid_len = vbr-in_hdr.residual;
+   req-sense_len = vbr-in_hdr.sense_len;
+   req-errors = vbr-in_hdr.errors;
+   }
+
+   __blk_end_request_all(req, virtblk_result(vbr));
+   mempool_free(vbr, vblk-pool);
+}
+
 static void blk_done(struct virtqueue *vq)
 {
struct virtio_blk *vblk = vq-vdev-priv;
@@ -53,35 +83,16 @@ static void blk_done(struct virtqueue *v
 
spin_lock_irqsave(vblk-lock, flags);
while ((vbr = virtqueue_get_buf(vblk-vq, len)) != NULL) {
-   int error;
-
-   switch (vbr-status) {
-   case VIRTIO_BLK_S_OK:
-   error = 0;
-   break;
-   case VIRTIO_BLK_S_UNSUPP:
-   error = -ENOTTY;
-   break;
-   default:
-   error = -EIO;
-   break;
-   }
-
-   switch (vbr-req-cmd_type) {
-   case REQ_TYPE_BLOCK_PC:
-   vbr-req-resid_len = vbr-in_hdr.residual;
-   vbr-req-sense_len = vbr-in_hdr.sense_len;
-   vbr-req-errors = vbr-in_hdr.errors;
+   switch (vbr-kind) {
+   case VIRTIO_BLK_REQUEST:
+   virtblk_request_done(vblk, vbr);
break;
-   case REQ_TYPE_SPECIAL:
-   vbr-req-errors = (error != 0);
+   case VIRTIO_BLK_INTERNAL:
+   complete(vbr-private);
break;
default:
-   break;
+   BUG();
}
-
-   __blk_end_request_all(vbr-req, error);
-   mempool_free(vbr, vblk-pool);
}
/* In case queue is stopped waiting for more buffers. */
blk_start_queue(vblk-disk-queue);
@@ -99,28 +110,24 @@ static bool do_req(struct request_queue
/* When another request finishes we'll try again. */
return false;
 
-   vbr-req = req;
+   vbr-private = req;
+   vbr-kind = VIRTIO_BLK_REQUEST;
 
if (req-cmd_flags  REQ_FLUSH) {
vbr-out_hdr.type = VIRTIO_BLK_T_FLUSH;
vbr-out_hdr.sector = 0;
-   vbr-out_hdr.ioprio = req_get_ioprio(vbr-req);
+   vbr-out_hdr.ioprio = req_get_ioprio(req);
} else {
switch (req-cmd_type) {
case REQ_TYPE_FS:
vbr-out_hdr.type = 0;
-   vbr-out_hdr.sector = blk_rq_pos(vbr-req);
-   vbr-out_hdr.ioprio = req_get_ioprio(vbr-req);
+   vbr-out_hdr.sector = blk_rq_pos(req);
+   vbr-out_hdr.ioprio = req_get_ioprio(req);
break;
case REQ_TYPE_BLOCK_PC:
vbr-out_hdr.type = VIRTIO_BLK_T_SCSI_CMD;
vbr-out_hdr.sector = 0;
-   vbr-out_hdr.ioprio = req_get_ioprio(vbr-req);
-   break;
-   case REQ_TYPE_SPECIAL:
-   vbr-out_hdr.type = VIRTIO_BLK_T_GET_ID;
-   vbr-out_hdr.sector = 0;
-   vbr-out_hdr.ioprio = req_get_ioprio(vbr-req);
+   vbr-out_hdr.ioprio = req_get_ioprio(req);
break;
default:
/* We don't put anything else in the queue. */
@@ -136,13 +143,14 @@ static bool do_req(struct request_queue
 * block, and before the normal inhdr we put the sense data and the
 * inhdr with 

[PATCH 1/5] block: add bio_map_sg

2011-10-05 Thread Christoph Hellwig
Add a helper to map a bio to a scatterlist, modelled after blk_rq_map_sg.
This helper is useful for any driver that wants to create a scatterlist
from its -make_request method.

Signed-off-by: Christoph Hellwig h...@lst.de

Index: linux-2.6/block/blk-merge.c
===
--- linux-2.6.orig/block/blk-merge.c2011-10-04 11:49:32.857014742 -0400
+++ linux-2.6/block/blk-merge.c 2011-10-04 13:37:51.305630951 -0400
@@ -199,6 +199,69 @@ new_segment:
 }
 EXPORT_SYMBOL(blk_rq_map_sg);
 
+/*
+ * map a bio to a scatterlist, return number of sg entries setup. Caller
+ * must make sure sg can hold bio-bi_phys_segments entries
+ */
+int bio_map_sg(struct request_queue *q, struct bio *bio,
+ struct scatterlist *sglist)
+{
+   struct bio_vec *bvec, *bvprv;
+   struct scatterlist *sg;
+   int nsegs, cluster;
+   unsigned long i;
+
+   nsegs = 0;
+   cluster = blk_queue_cluster(q);
+
+   bvprv = NULL;
+   sg = NULL;
+   bio_for_each_segment(bvec, bio, i) {
+   int nbytes = bvec-bv_len;
+
+   if (bvprv  cluster) {
+   if (sg-length + nbytes  queue_max_segment_size(q))
+   goto new_segment;
+
+   if (!BIOVEC_PHYS_MERGEABLE(bvprv, bvec))
+   goto new_segment;
+   if (!BIOVEC_SEG_BOUNDARY(q, bvprv, bvec))
+   goto new_segment;
+
+   sg-length += nbytes;
+   } else {
+new_segment:
+   if (!sg)
+   sg = sglist;
+   else {
+   /*
+* If the driver previously mapped a shorter
+* list, we could see a termination bit
+* prematurely unless it fully inits the sg
+* table on each mapping. We KNOW that there
+* must be more entries here or the driver
+* would be buggy, so force clear the
+* termination bit to avoid doing a full
+* sg_init_table() in drivers for each command.
+*/
+   sg-page_link = ~0x02;
+   sg = sg_next(sg);
+   }
+
+   sg_set_page(sg, bvec-bv_page, nbytes, bvec-bv_offset);
+   nsegs++;
+   }
+   bvprv = bvec;
+   } /* segments in bio */
+
+   if (sg)
+   sg_mark_end(sg);
+
+   BUG_ON(bio-bi_phys_segments  nsegs  bio-bi_phys_segments);
+   return nsegs;
+}
+EXPORT_SYMBOL(bio_map_sg);
+
 static inline int ll_new_hw_segment(struct request_queue *q,
struct request *req,
struct bio *bio)
Index: linux-2.6/include/linux/blkdev.h
===
--- linux-2.6.orig/include/linux/blkdev.h   2011-10-04 13:37:13.216148915 
-0400
+++ linux-2.6/include/linux/blkdev.h2011-10-04 13:37:51.317613617 -0400
@@ -854,6 +854,8 @@ extern void blk_queue_flush_queueable(st
 extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device 
*bdev);
 
 extern int blk_rq_map_sg(struct request_queue *, struct request *, struct 
scatterlist *);
+extern int bio_map_sg(struct request_queue *q, struct bio *bio,
+   struct scatterlist *sglist);
 extern void blk_dump_rq_flags(struct request *, char *);
 extern long nr_blockdev_pages(void);
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pci_cap_init: add 82599 VF quirk

2011-10-05 Thread Marcelo Tosatti
On Tue, Oct 04, 2011 at 03:02:18PM -0400, Donald Dutile wrote:
 v2: Updated to reflect changes requested by reviewers.
  
 Add check when PCIe capability structure is version 0
 and VID and DID is 82599 VF.  In this case, the size
 of the PCIe cap structure should be the same as a version 2
 cap structure.
 Documented in 82599 Errata 35, and is still marked No Fix.
 According to Intel, it's in silicon not fw, and needs a sw workaround.
 
 Signed-off-by: Donald Dutile ddut...@redhat.com
 ---
  hw/device-assignment.c |   17 +++--
  1 files changed, 15 insertions(+), 2 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm] Re: tcpdump locks up kvm host for a while.

2011-10-05 Thread Robin Lee Powell
On Wed, Oct 05, 2011 at 11:36:53AM -0700, Robin Lee Powell wrote:
 On Wed, Oct 05, 2011 at 08:21:53PM +0200, Avi Kivity wrote:
  On 10/04/2011 09:40 PM, Robin Lee Powell wrote:
When I run tcpdump on a *guest*, the entire guest completely
freezes up; no response even to hitting enter on the console.
virsh list also locks up whenever it tries to print state
about that VM (but the others work fine), as does any other
operation that touches the state of that VM.  The VM takes up
100% of CPU on one core while this is happening.  Eventually
it gets better.
  
You can use 'perf kvm' to figure out where the guest is
spinning.
  
  OK, gathered with:
  
  sudo  perf kvm --guest --host record -o /tmp/kvm_perf -a
  
  I don't know how to read it at all, so it's at
  http://users.digitalkingdom.org/~rlpowell/media/public/kvm_perf
  
  
  Not accessible.
 
 -_-  Fixed.
 
  Please post the output of 'perf kvm report  log'.
 
 # Events: 42K
 #
 # Overhead   Command Shared Object  Symbol
 #       ..
 #
100.00%  qemu-kvm  [unknown] [g] 0x8111c7a5
 
 
 #
 # (For a higher level overview, try: perf report --sort comm,dso)
 #
 
 How helpful is that?  -_-
 
 I'm guessing I need --guestkallsyms= ; since they're all the same
 kernel I thought it'd figure it out.  I'll redo.

OK, here's a better version.

# Events: 46K cycles
#
# Overhead   CommandShared Object   Symbol
#     ...  ...
#
74.81%  qemu-kvm  [unknown][u] 0x7fbdffd4c18a
25.14%  qemu-kvm  [guest.kernel.kallsyms]  [g] 0x82f0
 0.03%  qemu-kvm  [virtio_net] [g] 0x83e8
 0.01%  qemu-kvm  [virtio_balloon] [g] 0x103b
 0.00%  qemu-kvm  [ip6_tables] [g] compat_standard_to_user
 0.00%  qemu-kvm  [ipv6]   [g] icmpv6_send
 0.00%  qemu-kvm  [virtio_blk] [g] 0x7783
 0.00%  qemu-kvm  [ipv6]   [g] raw6_seq_show
 0.00%  qemu-kvm  [ipv6]   [g] icmpv6_rcv
 0.00%  qemu-kvm  [virtio_net] [g] fini
 0.00%  qemu-kvm  [ip6table_filter][g] 0x9b5


#
# (For a higher level overview, try: perf report --sort comm,dso)
#

The file is also updated.

-Robin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] RFC: -make_request support for virtio-blk

2011-10-05 Thread Vivek Goyal
On Wed, Oct 05, 2011 at 03:54:03PM -0400, Christoph Hellwig wrote:
 This patchset allows the virtio-blk driver to support much higher IOP
 rates which can be driven out of modern PCI-e flash devices.  At this
 point it really is just a RFC due to various issues.
 

So you no longer believe that request queue overhead can be brought
down to mangeable levels for these fast devices. And instead go for
bio based drivers and give up on merging and implement own FLUSH/FUA
machinery.

Not that I am advocating for continuing with request based driver. Just
curious..

Thanks
Vivek
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] RFC: -make_request support for virtio-blk

2011-10-05 Thread Christoph Hellwig
On Wed, Oct 05, 2011 at 04:31:16PM -0400, Vivek Goyal wrote:
 So you no longer believe that request queue overhead can be brought
 down to mangeable levels for these fast devices. And instead go for
 bio based drivers and give up on merging and implement own FLUSH/FUA
 machinery.

Not in a reasonable time frame. Having a common interface is still the
grand plan, but the virtio performance issues are hurting people so
badly that we need a fix soon.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Perf for only one VM?

2011-10-05 Thread David Ahern
On 10/05/2011 01:54 PM, Robin Lee Powell wrote:
 Is it possible to get perf kvm record to record only what's going on
 on a single VM?  The options seem to be --host (all host data) and
 --guest (all VM data).

-p pid argument after the record.

David
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] block: add bio_map_sg

2011-10-05 Thread Boaz Harrosh
On 10/05/2011 09:54 PM, Christoph Hellwig wrote:
 Add a helper to map a bio to a scatterlist, modelled after blk_rq_map_sg.
 This helper is useful for any driver that wants to create a scatterlist
 from its -make_request method.
 
 Signed-off-by: Christoph Hellwig h...@lst.de
 

I have some questions.

- Could we later use this bio_map_sg() to implement blk_rq_map_sg() and
  remove some duplicated code?

- Don't you need to support a chained bio (bio-next != NULL)? Because
  I did not see any looping in the last patch 
[PATCH 5/5] virtio-blk: implement -make_request
  Or is it that -make_request() is a single bio at a time?
  If so could we benefit from both bio-chaining and sg-chaning to
  make bigger IOs?

Thanks
Boaz

 Index: linux-2.6/block/blk-merge.c
 ===
 --- linux-2.6.orig/block/blk-merge.c  2011-10-04 11:49:32.857014742 -0400
 +++ linux-2.6/block/blk-merge.c   2011-10-04 13:37:51.305630951 -0400
 @@ -199,6 +199,69 @@ new_segment:
  }
  EXPORT_SYMBOL(blk_rq_map_sg);
  
 +/*
 + * map a bio to a scatterlist, return number of sg entries setup. Caller
 + * must make sure sg can hold bio-bi_phys_segments entries
 + */
 +int bio_map_sg(struct request_queue *q, struct bio *bio,
 +   struct scatterlist *sglist)
 +{
 + struct bio_vec *bvec, *bvprv;
 + struct scatterlist *sg;
 + int nsegs, cluster;
 + unsigned long i;
 +
 + nsegs = 0;
 + cluster = blk_queue_cluster(q);
 +
 + bvprv = NULL;
 + sg = NULL;
 + bio_for_each_segment(bvec, bio, i) {
 + int nbytes = bvec-bv_len;
 +
 + if (bvprv  cluster) {
 + if (sg-length + nbytes  queue_max_segment_size(q))
 + goto new_segment;
 +
 + if (!BIOVEC_PHYS_MERGEABLE(bvprv, bvec))
 + goto new_segment;
 + if (!BIOVEC_SEG_BOUNDARY(q, bvprv, bvec))
 + goto new_segment;
 +
 + sg-length += nbytes;
 + } else {
 +new_segment:
 + if (!sg)
 + sg = sglist;
 + else {
 + /*
 +  * If the driver previously mapped a shorter
 +  * list, we could see a termination bit
 +  * prematurely unless it fully inits the sg
 +  * table on each mapping. We KNOW that there
 +  * must be more entries here or the driver
 +  * would be buggy, so force clear the
 +  * termination bit to avoid doing a full
 +  * sg_init_table() in drivers for each command.
 +  */
 + sg-page_link = ~0x02;
 + sg = sg_next(sg);
 + }
 +
 + sg_set_page(sg, bvec-bv_page, nbytes, bvec-bv_offset);
 + nsegs++;
 + }
 + bvprv = bvec;
 + } /* segments in bio */
 +
 + if (sg)
 + sg_mark_end(sg);
 +
 + BUG_ON(bio-bi_phys_segments  nsegs  bio-bi_phys_segments);
 + return nsegs;
 +}
 +EXPORT_SYMBOL(bio_map_sg);
 +
  static inline int ll_new_hw_segment(struct request_queue *q,
   struct request *req,
   struct bio *bio)
 Index: linux-2.6/include/linux/blkdev.h
 ===
 --- linux-2.6.orig/include/linux/blkdev.h 2011-10-04 13:37:13.216148915 
 -0400
 +++ linux-2.6/include/linux/blkdev.h  2011-10-04 13:37:51.317613617 -0400
 @@ -854,6 +854,8 @@ extern void blk_queue_flush_queueable(st
  extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device 
 *bdev);
  
  extern int blk_rq_map_sg(struct request_queue *, struct request *, struct 
 scatterlist *);
 +extern int bio_map_sg(struct request_queue *q, struct bio *bio,
 + struct scatterlist *sglist);
  extern void blk_dump_rq_flags(struct request *, char *);
  extern long nr_blockdev_pages(void);
  
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm] Re: Perf for only one VM?

2011-10-05 Thread Robin Lee Powell
On Wed, Oct 05, 2011 at 04:03:13PM -0600, David Ahern wrote:
 On 10/05/2011 01:54 PM, Robin Lee Powell wrote:
  Is it possible to get perf kvm record to record only what's
  going on on a single VM?  The options seem to be --host (all
  host data) and --guest (all VM data).
 
 -p pid argument after the record.

Thanks!

-Robin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 2/2] LAPIC: make lapic support cpu hotplug

2011-10-05 Thread liu ping fan
On Wed, Oct 5, 2011 at 7:01 PM, Jan Kiszka jan.kis...@web.de wrote:
 On 2011-10-05 12:26, liu ping fan wrote:
   And make the creation of apic as part of cpu initialization, so
 apic's state has been ready, before setting kvm_apic.

 There is no kvm-apic upstream yet, so it's hard to judge why we need
 this here. If we do, this has to be a separate patch. But I seriously
 doubt we need it (my hack worked without it, and that was not because of
 its hack nature).

 Sorry, I did not explain it clearly. What I mean is that “env-apic_state”
 must be prepared
 before qemu_kvm_cpu_thread_fn() - ... - kvm_put_sregs(), where we get
 apic_base by
 “ sregs.apic_base = cpu_get_apic_base(env-apic_state);”
 and then call “kvm_vcpu_ioctl(env, KVM_SET_SREGS, sregs);” which will
 finally affect the
 kvm_apic structure in kernel.

 But as current code, in pc_new_cpu(), we call apic_init() to initialize
 apic_state, after cpu_init(),
 so we can not guarantee the order of apic_state initializaion and the
 setting to kernel.

 Because LAPIC is part of x86 chip, I want to move it into cpu_x86_init(),
 and ensure apic_init()
 called before thread “qemu_kvm_cpu_thread_fn()” creation.

 The LAPIC is part of the CPU, the classic APIC was a dedicated chip.
Sorry, a little puzzle.  I think x86 interrupt system consists of two
parts: IOAPIC/LAPIC.
For we have hw/ioapic.c to simulate IOAPIC,  I think hw/apic.c
takes the role as LAPIC,
especially that we create an APICState instance for each CPUX86State,
just like each LAPIC
for x86 CPU in real machine.
So we can consider apic_init() to create a LAPIC instance, rather than
create a  classic APIC?

I guess If there is lack of something in IOAPIC/LAPIC bus topology,
that will be the arbitrator of ICC bus, right?
So the classic APIC was a dedicated chip what you said, play this
role,  right?
Could you tell me a sample chipset of APIC, and I can increase my
knowledge about it, thanks.


 For various reasons, a safer approach for creating a new CPU is to stop
 the machine, add the new device models, run cpu_synchronize_post_init on
 that new cpu (looks like you missed that) and then resume everything.
 See
 http://git.kiszka.org/?p=qemu-kvm.git;a=commitdiff;h=be8f21c6b54eac82f7add7ee9d4ecf9cb8ebb320

Great job. And I am interesting about it. Could you give some sample
reason about why we need to stop
the machine, or list all of the reasons, so we can resolve it one by
one. I can not figure out such scenes by myself.
From my view, especially for KVM, the creation of vcpu are protected
well by lock mechanism from other
vcpu threads in kernel, so we need not to stop all of the threads.

Hope for suggestion and direct from you, and make a further step for
CPU hot-plug feature,

Thanks and regards,
ping fan


 ...
 diff --git a/hw/icc_bus.c b/hw/icc_bus.c
 new file mode 100644
 index 000..360ca2a
 --- /dev/null
 +++ b/hw/icc_bus.c
 @@ -0,0 +1,62 @@
 +/*
 +*/
 +#define ICC_BUS_PLUG
 +#ifdef ICC_BUS_PLUG
 +#include icc_bus.h
 +
 +
 +
 +struct icc_bus_info icc_info = {
 +    .qinfo.name = icc,
 +    .qinfo.size = sizeof(struct icc_bus),
 +    .qinfo.props = (Property[]) {
 +        DEFINE_PROP_END_OF_LIST(),
 +    }
 +
 +};
 +
 +
 +static const VMStateDescription vmstate_icc_bus = {
 +    .name = icc_bus,
 +    .version_id = 1,
 +    .minimum_version_id = 1,
 +    .minimum_version_id_old = 1,
 +    .pre_save = NULL,
 +    .post_load = NULL,
 +};
 +
 +struct icc_bus *g_iccbus;
 +
 +struct icc_bus *icc_init_bus(DeviceState *parent, const char *name)
 +{
 +    struct icc_bus *bus;
 +
 +    bus = FROM_QBUS(icc_bus, qbus_create(icc_info.qinfo, parent,
 name));
 +    bus-qbus.allow_hotplug = 1; /* Yes, we can */
 +    bus-qbus.name = icc;
 +    vmstate_register(NULL, -1, vmstate_icc_bus, bus);

 The chipset is the owner of this bus and instantiates it. So it also
 provides a vmstate. You can drop this unneeded one here (it's created
 via an obsolete API anyway).


 No familiar with Qemu bus emulation, keep on learning :) . But what I
 thought is,
 the x86-ICC bus is not the same as bus like PCI.
 For a PCI bus, it lies behind a host bridge, but ICC is shared by all x86
 processors in SMP system,
 so there is not a outstanding owner.  And I right?

 ICC is also attached to the chipset (due to the IOAPIC). So it looks
 reasonable to me to let the chipset do the lifecycle management as well.
 It is the fixed point, CPUs may come and go.

 Jan


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 2/2] LAPIC: make lapic support cpu hotplug

2011-10-05 Thread Anthony Liguori

On 10/05/2011 08:13 PM, liu ping fan wrote:

On Wed, Oct 5, 2011 at 7:01 PM, Jan Kiszkajan.kis...@web.de  wrote:

On 2011-10-05 12:26, liu ping fan wrote:

And make the creation of apic as part of cpu initialization, so

apic's state has been ready, before setting kvm_apic.


There is no kvm-apic upstream yet, so it's hard to judge why we need
this here. If we do, this has to be a separate patch. But I seriously
doubt we need it (my hack worked without it, and that was not because of
its hack nature).

Sorry, I did not explain it clearly. What I mean is that “env-apic_state”

must be prepared
before qemu_kvm_cpu_thread_fn() -  ... -  kvm_put_sregs(), where we get
apic_base by
“ sregs.apic_base = cpu_get_apic_base(env-apic_state);”
and then call “kvm_vcpu_ioctl(env, KVM_SET_SREGS,sregs);” which will
finally affect the
kvm_apic structure in kernel.

But as current code, in pc_new_cpu(), we call apic_init() to initialize
apic_state, after cpu_init(),
so we can not guarantee the order of apic_state initializaion and the
setting to kernel.

Because LAPIC is part of x86 chip, I want to move it into cpu_x86_init(),
and ensure apic_init()
called before thread “qemu_kvm_cpu_thread_fn()” creation.


The LAPIC is part of the CPU, the classic APIC was a dedicated chip.

Sorry, a little puzzle.  I think x86 interrupt system consists of two
parts: IOAPIC/LAPIC.
For we have hw/ioapic.c to simulate IOAPIC,  I think hw/apic.c
takes the role as LAPIC,
especially that we create an APICState instance for each CPUX86State,
just like each LAPIC
for x86 CPU in real machine.
So we can consider apic_init() to create a LAPIC instance, rather than
create a  classic APIC?

I guess If there is lack of something in IOAPIC/LAPIC bus topology,
that will be the arbitrator of ICC bus, right?
So the classic APIC was a dedicated chip what you said, play this
role,  right?
Could you tell me a sample chipset of APIC, and I can increase my
knowledge about it, thanks.


I think Jan meant the PIC was a dedicated chip.  hw/apic.c implements an LAPIC, 
hw/i8259.c implements an I8259A otherwise known as the PIC.  hw/ioapic.c 
implements an I/O APIC.


Together, the I/O APIC and the LAPIC form an 'APIC Architecture'.  Usually, the 
legacy PIC can hang off of the BSP LAPIC.


Regards,

Anthony Liguori





For various reasons, a safer approach for creating a new CPU is to stop
the machine, add the new device models, run cpu_synchronize_post_init on
that new cpu (looks like you missed that) and then resume everything.
See
http://git.kiszka.org/?p=qemu-kvm.git;a=commitdiff;h=be8f21c6b54eac82f7add7ee9d4ecf9cb8ebb320


Great job. And I am interesting about it. Could you give some sample
reason about why we need to stop
the machine, or list all of the reasons, so we can resolve it one by
one. I can not figure out such scenes by myself.

From my view, especially for KVM, the creation of vcpu are protected

well by lock mechanism from other
vcpu threads in kernel, so we need not to stop all of the threads.

Hope for suggestion and direct from you, and make a further step for
CPU hot-plug feature,

Thanks and regards,
ping fan



...

diff --git a/hw/icc_bus.c b/hw/icc_bus.c
new file mode 100644
index 000..360ca2a
--- /dev/null
+++ b/hw/icc_bus.c
@@ -0,0 +1,62 @@
+/*
+*/
+#define ICC_BUS_PLUG
+#ifdef ICC_BUS_PLUG
+#include icc_bus.h
+
+
+
+struct icc_bus_info icc_info = {
+.qinfo.name = icc,
+.qinfo.size = sizeof(struct icc_bus),
+.qinfo.props = (Property[]) {
+DEFINE_PROP_END_OF_LIST(),
+}
+
+};
+
+
+static const VMStateDescription vmstate_icc_bus = {
+.name = icc_bus,
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.pre_save = NULL,
+.post_load = NULL,
+};
+
+struct icc_bus *g_iccbus;
+
+struct icc_bus *icc_init_bus(DeviceState *parent, const char *name)
+{
+struct icc_bus *bus;
+
+bus = FROM_QBUS(icc_bus, qbus_create(icc_info.qinfo, parent,

name));

+bus-qbus.allow_hotplug = 1; /* Yes, we can */
+bus-qbus.name = icc;
+vmstate_register(NULL, -1,vmstate_icc_bus, bus);


The chipset is the owner of this bus and instantiates it. So it also
provides a vmstate. You can drop this unneeded one here (it's created
via an obsolete API anyway).



No familiar with Qemu bus emulation, keep on learning :) . But what I
thought is,
the x86-ICC bus is not the same as bus like PCI.
For a PCI bus, it lies behind a host bridge, but ICC is shared by all x86
processors in SMP system,
so there is not a outstanding owner.  And I right?


ICC is also attached to the chipset (due to the IOAPIC). So it looks
reasonable to me to let the chipset do the lifecycle management as well.
It is the fixed point, CPUs may come and go.

Jan






--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 4/4] acpi: automatically generated ssdt proc

2011-10-05 Thread Kevin O'Connor
On Wed, Oct 05, 2011 at 08:35:26AM -0200, Michael S. Tsirkin wrote:
 On Tue, Oct 04, 2011 at 10:52:33PM -0400, Kevin O'Connor wrote:
  Something like:
  
  ACPI_EXTRACT ssdt_proc_obj
  Processor (CPAA, 0xAA, 0xb010, 0x06) {
  
 
 I considered this, sure. We could parse AML to figure out
 what is the object type we are trying to look up.
 
 However I decided sanity-checking that we get the right type of object
 in AML is better. This way if iasl output format breaks
 we will have a better chance to detect that.
 Makes sense?

Yes.  I guess one could do ACPI_EXTRACT_PROCESSOR for the sanity
check.

 Also this can not be as generic as it seems: each type of
 object in AML bytecode is encoded slightly differently.
 So we would still have types of objects we support
 and types of object we don't.

Yes.

  which would produce something like:
  
  static struct aml_object ssdt_proc_obj = {.addr=0x24, .size=0x40, 
  .param=0x28};
 
 What is the param offset here?

The location of the first byte of the parameters (the same as you had
for ssdt_proc_name).  Normally, AML objects take the form: id
length fixed params variable param list.  The length is itself
of variable length, so passing in the start of the fixed parameters
would make manipulating the results easier.

  As for the other parts of this patch series - I'm still leary of
  changing the DSDT dynamically.
 
 Hmm, not sure I understand why. Could you explain pls?

Sure:

- The DSDT is big and has several cross-functional users.  Patching up
  the DSDT for hotplug when the DSDT also has unrelated stuff (eg,
  mouse) seems ugly.

- The PCI hotplug stuff is generating a whole bunch of devices and the
  dynamic code is effectively disabling the unwanted ones.  It seems
  nicer to dynamically generate the desired entries instead of bulk
  generating and dynamically blanking.

- The CPU hotplug has similar requirements, but is implemented
  differently - it generates the CPU objects dynamically.  It's not
  desirable to bulk generate the CPU objects and blank them
  dynamically, because 255 CPU objects would noticeably increase
  SeaBIOS' static size.

- Some time back there were patches floating around to pass the DSDT
  into SeaBIOS via fw_cfg interface.  Those patches never made it in
  (I forget why), but the basic functionality seemed sound.  Patching
  the DSDT in SeaBIOS would seem to eliminate that possibility.

None of these would be road-blocks.  However, they make me want to
consider other approaches.

  and then just memcpy the hotplug_obj N number of times into the ssdt
  for each available slot.  (This would be on top of the DSDT
  simplification patch series that I posted previously.)
 
 This assumes all devices are the same. But unfortunately this will not
 work for other devices such as VGA.

The VGA device can't be hotplugged, so I don't see why that would be
an issue.

-Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] virtio-blk: implement -make_request

2011-10-05 Thread Rusty Russell
On Wed, 05 Oct 2011 15:54:08 -0400, Christoph Hellwig h...@infradead.org 
wrote:
 Add an alternate I/O path that implements -make_request for virtio-blk.
 This is required for high IOPs devices which get slowed down to 1/5th of
 the native speed by all the locking, memory allocation and other overhead
 in the request based I/O path.

Ouch.

I'd be tempted to just switch across to this, though I'd be interested
to see if the simple add_buf change I referred to before has some effect
by itself (I doubt it).

Also, though it's overkill I'd use standard list primitives rather than
open-coding a single linked list.

Thanks!
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: E500: Support hugetlbfs

2011-10-05 Thread Alexander Graf

On 05.10.2011, at 18:06, Scott Wood wrote:

 On 10/05/2011 09:37 AM, Alexander Graf wrote:
 diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
 index ec17148..1dd96a9 100644
 --- a/arch/powerpc/kvm/e500_tlb.c
 +++ b/arch/powerpc/kvm/e500_tlb.c
 @@ -24,6 +24,7 @@
 #include linux/sched.h
 #include linux/rwsem.h
 #include linux/vmalloc.h
 +#include linux/hugetlb.h
 #include asm/kvm_ppc.h
 #include asm/kvm_e500.h
 
 @@ -673,12 +674,31 @@ static inline void kvmppc_e500_shadow_map(struct 
 kvmppc_vcpu_e500 *vcpu_e500,
  pfn = ~(tsize_pages - 1);
  break;
  }
 +} else if (vma  hva = vma-vm_start 
 +   (vma-vm_flags  VM_HUGETLB)) {
 +unsigned long psize = vma_kernel_pagesize(vma);
 +
 +tsize = (gtlbe-mas1  MAS1_TSIZE_MASK) 
 +MAS1_TSIZE_SHIFT;
 +
 +/*
 + * Take the largest page size that satisfies both host
 + * and guest mapping
 + */
 +tsize = min(__ilog2(psize) - 10, tsize);
 
 Any reason for __ilog2() rather than ilog2()?  Shouldn't make a
 difference, just curious about avoiding the public interface.

I grep'ed through the kernel tree and only found __ilog2 defined as well as 
mostly users for __ilog2, so I figured there's got to be a reason ;)


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: E500: Support hugetlbfs

2011-10-05 Thread Scott Wood
On 10/05/2011 01:55 PM, Alexander Graf wrote:
 
 On 05.10.2011, at 18:06, Scott Wood wrote:
 
 Any reason for __ilog2() rather than ilog2()?  Shouldn't make a
 difference, just curious about avoiding the public interface.
 
 I grep'ed through the kernel tree and only found __ilog2 defined as well as 
 mostly users for __ilog2, so I figured there's got to be a reason ;)

ilog2() is defined in include/linux/ilog2.h.  It produces constant
output if the input is constant, and appears to be the front door to
__ilog2_u32/__ilog2_u64.  Plain __ilog2 is older and powerpc-specific,
which is probably why there are more users of that in arch/powerpc.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html