from:"Glauber Costa"

Re: [PATCH v1 0/2] Add timeout mechanism to qmp actions

2020-12-08 Thread Glauber Costa

On Tue, Dec 8, 2020 at 8:11 AM Stefan Hajnoczi  wrote:
>
> On Thu, Oct 22, 2020 at 05:29:16PM +0100, Fam Zheng wrote:
> > On Tue, 2020-10-20 at 09:34 +0800, Zhenyu Ye wrote:
> > > On 2020/10/19 21:25, Paolo Bonzini wrote:
> > > > On 19/10/20 14:40, Zhenyu Ye wrote:
> > > > > The kernel backtrace for io_submit in GUEST is:
> > > > >
> > > > > guest# ./offcputime -K -p `pgrep -nx fio`
> > > > > b'finish_task_switch'
> > > > > b'__schedule'
> > > > > b'schedule'
> > > > > b'io_schedule'
> > > > > b'blk_mq_get_tag'
> > > > > b'blk_mq_get_request'
> > > > > b'blk_mq_make_request'
> > > > > b'generic_make_request'
> > > > > b'submit_bio'
> > > > > b'blkdev_direct_IO'
> > > > > b'generic_file_read_iter'
> > > > > b'aio_read'
> > > > > b'io_submit_one'
> > > > > b'__x64_sys_io_submit'
> > > > > b'do_syscall_64'
> > > > > b'entry_SYSCALL_64_after_hwframe'
> > > > > -fio (1464)
> > > > > 40031912
> > > > >
> > > > > And Linux io_uring can avoid the latency problem.
> >
> > Thanks for the info. What this tells us is basically the inflight
> > requests are high. It's sad that the linux-aio is in practice
> > implemented as a blocking API.

it is.

> >
> > Host side backtrace will be of more help. Can you get that too?
>
> I guess Linux AIO didn't set the BLK_MQ_REQ_NOWAIT flag so the task went
> to sleep when it ran out of blk-mq tags. The easiest solution is to move
> to io_uring. Linux AIO is broken - it's not AIO :).

Agree!
>
> If we know that no other process is writing to the host block device
> then maybe we can determine the blk-mq tags limit (the queue depth) and
> avoid sending more requests. That way QEMU doesn't block, but I don't
> think this approach works when other processes are submitting I/O to the
> same host block device :(.
>
> Fam's original suggestion of invoking io_submit(2) from a worker thread
> is an option, but I'm afraid it will slow down the uncontended case.
>
> I'm CCing Glauber in case he battled this in the past in ScyllaDB.

We have, and a lot. I don't recall seeing this particular lock, but
XFS would block us all the time
if it had to update metadata to submit the operation, lock inodes, etc.

The work we did at the time was in fixing those things in the kernel
as much as we could.
But the API is just like that...

>
> Stefan

Re: [Qemu-devel] [PATCH v3] Add an isa device for SGA

2011-06-07 Thread Glauber Costa


On 06/07/2011 04:17 PM, Anthony Liguori wrote:

On 05/16/2011 01:45 PM, Glauber Costa wrote:

This patch adds a dummy legacy ISA device whose responsibility is to
deploy sgabios, an option rom for a serial graphics adapter.
The proposal is that this device is always-on when -nographics,
but can otherwise be enable in any setup when -device sga is used.

[v2: suggestions on qdev by Markus ]
[v3: cleanups and documentation, per list suggestions ]

Signed-off-by: Glauber Costa


Applied. But I'd like to figure out what to do about sgabios.bin. I
think we should ship a copy.


Agree.


Regards,

Anthony Liguori


---
Makefile.target | 2 +-
hw/pc.c | 9 
hw/sga.c | 56 +++
3 files changed, 66 insertions(+), 1 deletions(-)
create mode 100644 hw/sga.c

diff --git a/Makefile.target b/Makefile.target
index fdbdc6c..004ea7e 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -224,7 +224,7 @@ obj-$(CONFIG_KVM) += ivshmem.o
# Hardware support
obj-i386-y += vga.o
obj-i386-y += mc146818rtc.o i8259.o pc.o
-obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o
+obj-i386-y += cirrus_vga.o sga.o apic.o ioapic.o piix_pci.o
obj-i386-y += vmport.o
obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
obj-i386-y += extboot.o
diff --git a/hw/pc.c b/hw/pc.c
index 8d351ba..5a8e00a 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -1096,6 +1096,15 @@ void pc_vga_init(PCIBus *pci_bus)
isa_vga_init();
}
}
+
+ /*
+ * sga does not suppress normal vga output. So a machine can have both a
+ * vga card and sga manually enabled. Output will be seen on both.
+ * For nographic case, sga is enabled at all times
+ */
+ if (display_type == DT_NOGRAPHIC) {
+ isa_create_simple("sga");
+ }
}

static void cpu_request_exit(void *opaque, int irq, int level)
diff --git a/hw/sga.c b/hw/sga.c
new file mode 100644
index 000..7ef750a
--- /dev/null
+++ b/hw/sga.c
@@ -0,0 +1,56 @@
+/*
+ * QEMU dummy ISA device for loading sgabios option rom.
+ *
+ * Copyright (c) 2011 Glauber Costa, Red Hat Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person
obtaining a copy
+ * of this software and associated documentation files (the
"Software"), to deal
+ * in the Software without restriction, including without limitation
the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be
included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT
SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES
OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN
+ * THE SOFTWARE.
+ *
+ * sgabios code originally available at code.google.com/p/sgabios
+ *
+ */
+#include "pci.h"
+#include "pc.h"
+#include "loader.h"
+#include "sysemu.h"
+
+#define SGABIOS_FILENAME "sgabios.bin"
+
+typedef struct ISAGAState {
+ ISADevice dev;
+} ISASGAState;
+
+static int isa_cirrus_vga_initfn(ISADevice *dev)
+{
+ rom_add_vga(SGABIOS_FILENAME);
+ return 0;
+}
+
+static ISADeviceInfo sga_info = {
+ .qdev.name = "sga",
+ .qdev.desc = "Serial Graphics Adapter",
+ .qdev.size = sizeof(ISASGAState),
+ .init = isa_cirrus_vga_initfn,
+};
+
+static void sga_register(void)
+{
+ isa_qdev_register(&sga_info);
+}
+
+device_init(sga_register);

[Qemu-devel] [PATCH v3] Add an isa device for SGA

2011-05-16 Thread Glauber Costa

This patch adds a dummy legacy ISA device whose responsibility is to
deploy sgabios, an option rom for a serial graphics adapter.
The proposal is that this device is always-on when -nographics,
but can otherwise be enable in any setup when -device sga is used.

[v2: suggestions on qdev by Markus ]
[v3: cleanups and documentation, per list suggestions ]

Signed-off-by: Glauber Costa 
---
 Makefile.target |2 +-
 hw/pc.c |9 
 hw/sga.c|   56 +++
 3 files changed, 66 insertions(+), 1 deletions(-)
 create mode 100644 hw/sga.c

diff --git a/Makefile.target b/Makefile.target
index fdbdc6c..004ea7e 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -224,7 +224,7 @@ obj-$(CONFIG_KVM) += ivshmem.o
 # Hardware support
 obj-i386-y += vga.o
 obj-i386-y += mc146818rtc.o i8259.o pc.o
-obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o
+obj-i386-y += cirrus_vga.o sga.o apic.o ioapic.o piix_pci.o
 obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += extboot.o
diff --git a/hw/pc.c b/hw/pc.c
index 8d351ba..5a8e00a 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -1096,6 +1096,15 @@ void pc_vga_init(PCIBus *pci_bus)
 isa_vga_init();
 }
 }
+
+/*
+ * sga does not suppress normal vga output. So a machine can have both a
+ * vga card and sga manually enabled. Output will be seen on both.
+ * For nographic case, sga is enabled at all times
+ */
+if (display_type == DT_NOGRAPHIC) {
+isa_create_simple("sga");
+}
 }
 
 static void cpu_request_exit(void *opaque, int irq, int level)
diff --git a/hw/sga.c b/hw/sga.c
new file mode 100644
index 000..7ef750a
--- /dev/null
+++ b/hw/sga.c
@@ -0,0 +1,56 @@
+/*
+ * QEMU dummy ISA device for loading sgabios option rom.
+ *
+ * Copyright (c) 2011 Glauber Costa, Red Hat Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ *
+ * sgabios code originally available at code.google.com/p/sgabios
+ *
+ */
+#include "pci.h"
+#include "pc.h"
+#include "loader.h"
+#include "sysemu.h"
+
+#define SGABIOS_FILENAME "sgabios.bin"
+
+typedef struct ISAGAState {
+ISADevice dev;
+} ISASGAState;
+
+static int isa_cirrus_vga_initfn(ISADevice *dev)
+{
+rom_add_vga(SGABIOS_FILENAME);
+return 0;
+}
+
+static ISADeviceInfo sga_info = {
+.qdev.name= "sga",
+.qdev.desc= "Serial Graphics Adapter",
+.qdev.size= sizeof(ISASGAState),
+.init = isa_cirrus_vga_initfn,
+};
+
+static void sga_register(void)
+{
+  isa_qdev_register(&sga_info);
+}
+
+device_init(sga_register);
-- 
1.7.4.2

Re: [Qemu-devel] [PATCH v2] Add an isa device for SGA

2011-05-13 Thread Glauber Costa

On Fri, May 13, 2011 at 10:43:33AM +0200, Markus Armbruster wrote:
> Glauber Costa  writes:
> 
> > This patch adds a dummy legacy ISA device whose responsibility is to
> > deploy sgabios, an option rom for a serial graphics adapter.
> > The proposal is that this device is always-on when -nographics,
> > but can otherwise be enable in any setup when -device sga is used.
> >
> > [v2: suggestions on qdev by Markus ]
> 
> One more question: should "-device sga" suppress the default VGA?  The
> other three VGA devices do; check out default_list[] in vl.c.
No, it should not. This rom is specially designed not to:
It will chain all int10h requests to whatever it was there before.

> 
> [...]
> > diff --git a/hw/pc.h b/hw/pc.h
> > index 1291e2d..a00e054 100644
> > --- a/hw/pc.h
> > +++ b/hw/pc.h
> > @@ -219,6 +219,9 @@ int isa_vga_mm_init(target_phys_addr_t vram_base,
> >  void pci_cirrus_vga_init(PCIBus *bus);
> >  void isa_cirrus_vga_init(void);
> >  
> > +/* serial graphics */
> > +void isa_sga_init(void);
> > +
> 
> No longer exists, please drop.
You're right, my bad.

Anthony, is it needed to send a v3 to take this away, or will you remove 
manually ?

[Qemu-devel] [PATCH v2] Add an isa device for SGA

2011-05-12 Thread Glauber Costa

This patch adds a dummy legacy ISA device whose responsibility is to
deploy sgabios, an option rom for a serial graphics adapter.
The proposal is that this device is always-on when -nographics,
but can otherwise be enable in any setup when -device sga is used.

[v2: suggestions on qdev by Markus ]

Signed-off-by: Glauber Costa 
---
 Makefile.target |2 +-
 hw/pc.c |4 
 hw/pc.h |3 +++
 hw/sga.c|   31 +++
 4 files changed, 39 insertions(+), 1 deletions(-)
 create mode 100644 hw/sga.c

diff --git a/Makefile.target b/Makefile.target
index fdbdc6c..004ea7e 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -224,7 +224,7 @@ obj-$(CONFIG_KVM) += ivshmem.o
 # Hardware support
 obj-i386-y += vga.o
 obj-i386-y += mc146818rtc.o i8259.o pc.o
-obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o
+obj-i386-y += cirrus_vga.o sga.o apic.o ioapic.o piix_pci.o
 obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += extboot.o
diff --git a/hw/pc.c b/hw/pc.c
index 8d351ba..78bf7de 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -1096,6 +1096,10 @@ void pc_vga_init(PCIBus *pci_bus)
 isa_vga_init();
 }
 }
+
+if (display_type == DT_NOGRAPHIC) {
+isa_create_simple("sga");
+}
 }
 
 static void cpu_request_exit(void *opaque, int irq, int level)
diff --git a/hw/pc.h b/hw/pc.h
index 1291e2d..a00e054 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -219,6 +219,9 @@ int isa_vga_mm_init(target_phys_addr_t vram_base,
 void pci_cirrus_vga_init(PCIBus *bus);
 void isa_cirrus_vga_init(void);
 
+/* serial graphics */
+void isa_sga_init(void);
+
 /* ne2000.c */
 static inline bool isa_ne2000_init(int base, int irq, NICInfo *nd)
 {
diff --git a/hw/sga.c b/hw/sga.c
new file mode 100644
index 000..08cef56
--- /dev/null
+++ b/hw/sga.c
@@ -0,0 +1,31 @@
+#include "pci.h"
+#include "pc.h"
+#include "loader.h"
+#include "sysemu.h"
+
+#define SGABIOS_FILENAME "sgabios.bin"
+
+typedef struct ISAGAState {
+ISADevice dev;
+} ISASGAState;
+
+
+static int isa_cirrus_vga_initfn(ISADevice *dev)
+{
+rom_add_vga(SGABIOS_FILENAME);
+return 0;
+}
+
+static ISADeviceInfo sga_info = {
+.qdev.name= "sga",
+.qdev.desc= "Serial Graphics Adapter",
+.qdev.size= sizeof(ISASGAState),
+.init = isa_cirrus_vga_initfn,
+};
+
+static void sga_register(void)
+{
+  isa_qdev_register(&sga_info);
+}
+
+device_init(sga_register);
-- 
1.7.4.2

[Qemu-devel] [PATCH] Avoid segmentation fault for qdev device not found

2011-05-12 Thread Glauber Costa

qdev_try_create will cope well with a NULL bus, since it will assume
the main system bus by default. qdev_create, however, wants to print
a message, in which it instantiates the bus name. That simple and at
first inoffensive message will generate a segmentation found if the
reason for failure is a NULL bus.

I propose we avoid that - thus generating the normal hw_error by
always passing a valid bus to qdev_try_create - if none, be it the
main system bus.

The code for testing a NULL bus is not remove from qdev_try_create
because it is a externally visible function, and we want it to continue
working fine.

Signed-off-by: Glauber Costa 
---
 hw/qdev.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/hw/qdev.c b/hw/qdev.c
index 1aa1ea0..21ef075 100644
--- a/hw/qdev.c
+++ b/hw/qdev.c
@@ -108,6 +108,10 @@ DeviceState *qdev_create(BusState *bus, const char *name)
 {
 DeviceState *dev;
 
+if (!bus) {
+bus = sysbus_get_default();
+}
+
 dev = qdev_try_create(bus, name);
 if (!dev) {
 hw_error("Unknown device '%s' for bus '%s'\n", name, bus->info->name);
-- 
1.7.4.2

[Qemu-devel] [PATCH] Add an isa device for SGA

2011-05-11 Thread Glauber Costa

This patch adds a dummy legacy ISA device whose responsibility is to
deploy sgabios, an option rom for a serial graphics adapter.
The proposal is that this device is always-on when -nographics,
but can otherwise be enable in any setup when -device sga is used.

Signed-off-by: Glauber Costa 
---
 Makefile.target |2 +-
 hw/pc.c |2 ++
 hw/pc.h |3 +++
 hw/sga.c|   49 +
 4 files changed, 55 insertions(+), 1 deletions(-)
 create mode 100644 hw/sga.c

diff --git a/Makefile.target b/Makefile.target
index fdbdc6c..004ea7e 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -224,7 +224,7 @@ obj-$(CONFIG_KVM) += ivshmem.o
 # Hardware support
 obj-i386-y += vga.o
 obj-i386-y += mc146818rtc.o i8259.o pc.o
-obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o
+obj-i386-y += cirrus_vga.o sga.o apic.o ioapic.o piix_pci.o
 obj-i386-y += vmport.o
 obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
 obj-i386-y += extboot.o
diff --git a/hw/pc.c b/hw/pc.c
index 8d351ba..56c3887 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -1096,6 +1096,8 @@ void pc_vga_init(PCIBus *pci_bus)
 isa_vga_init();
 }
 }
+
+isa_sga_init();
 }
 
 static void cpu_request_exit(void *opaque, int irq, int level)
diff --git a/hw/pc.h b/hw/pc.h
index 1291e2d..a00e054 100644
--- a/hw/pc.h
+++ b/hw/pc.h
@@ -219,6 +219,9 @@ int isa_vga_mm_init(target_phys_addr_t vram_base,
 void pci_cirrus_vga_init(PCIBus *bus);
 void isa_cirrus_vga_init(void);
 
+/* serial graphics */
+void isa_sga_init(void);
+
 /* ne2000.c */
 static inline bool isa_ne2000_init(int base, int irq, NICInfo *nd)
 {
diff --git a/hw/sga.c b/hw/sga.c
new file mode 100644
index 000..411191b
--- /dev/null
+++ b/hw/sga.c
@@ -0,0 +1,49 @@
+#include "pci.h"
+#include "pc.h"
+#include "loader.h"
+#include "sysemu.h"
+
+#define SGABIOS_FILENAME "sgabios.bin"
+
+typedef struct ISAGAState {
+ISADevice dev;
+} ISASGAState;
+
+/* We can have both -device, and the initfn called, so better
+ * avoid to have the rom loaded twice */
+static void deploy_rom(void)
+{
+static int rom_deployed = 0;
+
+if (!rom_deployed++) {
+rom_add_vga(SGABIOS_FILENAME);
+}
+}
+
+static int isa_cirrus_vga_initfn(ISADevice *dev)
+{
+deploy_rom();
+
+return 0;
+}
+
+void isa_sga_init(void)
+{
+if (display_type == DT_NOGRAPHIC) {
+deploy_rom();
+}
+}
+
+static ISADeviceInfo sga_info = {
+.qdev.name= "sga",
+.qdev.desc= "Serial Graphics Adapter",
+.qdev.size= sizeof(ISASGAState),
+.init = isa_cirrus_vga_initfn,
+};
+
+static void sga_register(void)
+{
+  isa_qdev_register(&sga_info);
+}
+
+device_init(sga_register);
-- 
1.7.4.2

Re: [Qemu-devel] [PATCH v3 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts

2011-05-04 Thread Glauber Costa

On Wed, 2011-05-04 at 16:46 +0300, Gleb Natapov wrote:
> On Wed, May 04, 2011 at 10:36:12AM -0300, Glauber Costa wrote:
> > On Wed, 2011-05-04 at 06:09 -0300, Marcelo Tosatti wrote:
> > > On Wed, May 04, 2011 at 04:06:59AM -0400, Ulrich Obergfell wrote:
> > > > 
> > > > Hi Marcelo,
> > > >  
> > > > > Whats prev_period for, since in practice the period will not change
> > > > > between interrupts (OS programs comparator once, or perhaps twice
> > > > > during bootup) ?
> > > > 
> > > > 'prev_period' is needed if a guest o/s changes the comparator period
> > > > 'on the fly' (without stopping and restarting the timer).
> > > > 
> > > > 
> > > >  guest o/s changes period
> > > >|
> > > >   ti(n-1)  |ti(n)  ti(n+1)
> > > > |  v  |  |
> > > > +-+--+
> > > > 
> > > >  <--- prev_period ---> <-- period -->
> > > > 
> > > > 
> > > > The idea is that each timer interrupt represents a certain quantum
> > > > of time (the comparator period). If a guest o/s changes the period
> > > > between timer interrupt 'n-1' and timer interrupt 'n', I think the
> > > > new value should not take effect before timer interrupt 'n'. Timer
> > > > interrupt 'n' still represents the old/previous quantum, and timer
> > > > interrupt 'n+1' represents the new quantum.
> > > > 
> > > > Hence, the patch decrements 'ticks_not_accounted' by 'prev_period'
> > > > and sets 'prev_period' to 'period' when an interrupt was delivered
> > > > to the guest o/s.
> > > > 
> > > > +irq_delivered = update_irq(t, 1);
> > > > +if (irq_delivered) {
> > > > +t->ticks_not_accounted -= t->prev_period;
> > > > +t->prev_period = t->period;
> > > > +} else {
> > > > 
> > > > Most of the time 'prev_period' is equal to 'period'. It should only
> > > > be different in the scenario shown above.
> > > 
> > > OK, makes sense. You should probably reset ticks_not_accounted to zero
> > > on HPET initialization (for example, to avoid miscalibration when
> > > kexec'ing a new kernel).
> > 
> > Everybody resetting the machine in anyway is expected to force devices
> > to be reinitialized, right ?
> > I may be wrong, but I was under the impression that kexec would do this
> > as well. In this case, the reset function should be enough.
> > 
> kexec does not reset a machine. That's the whole point of kexec in
> fact.
Sure thing, but doesn't it force the initialization routine of the devices 
themselves, without
going through the bios ?

> --
>   Gleb.

Re: [Qemu-devel] [PATCH v3 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts

2011-05-04 Thread Glauber Costa

On Wed, 2011-05-04 at 06:09 -0300, Marcelo Tosatti wrote:
> On Wed, May 04, 2011 at 04:06:59AM -0400, Ulrich Obergfell wrote:
> > 
> > Hi Marcelo,
> >  
> > > Whats prev_period for, since in practice the period will not change
> > > between interrupts (OS programs comparator once, or perhaps twice
> > > during bootup) ?
> > 
> > 'prev_period' is needed if a guest o/s changes the comparator period
> > 'on the fly' (without stopping and restarting the timer).
> > 
> > 
> >  guest o/s changes period
> >|
> >   ti(n-1)  |ti(n)  ti(n+1)
> > |  v  |  |
> > +-+--+
> > 
> >  <--- prev_period ---> <-- period -->
> > 
> > 
> > The idea is that each timer interrupt represents a certain quantum
> > of time (the comparator period). If a guest o/s changes the period
> > between timer interrupt 'n-1' and timer interrupt 'n', I think the
> > new value should not take effect before timer interrupt 'n'. Timer
> > interrupt 'n' still represents the old/previous quantum, and timer
> > interrupt 'n+1' represents the new quantum.
> > 
> > Hence, the patch decrements 'ticks_not_accounted' by 'prev_period'
> > and sets 'prev_period' to 'period' when an interrupt was delivered
> > to the guest o/s.
> > 
> > +irq_delivered = update_irq(t, 1);
> > +if (irq_delivered) {
> > +t->ticks_not_accounted -= t->prev_period;
> > +t->prev_period = t->period;
> > +} else {
> > 
> > Most of the time 'prev_period' is equal to 'period'. It should only
> > be different in the scenario shown above.
> 
> OK, makes sense. You should probably reset ticks_not_accounted to zero
> on HPET initialization (for example, to avoid miscalibration when
> kexec'ing a new kernel).

Everybody resetting the machine in anyway is expected to force devices
to be reinitialized, right ?
I may be wrong, but I was under the impression that kexec would do this
as well. In this case, the reset function should be enough.

Re: [Qemu-devel] [PATCH v3 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts

2011-05-03 Thread Glauber Costa

On Tue, 2011-05-03 at 16:03 -0300, Marcelo Tosatti wrote:
> On Thu, Apr 28, 2011 at 04:25:00PM +0200, Ulrich Obergfell wrote:
> > Loss of periodic timer interrupts caused by delayed callbacks and by
> > interrupt coalescing is compensated by gradually injecting additional
> > interrupts during subsequent timer intervals, starting at a rate of
> > one additional interrupt per interval. The injection of additional
> > interrupts is based on a backlog of unaccounted HPET clock periods
> > (new HPETTimer field 'ticks_not_accounted'). The backlog increases
> > due to delayed callbacks and coalesced interrupts, and it decreases
> > if an interrupt was injected successfully. If the backlog increases
> > while compensation is still in progress, the rate at which additional
> > interrupts are injected is increased too. A limit is imposed on the
> > backlog and on the rate.
> > 
> > Injecting additional timer interrupts to compensate lost interrupts
> > can alleviate long term time drift. However, on a short time scale,
> > this method can have the side effect of making virtual machine time
> > intermittently pass slower and faster than real time (depending on
> > the guest's time keeping algorithm). Compensation is disabled by
> > default and can be enabled for guests where this behaviour may be
> > acceptable.
> > 
> > Signed-off-by: Ulrich Obergfell 
> > ---
> >  hw/hpet.c |   63 
> > +++-
> >  1 files changed, 61 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/hpet.c b/hw/hpet.c
> > index 35466ae..92d5f58 100644
> > --- a/hw/hpet.c
> > +++ b/hw/hpet.c
> > @@ -32,6 +32,7 @@
> >  #include "sysbus.h"
> >  #include "mc146818rtc.h"
> >  #include "sysemu.h"
> > +#include 
> >  
> >  //#define HPET_DEBUG
> >  #ifdef HPET_DEBUG
> > @@ -42,6 +43,9 @@
> >  
> >  #define HPET_MSI_SUPPORT0
> >  
> > +#define MAX_TICKS_NOT_ACCOUNTED (uint64_t)5 /* 5 sec */
> > +#define MAX_IRQ_RATE(uint32_t)10
> > +
> >  struct HPETState;
> >  typedef struct HPETTimer {  /* timers */
> >  uint8_t tn; /*timer number*/
> > @@ -326,28 +330,63 @@ static const VMStateDescription vmstate_hpet = {
> >  }
> >  };
> >  
> > +static bool hpet_timer_has_tick_backlog(HPETTimer *t)
> > +{
> > +uint64_t backlog = t->ticks_not_accounted - (t->period + 
> > t->prev_period);
> > +return (backlog >= t->period);
> > +}
> > +
> >  /*
> >   * timer expiration callback
> >   */
> >  static void hpet_timer(void *opaque)
> >  {
> >  HPETTimer *t = opaque;
> > +HPETState *s = t->state;
> >  uint64_t diff;
> > -
> > +int irq_delivered = 0;
> > +uint32_t irq_count = 0;
> >  uint64_t period = t->period;
> >  uint64_t cur_tick = hpet_get_ticks(t->state);
> >  
> > +if (s->driftfix && !t->ticks_not_accounted) {
> > +t->ticks_not_accounted = t->prev_period = t->period;
> > +}
> >  if (timer_is_periodic(t) && period != 0) {
> >  if (t->config & HPET_TN_32BIT) {
> >  while (hpet_time_after(cur_tick, t->cmp)) {
> >  t->cmp = (uint32_t)(t->cmp + t->period);
> > +t->ticks_not_accounted += t->period;
> > +irq_count++;
> >  }
> >  } else {
> >  while (hpet_time_after64(cur_tick, t->cmp)) {
> >  t->cmp += period;
> > +t->ticks_not_accounted += period;
> > +irq_count++;
> >  }
> >  }
> >  diff = hpet_calculate_diff(t, cur_tick);
> > +if (s->driftfix) {
> > +if (t->ticks_not_accounted > MAX_TICKS_NOT_ACCOUNTED) {
> > +t->ticks_not_accounted = t->period + t->prev_period;
> > +}
> > +if (hpet_timer_has_tick_backlog(t)) {
> > +if (t->irq_rate == 1 || irq_count > 1) {
> > +t->irq_rate++;
> > +t->irq_rate = MIN(t->irq_rate, MAX_IRQ_RATE);
> > +}
> > +if (t->divisor == 0) {
> > +assert(irq_count);
> > +}
> > +if (irq_count) {
> > +t->divisor = t->irq_rate;
> > +}
> > +diff /= t->divisor--;
> > +} else {
> > +t->irq_rate = 1;
> > +}
> > +}
> >  qemu_mod_timer(t->qemu_timer,
> > qemu_get_clock_ns(vm_clock) + 
> > (int64_t)ticks_to_ns(diff));
> >  } else if (t->config & HPET_TN_32BIT && !timer_is_periodic(t)) {
> > @@ -358,7 +397,22 @@ static void hpet_timer(void *opaque)
> >  t->wrap_flag = 0;
> >  }
> >  }
> > -update_irq(t, 1);
> > +if (s->driftfix && timer_is_periodic(t) && period != 0) {
> > +if (t->ticks_not_accounted >= t->period + t->prev_period) {
> > +irq_delivered = update_irq(t, 1);
> > +if (irq_delivered) {
> > +t->ticks_not_

Re: [Qemu-devel] [PATCH v2 2a/6] x86: Allow multiple cpu feature matches of lookup_feature

2011-04-27 Thread Glauber Costa

On Tue, 2011-04-19 at 13:06 +0200, Jan Kiszka wrote:
> kvmclock is represented by two feature bits. Therefore, lookup_feature
> needs to continue its search even after the first match. Enhance it
> accordingly and switch to a bool return type at this chance.
> 
> Signed-off-by: Jan Kiszka 
> ---
>  target-i386/cpuid.c |   14 --
>  1 files changed, 8 insertions(+), 6 deletions(-)
> 
> Glauber, could you check/ack this? Marcelo, please respin the series
> afterward. I'd like to see all bits upstream and merged back into
> qemu-kvm to proceed with switching the latter to upstream's kvm
> infrastructure.

Yes, this patch is okay.

Actually, I did sent out something like this, maybe Marcelo applied only
part of the series?

Anyway, Jan's version is handy, please apply it.

> diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c
> index 814d13e..0ac592f 100644
> --- a/target-i386/cpuid.c
> +++ b/target-i386/cpuid.c
> @@ -182,20 +182,22 @@ static int altcmp(const char *s, const char *e, const 
> char *altstr)
>  }
>  
>  /* search featureset for flag *[s..e), if found set corresponding bit in
> - * *pval and return success, otherwise return zero
> + * *pval and return true, otherwise return false
>   */
> -static int lookup_feature(uint32_t *pval, const char *s, const char *e,
> -const char **featureset)
> +static bool lookup_feature(uint32_t *pval, const char *s, const char *e,
> +   const char **featureset)
>  {
>  uint32_t mask;
>  const char **ppc;
> +bool found = false;
>  
> -for (mask = 1, ppc = featureset; mask; mask <<= 1, ++ppc)
> +for (mask = 1, ppc = featureset; mask; mask <<= 1, ++ppc) {
>  if (*ppc && !altcmp(s, e, *ppc)) {
>  *pval |= mask;
> -break;
> +found = true;
>  }
> -return (mask ? 1 : 0);
> +}
> +return found;
>  }
>  
>  static void add_flagname_to_bitmaps(const char *flagname, uint32_t *features,

Re: [Qemu-devel] [PATCH 2/2] add fw_dir option to option-rom switch

2011-04-12 Thread Glauber Costa

On Tue, 2011-04-12 at 14:19 -0500, Anthony Liguori wrote:
> On 04/12/2011 01:47 PM, Glauber Costa wrote:
> > For sure, but if we had this discussion a while ago,
> > sgabios wouldn't exist back then, and now it does =p
> 
> Actually, it's been around for ages :-)
> 
> >>>And the fact that not all of them should
> >>> live in genroms persists.
> >> Actually genroms should disappear.  We should make -option-rom work by
> >> using a dummy PCI device or something like that.
> > In a side track, I think sgabios is a pretty useful addition to qemu.
> > Would you consider including it together with the other roms ?
> 
> Definitely.
> 
> > This would definitely increase the out-of-the box experience.
> >
> > I think we don't even need to have a -vga sga switch then. Since sgabios
> > does not prevent output from going to the normal vga, we could always
> > install it, or install when there is a -serial parameter present.
> 
> Is that true?  It needs to load during video bios init, no?  So doesn't 
> that preclude having a normal video bios installed?

No.

sgabios chains all int10h calls.

Re: [Qemu-devel] [PATCH 2/2] add fw_dir option to option-rom switch

2011-04-12 Thread Glauber Costa

On Tue, 2011-04-12 at 13:31 -0500, Anthony Liguori wrote:
> On 04/12/2011 01:13 PM, Glauber Costa wrote:
> > On Tue, 2011-04-12 at 12:40 -0500, Anthony Liguori wrote:
> >> On 04/12/2011 12:23 PM, Glauber Costa wrote:
> >>> The option-rom puts all roms passed by this switch in the genroms 
> >>> directory,
> >>> through the fw_dir option. But as it turns out, not all roms should be 
> >>> placed there.
> >>>
> >>> VGA roms are a of a different kind. They live in a different segment, and 
> >>> are scanned
> >>> first. This patch allows qemu to use external vgaroms such as sgabios, a 
> >>> rom for a serial
> >>> graphics adapter.
> >> Wouldn't it make more sense to have a new PCI device that had sgabios
> >> that could play the role of the VGA device.
> >>
> >> Then you could say -vga sga
> >>
> >> Regards,
> > This would be good, if we included sga along the roms we ship. We're
> > still, however, left with the problem that in the future, people may
> > want to use their own roms.
> 
> Fortunately, I don't think that there are myriads of people writing 
> 16-bit option roms so I don't think this is a pressing problem :-)
For sure, but if we had this discussion a while ago, 
sgabios wouldn't exist back then, and now it does =p

> >   And the fact that not all of them should
> > live in genroms persists.
> 
> Actually genroms should disappear.  We should make -option-rom work by 
> using a dummy PCI device or something like that.

In a side track, I think sgabios is a pretty useful addition to qemu.
Would you consider including it together with the other roms ?
This would definitely increase the out-of-the box experience.

I think we don't even need to have a -vga sga switch then. Since sgabios
does not prevent output from going to the normal vga, we could always
install it, or install when there is a -serial parameter present.

Re: [Qemu-devel] [PATCH 2/2] add fw_dir option to option-rom switch

2011-04-12 Thread Glauber Costa

On Tue, 2011-04-12 at 12:40 -0500, Anthony Liguori wrote:
> On 04/12/2011 12:23 PM, Glauber Costa wrote:
> > The option-rom puts all roms passed by this switch in the genroms directory,
> > through the fw_dir option. But as it turns out, not all roms should be 
> > placed there.
> >
> > VGA roms are a of a different kind. They live in a different segment, and 
> > are scanned
> > first. This patch allows qemu to use external vgaroms such as sgabios, a 
> > rom for a serial
> > graphics adapter.
> 
> Wouldn't it make more sense to have a new PCI device that had sgabios 
> that could play the role of the VGA device.
> 
> Then you could say -vga sga
> 
> Regards,

This would be good, if we included sga along the roms we ship. We're
still, however, left with the problem that in the future, people may
want to use their own roms. And the fact that not all of them should
live in genroms persists.

> Anthony Liguori
> 
> > Without this patch, we lose all the initial BIOS output: until the option 
> > rom is initialized,
> > the only available vga rom is the default cirrus/vesa one.
> >
> > We're also vulnerable to option rom enumeration order: if a vga oprom is 
> > initialized
> > before, say, gpxe, we are able to see its output in the adapter. If it is 
> > initialized
> > after, we miss it.
> >
> > So I believe the proper solution is to allow users to specify that a rom 
> > belongs in vgaroms
> > directory instead.
> >
> > Signed-off-by: Glauber Costa
> > CC: Gleb Natapov
> > ---
> >   hw/pc.c |7 ++-
> >   qemu-config.c   |3 +++
> >   qemu-options.hx |7 +--
> >   sysemu.h|1 +
> >   vl.c|1 +
> >   5 files changed, 16 insertions(+), 3 deletions(-)
> >
> > diff --git a/hw/pc.c b/hw/pc.c
> > index 8d351ba..736efbb 100644
> > --- a/hw/pc.c
> > +++ b/hw/pc.c
> > @@ -1055,7 +1055,12 @@ void pc_memory_init(ram_addr_t ram_size,
> >   }
> >
> >   for (i = 0; i<  nb_option_roms; i++) {
> > -rom_add_option(option_rom[i].name, option_rom[i].bootindex);
> > +/* do it here, instead of in vl.c, to avoid cluttering that file 
> > with x86 material */
> > +if (!option_rom[i].fw_dir) {
> > +option_rom[i].fw_dir = "genroms";
> > +}
> > +rom_add_file(option_rom[i].name, option_rom[i].fw_dir, 0,
> > + option_rom[i].bootindex);
> >   }
> >   }
> >
> > diff --git a/qemu-config.c b/qemu-config.c
> > index 6d9c238..97b8515 100644
> > --- a/qemu-config.c
> > +++ b/qemu-config.c
> > @@ -450,6 +450,9 @@ QemuOptsList qemu_option_rom_opts = {
> >   }, {
> >   .name = "romfile",
> >   .type = QEMU_OPT_STRING,
> > +}, {
> > +.name = "fw_dir",
> > +.type = QEMU_OPT_STRING,
> >   },
> >   { /* end if list */ }
> >   },
> > diff --git a/qemu-options.hx b/qemu-options.hx
> > index 96927cc..d9eec62 100644
> > --- a/qemu-options.hx
> > +++ b/qemu-options.hx
> > @@ -2079,10 +2079,10 @@ to cope with initialization race conditions.
> >   ETEXI
> >
> >   DEF("option-rom", HAS_ARG, QEMU_OPTION_option_rom, \
> > -"-option-rom rom[,bootindex=idx] load a file, rom, into the option ROM 
> > space\n",
> > +"-option-rom rom[,bootindex=idx][,fw_dir=dir] load a file, rom, into 
> > the option ROM space\n",
> >   QEMU_ARCH_ALL)
> >   STEXI
> > -@item -option-rom @var{file}[,bootindex=@var{bootindex}]
> > +@item -option-rom @var{file}[,bootindex=@var{bootindex}][,fw_dir=@var{dir}]
> >   @findex -option-rom
> >   Load the contents of @var{file} as an option ROM.
> >   This option is useful to load things like EtherBoot.
> > @@ -2091,6 +2091,9 @@ This option is useful to load things like EtherBoot.
> >   @item bootindex=@var{bootindex}
> >   Defines which boot index the option rom will be given in boot entry 
> > vectors,
> >   allowing fine-grained selection of devices boot order.
> > +@item fw_dir=@var{dir} (x86 only)
> > +Specify under which firmware directory entry this rom should live. Current
> > +allowed values are vgaroms and genroms (default).
> >   @end table
> >
> >   ETEXI
> > diff --git a/sysemu.h b/sysemu.h
> > index 3f7e17e..2f8be32 100644
> > --- a/sysemu.h
> > +++ b/sysemu.h
> > @@ -160,6 +160,7 @@ extern uint64

[Qemu-devel] [PATCH 2/2] add fw_dir option to option-rom switch

2011-04-12 Thread Glauber Costa

The option-rom puts all roms passed by this switch in the genroms directory,
through the fw_dir option. But as it turns out, not all roms should be placed 
there.

VGA roms are a of a different kind. They live in a different segment, and are 
scanned
first. This patch allows qemu to use external vgaroms such as sgabios, a rom 
for a serial
graphics adapter.

Without this patch, we lose all the initial BIOS output: until the option rom 
is initialized,
the only available vga rom is the default cirrus/vesa one.

We're also vulnerable to option rom enumeration order: if a vga oprom is 
initialized
before, say, gpxe, we are able to see its output in the adapter. If it is 
initialized
after, we miss it.

So I believe the proper solution is to allow users to specify that a rom 
belongs in vgaroms
directory instead.

Signed-off-by: Glauber Costa 
CC: Gleb Natapov 
---
 hw/pc.c |7 ++-
 qemu-config.c   |3 +++
 qemu-options.hx |7 +--
 sysemu.h|1 +
 vl.c|1 +
 5 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index 8d351ba..736efbb 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -1055,7 +1055,12 @@ void pc_memory_init(ram_addr_t ram_size,
 }
 
 for (i = 0; i < nb_option_roms; i++) {
-rom_add_option(option_rom[i].name, option_rom[i].bootindex);
+/* do it here, instead of in vl.c, to avoid cluttering that file with 
x86 material */
+if (!option_rom[i].fw_dir) {
+option_rom[i].fw_dir = "genroms";
+}
+rom_add_file(option_rom[i].name, option_rom[i].fw_dir, 0,
+ option_rom[i].bootindex);
 }
 }
 
diff --git a/qemu-config.c b/qemu-config.c
index 6d9c238..97b8515 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -450,6 +450,9 @@ QemuOptsList qemu_option_rom_opts = {
 }, {
 .name = "romfile",
 .type = QEMU_OPT_STRING,
+}, {
+.name = "fw_dir",
+.type = QEMU_OPT_STRING,
 },
 { /* end if list */ }
 },
diff --git a/qemu-options.hx b/qemu-options.hx
index 96927cc..d9eec62 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -2079,10 +2079,10 @@ to cope with initialization race conditions.
 ETEXI
 
 DEF("option-rom", HAS_ARG, QEMU_OPTION_option_rom, \
-"-option-rom rom[,bootindex=idx] load a file, rom, into the option ROM 
space\n",
+"-option-rom rom[,bootindex=idx][,fw_dir=dir] load a file, rom, into the 
option ROM space\n",
 QEMU_ARCH_ALL)
 STEXI
-@item -option-rom @var{file}[,bootindex=@var{bootindex}]
+@item -option-rom @var{file}[,bootindex=@var{bootindex}][,fw_dir=@var{dir}]
 @findex -option-rom
 Load the contents of @var{file} as an option ROM.
 This option is useful to load things like EtherBoot.
@@ -2091,6 +2091,9 @@ This option is useful to load things like EtherBoot.
 @item bootindex=@var{bootindex}
 Defines which boot index the option rom will be given in boot entry vectors,
 allowing fine-grained selection of devices boot order.
+@item fw_dir=@var{dir} (x86 only)
+Specify under which firmware directory entry this rom should live. Current
+allowed values are vgaroms and genroms (default).
 @end table
 
 ETEXI
diff --git a/sysemu.h b/sysemu.h
index 3f7e17e..2f8be32 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -160,6 +160,7 @@ extern uint64_t node_cpumask[MAX_NODES];
 typedef struct QEMUOptionRom {
 const char *name;
 int32_t bootindex;
+const char *fw_dir;
 } QEMUOptionRom;
 extern QEMUOptionRom option_rom[MAX_OPTION_ROMS];
 extern int nb_option_roms;
diff --git a/vl.c b/vl.c
index 8bcf2ae..e1d7868 100644
--- a/vl.c
+++ b/vl.c
@@ -2675,6 +2675,7 @@ int main(int argc, char **argv, char **envp)
 fprintf(stderr, "Option ROM file is not specified\n");
 exit(1);
 }
+option_rom[nb_option_roms].fw_dir = qemu_opt_get(opts, 
"fw_dir");
nb_option_roms++;
break;
 case QEMU_OPTION_semihosting:
-- 
1.7.4

[Qemu-devel] [PATCH 1/2] document bootindex option

2011-04-12 Thread Glauber Costa

bootindex option was added to -option-rom switch, but never documented.

Signed-off-by: Glauber Costa 
---
 qemu-options.hx |   11 +--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/qemu-options.hx b/qemu-options.hx
index 18f54d2..96927cc 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -2079,13 +2079,20 @@ to cope with initialization race conditions.
 ETEXI
 
 DEF("option-rom", HAS_ARG, QEMU_OPTION_option_rom, \
-"-option-rom rom load a file, rom, into the option ROM space\n",
+"-option-rom rom[,bootindex=idx] load a file, rom, into the option ROM 
space\n",
 QEMU_ARCH_ALL)
 STEXI
-@item -option-rom @var{file}
+@item -option-rom @var{file}[,bootindex=@var{bootindex}]
 @findex -option-rom
 Load the contents of @var{file} as an option ROM.
 This option is useful to load things like EtherBoot.
+
+@table @option
+@item bootindex=@var{bootindex}
+Defines which boot index the option rom will be given in boot entry vectors,
+allowing fine-grained selection of devices boot order.
+@end table
+
 ETEXI
 
 DEF("clock", HAS_ARG, QEMU_OPTION_clock, \
-- 
1.7.4

[Qemu-devel] [PATCH 0/2] allow fw_dir to be specified in the option-rom switch

2011-04-12 Thread Glauber Costa

Some roms should not live in genroms/, the default place for all roms passed
through -option-rom switch.
Rather, they'd like to be placed in vgaroms. This patch allows it to happen.

Glauber Costa (2):
  document bootindex option
  add fw_dir option to option-rom switch

 hw/pc.c |7 ++-
 qemu-config.c   |3 +++
 qemu-options.hx |   14 --
 sysemu.h|1 +
 vl.c|1 +
 5 files changed, 23 insertions(+), 3 deletions(-)

-- 
1.7.4

Re: [Qemu-devel] [PATCH v2 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription

2011-04-11 Thread Glauber Costa

On Mon, 2011-04-11 at 08:47 -0500, Anthony Liguori wrote:
> On 04/11/2011 08:39 AM, Glauber Costa wrote:
> > On Mon, 2011-04-11 at 08:10 -0500, Anthony Liguori wrote:
> >> On 04/11/2011 04:08 AM, Avi Kivity wrote:
> >>> On 04/11/2011 12:06 PM, Ulrich Obergfell wrote:
> >>>>>>   vmstate_hpet_timer = {
> >>>>>> VMSTATE_UINT64(fsb, HPETTimer),
> >>>>>> VMSTATE_UINT64(period, HPETTimer),
> >>>>>> VMSTATE_UINT8(wrap_flag, HPETTimer),
> >>>>>>   + VMSTATE_UINT64_V(saved_period, HPETTimer, 3),
> >>>>>>   + VMSTATE_UINT64_V(ticks_not_accounted, HPETTimer, 3),
> >>>>>>   + VMSTATE_UINT32_V(irqs_to_inject, HPETTimer, 3),
> >>>>>>   + VMSTATE_UINT32_V(irq_rate, HPETTimer, 3),
> >>>>>>   + VMSTATE_UINT32_V(divisor, HPETTimer, 3),
> >>>>>   We ought to be able to use a subsection keyed off of whether any
> >>>> ticks
> >>>>>   are currently accumulated, no?
> >>>>
> >>>> Anthony,
> >>>>
> >>>> I'm not sure if I understand your question correctly. Are you suggesting
> >>>> to migrate the driftfix-related state conditionally / only if there are
> >>>> any ticks accumulated in 'ticks_not_accounted' and 'irqs_to_inject' ?
> >>>>
> >>>> The size of the driftfix-related state is 28 bytes per timer and we have
> >>>> 32 timers per HPETState, i.e. 896 additional bytes per HPETState. With a
> >>>> maximum number of 8 HPET blocks (HPETState), this amounts to 7168 bytes.
> >>>> Hence, unconditional migration of the driftfix-related state should not
> >>>> cause significant additional overhead.
> >>>>
> >>> It's not about overhead.
> >>>
> >>>> Maybe I missed something. Could you please explain which benefit you see
> >>>> in using a subsection ?
> >>> In the common case of there being no drift, you can migrate from a
> >>> qemu that supports driftfix to a qemu that doesn't.
> >>>
> >> Right, subsections are a trick.  The idea is that when you introduce new
> >> state for a device model that is not always going to be set, when you do
> >> the migration, you detect whether the state is set or not and if it's
> >> not set, instead of sending empty versions of that state (i.e.
> >> missed_ticks=0) you just don't send the new state at all.
> >>
> >> This means that you can migrate to an older version of QEMU provided the
> >> migration would work correctly.
> > Using subsections and testing for hpet option being disabled vs enabled,
> > is fine. But checking for the existence of drift, like you suggested (or
> > at least how I understood you), is very tricky. It is expected to change
> > many times during guest lifetime, and would make our migration
> > predictability something Heisenberg would be proud of.
> 
> Is this true?  I would expect it to be very tied to workloads.  For idle 
> workloads, you should never have accumulated missed ticks whereas with 
> heavy workloads, you always will have accumulated ticks.
> 
> Is that not correct?
Yes, it is , but we lose a lot of reliability by tying migration to the 
workload. Given that
we still have to start qemu the same way both sides, we end up with a
situation in which at time t, migration is possible, and at time t+1
migration is not.

I'd rather have subsections enabled at all times when the option to
allow driftfix is enabled.

Re: [Qemu-devel] [PATCH v2 3/5] hpet 'driftfix': add fields to HPETTimer and VMStateDescription

2011-04-11 Thread Glauber Costa

On Mon, 2011-04-11 at 08:10 -0500, Anthony Liguori wrote:
> On 04/11/2011 04:08 AM, Avi Kivity wrote:
> > On 04/11/2011 12:06 PM, Ulrich Obergfell wrote:
> >> >>  vmstate_hpet_timer = {
> >> >>VMSTATE_UINT64(fsb, HPETTimer),
> >> >>VMSTATE_UINT64(period, HPETTimer),
> >> >>VMSTATE_UINT8(wrap_flag, HPETTimer),
> >> >>  + VMSTATE_UINT64_V(saved_period, HPETTimer, 3),
> >> >>  + VMSTATE_UINT64_V(ticks_not_accounted, HPETTimer, 3),
> >> >>  + VMSTATE_UINT32_V(irqs_to_inject, HPETTimer, 3),
> >> >>  + VMSTATE_UINT32_V(irq_rate, HPETTimer, 3),
> >> >>  + VMSTATE_UINT32_V(divisor, HPETTimer, 3),
> >> >
> >> >  We ought to be able to use a subsection keyed off of whether any 
> >> ticks
> >> >  are currently accumulated, no?
> >>
> >>
> >> Anthony,
> >>
> >> I'm not sure if I understand your question correctly. Are you suggesting
> >> to migrate the driftfix-related state conditionally / only if there are
> >> any ticks accumulated in 'ticks_not_accounted' and 'irqs_to_inject' ?
> >>
> >> The size of the driftfix-related state is 28 bytes per timer and we have
> >> 32 timers per HPETState, i.e. 896 additional bytes per HPETState. With a
> >> maximum number of 8 HPET blocks (HPETState), this amounts to 7168 bytes.
> >> Hence, unconditional migration of the driftfix-related state should not
> >> cause significant additional overhead.
> >>
> >
> > It's not about overhead.
> >
> >> Maybe I missed something. Could you please explain which benefit you see
> >> in using a subsection ?
> >
> > In the common case of there being no drift, you can migrate from a 
> > qemu that supports driftfix to a qemu that doesn't.
> >
> 
> Right, subsections are a trick.  The idea is that when you introduce new 
> state for a device model that is not always going to be set, when you do 
> the migration, you detect whether the state is set or not and if it's 
> not set, instead of sending empty versions of that state (i.e. 
> missed_ticks=0) you just don't send the new state at all.
> 
> This means that you can migrate to an older version of QEMU provided the 
> migration would work correctly.

Using subsections and testing for hpet option being disabled vs enabled,
is fine. But checking for the existence of drift, like you suggested (or
at least how I understood you), is very tricky. It is expected to change
many times during guest lifetime, and would make our migration
predictability something Heisenberg would be proud of.

[Qemu-devel] Re: [PATCH v2 5/5] hpet 'driftfix': add code in hpet_timer() to compensate delayed callbacks and coalesced interrupts

2011-04-08 Thread Glauber Costa

On Fri, 2011-04-08 at 17:54 +0200, Jan Kiszka wrote:
> > +}
> 
> Did I miss some change in the plan? I thought we were heading for a
> generic, reusable driftfix tool box (or periodic timer service)? Or is
> this intentionally an intermediate step? 

Which is a medium to long way in the future. There is benefit of having
this early, since it can deliver real functionality - a reliable hpet,
and converting to whatever the api may look like in the future.

[Qemu-devel] Re: [PATCH 3/3] don't create kvmclock when one of the flags are present.

2011-03-18 Thread Glauber Costa

On Fri, 2011-03-18 at 11:24 +0100, Jan Kiszka wrote:
> On 2011-03-17 23:42, Glauber Costa wrote:
> > kvmclock presence can be signalled by two different flags. So for
> > device creation, we have to test for both.

> Patch is OK, but the subject's logic is inverted.
Indeed, should have said something like "to test for either of them"

Dear maintainers, is it okay to commit with a minor edit to the
changelogs?

> Jan
> 
> > 
> > Signed-off-by: Glauber Costa 
> > ---
> >  hw/kvmclock.c |6 +-
> >  1 files changed, 5 insertions(+), 1 deletions(-)
> > 
> > diff --git a/hw/kvmclock.c b/hw/kvmclock.c
> > index b6ceddf..004c4ad 100644
> > --- a/hw/kvmclock.c
> > +++ b/hw/kvmclock.c
> > @@ -103,7 +103,11 @@ static SysBusDeviceInfo kvmclock_info = {
> >  void kvmclock_create(void)
> >  {
> >  if (kvm_enabled() &&
> > -first_cpu->cpuid_kvm_features & (1ULL << KVM_FEATURE_CLOCKSOURCE)) 
> > {
> > +first_cpu->cpuid_kvm_features & ((1ULL << KVM_FEATURE_CLOCKSOURCE)
> > +#ifdef KVM_FEATURE_CLOCKSOURCE2
> > +|| (1ULL << KVM_FEATURE_CLOCKSOURCE2)
> > +#endif
> > +)) {
> >  sysbus_create_simple("kvmclock", -1, NULL);
> >  }
> >  }
>

[Qemu-devel] [PATCH 2/3] add kvmclock to its second bit

2011-03-17 Thread Glauber Costa

We have two bits that can represent kvmclock in cpuid.
They signal the guest which msr set to use. When we tweak flags
involving this value - specially when we use "-", we have to act on both.

Besides adding it to the kvm features list, we also have to "break" the
assumption represented by the break in lookup_feature.

Signed-off-by: Glauber Costa 
---
 target-i386/cpuid.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c
index d28de20..48f9bbd 100644
--- a/target-i386/cpuid.c
+++ b/target-i386/cpuid.c
@@ -73,7 +73,7 @@ static const char *ext3_feature_name[] = {
 };
 
 static const char *kvm_feature_name[] = {
-"kvmclock", "kvm_nopiodelay", "kvm_mmu", NULL, "kvm_asyncpf", NULL, NULL, 
NULL,
+"kvmclock", "kvm_nopiodelay", "kvm_mmu", "kvmclock", "kvm_asyncpf", NULL, 
NULL, NULL,
 NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
 NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
 NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
@@ -193,7 +193,6 @@ static int lookup_feature(uint32_t *pval, const char *s, 
const char *e,
 for (mask = 1, ppc = featureset; mask; mask <<= 1, ++ppc)
 if (*ppc && !altcmp(s, e, *ppc)) {
 *pval |= mask;
-break;
 }
 return (mask ? 1 : 0);
 }
-- 
1.7.2.3

[Qemu-devel] [PATCH 3/3] don't create kvmclock when one of the flags are present.

2011-03-17 Thread Glauber Costa

kvmclock presence can be signalled by two different flags. So for
device creation, we have to test for both.

Signed-off-by: Glauber Costa 
---
 hw/kvmclock.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/hw/kvmclock.c b/hw/kvmclock.c
index b6ceddf..004c4ad 100644
--- a/hw/kvmclock.c
+++ b/hw/kvmclock.c
@@ -103,7 +103,11 @@ static SysBusDeviceInfo kvmclock_info = {
 void kvmclock_create(void)
 {
 if (kvm_enabled() &&
-first_cpu->cpuid_kvm_features & (1ULL << KVM_FEATURE_CLOCKSOURCE)) {
+first_cpu->cpuid_kvm_features & ((1ULL << KVM_FEATURE_CLOCKSOURCE)
+#ifdef KVM_FEATURE_CLOCKSOURCE2
+|| (1ULL << KVM_FEATURE_CLOCKSOURCE2)
+#endif
+)) {
 sysbus_create_simple("kvmclock", -1, NULL);
 }
 }
-- 
1.7.2.3

[Qemu-devel] [PATCH 1/3] use kernel-provided para_features instead of statically coming up with new capabilities

2011-03-17 Thread Glauber Costa

According to Avi's comments over my last submission, I decided to take a
different, and more correct direction - we hope.

This patch is now using the features provided by KVM_GET_SUPPORTED_CPUID 
directly to
mask out features from guest-visible cpuid.

The old get_para_features() mechanism is kept for older kernels that do not 
implement it.

Signed-off-by: Glauber Costa 
---
 target-i386/kvm.c |   76 +++-
 1 files changed, 45 insertions(+), 31 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index da757fa..dc1e547 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -95,6 +95,35 @@ static struct kvm_cpuid2 *try_get_cpuid(KVMState *s, int max)
 return cpuid;
 }
 
+#ifdef CONFIG_KVM_PARA
+struct kvm_para_features {
+int cap;
+int feature;
+} para_features[] = {
+{ KVM_CAP_CLOCKSOURCE, KVM_FEATURE_CLOCKSOURCE },
+{ KVM_CAP_NOP_IO_DELAY, KVM_FEATURE_NOP_IO_DELAY },
+{ KVM_CAP_PV_MMU, KVM_FEATURE_MMU_OP },
+#ifdef KVM_CAP_ASYNC_PF
+{ KVM_CAP_ASYNC_PF, KVM_FEATURE_ASYNC_PF },
+#endif
+{ -1, -1 }
+};
+
+static int get_para_features(CPUState *env)
+{
+int i, features = 0;
+
+for (i = 0; i < ARRAY_SIZE(para_features) - 1; i++) {
+if (kvm_check_extension(env->kvm_state, para_features[i].cap)) {
+features |= (1 << para_features[i].feature);
+}
+}
+
+return features;
+}
+#endif
+
+
 uint32_t kvm_arch_get_supported_cpuid(CPUState *env, uint32_t function,
   uint32_t index, int reg)
 {
@@ -102,6 +131,7 @@ uint32_t kvm_arch_get_supported_cpuid(CPUState *env, 
uint32_t function,
 int i, max;
 uint32_t ret = 0;
 uint32_t cpuid_1_edx;
+int has_kvm_features = 0;
 
 max = 1;
 while ((cpuid = try_get_cpuid(env->kvm_state, max)) == NULL) {
@@ -111,6 +141,9 @@ uint32_t kvm_arch_get_supported_cpuid(CPUState *env, 
uint32_t function,
 for (i = 0; i < cpuid->nent; ++i) {
 if (cpuid->entries[i].function == function &&
 cpuid->entries[i].index == index) {
+if (cpuid->entries[i].function == KVM_CPUID_FEATURES) {
+has_kvm_features = 1;
+}
 switch (reg) {
 case R_EAX:
 ret = cpuid->entries[i].eax;
@@ -141,41 +174,16 @@ uint32_t kvm_arch_get_supported_cpuid(CPUState *env, 
uint32_t function,
 }
 }
 
+/* fallback for older kernels */
+if (!has_kvm_features && (function == KVM_CPUID_FEATURES)) {
+ret = get_para_features(env);
+}
+
 qemu_free(cpuid);
 
 return ret;
 }
 
-#ifdef CONFIG_KVM_PARA
-struct kvm_para_features {
-int cap;
-int feature;
-} para_features[] = {
-{ KVM_CAP_CLOCKSOURCE, KVM_FEATURE_CLOCKSOURCE },
-{ KVM_CAP_NOP_IO_DELAY, KVM_FEATURE_NOP_IO_DELAY },
-{ KVM_CAP_PV_MMU, KVM_FEATURE_MMU_OP },
-#ifdef KVM_CAP_ASYNC_PF
-{ KVM_CAP_ASYNC_PF, KVM_FEATURE_ASYNC_PF },
-#endif
-{ -1, -1 }
-};
-
-static int get_para_features(CPUState *env)
-{
-int i, features = 0;
-
-for (i = 0; i < ARRAY_SIZE(para_features) - 1; i++) {
-if (kvm_check_extension(env->kvm_state, para_features[i].cap)) {
-features |= (1 << para_features[i].feature);
-}
-}
-#ifdef KVM_CAP_ASYNC_PF
-has_msr_async_pf_en = features & (1 << KVM_FEATURE_ASYNC_PF);
-#endif
-return features;
-}
-#endif
-
 #ifdef KVM_CAP_MCE
 static int kvm_get_mce_cap_supported(KVMState *s, uint64_t *mce_cap,
  int *max_banks)
@@ -363,7 +371,13 @@ int kvm_arch_init_vcpu(CPUState *env)
 c = &cpuid_data.entries[cpuid_i++];
 memset(c, 0, sizeof(*c));
 c->function = KVM_CPUID_FEATURES;
-c->eax = env->cpuid_kvm_features & get_para_features(env);
+c->eax = env->cpuid_kvm_features & kvm_arch_get_supported_cpuid(env,
+KVM_CPUID_FEATURES, 0, R_EAX);
+
+#ifdef KVM_CAP_ASYNC_PF
+has_msr_async_pf_en = c->eax & (1 << KVM_FEATURE_ASYNC_PF);
+#endif
+
 #endif
 
 cpu_x86_cpuid(env, 0, 0, &limit, &unused, &unused, &unused);
-- 
1.7.2.3

[Qemu-devel] [PATCH 0/3] enable newer msr set for kvm

2011-03-17 Thread Glauber Costa

This patch is a follow up to an earlier one that aims to enable
kvmclock newer msr set. This time I'm doing it through a more sane
mechanism of consulting the kernel about the supported msr set.

Glauber Costa (3):
  use kernel-provided para_features instead of statically coming up
with new capabilities
  add kvmclock to its second bit
  don't create kvmclock when one of the flags are present.

 hw/kvmclock.c   |6 +++-
 target-i386/cpuid.c |3 +-
 target-i386/kvm.c   |   76 ++-
 3 files changed, 51 insertions(+), 34 deletions(-)

-- 
1.7.2.3

[Qemu-devel] Re: [PATCH 15/15] kvm: x86: Introduce kvmclock device to save/restore its state

2011-02-07 Thread Glauber Costa

On Mon, 2011-02-07 at 12:19 +0100, Jan Kiszka wrote:
> If kvmclock is used, which implies the kernel supports it, register a
> kvmclock device with the sysbus. Its main purpose is to save and restore
> the kernel state on migration, but this will also allow to visualize it
> one day.
> 
> Signed-off-by: Jan Kiszka 
> CC: Glauber Costa 
> ---
>  Makefile.target |4 +-
>  hw/kvmclock.c   |  125 
> +++
>  hw/kvmclock.h   |   14 ++
>  hw/pc_piix.c|   31 +++---
>  4 files changed, 165 insertions(+), 9 deletions(-)
>  create mode 100644 hw/kvmclock.c
>  create mode 100644 hw/kvmclock.h
> 
> diff --git a/Makefile.target b/Makefile.target
> index b0ba95f..30232fa 100644
> --- a/Makefile.target
> +++ b/Makefile.target
> @@ -37,7 +37,7 @@ ifndef CONFIG_HAIKU
>  LIBS+=-lm
>  endif
>  
> -kvm.o kvm-all.o vhost.o vhost_net.o: QEMU_CFLAGS+=$(KVM_CFLAGS)
> +kvm.o kvm-all.o vhost.o vhost_net.o kvmclock.o: QEMU_CFLAGS+=$(KVM_CFLAGS)
>  
>  config-target.h: config-target.h-timestamp
>  config-target.h-timestamp: config-target.mak
> @@ -218,7 +218,7 @@ obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o
>  obj-i386-y += vmmouse.o vmport.o hpet.o applesmc.o
>  obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
>  obj-i386-y += debugcon.o multiboot.o
> -obj-i386-y += pc_piix.o
> +obj-i386-y += pc_piix.o kvmclock.o
>  obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
>  
>  # shared objects
> diff --git a/hw/kvmclock.c b/hw/kvmclock.c
> new file mode 100644
> index 000..b6ceddf
> --- /dev/null
> +++ b/hw/kvmclock.c
> @@ -0,0 +1,125 @@
> +/*
> + * QEMU KVM support, paravirtual clock device
> + *
> + * Copyright (C) 2011 Siemens AG
> + *
> + * Authors:
> + *  Jan Kiszka
> + *
> + * This work is licensed under the terms of the GNU GPL version 2.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qemu-common.h"
> +#include "sysemu.h"
> +#include "sysbus.h"
> +#include "kvm.h"
> +#include "kvmclock.h"
> +
> +#if defined(CONFIG_KVM_PARA) && defined(KVM_CAP_ADJUST_CLOCK)
> +
> +#include 
> +#include 
> +
> +typedef struct KVMClockState {
> +SysBusDevice busdev;
> +uint64_t clock;
> +bool clock_valid;
> +} KVMClockState;
> +
> +static void kvmclock_pre_save(void *opaque)
> +{
> +KVMClockState *s = opaque;
> +struct kvm_clock_data data;
> +int ret;
> +
> +if (s->clock_valid) {
> +return;
> +}
> +ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
> +if (ret < 0) {
> +fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
> +data.clock = 0;
> +}
> +s->clock = data.clock;
> +/*
> + * If the VM is stopped, declare the clock state valid to avoid 
> re-reading
> + * it on next vmsave (which would return a different value). Will be 
> reset
> + * when the VM is continued.
> + */
> +s->clock_valid = !vm_running;
> +}
> +
> +static int kvmclock_post_load(void *opaque, int version_id)
> +{
> +KVMClockState *s = opaque;
> +struct kvm_clock_data data;
> +
> +data.clock = s->clock;
> +data.flags = 0;
> +return kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
> +}
> +
> +static void kvmclock_vm_state_change(void *opaque, int running, int reason)
> +{
> +KVMClockState *s = opaque;
> +
> +if (running) {
> +s->clock_valid = false;
> +}
> +}
> +
> +static int kvmclock_init(SysBusDevice *dev)
> +{
> +KVMClockState *s = FROM_SYSBUS(KVMClockState, dev);
> +
> +qemu_add_vm_change_state_handler(kvmclock_vm_state_change, s);
> +return 0;
> +}
> +
> +static const VMStateDescription kvmclock_vmsd = {
> +.name = "kvmclock",
> +.version_id = 1,
> +.minimum_version_id = 1,
> +.minimum_version_id_old = 1,
> +.pre_save = kvmclock_pre_save,
> +.post_load = kvmclock_post_load,
> +.fields = (VMStateField[]) {
> +VMSTATE_UINT64(clock, KVMClockState),
> +VMSTATE_END_OF_LIST()
> +}
> +};
> +
> +static SysBusDeviceInfo kvmclock_info = {
> +.qdev.name = "kvmclock",
> +.qdev.size = sizeof(KVMClockState),
> +.qdev.vmsd = &kvmclock_vmsd,
> +.qdev.no_user = 1,
> +.init = kvmclock_init,
> +};
> +
> +/* Note: Must be called after VCPU initialization. */
> +void kvmclock_create(void)
> +{
> +if (kvm_enabled() &&
> +first_cpu->cpu

[Qemu-devel] Re: [PATCH 15/15] kvm: x86: Introduce kvmclock device to save/restore its state

2011-02-07 Thread Glauber Costa

On Mon, 2011-02-07 at 19:12 +0100, Jan Kiszka wrote:
> On 2011-02-07 19:04, Glauber Costa wrote:
> > On Mon, 2011-02-07 at 15:03 +0100, Jan Kiszka wrote:
> >> On 2011-02-07 14:40, Glauber Costa wrote:
> >>> On Mon, 2011-02-07 at 13:36 +0100, Jan Kiszka wrote:
> >>>> On 2011-02-07 13:27, Glauber Costa wrote:
> >>>>> On Mon, 2011-02-07 at 12:19 +0100, Jan Kiszka wrote:
> >>>>>> If kvmclock is used, which implies the kernel supports it, register a
> >>>>>> kvmclock device with the sysbus. Its main purpose is to save and 
> >>>>>> restore
> >>>>>> the kernel state on migration, but this will also allow to visualize it
> >>>>>> one day.
> >>>>>>
> >>>>>> Signed-off-by: Jan Kiszka 
> >>>>>> CC: Glauber Costa 
> >>>>>> ---
> >>>>>>  Makefile.target |4 +-
> >>>>>>  hw/kvmclock.c   |  125 
> >>>>>> +++
> >>>>>>  hw/kvmclock.h   |   14 ++
> >>>>>>  hw/pc_piix.c|   31 +++---
> >>>>>>  4 files changed, 165 insertions(+), 9 deletions(-)
> >>>>>>  create mode 100644 hw/kvmclock.c
> >>>>>>  create mode 100644 hw/kvmclock.h
> >>>>>>
> >>>>>> diff --git a/Makefile.target b/Makefile.target
> >>>>>> index b0ba95f..30232fa 100644
> >>>>>> --- a/Makefile.target
> >>>>>> +++ b/Makefile.target
> >>>>>> @@ -37,7 +37,7 @@ ifndef CONFIG_HAIKU
> >>>>>>  LIBS+=-lm
> >>>>>>  endif
> >>>>>>  
> >>>>>> -kvm.o kvm-all.o vhost.o vhost_net.o: QEMU_CFLAGS+=$(KVM_CFLAGS)
> >>>>>> +kvm.o kvm-all.o vhost.o vhost_net.o kvmclock.o: 
> >>>>>> QEMU_CFLAGS+=$(KVM_CFLAGS)
> >>>>>>  
> >>>>>>  config-target.h: config-target.h-timestamp
> >>>>>>  config-target.h-timestamp: config-target.mak
> >>>>>> @@ -218,7 +218,7 @@ obj-i386-y += cirrus_vga.o apic.o ioapic.o 
> >>>>>> piix_pci.o
> >>>>>>  obj-i386-y += vmmouse.o vmport.o hpet.o applesmc.o
> >>>>>>  obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
> >>>>>>  obj-i386-y += debugcon.o multiboot.o
> >>>>>> -obj-i386-y += pc_piix.o
> >>>>>> +obj-i386-y += pc_piix.o kvmclock.o
> >>>>>>  obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
> >>>>>>  
> >>>>>>  # shared objects
> >>>>>> diff --git a/hw/kvmclock.c b/hw/kvmclock.c
> >>>>>> new file mode 100644
> >>>>>> index 000..b6ceddf
> >>>>>> --- /dev/null
> >>>>>> +++ b/hw/kvmclock.c
> >>>>>> @@ -0,0 +1,125 @@
> >>>>>> +/*
> >>>>>> + * QEMU KVM support, paravirtual clock device
> >>>>>> + *
> >>>>>> + * Copyright (C) 2011 Siemens AG
> >>>>>> + *
> >>>>>> + * Authors:
> >>>>>> + *  Jan Kiszka
> >>>>>> + *
> >>>>>> + * This work is licensed under the terms of the GNU GPL version 2.
> >>>>>> + * See the COPYING file in the top-level directory.
> >>>>>> + *
> >>>>>> + */
> >>>>>> +
> >>>>>> +#include "qemu-common.h"
> >>>>>> +#include "sysemu.h"
> >>>>>> +#include "sysbus.h"
> >>>>>> +#include "kvm.h"
> >>>>>> +#include "kvmclock.h"
> >>>>>> +
> >>>>>> +#if defined(CONFIG_KVM_PARA) && defined(KVM_CAP_ADJUST_CLOCK)
> >>>>>> +
> >>>>>> +#include 
> >>>>>> +#include 
> >>>>>> +
> >>>>>> +typedef struct KVMClockState {
> >>>>>> +SysBusDevice busdev;
> >>>>>> +uint64_t clock;
> >>>>>> +bool clock_valid;
> >>>>>> +} KVMClockState;
> >>>>>> +
> >>>>>> +static void kvmclock_pre_save(void *opa

[Qemu-devel] Re: [PATCH 15/15] kvm: x86: Introduce kvmclock device to save/restore its state

2011-02-07 Thread Glauber Costa

On Mon, 2011-02-07 at 15:03 +0100, Jan Kiszka wrote:
> On 2011-02-07 14:40, Glauber Costa wrote:
> > On Mon, 2011-02-07 at 13:36 +0100, Jan Kiszka wrote:
> >> On 2011-02-07 13:27, Glauber Costa wrote:
> >>> On Mon, 2011-02-07 at 12:19 +0100, Jan Kiszka wrote:
> >>>> If kvmclock is used, which implies the kernel supports it, register a
> >>>> kvmclock device with the sysbus. Its main purpose is to save and restore
> >>>> the kernel state on migration, but this will also allow to visualize it
> >>>> one day.
> >>>>
> >>>> Signed-off-by: Jan Kiszka 
> >>>> CC: Glauber Costa 
> >>>> ---
> >>>>  Makefile.target |4 +-
> >>>>  hw/kvmclock.c   |  125 
> >>>> +++
> >>>>  hw/kvmclock.h   |   14 ++
> >>>>  hw/pc_piix.c|   31 +++---
> >>>>  4 files changed, 165 insertions(+), 9 deletions(-)
> >>>>  create mode 100644 hw/kvmclock.c
> >>>>  create mode 100644 hw/kvmclock.h
> >>>>
> >>>> diff --git a/Makefile.target b/Makefile.target
> >>>> index b0ba95f..30232fa 100644
> >>>> --- a/Makefile.target
> >>>> +++ b/Makefile.target
> >>>> @@ -37,7 +37,7 @@ ifndef CONFIG_HAIKU
> >>>>  LIBS+=-lm
> >>>>  endif
> >>>>  
> >>>> -kvm.o kvm-all.o vhost.o vhost_net.o: QEMU_CFLAGS+=$(KVM_CFLAGS)
> >>>> +kvm.o kvm-all.o vhost.o vhost_net.o kvmclock.o: 
> >>>> QEMU_CFLAGS+=$(KVM_CFLAGS)
> >>>>  
> >>>>  config-target.h: config-target.h-timestamp
> >>>>  config-target.h-timestamp: config-target.mak
> >>>> @@ -218,7 +218,7 @@ obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o
> >>>>  obj-i386-y += vmmouse.o vmport.o hpet.o applesmc.o
> >>>>  obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
> >>>>  obj-i386-y += debugcon.o multiboot.o
> >>>> -obj-i386-y += pc_piix.o
> >>>> +obj-i386-y += pc_piix.o kvmclock.o
> >>>>  obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
> >>>>  
> >>>>  # shared objects
> >>>> diff --git a/hw/kvmclock.c b/hw/kvmclock.c
> >>>> new file mode 100644
> >>>> index 000..b6ceddf
> >>>> --- /dev/null
> >>>> +++ b/hw/kvmclock.c
> >>>> @@ -0,0 +1,125 @@
> >>>> +/*
> >>>> + * QEMU KVM support, paravirtual clock device
> >>>> + *
> >>>> + * Copyright (C) 2011 Siemens AG
> >>>> + *
> >>>> + * Authors:
> >>>> + *  Jan Kiszka
> >>>> + *
> >>>> + * This work is licensed under the terms of the GNU GPL version 2.
> >>>> + * See the COPYING file in the top-level directory.
> >>>> + *
> >>>> + */
> >>>> +
> >>>> +#include "qemu-common.h"
> >>>> +#include "sysemu.h"
> >>>> +#include "sysbus.h"
> >>>> +#include "kvm.h"
> >>>> +#include "kvmclock.h"
> >>>> +
> >>>> +#if defined(CONFIG_KVM_PARA) && defined(KVM_CAP_ADJUST_CLOCK)
> >>>> +
> >>>> +#include 
> >>>> +#include 
> >>>> +
> >>>> +typedef struct KVMClockState {
> >>>> +SysBusDevice busdev;
> >>>> +uint64_t clock;
> >>>> +bool clock_valid;
> >>>> +} KVMClockState;
> >>>> +
> >>>> +static void kvmclock_pre_save(void *opaque)
> >>>> +{
> >>>> +KVMClockState *s = opaque;
> >>>> +struct kvm_clock_data data;
> >>>> +int ret;
> >>>> +
> >>>> +if (s->clock_valid) {
> >>>> +return;
> >>>> +}
> >>>> +ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
> >>>> +if (ret < 0) {
> >>>> +fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
> >>>> +data.clock = 0;
> >>>> +}
> >>>> +s->clock = data.clock;
> >>>> +/*
> >>>> + * If the VM is stopped, declare the clock state valid to avoid 
> >>>> r

[Qemu-devel] Re: [PATCH 15/15] kvm: x86: Introduce kvmclock device to save/restore its state

2011-02-07 Thread Glauber Costa

On Mon, 2011-02-07 at 13:36 +0100, Jan Kiszka wrote:
> On 2011-02-07 13:27, Glauber Costa wrote:
> > On Mon, 2011-02-07 at 12:19 +0100, Jan Kiszka wrote:
> >> If kvmclock is used, which implies the kernel supports it, register a
> >> kvmclock device with the sysbus. Its main purpose is to save and restore
> >> the kernel state on migration, but this will also allow to visualize it
> >> one day.
> >>
> >> Signed-off-by: Jan Kiszka 
> >> CC: Glauber Costa 
> >> ---
> >>  Makefile.target |4 +-
> >>  hw/kvmclock.c   |  125 
> >> +++
> >>  hw/kvmclock.h   |   14 ++
> >>  hw/pc_piix.c|   31 +++---
> >>  4 files changed, 165 insertions(+), 9 deletions(-)
> >>  create mode 100644 hw/kvmclock.c
> >>  create mode 100644 hw/kvmclock.h
> >>
> >> diff --git a/Makefile.target b/Makefile.target
> >> index b0ba95f..30232fa 100644
> >> --- a/Makefile.target
> >> +++ b/Makefile.target
> >> @@ -37,7 +37,7 @@ ifndef CONFIG_HAIKU
> >>  LIBS+=-lm
> >>  endif
> >>  
> >> -kvm.o kvm-all.o vhost.o vhost_net.o: QEMU_CFLAGS+=$(KVM_CFLAGS)
> >> +kvm.o kvm-all.o vhost.o vhost_net.o kvmclock.o: QEMU_CFLAGS+=$(KVM_CFLAGS)
> >>  
> >>  config-target.h: config-target.h-timestamp
> >>  config-target.h-timestamp: config-target.mak
> >> @@ -218,7 +218,7 @@ obj-i386-y += cirrus_vga.o apic.o ioapic.o piix_pci.o
> >>  obj-i386-y += vmmouse.o vmport.o hpet.o applesmc.o
> >>  obj-i386-y += device-hotplug.o pci-hotplug.o smbios.o wdt_ib700.o
> >>  obj-i386-y += debugcon.o multiboot.o
> >> -obj-i386-y += pc_piix.o
> >> +obj-i386-y += pc_piix.o kvmclock.o
> >>  obj-i386-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
> >>  
> >>  # shared objects
> >> diff --git a/hw/kvmclock.c b/hw/kvmclock.c
> >> new file mode 100644
> >> index 000..b6ceddf
> >> --- /dev/null
> >> +++ b/hw/kvmclock.c
> >> @@ -0,0 +1,125 @@
> >> +/*
> >> + * QEMU KVM support, paravirtual clock device
> >> + *
> >> + * Copyright (C) 2011 Siemens AG
> >> + *
> >> + * Authors:
> >> + *  Jan Kiszka
> >> + *
> >> + * This work is licensed under the terms of the GNU GPL version 2.
> >> + * See the COPYING file in the top-level directory.
> >> + *
> >> + */
> >> +
> >> +#include "qemu-common.h"
> >> +#include "sysemu.h"
> >> +#include "sysbus.h"
> >> +#include "kvm.h"
> >> +#include "kvmclock.h"
> >> +
> >> +#if defined(CONFIG_KVM_PARA) && defined(KVM_CAP_ADJUST_CLOCK)
> >> +
> >> +#include 
> >> +#include 
> >> +
> >> +typedef struct KVMClockState {
> >> +SysBusDevice busdev;
> >> +uint64_t clock;
> >> +bool clock_valid;
> >> +} KVMClockState;
> >> +
> >> +static void kvmclock_pre_save(void *opaque)
> >> +{
> >> +KVMClockState *s = opaque;
> >> +struct kvm_clock_data data;
> >> +int ret;
> >> +
> >> +if (s->clock_valid) {
> >> +return;
> >> +}
> >> +ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
> >> +if (ret < 0) {
> >> +fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
> >> +data.clock = 0;
> >> +}
> >> +s->clock = data.clock;
> >> +/*
> >> + * If the VM is stopped, declare the clock state valid to avoid 
> >> re-reading
> >> + * it on next vmsave (which would return a different value). Will be 
> >> reset
> >> + * when the VM is continued.
> >> + */
> >> +s->clock_valid = !vm_running;
> >> +}
> >> +
> >> +static int kvmclock_post_load(void *opaque, int version_id)
> >> +{
> >> +KVMClockState *s = opaque;
> >> +struct kvm_clock_data data;
> >> +
> >> +data.clock = s->clock;
> >> +data.flags = 0;
> >> +return kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
> >> +}
> >> +
> >> +static void kvmclock_vm_state_change(void *opaque, int running, int 
> >> reason)
> >> +{
> >> +KVMClockState *s = opaque;
> >> +
> >> +if (running) {
> >&g

[Qemu-devel] [PATCH v3] make tsc stable over migration and machine start

2011-02-03 Thread Glauber Costa

If the machine is stopped, we should not record two different tsc values
upon a save operation. The same problem happens with kvmclock.

But kvmclock is taking a different diretion, being now seen as a separate
device. Since this is unlikely to happen with the tsc, I am taking the
approach here of simply registering a handler for state change, and
using a per-CPUState variable that prevents double updates for the TSC.

Signed-off-by: Glauber Costa 
CC: Jan Kiszka 

---
v2: updated tsc validation logic, as asked by Jan
v3: regenerated against uq/master
---
 target-i386/cpu.h |1 +
 target-i386/kvm.c |   18 +-
 2 files changed, 18 insertions(+), 1 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index af701a4..5f1df8b 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -734,6 +734,7 @@ typedef struct CPUX86State {
 uint32_t sipi_vector;
 uint32_t cpuid_kvm_features;
 uint32_t cpuid_svm_features;
+bool tsc_valid;
 
 /* in order to simplify APIC support, we leave this pointer to the
user */
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 05010bb..ed9557e 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -301,6 +301,15 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t 
status,
 #endif
 }
 
+static void cpu_update_state(void *opaque, int running, int reason)
+{
+CPUState *env = opaque;
+
+if (running) {
+env->tsc_valid = false;
+}
+}
+
 int kvm_arch_init_vcpu(CPUState *env)
 {
 struct {
@@ -434,6 +443,8 @@ int kvm_arch_init_vcpu(CPUState *env)
 }
 #endif
 
+qemu_add_vm_change_state_handler(cpu_update_state, env);
+
 return kvm_vcpu_ioctl(env, KVM_SET_CPUID2, &cpuid_data);
 }
 
@@ -1061,7 +1072,12 @@ static int kvm_get_msrs(CPUState *env)
 if (has_msr_hsave_pa) {
 msrs[n++].index = MSR_VM_HSAVE_PA;
 }
-msrs[n++].index = MSR_IA32_TSC;
+
+if (!env->tsc_valid) {
+msrs[n++].index = MSR_IA32_TSC;
+env->tsc_valid = !vm_running;
+}
+
 #ifdef TARGET_X86_64
 if (lm_capable_kernel) {
 msrs[n++].index = MSR_CSTAR;
-- 
1.7.2.3

[Qemu-devel] Re: [PATCH] make tsc stable over migration and machine start

2011-02-02 Thread Glauber Costa

On Tue, 2011-02-01 at 21:26 +0100, Jan Kiszka wrote:
> On 2011-02-01 20:17, Glauber Costa wrote:
> > If the machine is stopped, we should not record two different tsc values
> > upon a save operation. The same problem happens with kvmclock.
> > 
> > But kvmclock is taking a different diretion, being now seen as a separate
> > device. Since this is unlikely to happen with the tsc, I am taking the
> > approach here of simply registering a handler for state change, and
> > using a per-CPUState variable that prevents double updates for the TSC.
> > 
> > Signed-off-by: Glauber Costa 
> > ---
> >  target-i386/cpu.h |1 +
> >  target-i386/kvm.c |   19 ++-
> >  2 files changed, 19 insertions(+), 1 deletions(-)
> > 
> > diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> > index 6d619e8..7f1c4f8 100644
> > --- a/target-i386/cpu.h
> > +++ b/target-i386/cpu.h
> > @@ -732,6 +732,7 @@ typedef struct CPUX86State {
> >  uint32_t sipi_vector;
> >  uint32_t cpuid_kvm_features;
> >  uint32_t cpuid_svm_features;
> > +uint8_t  update_tsc;
> 
> bool please.
> 
> >  
> >  /* in order to simplify APIC support, we leave this pointer to the
> > user */
> > diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> > index ecb8405..c3925be 100644
> > --- a/target-i386/kvm.c
> > +++ b/target-i386/kvm.c
> > @@ -302,6 +302,16 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, 
> > uint64_t status,
> >  
> >  static int _kvm_arch_init_vcpu(CPUState *env);
> >  
> > +static void cpu_update_state(void *opaque, int running, int reason)
> > +{
> > +CPUState *env = opaque;
> > +
> > +if (!running) {
> > +env->update_tsc = 1;
> > +}
> > +}
> > +
> > +
> 
> Additional blank line.
> 
> >  int kvm_arch_init_vcpu(CPUState *env)
> >  {
> >  int r;
> > @@ -444,6 +454,8 @@ int kvm_arch_init_vcpu(CPUState *env)
> >  }
> >  #endif
> >  
> > +qemu_add_vm_change_state_handler(cpu_update_state, env);
> > +
> >  return kvm_vcpu_ioctl(env, KVM_SET_CPUID2, &cpuid_data);
> >  }
> >  
> > @@ -1093,7 +1105,12 @@ static int kvm_get_msrs(CPUState *env)
> > msrs[n++].index = MSR_STAR;
> >  if (kvm_has_msr_hsave_pa(env))
> >  msrs[n++].index = MSR_VM_HSAVE_PA;
> > -msrs[n++].index = MSR_IA32_TSC;
> > +
> > +if (env->update_tsc) {
> > +msrs[n++].index = MSR_IA32_TSC;
> > +env->update_tsc = 0;
> > +}
> > +
> >  #ifdef TARGET_X86_64
> >  if (lm_capable_kernel) {
> >  msrs[n++].index = MSR_CSTAR;
> 
> Not quite the logic I'm using for kvmclock:

Ok. I have all the interest in keeping the same logic.
I will respin.

> cpu_update_state()
>   if running
>   tsc_valid = false;
> 
> kvm_get_msrs()
>   ...
>   if (!tsc_valid)
>   read_tsc
>   tsc_valid = !vm_running;
> 
> That ensure we always read the tsc while the VM is running, and not only
> after it was stopped (might otherwise be "surprising" when once
> visualizing the MSRs).
> 
> Jan
>

[Qemu-devel] [PATCH v2] make tsc stable over migration and machine start

2011-02-02 Thread Glauber Costa

If the machine is stopped, we should not record two different tsc values
upon a save operation. The same problem happens with kvmclock.

But kvmclock is taking a different diretion, being now seen as a separate
device. Since this is unlikely to happen with the tsc, I am taking the
approach here of simply registering a handler for state change, and
using a per-CPUState variable that prevents double updates for the TSC.

Signed-off-by: Glauber Costa 
CC: Jan Kiszka 

---
v2: updated tsc validation logic, as asked by Jan
---
 target-i386/cpu.h |1 +
 target-i386/kvm.c |   18 +-
 2 files changed, 18 insertions(+), 1 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 6d619e8..6bb2e87 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -732,6 +732,7 @@ typedef struct CPUX86State {
 uint32_t sipi_vector;
 uint32_t cpuid_kvm_features;
 uint32_t cpuid_svm_features;
+bool tsc_valid;
 
 /* in order to simplify APIC support, we leave this pointer to the
user */
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index ecb8405..9cc198a 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -302,6 +302,15 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t 
status,
 
 static int _kvm_arch_init_vcpu(CPUState *env);
 
+static void cpu_update_state(void *opaque, int running, int reason)
+{
+CPUState *env = opaque;
+
+if (running) {
+env->tsc_valid = false;
+}
+}
+
 int kvm_arch_init_vcpu(CPUState *env)
 {
 int r;
@@ -444,6 +453,8 @@ int kvm_arch_init_vcpu(CPUState *env)
 }
 #endif
 
+qemu_add_vm_change_state_handler(cpu_update_state, env);
+
 return kvm_vcpu_ioctl(env, KVM_SET_CPUID2, &cpuid_data);
 }
 
@@ -1093,7 +1104,12 @@ static int kvm_get_msrs(CPUState *env)
msrs[n++].index = MSR_STAR;
 if (kvm_has_msr_hsave_pa(env))
 msrs[n++].index = MSR_VM_HSAVE_PA;
-msrs[n++].index = MSR_IA32_TSC;
+
+if (!env->tsc_valid) {
+msrs[n++].index = MSR_IA32_TSC;
+env->tsc_valid = !vm_running;
+}
+
 #ifdef TARGET_X86_64
 if (lm_capable_kernel) {
 msrs[n++].index = MSR_CSTAR;
-- 
1.7.2.3

[Qemu-devel] [PATCH] make tsc stable over migration and machine start

2011-02-01 Thread Glauber Costa

If the machine is stopped, we should not record two different tsc values
upon a save operation. The same problem happens with kvmclock.

But kvmclock is taking a different diretion, being now seen as a separate
device. Since this is unlikely to happen with the tsc, I am taking the
approach here of simply registering a handler for state change, and
using a per-CPUState variable that prevents double updates for the TSC.

Signed-off-by: Glauber Costa 
---
 target-i386/cpu.h |1 +
 target-i386/kvm.c |   19 ++-
 2 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 6d619e8..7f1c4f8 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -732,6 +732,7 @@ typedef struct CPUX86State {
 uint32_t sipi_vector;
 uint32_t cpuid_kvm_features;
 uint32_t cpuid_svm_features;
+uint8_t  update_tsc;
 
 /* in order to simplify APIC support, we leave this pointer to the
user */
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index ecb8405..c3925be 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -302,6 +302,16 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t 
status,
 
 static int _kvm_arch_init_vcpu(CPUState *env);
 
+static void cpu_update_state(void *opaque, int running, int reason)
+{
+CPUState *env = opaque;
+
+if (!running) {
+env->update_tsc = 1;
+}
+}
+
+
 int kvm_arch_init_vcpu(CPUState *env)
 {
 int r;
@@ -444,6 +454,8 @@ int kvm_arch_init_vcpu(CPUState *env)
 }
 #endif
 
+qemu_add_vm_change_state_handler(cpu_update_state, env);
+
 return kvm_vcpu_ioctl(env, KVM_SET_CPUID2, &cpuid_data);
 }
 
@@ -1093,7 +1105,12 @@ static int kvm_get_msrs(CPUState *env)
msrs[n++].index = MSR_STAR;
 if (kvm_has_msr_hsave_pa(env))
 msrs[n++].index = MSR_VM_HSAVE_PA;
-msrs[n++].index = MSR_IA32_TSC;
+
+if (env->update_tsc) {
+msrs[n++].index = MSR_IA32_TSC;
+env->update_tsc = 0;
+}
+
 #ifdef TARGET_X86_64
 if (lm_capable_kernel) {
 msrs[n++].index = MSR_CSTAR;
-- 
1.7.2.3

[Qemu-devel] Re: [PATCH v3 15/21] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-04 Thread Glauber Costa

On Tue, 2011-01-04 at 12:45 +0100, Jan Kiszka wrote:
> Am 04.01.2011 12:43, Glauber Costa wrote:
> > On Tue, 2011-01-04 at 12:34 +0100, Jan Kiszka wrote:
> >> Am 04.01.2011 12:08, Glauber Costa wrote:
> >>> On Tue, 2011-01-04 at 09:32 +0100, Jan Kiszka wrote:
> >>>> From: Jan Kiszka 
> >>>>
> >>>> If kvmclock is used, which implies the kernel supports it, register a
> >>>> kvmclock device with the sysbus. Its main purpose is to save and restore
> >>>> the kernel state on migration, but this will also allow to visualize it
> >>>> one day.
> >>>
> >>> I would prefer having no pre-save, and doing the ioctl in the state
> >>> change handler. But I won't nitpick about that, if the maintainers think
> >>> this is okay, all the rest of the patch looks fine as well.
> >>
> >> I did this for a reason: to be able to obtain the current clock state
> >> even while the vm is running. It's cleaner IMHO.
> > 
> > but if we're on pre-save, we are certain that the vm is stopped at this
> > moment, no ?
> 
> Maybe at the moment, but not for device state dumping or maybe (dunno)
> for Kemari's continuous sync'ing. I simply want to avoid this implicit
> dependency.

that's fine.

[Qemu-devel] Re: [PATCH v3 15/21] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-04 Thread Glauber Costa

On Tue, 2011-01-04 at 12:34 +0100, Jan Kiszka wrote:
> Am 04.01.2011 12:08, Glauber Costa wrote:
> > On Tue, 2011-01-04 at 09:32 +0100, Jan Kiszka wrote:
> >> From: Jan Kiszka 
> >>
> >> If kvmclock is used, which implies the kernel supports it, register a
> >> kvmclock device with the sysbus. Its main purpose is to save and restore
> >> the kernel state on migration, but this will also allow to visualize it
> >> one day.
> > 
> > I would prefer having no pre-save, and doing the ioctl in the state
> > change handler. But I won't nitpick about that, if the maintainers think
> > this is okay, all the rest of the patch looks fine as well.
> 
> I did this for a reason: to be able to obtain the current clock state
> even while the vm is running. It's cleaner IMHO.

but if we're on pre-save, we are certain that the vm is stopped at this
moment, no ?

[Qemu-devel] Re: [PATCH v3 15/21] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-04 Thread Glauber Costa

On Tue, 2011-01-04 at 09:32 +0100, Jan Kiszka wrote:
> From: Jan Kiszka 
> 
> If kvmclock is used, which implies the kernel supports it, register a
> kvmclock device with the sysbus. Its main purpose is to save and restore
> the kernel state on migration, but this will also allow to visualize it
> one day.

I would prefer having no pre-save, and doing the ioctl in the state
change handler. But I won't nitpick about that, if the maintainers think
this is okay, all the rest of the patch looks fine as well.

[Qemu-devel] Re: [PATCH v3 11/21] kvm: x86: Reset paravirtual MSRs

2011-01-04 Thread Glauber Costa

On Tue, 2011-01-04 at 09:32 +0100, Jan Kiszka wrote:
> From: Jan Kiszka 
> 
> Make sure to write the cleared MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
> and MSR_KVM_ASYNC_PF_EN to the kernel state so that a freshly booted
> guest cannot be disturbed by old values.
> 
> Signed-off-by: Jan Kiszka 
> CC: Glauber Costa 

looks good.

>  target-i386/kvm.c |7 +++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index d8f26bf..8267655 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -845,6 +845,13 @@ static int kvm_put_msrs(CPUState *env, int level)
>  if (smp_cpus == 1 || env->tsc != 0) {
>  kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSC, env->tsc);
>  }
> +}
> +/*
> + * The following paravirtual MSRs have side effects on the guest or are
> + * too heavy for normal writeback. Limit them to reset or full state
> + * updates.
> + */
> +if (level >= KVM_PUT_RESET_STATE) {
>  kvm_msr_entry_set(&msrs[n++], MSR_KVM_SYSTEM_TIME,
>env->system_time_msr);
>  kvm_msr_entry_set(&msrs[n++], MSR_KVM_WALL_CLOCK, 
> env->wall_clock_msr);

[Qemu-devel] Re: [PATCH v2 11/17] kvm: x86: Reset paravirtual MSRs

2011-01-03 Thread Glauber Costa

On Mon, 2011-01-03 at 17:46 +0100, Jan Kiszka wrote:
> Am 03.01.2011 17:40, Glauber Costa wrote:
> > On Mon, 2011-01-03 at 09:33 +0100, Jan Kiszka wrote:
> >> From: Jan Kiszka 
> >>
> >> Make sure to clear MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, and
> >> MSR_KVM_ASYNC_PF_EN so that a freshly booted guest cannot be disturbed
> >> by old values.
> >>
> >> Signed-off-by: Jan Kiszka 
> >> CC: Glauber Costa 
> >> ---
> >>  target-i386/kvm.c |   10 ++
> >>  1 files changed, 10 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> >> index d8f26bf..664a4a0 100644
> >> --- a/target-i386/kvm.c
> >> +++ b/target-i386/kvm.c
> >> @@ -453,6 +453,9 @@ void kvm_arch_reset_vcpu(CPUState *env)
> >>  env->nmi_injected = 0;
> >>  env->nmi_pending = 0;
> >>  env->xcr0 = 1;
> >> +env->system_time_msr = 0;
> >> +env->wall_clock_msr = 0;
> >> +env->async_pf_en_msr = 0;
> > 
> > Have you seen this happening? I'd expect CPUState to be zeroed out over
> > init. And if it is not, I guess we should...
> 
> Ah, true, those three are part of the section that is zeroed. Will drop
> that hunk on repost.
> 
> Guess we should rather move some other variables in that region too and
> avoid clearing them manually like above...
> 
> Jan
> 
Agreed.

[Qemu-devel] Re: [PATCH v2 14/17] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-03 Thread Glauber Costa

On Mon, 2011-01-03 at 17:30 +0100, Jan Kiszka wrote:
> Am 03.01.2011 17:04, Avi Kivity wrote:
> > On 01/03/2011 10:33 AM, Jan Kiszka wrote:
> >> From: Jan Kiszka
> >>
> >> If kvmclock is used, which implies the kernel supports it, register a
> >> kvmclock device with the sysbus. Its main purpose is to save and restore
> >> the kernel state on migration, but this will also allow to visualize it
> >> one day.
> >>
> > 
> > kvmclock is a per-cpu affair.
> 
> Nope, it's state (the one save/restored here) is per VM.
> 
> > 
> >>
> >> @@ -534,6 +599,10 @@ int kvm_arch_init(int smp_cpus)
> >>   int ret;
> >>   struct utsname utsname;
> >>
> >> +#ifdef KVM_CAP_ADJUST_CLOCK
> >> +sysbus_register_withprop(&kvmclock_info);
> >> +#endif
> >> +
> > 
> > So this doesn't look right.  I think we're fine with just migrating the
> > MSRs, like we migrate anything else that has to do with the cpu.
> > 
> 
> The kvmclock state is not contained in any MSR. It's an independent
> machine state that can be indirectly obtained via MSR access. Therefore,
> qemu-kvm currently registers only one vmstate entry per machine, and
> this patch just turns this into a clean device - because that's what
> kvmclock is in the end, something like an HPET.

Jan is right, the per-cpu information only cares about which piece of
memory will be written to.

[Qemu-devel] Re: [PATCH v2 14/17] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-03 Thread Glauber Costa

On Mon, 2011-01-03 at 09:33 +0100, Jan Kiszka wrote:
> From: Jan Kiszka 
> 
> If kvmclock is used, which implies the kernel supports it, register a
> kvmclock device with the sysbus. Its main purpose is to save and restore
> the kernel state on migration, but this will also allow to visualize it
> one day.
> 
> Signed-off-by: Jan Kiszka 
> CC: Glauber Costa 

Hi Jan.

I've just recently posted a patch (not sure what was made from it), that
fixes a bug that you reintroduce here.

The bug is: if we call KVM_GET_CLOCK ioctl in pre_save, this means that
this value will change every time we issue savevm, even if the machine
is not run in between.

Ideally, we'd like to have two consecutive savevms returning the exact
same thing in that situation.

I like the general approach of moving it to sysbus, but please move the
ioctl to change state notifiers.

Cheers!

[Qemu-devel] Re: [PATCH v2 11/17] kvm: x86: Reset paravirtual MSRs

2011-01-03 Thread Glauber Costa

On Mon, 2011-01-03 at 09:33 +0100, Jan Kiszka wrote:
> From: Jan Kiszka 
> 
> Make sure to clear MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK, and
> MSR_KVM_ASYNC_PF_EN so that a freshly booted guest cannot be disturbed
> by old values.
> 
> Signed-off-by: Jan Kiszka 
> CC: Glauber Costa 
> ---
>  target-i386/kvm.c |   10 ++
>  1 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index d8f26bf..664a4a0 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -453,6 +453,9 @@ void kvm_arch_reset_vcpu(CPUState *env)
>  env->nmi_injected = 0;
>  env->nmi_pending = 0;
>  env->xcr0 = 1;
> +env->system_time_msr = 0;
> +env->wall_clock_msr = 0;
> +env->async_pf_en_msr = 0;

Have you seen this happening? I'd expect CPUState to be zeroed out over
init. And if it is not, I guess we should...

[Qemu-devel] Re: [PATCH v2 14/17] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-03 Thread Glauber Costa

On Mon, 2011-01-03 at 18:04 +0200, Avi Kivity wrote:
> On 01/03/2011 10:33 AM, Jan Kiszka wrote:
> > From: Jan Kiszka
> >
> > If kvmclock is used, which implies the kernel supports it, register a
> > kvmclock device with the sysbus. Its main purpose is to save and restore
> > the kernel state on migration, but this will also allow to visualize it
> > one day.
> >
> 
> kvmclock is a per-cpu affair.
> 
> >
> > @@ -534,6 +599,10 @@ int kvm_arch_init(int smp_cpus)
> >   int ret;
> >   struct utsname utsname;
> >
> > +#ifdef KVM_CAP_ADJUST_CLOCK
> > +sysbus_register_withprop(&kvmclock_info);
> > +#endif
> > +
> 
> So this doesn't look right.  I think we're fine with just migrating the 
> MSRs, like we migrate anything else that has to do with the cpu.

The ioctl jan is handling here is a global one, that adjusts the base
offset for the clock over migration. It is okay.

[Qemu-devel] Re: [PATCH v2 1/2] Do not register kvmclock savevm section if kvmclock is disabled.

2010-12-13 Thread Glauber Costa

On Wed, 2010-12-08 at 17:31 -0200, Marcelo Tosatti wrote:
> On Tue, Dec 07, 2010 at 03:12:36PM -0200, Glauber Costa wrote:
> > On Mon, 2010-12-06 at 19:04 -0200, Marcelo Tosatti wrote:
> > > On Mon, Dec 06, 2010 at 09:03:46AM -0500, Glauber Costa wrote:
> > > > Usually nobody usually thinks about that scenario (me included and 
> > > > specially),
> > > > but kvmclock can be actually disabled in the host.
> > > > 
> > > > It happens in two scenarios:
> > > >  1. host too old.
> > > >  2. we passed -kvmclock to our -cpu parameter.
> > > > 
> > > > In both cases, we should not register kvmclock savevm section. This 
> > > > patch
> > > > achives that by registering this section only if kvmclock is actually
> > > > currently enabled in cpuid.
> > > > 
> > > > The only caveat is that we have to register the savevm section a little 
> > > > bit
> > > > later, since we won't know the final kvmclock state before cpuid gets 
> > > > parsed.
> > > 
> > > What is the problem of registering the section? Restoring the value if
> > > the host does not support it returns an error?
> > > 
> > > Can't you ignore the error if kvmclock is not reported in cpuid, in the
> > > restore handler?
> > 
> > We can change the restore handler, but not the restore handler of
> > binaries that are already out there. The motivation here is precisely to
> > address migration to hosts without kvmclock, so it's better to have
> > a way to disable, than to count on the fact that the other side will be
> > able to ignore it.
> 
> OK. Can't you register conditionally on kvmclock cpuid bit at the end of
> kvm_arch_init_vcpu, in target-i386/kvm.c?

Haven't looked at it, but will today. Actually, tsc has (obviously) the
same problem and I plan to respin the patch today including a fix for it
as well.

Thanks!

[Qemu-devel] Re: [PATCH v2 1/2] Do not register kvmclock savevm section if kvmclock is disabled.

2010-12-07 Thread Glauber Costa

On Mon, 2010-12-06 at 19:04 -0200, Marcelo Tosatti wrote:
> On Mon, Dec 06, 2010 at 09:03:46AM -0500, Glauber Costa wrote:
> > Usually nobody usually thinks about that scenario (me included and 
> > specially),
> > but kvmclock can be actually disabled in the host.
> > 
> > It happens in two scenarios:
> >  1. host too old.
> >  2. we passed -kvmclock to our -cpu parameter.
> > 
> > In both cases, we should not register kvmclock savevm section. This patch
> > achives that by registering this section only if kvmclock is actually
> > currently enabled in cpuid.
> > 
> > The only caveat is that we have to register the savevm section a little bit
> > later, since we won't know the final kvmclock state before cpuid gets 
> > parsed.
> 
> What is the problem of registering the section? Restoring the value if
> the host does not support it returns an error?
> 
> Can't you ignore the error if kvmclock is not reported in cpuid, in the
> restore handler?

We can change the restore handler, but not the restore handler of
binaries that are already out there. The motivation here is precisely to
address migration to hosts without kvmclock, so it's better to have
a way to disable, than to count on the fact that the other side will be
able to ignore it.

[Qemu-devel] [PATCH v2 1/2] Do not register kvmclock savevm section if kvmclock is disabled.

2010-12-06 Thread Glauber Costa

Usually nobody usually thinks about that scenario (me included and specially),
but kvmclock can be actually disabled in the host.

It happens in two scenarios:
 1. host too old.
 2. we passed -kvmclock to our -cpu parameter.

In both cases, we should not register kvmclock savevm section. This patch
achives that by registering this section only if kvmclock is actually
currently enabled in cpuid.

The only caveat is that we have to register the savevm section a little bit
later, since we won't know the final kvmclock state before cpuid gets parsed.

Signed-off-by: Glauber Costa 
---
 cpus.c|3 +++
 qemu-kvm-x86.c|   19 +--
 qemu-kvm.h|3 +++
 target-i386/kvm.c |7 +++
 4 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/cpus.c b/cpus.c
index a55c330..a24098e 100644
--- a/cpus.c
+++ b/cpus.c
@@ -97,6 +97,9 @@ void cpu_synchronize_all_post_init(void)
 for (cpu = first_cpu; cpu; cpu = cpu->next_cpu) {
 cpu_synchronize_post_init(cpu);
 }
+if (kvm_enabled()) {
+kvmclock_register_savevm();
+}
 }
 
 int cpu_is_stopped(CPUState *env)
diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 20b7d6d..14a52ce 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -504,6 +504,9 @@ static void kvmclock_pre_save(void *opaque)
 {
 struct kvm_clock_data *cl = opaque;
 
+if (!kvmclock_enabled)
+return;
+
 kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, cl);
 }
 
@@ -528,6 +531,16 @@ static const VMStateDescription vmstate_kvmclock= {
 };
 #endif
 
+/* This has to happen after vcpu setup*/
+void kvmclock_register_savevm(void)
+{
+#ifdef KVM_CAP_ADJUST_CLOCK
+if (kvmclock_enabled && kvm_check_extension(kvm_state, 
KVM_CAP_ADJUST_CLOCK)) {
+vmstate_register(NULL, 0, &vmstate_kvmclock, &kvmclock_data);
+}
+#endif
+}
+
 int kvm_arch_qemu_create_context(void)
 {
 int r;
@@ -545,12 +558,6 @@ int kvm_arch_qemu_create_context(void)
 return -1;
 }
 
-#ifdef KVM_CAP_ADJUST_CLOCK
-if (kvm_check_extension(kvm_state, KVM_CAP_ADJUST_CLOCK)) {
-vmstate_register(NULL, 0, &vmstate_kvmclock, &kvmclock_data);
-}
-#endif
-
 r = kvm_set_boot_cpu_id(0);
 if (r < 0 && r != -ENOSYS) {
 return r;
diff --git a/qemu-kvm.h b/qemu-kvm.h
index 0f3fb50..0a104ef 100644
--- a/qemu-kvm.h
+++ b/qemu-kvm.h
@@ -752,6 +752,9 @@ int handle_tpr_access(void *opaque, CPUState *env, uint64_t 
rip,
 #define qemu_kvm_cpu_stop(env) do {} while(0)
 #endif
 
+extern int kvmclock_enabled;
+void kvmclock_register_savevm(void);
+
 #ifdef CONFIG_KVM
 
 typedef struct KVMSlot {
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 95e5d02..5443765 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -293,6 +293,7 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t 
status,
 }
 
 static int _kvm_arch_init_vcpu(CPUState *env);
+int kvmclock_enabled = 1;
 
 int kvm_arch_init_vcpu(CPUState *env)
 {
@@ -350,6 +351,12 @@ int kvm_arch_init_vcpu(CPUState *env)
 memset(c, 0, sizeof(*c));
 c->function = KVM_CPUID_FEATURES;
 c->eax = env->cpuid_kvm_features & get_para_features(env);
+
+if (!(c->eax & (1 << KVM_FEATURE_CLOCKSOURCE))) {
+/* In theory cpuid is per-cpu, and this is a global variable,
+ * but we don't expect kvmclock enabled in some cpus only */
+kvmclock_enabled = 0;
+}
 #endif
 
 cpu_x86_cpuid(env, 0, 0, &limit, &unused, &unused, &unused);
-- 
1.7.2.3

[Qemu-devel] [PATCH v2 0/2] savevm odness related to kvmclock

2010-12-06 Thread Glauber Costa

Some users told me that savevm path is behaving oddly wrt kvmclock.
The first oddness is that a guarantee we never made (AFAIK) is being broken:
two consecutive "savevm" operations, with the machine stopped in between
produces different results, due to the call to KVM_GET_CLOCK ioctl.
I believe the assumption that if the vm does not run, its saveable
state won't change is fairly reasonable. Maybe we should formally
guarantee that?

Also, this patch deals with the fact that this happens even if
kvmclock is disabled in cpuid: its savevm section is registered
nevertheless. Here, I try to register it only if it's enabled at
machine start.

v2: improvements suggested by Paolo, and patch reordering.

Glauber Costa (2):
  Do not register kvmclock savevm section if kvmclock is disabled.
  make kvmclock value idempotent for stopped machine

 cpus.c|3 +++
 qemu-kvm-x86.c|   23 +++
 qemu-kvm.h|3 +++
 target-i386/kvm.c |7 +++
 4 files changed, 28 insertions(+), 8 deletions(-)

-- 
1.7.2.3

[Qemu-devel] [PATCH v2 2/2] make kvmclock value idempotent for stopped machine

2010-12-06 Thread Glauber Costa

Although we never made such commitment clear (well, to the best
of my knowledge), some people expect that two savevm issued in sequence
in a stopped machine will yield the same results. This is not a crazy
requirement, since we don't expect a stopped machine to be updating its state,
for any device.

With kvmclock, this is not the case, since the .pre_save hook will issue an
ioctl to the host to acquire a timestamp, which is always changing.

This patch moves the value acquisition to vm state change handlers, conditional
on not being run. This could mean mean our get clock ioctl is issued more times,
but this should be fine since vm_stop is not a hot path.

When we do migrate, we'll transfer that value along.

Signed-off-by: Glauber Costa 
CC: Paolo Bonzini 

---
 qemu-kvm-x86.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 14a52ce..0e357ac 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -500,11 +500,11 @@ static int kvm_enable_tpr_access_reporting(CPUState *env)
 #ifdef KVM_CAP_ADJUST_CLOCK
 static struct kvm_clock_data kvmclock_data;
 
-static void kvmclock_pre_save(void *opaque)
+static void kvmclock_update_clock(void *opaque, int running, int reason)
 {
 struct kvm_clock_data *cl = opaque;
 
-if (!kvmclock_enabled)
+if ((!kvmclock_enabled) || running)
 return;
 
 kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, cl);
@@ -522,7 +522,6 @@ static const VMStateDescription vmstate_kvmclock= {
 .version_id = 1,
 .minimum_version_id = 1,
 .minimum_version_id_old = 1,
-.pre_save = kvmclock_pre_save,
 .post_load = kvmclock_post_load,
 .fields  = (VMStateField []) {
 VMSTATE_U64(clock, struct kvm_clock_data),
@@ -537,6 +536,7 @@ void kvmclock_register_savevm(void)
 #ifdef KVM_CAP_ADJUST_CLOCK
 if (kvmclock_enabled && kvm_check_extension(kvm_state, 
KVM_CAP_ADJUST_CLOCK)) {
 vmstate_register(NULL, 0, &vmstate_kvmclock, &kvmclock_data);
+qemu_add_vm_change_state_handler(kvmclock_update_clock, 
&kvmclock_data);
 }
 #endif
 }
-- 
1.7.2.3

[Qemu-devel] [PATCH 1/2] make kvmclock value idempotent for stopped machine

2010-12-03 Thread Glauber Costa

Although we never made such commitment clear (well, to the best
of my knowledge), some people expect that two savevm issued in sequence
in a stopped machine will yield the same results. This is not a crazy
requirement, since we don't expect a stopped machine to be updating its state,
for any device.

With kvmclock, this is not the case, since the .pre_save hook will issue an
ioctl to the host to acquire a timestamp, which is always changing.

This patch moves the value acquisition to vm_stop. This should mean our
get clock ioctl is issued more times, but this should be fine since vm_stop
is not a hot path.

When we do migrate, we'll transfer that value along.

Signed-off-by: Glauber Costa 
---
 cpus.c |4 
 qemu-kvm-x86.c |7 ++-
 qemu-kvm.h |2 ++
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/cpus.c b/cpus.c
index a55c330..879a03a 100644
--- a/cpus.c
+++ b/cpus.c
@@ -112,6 +112,10 @@ static void do_vm_stop(int reason)
 pause_all_vcpus();
 vm_state_notify(0, reason);
 monitor_protocol_event(QEVENT_STOP, NULL);
+if (kvm_enabled()) {
+kvmclock_update_clock();
+}
+
 }
 }
 
diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index 20b7d6d..d099d3d 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -500,11 +500,9 @@ static int kvm_enable_tpr_access_reporting(CPUState *env)
 #ifdef KVM_CAP_ADJUST_CLOCK
 static struct kvm_clock_data kvmclock_data;
 
-static void kvmclock_pre_save(void *opaque)
+void kvmclock_update_clock(void)
 {
-struct kvm_clock_data *cl = opaque;
-
-kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, cl);
+kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &kvmclock_data);
 }
 
 static int kvmclock_post_load(void *opaque, int version_id)
@@ -519,7 +517,6 @@ static const VMStateDescription vmstate_kvmclock= {
 .version_id = 1,
 .minimum_version_id = 1,
 .minimum_version_id_old = 1,
-.pre_save = kvmclock_pre_save,
 .post_load = kvmclock_post_load,
 .fields  = (VMStateField []) {
 VMSTATE_U64(clock, struct kvm_clock_data),
diff --git a/qemu-kvm.h b/qemu-kvm.h
index 0f3fb50..b0b7ab3 100644
--- a/qemu-kvm.h
+++ b/qemu-kvm.h
@@ -752,6 +752,8 @@ int handle_tpr_access(void *opaque, CPUState *env, uint64_t 
rip,
 #define qemu_kvm_cpu_stop(env) do {} while(0)
 #endif
 
+void kvmclock_update_clock(void);
+
 #ifdef CONFIG_KVM
 
 typedef struct KVMSlot {
-- 
1.7.2.3

[Qemu-devel] [PATCH 2/2] Do not register kvmclock savevm section if kvmclock is disabled.

2010-12-03 Thread Glauber Costa

Usually nobody usually thinks about that scenario (me included and specially),
but kvmclock can be actually disabled in the host.

It happens in two scenarios:
 1. host too old.
 2. we passed -kvmclock to our -cpu parameter.

In both cases, we should not register kvmclock savevm section. This patch
achives that by registering this section only if kvmclock is actually
currently enabled in cpuid.

The only caveat is that we have to register the savevm section a little bit
later, since we won't know the final kvmclock state before cpuid gets parsed.

Signed-off-by: Glauber Costa 
---
 cpus.c|3 +++
 qemu-kvm-x86.c|   20 ++--
 qemu-kvm.h|2 ++
 target-i386/kvm.c |7 +++
 4 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/cpus.c b/cpus.c
index 879a03a..eef716c 100644
--- a/cpus.c
+++ b/cpus.c
@@ -97,6 +97,9 @@ void cpu_synchronize_all_post_init(void)
 for (cpu = first_cpu; cpu; cpu = cpu->next_cpu) {
 cpu_synchronize_post_init(cpu);
 }
+if (kvm_enabled()) {
+kvmclock_register_savevm();
+}
 }
 
 int cpu_is_stopped(CPUState *env)
diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index d099d3d..668c8cf 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -502,6 +502,9 @@ static struct kvm_clock_data kvmclock_data;
 
 void kvmclock_update_clock(void)
 {
+if (!kvmclock_enabled)
+return;
+
 kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &kvmclock_data);
 }
 
@@ -525,6 +528,17 @@ static const VMStateDescription vmstate_kvmclock= {
 };
 #endif
 
+/* This has to happen after vcpu setup*/
+void kvmclock_register_savevm(void)
+{
+#ifdef KVM_CAP_ADJUST_CLOCK
+if (kvmclock_enabled && kvm_check_extension(kvm_state, 
KVM_CAP_ADJUST_CLOCK)) {
+printf("registering kvmclock savevm section\n");
+vmstate_register(NULL, 0, &vmstate_kvmclock, &kvmclock_data);
+}
+#endif
+}
+
 int kvm_arch_qemu_create_context(void)
 {
 int r;
@@ -542,12 +556,6 @@ int kvm_arch_qemu_create_context(void)
 return -1;
 }
 
-#ifdef KVM_CAP_ADJUST_CLOCK
-if (kvm_check_extension(kvm_state, KVM_CAP_ADJUST_CLOCK)) {
-vmstate_register(NULL, 0, &vmstate_kvmclock, &kvmclock_data);
-}
-#endif
-
 r = kvm_set_boot_cpu_id(0);
 if (r < 0 && r != -ENOSYS) {
 return r;
diff --git a/qemu-kvm.h b/qemu-kvm.h
index b0b7ab3..f51a2d6 100644
--- a/qemu-kvm.h
+++ b/qemu-kvm.h
@@ -753,6 +753,8 @@ int handle_tpr_access(void *opaque, CPUState *env, uint64_t 
rip,
 #endif
 
 void kvmclock_update_clock(void);
+extern int kvmclock_enabled;
+void kvmclock_register_savevm(void);
 
 #ifdef CONFIG_KVM
 
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 95e5d02..5443765 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -293,6 +293,7 @@ void kvm_inject_x86_mce(CPUState *cenv, int bank, uint64_t 
status,
 }
 
 static int _kvm_arch_init_vcpu(CPUState *env);
+int kvmclock_enabled = 1;
 
 int kvm_arch_init_vcpu(CPUState *env)
 {
@@ -350,6 +351,12 @@ int kvm_arch_init_vcpu(CPUState *env)
 memset(c, 0, sizeof(*c));
 c->function = KVM_CPUID_FEATURES;
 c->eax = env->cpuid_kvm_features & get_para_features(env);
+
+if (!(c->eax & (1 << KVM_FEATURE_CLOCKSOURCE))) {
+/* In theory cpuid is per-cpu, and this is a global variable,
+ * but we don't expect kvmclock enabled in some cpus only */
+kvmclock_enabled = 0;
+}
 #endif
 
 cpu_x86_cpuid(env, 0, 0, &limit, &unused, &unused, &unused);
-- 
1.7.2.3

[Qemu-devel] [PATCH 0/2] Fix savevm odness related to kvmclock

2010-12-03 Thread Glauber Costa

Some users told me that savevm path is behaving oddly wrt kvmclock.
The first oddness is that a guarantee we never made (AFAIK) is being broken:
two consecutive "savevm" operations, with the machine stopped in between
produces different results, due to the call to KVM_GET_CLOCK ioctl.
I believe the assumption that if the vm does not run, its saveable
state won't change is fairly reasonable. Maybe we should formally
guarantee that?

The second patch deals with the fact that this happens even if
kvmclock is disabled in cpuid: its savevm section is registered
nevertheless. Here, I try to register it only if it's enabled at
machine start.

Thanks

Glauber Costa (2):
  make kvmclock value idempotent for stopped machine
  Do not register kvmclock savevm section if kvmclock is disabled.

 cpus.c|7 +++
 qemu-kvm-x86.c|   25 +++--
 qemu-kvm.h|4 
 target-i386/kvm.c |7 +++
 4 files changed, 33 insertions(+), 10 deletions(-)

-- 
1.7.2.3

[Qemu-devel] Re: [PATCH 0/5] RFC: distinguish warm reset from cold reset.

2010-08-30 Thread Glauber Costa

On Mon, Aug 30, 2010 at 05:35:20PM +0900, Isaku Yamahata wrote:
> On Mon, Aug 30, 2010 at 10:59:19AM +0300, Avi Kivity wrote:
> >  On 08/30/2010 10:49 AM, Isaku Yamahata wrote:
> >> This patch set distinguish warm reset from cold reset by
> >> introducing warm reset callback handler.
> >> The first 4 patches are trivial clean up patches. The last patch of 5/5
> >> is RFC patch.
> >>
> >> The following thread arose cold reset vs warm reset issues.
> >> http://lists.nongnu.org/archive/html/qemu-devel/2010-08/msg00186.html
> >> The summary is
> >> - warm reset is wanted in qemu
> >>- Pressing the reset button is a warm reset on real machines
> >>- Sparc64 CPU uses different reset vector for warm and cold reset,
> >>  so system_reset acts like a reset button
> >>- Bus reset can be implemented utilizing qdev frame work instead of
> >>  implemeting it each bus layer independently.
> >> - The modification should be incremental.
> >>Anthony would like to see that as an incremental addition to what we 
> >> have
> >>today (like introducing a propagating warm reset callback) and thinking
> >>through what the actual behavior should and shouldn't be.
> >>
> >>
> >> If the direction is okay, The next step would be a patch(set) for qdev 
> >> which
> >> would introduce qdev_cold_reset(), qdev_warm_reset(),
> >> DeviceInfo::cold_reset and DeviceInfo::warm_reset
> >> and would obsolete qdev_reset() and DeviceInfo::reset.
> >>
> >
> > What would be the difference between warm and cold reset?  Former called  
> > on any reset, while the latter called on power up only?
> 
> What I have in mind at the moment is,
> warm reset callback is called on warm reset, not called on power up.
> cold reset callback is called only on power up (and power cycle).
> 
> Hmm, should warm reset handler be called on power up?
> Each cold reset callbacks can call corresponding warm reset handler
> if necessary. So it would be redundant to call qdev_warm_reset() on power up.
> Or would it be more robust to call warm reset in addition to cold reset
> on power on?
I guess a good way to do that would be making "reset" just mean warm reset,
and keep a single list. Those would be called on every kind of reset.
Alternatively, cards that want to have a different power-on code could
then be added to an amendment list, interfaced by a cold-reset API.

Re: [Qemu-devel] [PATCH RESEND] savevm: Reset last block info at beginning of each save

2010-08-18 Thread Glauber Costa

On Wed, Aug 18, 2010 at 08:02:10AM -0600, Alex Williamson wrote:
> If we save more than once we need to reset the last block info or else
> only the first save has the actual block info and each subsequent save
> will only use continue flags, making them unloadable independently.
> 
> Found-by: Miguel Di Ciurcio Filho 
> Signed-off-by: Alex Williamson 
Acked-by: Glauber Costa

[Qemu-devel] Re: [PATCH] qdev: Reset hotplugged devices

2010-08-03 Thread Glauber Costa

On Tue, Aug 03, 2010 at 10:19:47AM -0600, Alex Williamson wrote:
> Several devices rely on their reset() function being called to
> initialize device state, e1000 and rtl8139 in particular.  When
> the device is hot added, the reset doesn't occur, often leaving
> the device in an unusable state.  Adding a call to reset() after
> init() for hotplugged devices puts the device in the expected
> state for the guest.
> 
> Signed-off-by: Alex Williamson 
I like this one better.

[Qemu-devel] Re: [PATCH] e1000: Fix hotplug

2010-08-03 Thread Glauber Costa

On Mon, Aug 02, 2010 at 03:15:17PM -0600, Alex Williamson wrote:
> When we hotplug the device,
> we don't go through a reset cycle, which means a hot added e1000 is
> useless until the VM reboots.  

I do guess, however, that this is true for any device, right?

Wouldn't it be better to just call the newly added reset function at
hotplug? One way to do that, would be to store a value indicated qemu
has already started. If you add a reset handler after that, the function
is called before being placed on the list.

Re: [Qemu-devel] [PATCH] stop cpus before forking.

2010-06-16 Thread Glauber Costa

On Mon, Jun 14, 2010 at 02:58:47PM -0500, Anthony Liguori wrote:
> On 06/14/2010 02:42 PM, Glauber Costa wrote:
> >On Mon, Jun 14, 2010 at 02:33:00PM -0500, Anthony Liguori wrote:
> >>On 06/14/2010 02:27 PM, Glauber Costa wrote:
> >>>This patch fixes a bug that happens with kvm, irqchip-in-kernel,
> >>>while adding a netdev. Despite the situations of reproduction being
> >>>specific to kvm, I believe this fix is pretty generic, and fits here.
> >>>Specially if we ever want to have our own irqchip in kernel too.
> >>>
> >>>The problem happens after the fork system call, and although it is not
> >>>100 % reproduceable, happens pretty often. After fork, the memory where
> >>>the apic is mapped is present in both processes. It ends up confusing
> >>>the vcpus somewhere in the irq<->   ack path, and qemu hangs, with no
> >>>irqs being delivered at all from that point on.
> >>>
> >>>Making sure the vcpus are stopped before forking makes the problem go
> >>>away. Besides, this is a pretty unfrequent operation, which already hangs
> >>>the io-thread for a while. So it should not hurt performance.
> >>>
> >>>Signed-off-by: Glauber Costa
> >>This doesn't make very much sense to me but smells like a kernel bug to me.
> >My interpretation is that by doing that, we make sure no in-flight
> >requests are happening. Actually, a sleep(x), with x sufficiently big
> >is enough to make this problem go away, but that is too hacky.
> 
> vm_stop() is probably just acting a glorified sleep() since it has
> to wait for each thread to stop.
No, it is not. It also makes sure no vcpus are running, and thus, not generating
new requests (or waiting for any replies, for that matter).

(Note: I am not advocating the inclusion of this, just trying to build my own
awareness)

Re: [Qemu-devel] [PATCH] stop cpus before forking.

2010-06-16 Thread Glauber Costa

On Tue, Jun 15, 2010 at 10:33:21AM +0300, Avi Kivity wrote:
> On 06/14/2010 10:33 PM, Anthony Liguori wrote:
> >On 06/14/2010 02:27 PM, Glauber Costa wrote:
> >>This patch fixes a bug that happens with kvm, irqchip-in-kernel,
> >>while adding a netdev. Despite the situations of reproduction being
> >>specific to kvm, I believe this fix is pretty generic, and fits here.
> >>Specially if we ever want to have our own irqchip in kernel too.
> >>
> >>The problem happens after the fork system call, and although it is not
> >>100 % reproduceable, happens pretty often. After fork, the memory where
> >>the apic is mapped is present in both processes. It ends up confusing
> >>the vcpus somewhere in the irq<->  ack path, and qemu hangs, with no
> >>irqs being delivered at all from that point on.
> >>
> >>Making sure the vcpus are stopped before forking makes the problem go
> >>away. Besides, this is a pretty unfrequent operation, which
> >>already hangs
> >>the io-thread for a while. So it should not hurt performance.
> >
> >This doesn't make very much sense to me but smells like a kernel
> >bug to me.
> 
> It is, and the fix would be to create the APIC memory slot as
> sharable across forks (should be easy to fix in the kernel).
Kernel pages are already shared across fork, no?

Re: [Qemu-devel] [PATCH] stop cpus before forking.

2010-06-14 Thread Glauber Costa

On Mon, Jun 14, 2010 at 02:58:47PM -0500, Anthony Liguori wrote:
> On 06/14/2010 02:42 PM, Glauber Costa wrote:
> >On Mon, Jun 14, 2010 at 02:33:00PM -0500, Anthony Liguori wrote:
> >>On 06/14/2010 02:27 PM, Glauber Costa wrote:
> >>>This patch fixes a bug that happens with kvm, irqchip-in-kernel,
> >>>while adding a netdev. Despite the situations of reproduction being
> >>>specific to kvm, I believe this fix is pretty generic, and fits here.
> >>>Specially if we ever want to have our own irqchip in kernel too.
> >>>
> >>>The problem happens after the fork system call, and although it is not
> >>>100 % reproduceable, happens pretty often. After fork, the memory where
> >>>the apic is mapped is present in both processes. It ends up confusing
> >>>the vcpus somewhere in the irq<->   ack path, and qemu hangs, with no
> >>>irqs being delivered at all from that point on.
> >>>
> >>>Making sure the vcpus are stopped before forking makes the problem go
> >>>away. Besides, this is a pretty unfrequent operation, which already hangs
> >>>the io-thread for a while. So it should not hurt performance.
> >>>
> >>>Signed-off-by: Glauber Costa
> >>This doesn't make very much sense to me but smells like a kernel bug to me.
> >My interpretation is that by doing that, we make sure no in-flight
> >requests are happening. Actually, a sleep(x), with x sufficiently big
> >is enough to make this problem go away, but that is too hacky.
> 
> vm_stop() is probably just acting a glorified sleep() since it has
> to wait for each thread to stop.
> 
> >I do agree that this is most likely a kernel bug. But as with any other
> >kernel bugs, I believe this is a easy workaround to have things working
> >even in older kernels until we fix it.
> 
> If we don't know what the bug is, then we do not know whether this
> is a work around.  Rather, this change happens to make the bug more
> difficult to reproduce with your test case.
> 
> >>Even if it isn't, I can't rationalize why stopping the vm like this
> >>is enough to fix such a problem.  Is the problem that the KVM VCPU
> >>threads get duplicated while potentially running or something like
> >>that?
> >I doubt fork is duplicating the vcpu threads. More than that, this
> >bug does not happen with userspace irqchip.
> >So I believe that either irq request or the ack itself is reaching the
> >wrong process, forever stalling the apic.
> 
> That sounds more like a signal delivery issue.  It's not obvious to
> me that we're doing the wrong thing with signal mask though.
> 
> If it's a signal mask related issue, then vm_stop isn't a proper fix
> as there would be still be a race.
I do can investigate it further, but I doubt it is signal-delivery related.
I spent the first days believing it was, but now, I believe it is much
more likely to be apic-related. We don't need to wait for the child to exit for
this bug to happen, so SIGCHLD is never raised. And with in-kernel irqchip,
we don't deliver signals during normal vcpu execution.

Re: [Qemu-devel] [PATCH] stop cpus before forking.

2010-06-14 Thread Glauber Costa

On Mon, Jun 14, 2010 at 02:33:00PM -0500, Anthony Liguori wrote:
> On 06/14/2010 02:27 PM, Glauber Costa wrote:
> >This patch fixes a bug that happens with kvm, irqchip-in-kernel,
> >while adding a netdev. Despite the situations of reproduction being
> >specific to kvm, I believe this fix is pretty generic, and fits here.
> >Specially if we ever want to have our own irqchip in kernel too.
> >
> >The problem happens after the fork system call, and although it is not
> >100 % reproduceable, happens pretty often. After fork, the memory where
> >the apic is mapped is present in both processes. It ends up confusing
> >the vcpus somewhere in the irq<->  ack path, and qemu hangs, with no
> >irqs being delivered at all from that point on.
> >
> >Making sure the vcpus are stopped before forking makes the problem go
> >away. Besides, this is a pretty unfrequent operation, which already hangs
> >the io-thread for a while. So it should not hurt performance.
> >
> >Signed-off-by: Glauber Costa
> 
> This doesn't make very much sense to me but smells like a kernel bug to me.
My interpretation is that by doing that, we make sure no in-flight
requests are happening. Actually, a sleep(x), with x sufficiently big
is enough to make this problem go away, but that is too hacky.

I do agree that this is most likely a kernel bug. But as with any other
kernel bugs, I believe this is a easy workaround to have things working
even in older kernels until we fix it.

> 
> Even if it isn't, I can't rationalize why stopping the vm like this
> is enough to fix such a problem.  Is the problem that the KVM VCPU
> threads get duplicated while potentially running or something like
> that?

I doubt fork is duplicating the vcpu threads. More than that, this
bug does not happen with userspace irqchip.
So I believe that either irq request or the ack itself is reaching the
wrong process, forever stalling the apic.

[Qemu-devel] [PATCH] stop cpus before forking.

2010-06-14 Thread Glauber Costa

This patch fixes a bug that happens with kvm, irqchip-in-kernel,
while adding a netdev. Despite the situations of reproduction being
specific to kvm, I believe this fix is pretty generic, and fits here.
Specially if we ever want to have our own irqchip in kernel too.

The problem happens after the fork system call, and although it is not
100 % reproduceable, happens pretty often. After fork, the memory where
the apic is mapped is present in both processes. It ends up confusing
the vcpus somewhere in the irq <-> ack path, and qemu hangs, with no
irqs being delivered at all from that point on.

Making sure the vcpus are stopped before forking makes the problem go
away. Besides, this is a pretty unfrequent operation, which already hangs
the io-thread for a while. So it should not hurt performance.

Signed-off-by: Glauber Costa 
---
 net/tap.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/net/tap.c b/net/tap.c
index 0147dab..f34dd9c 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -330,6 +330,9 @@ static int launch_script(const char *setup_script, const 
char *ifname, int fd)
 sigaddset(&mask, SIGCHLD);
 sigprocmask(SIG_BLOCK, &mask, &oldmask);
 
+/* make sure no cpus are running, so the apic does not
+ * get confused */
+vm_stop(0);
 /* try to launch network script */
 pid = fork();
 if (pid == 0) {
@@ -350,6 +353,7 @@ static int launch_script(const char *setup_script, const 
char *ifname, int fd)
 execv(setup_script, args);
 _exit(1);
 } else if (pid > 0) {
+vm_start();
 while (waitpid(pid, &status, 0) != pid) {
 /* loop */
 }
-- 
1.7.0.1

[Qemu-devel] [PATCH] introduce -machine switch

2010-06-04 Thread Glauber Costa

This patch adds initial support for the -machine option, that allows
command line specification of machine attributes.
Besides its value per-se, it is the saner way we found to
allow for enabling/disabling of kvm's in-kernel irqchip.

machine-related options like kernel, initrd, etc, are now
accepted under this switch.

Note: This is against anthony's staging.

---
 hw/boards.h |4 +++
 hw/pc_piix.c|3 ++
 qemu-options.hx |   14 ++
 vl.c|   72 +++
 4 files changed, 67 insertions(+), 26 deletions(-)

diff --git a/hw/boards.h b/hw/boards.h
index 18b6b8f..bac8583 100644
--- a/hw/boards.h
+++ b/hw/boards.h
@@ -35,6 +35,10 @@ extern QEMUMachine *current_machine;
 
 #define COMMON_MACHINE_OPTS()  \
 {   \
+.name = "machine",  \
+.type = QEMU_OPT_STRING,\
+},  \
+{   \
 .name = "ram_size", \
 .type = QEMU_OPT_NUMBER,\
 },  \
diff --git a/hw/pc_piix.c b/hw/pc_piix.c
index f01194c..3ddb695 100644
--- a/hw/pc_piix.c
+++ b/hw/pc_piix.c
@@ -67,6 +67,9 @@ static void pc_init1(QemuOpts *opts, int pci_enabled)
 
 vmport_init();
 
+if (!kernel_cmdline)
+kernel_cmdline = "";
+
 /* allocate ram and load rom/bios */
 pc_memory_init(ram_size, kernel_filename, kernel_cmdline, initrd_filename,
&below_4g_mem_size, &above_4g_mem_size);
diff --git a/qemu-options.hx b/qemu-options.hx
index a6928b7..76ca866 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -35,6 +35,20 @@ STEXI
 Select the emulated @var{machine} (@code{-M ?} for list)
 ETEXI
 
+DEF("machine", HAS_ARG, QEMU_OPTION_machine,
+"-machine [machine=m][,ram_size=ram][,boot_device=dev]\n"
+" [,kernel=vmlinux][,cmdline=kernel_cmdline][,initrd=initrd]\n"
+" [,cpu=cpu_type]\n"
+"pc-specific options: [,acpi=on|off]\n"
+"kvm-x86 specific options: [,apic_in_kernel=on|off]\n"
+"select emulated machine (-machine ? for list)\n",
+QEMU_ARCH_ALL)
+STEXI
+...@item -machine @var{machine}[,@var{option}]
+...@findex -machine
+Select the emulated @var{machine} (@code{-machine ?} for list)
+ETEXI
+
 DEF("cpu", HAS_ARG, QEMU_OPTION_cpu,
 "-cpu cpuselect CPU (-cpu ? for list)\n", QEMU_ARCH_ALL)
 STEXI
diff --git a/vl.c b/vl.c
index 96b8d35..177ffe2 100644
--- a/vl.c
+++ b/vl.c
@@ -1605,6 +1605,16 @@ static QEMUMachine *find_machine(const char *name)
 if (m->alias && !strcmp(m->alias, name))
 return m;
 }
+
+printf("Supported machines are:\n");
+for(m = first_machine; m != NULL; m = m->next) {
+   if (m->alias)
+   printf("%-10s %s (alias of %s)\n",
+  m->alias, m->desc, m->name);
+   printf("%-10s %s%s\n",
+  m->name, m->desc,
+  m->is_default ? " (default)" : "");
+}
 return NULL;
 }
 
@@ -2567,7 +2577,7 @@ int main(int argc, char **argv, char **envp)
 DisplayState *ds;
 DisplayChangeListener *dcl;
 int cyls, heads, secs, translation;
-QemuOpts *hda_opts = NULL, *opts;
+QemuOpts *hda_opts = NULL, *machine_opts = NULL, *opts = NULL;
 int optind;
 const char *optarg;
 const char *loadvm = NULL;
@@ -2697,21 +2707,29 @@ int main(int argc, char **argv, char **envp)
 exit(1);
 }
 switch(popt->index) {
+case QEMU_OPTION_machine: {
+   const char *mach;
+
+machine_opts = qemu_opts_parse(&qemu_machine_opts, optarg, 0);
+if (!machine_opts) {
+fprintf(stderr, "parse error: %s\n", optarg);
+exit(1);
+}
+mach = qemu_opt_get(machine_opts, "machine");
+
+   if (!mach)
+   break;
+
+machine = find_machine(mach);
+
+if (!machine)
+exit(*mach != '?');
+   break;
+   }
 case QEMU_OPTION_M:
 machine = find_machine(optarg);
-if (!machine) {
-QEMUMachine *m;
-printf("Supported machines are:\n");
-for(m = first_machine; m != NULL; m = m->next) {
-if (m->alias)
-printf("%-10s %s (alias of %s)\n",
-   m->alias, m->desc, m->name);
-printf("%-10s %s%s\n",
-   m->name, m->desc,
-   m->is_default ? " (default)" : "");
-}
-exit(*optarg != '?');
-}
+if (!machine) 
+   exit(*optarg != '?');
 break;

[Qemu-devel] Re: [PATCH v2 2/2] basic machine opts framework

2010-06-02 Thread Glauber Costa

On Wed, Jun 02, 2010 at 09:15:10AM +0200, Jan Kiszka wrote:
> >  
> > +QemuOptsList qemu_machine_opts = {
> > +.name = "M",
> > +.head = QTAILQ_HEAD_INITIALIZER(qemu_machine_opts.head),
> > +.desc = {
> > +{
> > +.name = "mach",
> > +.type = QEMU_OPT_STRING,
> > +},{
> > +.name = "irqchip",
> > +.type = QEMU_OPT_STRING,
> > +},
> 
> Can't we make the concrete machine define what options it needs? Pushing
> this into the generic code may soon end up in a bunch of very special
> switches that are unused on most machines or even have no meaning for them.
> 
> Also, I would suggest to introduce the generic part first, and then add
> first users like the x86 irqchip.
Yeah, in general, I do agree with you.

Me and anthony talked about it for a while some time ago, and more or less
concluded that it could be possible to avoid that, putting a little think
which options to add.

the "irqchip" option, if you note, is not x86-specific, in any case.
Any machine has an irqchip. The first idea was to use something like
"apic=in_kernel|userspace" which would be, that, very x86-centric.

So, since letting machines define their own options adds complexity,
my take would be to add those common options, and add infrastructure
for machine-specific options when we see something that makes it
unavoidable.

What do you think?

[Qemu-devel] [PATCH v2 0/2] basic machine opts framework

2010-06-01 Thread Glauber Costa

Hello,

This is a resent (rebased) of an old patch set of mine, I sent
some time ago. With that, we should have all the needed infrastructure
to select the in-kernel irqchip for KVM.


Glauber Costa (2):
  early set current_machine
  basic machine opts framework

 hw/boards.h |   10 +
 hw/pc.c |   45 -
 qemu-config.c   |   16 ++
 qemu-config.h   |1 +
 qemu-options.hx |9 
 vl.c|   59 +-
 6 files changed, 132 insertions(+), 8 deletions(-)

[Qemu-devel] [PATCH v2 2/2] basic machine opts framework

2010-06-01 Thread Glauber Costa

This patch adds initial support for the -machine option, that allows
command line specification of machine attributes (always relying on safe
defaults). Besides its value per-se, it is the saner way we found to
allow for enabling/disabling of kvm's in-kernel irqchip.

A machine with in-kernel-irqchip could be specified as:
-machine irqchip=apic-kvm
And one without it:
-machine irqchip=apic

To demonstrate how it'd work, this patch introduces a choice between
"pic" and "apic", pic being the old-style isa thing.
---
 hw/boards.h |   10 ++
 hw/pc.c |   45 +++--
 qemu-config.c   |   16 
 qemu-config.h   |1 +
 qemu-options.hx |9 +
 vl.c|   54 ++
 6 files changed, 129 insertions(+), 6 deletions(-)

diff --git a/hw/boards.h b/hw/boards.h
index d889341..187794e 100644
--- a/hw/boards.h
+++ b/hw/boards.h
@@ -12,6 +12,15 @@ typedef void QEMUMachineInitFunc(ram_addr_t ram_size,
  const char *initrd_filename,
  const char *cpu_model);
 
+typedef void (QEMUIrqchipFunc)(void *opaque);
+
+typedef struct QEMUIrqchip {
+const char *name;
+QEMUIrqchipFunc *init; 
+int used;
+int is_default;
+} QEMUIrqchip;
+
 typedef struct QEMUMachine {
 const char *name;
 const char *alias;
@@ -21,6 +30,7 @@ typedef struct QEMUMachine {
 int max_cpus;
 int is_default;
 CompatProperty *compat_props;
+QEMUIrqchip *irqchip;
 struct QEMUMachine *next;
 } QEMUMachine;
 
diff --git a/hw/pc.c b/hw/pc.c
index 408d6d6..b3de30a 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -1007,21 +1007,43 @@ int cpu_is_bsp(CPUState *env)
 return env->cpuid_apic_id == 0;
 }
 
+static void qemu_apic_init(void *opaque)
+{
+CPUState *env = opaque;
+if (!(env->cpuid_features & CPUID_APIC)) {
+fprintf(stderr, "CPU lacks APIC cpuid flag\n");
+exit(1);
+}
+env->cpuid_apic_id = env->cpu_index;
+/* APIC reset callback resets cpu */
+apic_init(env);
+}
+
+static void qemu_pic_init(void *opaque)
+{
+CPUState *env = opaque;
+
+if (smp_cpus > 1) {
+fprintf(stderr, "PIC can't support smp systems\n");
+exit(1);
+}
+qemu_register_reset((QEMUResetHandler*)cpu_reset, env);
+}
+
 static CPUState *pc_new_cpu(const char *cpu_model)
 {
 CPUState *env;
+QEMUIrqchip *ic;
 
 env = cpu_init(cpu_model);
 if (!env) {
 fprintf(stderr, "Unable to find x86 CPU definition\n");
 exit(1);
 }
-if ((env->cpuid_features & CPUID_APIC) || smp_cpus > 1) {
-env->cpuid_apic_id = env->cpu_index;
-/* APIC reset callback resets cpu */
-apic_init(env);
-} else {
-qemu_register_reset((QEMUResetHandler*)cpu_reset, env);
+
+for (ic = current_machine->irqchip; ic->name != NULL; ic++) {
+if (ic->used)
+ic->init(env);
 }
 return env;
 }
@@ -1370,6 +1392,17 @@ static QEMUMachine pc_machine = {
 .desc = "Standard PC",
 .init = pc_init_pci,
 .max_cpus = 255,
+.irqchip = (QEMUIrqchip[]){
+{
+.name = "apic",
+.init = qemu_apic_init,
+.is_default = 1,
+},{
+.name = "pic",
+.init = qemu_pic_init,
+},
+{ /* end of list */ },
+},
 .is_default = 1,
 };
 
diff --git a/qemu-config.c b/qemu-config.c
index cae92f7..e83b301 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -196,6 +196,21 @@ QemuOptsList qemu_rtc_opts = {
 },
 };
 
+QemuOptsList qemu_machine_opts = {
+.name = "M",
+.head = QTAILQ_HEAD_INITIALIZER(qemu_machine_opts.head),
+.desc = {
+{
+.name = "mach",
+.type = QEMU_OPT_STRING,
+},{
+.name = "irqchip",
+.type = QEMU_OPT_STRING,
+},
+{ /* end of list */ }
+},
+};
+
 static QemuOptsList *lists[] = {
 &qemu_drive_opts,
 &qemu_chardev_opts,
@@ -203,6 +218,7 @@ static QemuOptsList *lists[] = {
 &qemu_netdev_opts,
 &qemu_net_opts,
 &qemu_rtc_opts,
+&qemu_machine_opts,
 NULL,
 };
 
diff --git a/qemu-config.h b/qemu-config.h
index 3cc8864..9b02ee0 100644
--- a/qemu-config.h
+++ b/qemu-config.h
@@ -7,6 +7,7 @@ extern QemuOptsList qemu_device_opts;
 extern QemuOptsList qemu_netdev_opts;
 extern QemuOptsList qemu_net_opts;
 extern QemuOptsList qemu_rtc_opts;
+extern QemuOptsList qemu_machine_opts;
 
 int qemu_set_option(const char *str);
 
diff --git a/qemu-options.hx b/qemu-options.hx
index 20aa242..4dfc55a 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -31,6 +31,15 @@ STEXI
 Select the emulated @var{machine} (@code{-M ?} for list)
 ETEXI
 
+DEF("machine", HAS_ARG, QEMU_OPTION_machine,
+"-machine mach=m[,irqchip=chip]\n"
+"select emulated machine (-machine ? for list)\n")
+STEXI
+...@item -machine @var{machine}[,@var{op

[Qemu-devel] [PATCH v2 1/2] early set current_machine

2010-06-01 Thread Glauber Costa

this way, the machine_init function itself can know which machine is current
in use, not only the late init code.
Signed-off-by: Glauber Costa 
---
 vl.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/vl.c b/vl.c
index 96838f8..7a8b20b 100644
--- a/vl.c
+++ b/vl.c
@@ -5824,6 +5824,9 @@ int main(int argc, char **argv, char **envp)
 if (machine->compat_props) {
 qdev_prop_register_compat(machine->compat_props);
 }
+
+current_machine = machine;
+
 machine->init(ram_size, boot_devices,
   kernel_filename, kernel_cmdline, initrd_filename, cpu_model);
 
@@ -5841,8 +5844,6 @@ int main(int argc, char **argv, char **envp)
 }
 }
 
-current_machine = machine;
-
 /* init USB devices */
 if (usb_enabled) {
 if (foreach_device_config(DEV_USB, usb_parse) < 0)
-- 
1.6.2.2

Re: [Qemu-devel] [PATCH 2/2] machine opts framework

2010-03-24 Thread Glauber Costa

On Wed, Mar 24, 2010 at 02:43:35PM -0500, Anthony Liguori wrote:
> On 03/24/2010 02:26 PM, Glauber Costa wrote:
> >This patch adds initial support for the -machine option, that allows
> >command line specification of machine attributes (always relying on safe
> >defaults). Besides its value per-se, it is the saner way we found to
> >allow for enabling/disabling of kvm's in-kernel irqchip.
> >
> >A machine with in-kernel-irqchip could be specified as:
> > -machine irqchip=apic-kvm
> >And one without it:
> > -machine irqchip=apic
> >
> >To demonstrate how it'd work, this patch introduces a choice between
> >"pic" and "apic", pic being the old-style isa thing.
> 
> I started from a different place.  See machine-qemuopts in my staging tree.
> 
> I think we should combine efforts.
> 
> Regards,
> 

Absolutely. I see little overlap between what we did. Just a comment on yours:

-static void an5206_init(ram_addr_t ram_size,
- const char *boot_device,
- const char *kernel_filename, const char *kernel_cmdline,
- const char *initrd_filename, const char *cpu_model)
+static void an5206_init(QemuOpts *opts)
 {

Since we're changing init functions anyway, I believe we should also pass
a pointer to the machine structure. With that, we can avoing using the global
current_machine.

[Qemu-devel] [PATCH 1/2] early set current_machine

2010-03-24 Thread Glauber Costa

this way, the machine_init function itself can know which machine is current
in use, not only the late init code.
---
 vl.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/vl.c b/vl.c
index d69250c..ceddeac 100644
--- a/vl.c
+++ b/vl.c
@@ -4841,6 +4841,9 @@ int main(int argc, char **argv, char **envp)
 }
 qemu_add_globals();
 
+
+current_machine = machine;
+
 machine->init(ram_size, boot_devices,
   kernel_filename, kernel_cmdline, initrd_filename, cpu_model);
 
@@ -4859,8 +4862,6 @@ int main(int argc, char **argv, char **envp)
 }
 }
 
-current_machine = machine;
-
 /* init USB devices */
 if (usb_enabled) {
 if (foreach_device_config(DEV_USB, usb_parse) < 0)
-- 
1.6.6

[Qemu-devel] [PATCH 0/2] machine opts framework

2010-03-24 Thread Glauber Costa

This patch adds initial support for the -machine option, that allows
command line specification of machine attributes (always relying on safe
defaults). Besides its value per-se, it is the saner way we found to
allow for enabling/disabling of kvm's in-kernel irqchip.

A machine with in-kernel-irqchip could be specified as:
-machine irqchip=apic-kvm
And one without it:
-machine irqchip=apic

To demonstrate how it'd work, this patch introduces a choice between
"pic" and "apic", pic being the old-style isa thing. It does introduce
a behavioral change, however: In old code, pic would always be a fallback.
Now, it will have to be explicitly selected. I believe it is better behaviour,
but this is not the most important part of it, so I can easily go back
if people want it out.

Let the flames begin!

Glauber Costa (2):
  early set current_machine
  machine opts framework

 hw/boards.h |   10 +
 hw/pc.c |   45 -
 qemu-config.c   |   16 ++
 qemu-config.h   |1 +
 qemu-options.hx |9 
 vl.c|   59 +-
 6 files changed, 132 insertions(+), 8 deletions(-)

[Qemu-devel] [PATCH 2/2] machine opts framework

2010-03-24 Thread Glauber Costa

This patch adds initial support for the -machine option, that allows
command line specification of machine attributes (always relying on safe
defaults). Besides its value per-se, it is the saner way we found to
allow for enabling/disabling of kvm's in-kernel irqchip.

A machine with in-kernel-irqchip could be specified as:
-machine irqchip=apic-kvm
And one without it:
-machine irqchip=apic

To demonstrate how it'd work, this patch introduces a choice between
"pic" and "apic", pic being the old-style isa thing. 

---
 hw/boards.h |   10 ++
 hw/pc.c |   45 +++--
 qemu-config.c   |   16 
 qemu-config.h   |1 +
 qemu-options.hx |9 +
 vl.c|   54 ++
 6 files changed, 129 insertions(+), 6 deletions(-)

diff --git a/hw/boards.h b/hw/boards.h
index 6f0f0d7..831728c 100644
--- a/hw/boards.h
+++ b/hw/boards.h
@@ -12,6 +12,15 @@ typedef void QEMUMachineInitFunc(ram_addr_t ram_size,
  const char *initrd_filename,
  const char *cpu_model);
 
+typedef void (QEMUIrqchipFunc)(void *opaque);
+
+typedef struct QEMUIrqchip {
+const char *name;
+QEMUIrqchipFunc *init; 
+int used;
+int is_default;
+} QEMUIrqchip;
+
 typedef struct QEMUMachine {
 const char *name;
 const char *alias;
@@ -28,6 +37,7 @@ typedef struct QEMUMachine {
 no_sdcard:1;
 int is_default;
 GlobalProperty *compat_props;
+QEMUIrqchip *irqchip;
 struct QEMUMachine *next;
 } QEMUMachine;
 
diff --git a/hw/pc.c b/hw/pc.c
index ba14df0..43ec022 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -752,21 +752,43 @@ int cpu_is_bsp(CPUState *env)
 return env->cpu_index == 0;
 }
 
+static void qemu_apic_init(void *opaque)
+{
+CPUState *env = opaque;
+if (!(env->cpuid_features & CPUID_APIC)) {
+fprintf(stderr, "CPU lacks APIC cpuid flag\n");
+exit(1);
+}
+env->cpuid_apic_id = env->cpu_index;
+/* APIC reset callback resets cpu */
+apic_init(env);
+}
+
+static void qemu_pic_init(void *opaque)
+{
+CPUState *env = opaque;
+
+if (smp_cpus > 1) {
+fprintf(stderr, "PIC can't support smp systems\n");
+exit(1);
+}
+qemu_register_reset((QEMUResetHandler*)cpu_reset, env);
+}
+
 static CPUState *pc_new_cpu(const char *cpu_model)
 {
 CPUState *env;
+QEMUIrqchip *ic;
 
 env = cpu_init(cpu_model);
 if (!env) {
 fprintf(stderr, "Unable to find x86 CPU definition\n");
 exit(1);
 }
-if ((env->cpuid_features & CPUID_APIC) || smp_cpus > 1) {
-env->cpuid_apic_id = env->cpu_index;
-/* APIC reset callback resets cpu */
-apic_init(env);
-} else {
-qemu_register_reset((QEMUResetHandler*)cpu_reset, env);
+
+for (ic = current_machine->irqchip; ic->name != NULL; ic++) {
+if (ic->used)
+ic->init(env);
 }
 return env;
 }
@@ -1074,6 +1096,17 @@ static QEMUMachine pc_machine = {
 .desc = "Standard PC",
 .init = pc_init_pci,
 .max_cpus = 255,
+.irqchip = (QEMUIrqchip[]){
+{
+.name = "apic",
+.init = qemu_apic_init,
+.is_default = 1,
+},{
+.name = "pic",
+.init = qemu_pic_init,
+},
+{ /* end of list */ },
+},
 .is_default = 1,
 };
 
diff --git a/qemu-config.c b/qemu-config.c
index 150157c..2b985a9 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -296,6 +296,21 @@ QemuOptsList qemu_cpudef_opts = {
 },
 };
 
+QemuOptsList qemu_machine_opts = {
+.name = "M",
+.head = QTAILQ_HEAD_INITIALIZER(qemu_machine_opts.head),
+.desc = {
+{
+.name = "mach",
+.type = QEMU_OPT_STRING,
+},{
+.name = "irqchip",
+.type = QEMU_OPT_STRING,
+},
+{ /* end of list */ }
+},
+};
+
 static QemuOptsList *lists[] = {
 &qemu_drive_opts,
 &qemu_chardev_opts,
@@ -306,6 +321,7 @@ static QemuOptsList *lists[] = {
 &qemu_global_opts,
 &qemu_mon_opts,
 &qemu_cpudef_opts,
+&qemu_machine_opts,
 NULL,
 };
 
diff --git a/qemu-config.h b/qemu-config.h
index f217c58..ea302f0 100644
--- a/qemu-config.h
+++ b/qemu-config.h
@@ -10,6 +10,7 @@ extern QemuOptsList qemu_rtc_opts;
 extern QemuOptsList qemu_global_opts;
 extern QemuOptsList qemu_mon_opts;
 extern QemuOptsList qemu_cpudef_opts;
+extern QemuOptsList qemu_machine_opts;
 
 QemuOptsList *qemu_find_opts(const char *group);
 int qemu_set_option(const char *str);
diff --git a/qemu-options.hx b/qemu-options.hx
index 8450b45..585ecd2 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -34,6 +34,15 @@ STEXI
 Select the emulated @var{machine} (@code{-M ?} for list)
 ETEXI
 
+DEF("machine", HAS_ARG, QEMU_OPTION_machine,
+"-machine mach=m[,irqchip=chip]\n"
+"select emulated machine (-machine ? for lis

Re: [Qemu-devel] Re: qdev property bug?

2009-12-14 Thread Glauber Costa

On Mon, Dec 14, 2009 at 08:22:12AM -0600, Anthony Liguori wrote:
> Michael S. Tsirkin wrote:
> >Well I am pretty sure that I used virtio + e1000 with 0.11
> >and apparently I can't now.
> >So it does look like a regression to me ...
> 
> That's what I said, we should make sure that we stop loading roms
> when we run out of room as opposed to trampling over the bios space.

Can't we first load the bios?

Then we're pretty sure oproms will never trample it.

Re: [Qemu-devel] Re: qdev property bug?

2009-12-14 Thread Glauber Costa

On Mon, Dec 14, 2009 at 04:11:43PM +0200, Michael S. Tsirkin wrote:
> On Mon, Dec 14, 2009 at 08:11:59AM -0600, Anthony Liguori wrote:
> > Alexander Graf wrote:
> >> Michael S. Tsirkin wrote:
> >>   
> >>> On Mon, Dec 14, 2009 at 12:55:28PM +0100, Alexander Graf wrote:
> >>>   
>  Am 14.12.2009 um 11:59 schrieb "Michael S. Tsirkin" :
> 
>    
> > On Mon, Dec 14, 2009 at 11:16:34AM +0100, Gerd Hoffmann wrote:
> >   
> >> On 12/14/09 10:44, Michael S. Tsirkin wrote:
> >>   
> >>> No, it did not even start booting the kernel. Just gave me 
> >>> blank  screen.
> >>>   
> >> [ testing ]
> >>
> >> Oh.  That is something completely different.  A bug in the rom  
> >> loader.
> >> It fails to fit both e1000 (default nic) and virtio-net boot 
> >> roms  into
> >> the option rom area and bails out (before loading seabios).  vl.c
> >> doesn't check the return value and happily continues (without bios).
> >> Which doesn't work out very well ...
> >>
> >> With two identical nics the (single) rom fits and qemu boots.
> >>
> >> Hmm.  Of course vl.c must be fixed to check the return value.
> >>   
> > Yes.
> >
> >   
> >> Not sure how to deal with the rom size issue.  The gPXE roms 
> >> look  quit
> >> big compared to the older roms we had.
> >>   
> > Hmm, it's a regression then ...
> >   
>  How does real hw handle this? I'm pretty sure most servers these 
>  days  use more option rom space than this. They usually have some 
>  onboard raid bios, external storage, on-board nic, pci nic, ...
>    
> >>> Real hardware might do several things I know about
> >>> - option rom is typically small.
> >>> - option rom is not loaded always (BIOS option), or not for all cards.
> >>> There are might be other tricks.
> >>>   
> >>
> >> There are probably other tricks. I was booting up a machine that had
> >> like 5 options roms going through their initialization that all weren't
> >> exactly small.
> >>
> >>   
>  So there must be some way to just have more option rom space.   
>    
> >>> What do you mean?
> >>>   
> >>
> >> Well, what's keeping us from having 5 MB of option roms?
> >>   
> >
> > For starters, option roms run in real mode when you only have 1MB of  
> > addressable memory :-)
> >
>  Implementing anything else would just be a waste of time. It'd 
>  break  again when ppl do device assignment.
> 
>  Alex
>    
> >>> We need some solution for 0.12 though IMO.
> >>> This does not need to address device assignment,
> >>> but it must be simple.
> >>>   
> >>
> >> Agreed. If there is a solution that gives us the chance to support an
> >> arbitrary number of option roms that wouldn't take forever to implement,
> >> I'd rather take that one though.
> >>   
> >
> > For 0.12, we just need to fail gracefully (meaning stop loading option  
> > roms when we run out of room).  It's not a regression compared to 0.11.
> >
> > Regards,
> >
> > Anthony Liguori
> 
> 
> Well I am pretty sure that I used virtio + e1000 with 0.11
> and apparently I can't now.
> So it does look like a regression to me ...

e1000 is the problem here, since it is by far the largest roms
of them all.

Re: [Qemu-devel] Re: qdev property bug?

2009-12-14 Thread Glauber Costa

On Mon, Dec 14, 2009 at 04:01:43PM +0200, Michael S. Tsirkin wrote:
> On Mon, Dec 14, 2009 at 02:35:31PM +0100, Alexander Graf wrote:
> > Michael S. Tsirkin wrote:
> > > On Mon, Dec 14, 2009 at 12:55:28PM +0100, Alexander Graf wrote:
> > >   
> > >> Am 14.12.2009 um 11:59 schrieb "Michael S. Tsirkin" :
> > >>
> > >> 
> > >>> On Mon, Dec 14, 2009 at 11:16:34AM +0100, Gerd Hoffmann wrote:
> > >>>   
> >  On 12/14/09 10:44, Michael S. Tsirkin wrote:
> >  
> > > No, it did not even start booting the kernel. Just gave me blank  
> > > screen.
> > >   
> >  [ testing ]
> > 
> >  Oh.  That is something completely different.  A bug in the rom  
> >  loader.
> >  It fails to fit both e1000 (default nic) and virtio-net boot roms  
> >  into
> >  the option rom area and bails out (before loading seabios).  vl.c
> >  doesn't check the return value and happily continues (without bios).
> >  Which doesn't work out very well ...
> > 
> >  With two identical nics the (single) rom fits and qemu boots.
> > 
> >  Hmm.  Of course vl.c must be fixed to check the return value.
> >  
> > >>> Yes.
> > >>>
> > >>>   
> >  Not sure how to deal with the rom size issue.  The gPXE roms look  
> >  quit
> >  big compared to the older roms we had.
> >  
> > >>> Hmm, it's a regression then ...
> > >>>   
> > >> How does real hw handle this? I'm pretty sure most servers these days  
> > >> use more option rom space than this. They usually have some onboard raid 
> > >> bios, external storage, on-board nic, pci nic, ...
> > >> 
> > >
> > > Real hardware might do several things I know about
> > > - option rom is typically small.
> > > - option rom is not loaded always (BIOS option), or not for all cards.
> > > There are might be other tricks.
> > >   
> > 
> > There are probably other tricks. I was booting up a machine that had
> > like 5 options roms going through their initialization that all weren't
> > exactly small.
> 
> This might boil down to better ROMs. E.g.
> maybe they request memory with PMM and then give it up
> after initialization?
That would be my guess.

However, gpxe rom are also used in real hardware. Maybe there
is some config trick we're missing?

Re: [Qemu-devel] Spice project is now open

2009-12-11 Thread Glauber Costa

On Fri, Dec 11, 2009 at 5:22 PM, Izik Eidus  wrote:
> On Fri, 11 Dec 2009 13:06:47 -0600
> Anthony Liguori  wrote:
>
>> Izik Eidus wrote:
>> > I want to add that qemu is not the sole user of spice, Spice will be
>> > used as a protocol to connect into physical windows/linux
>> > machines
>> >
>> > So how can we change the library just for qemu?
>> >
>> A library is not necessarily a problem.
>>
>> What would be a probably is if the library maintains guest visible
>> state.  There are a lot of advantages to keeping qemu as the sole
>> maintainer of guest visible state as it simplifies things like live
>> migration.  More importantly, it allows us to do things like Avi's
>> suggested security sandboxing using seccomp().  For that to work, we
>> need to make sure that we can isolate any code that interacts
>> directly with the guest.
>
> Spice guest visible state inside qemu is just its PCI QXL device.
> This part is qemu specificed.
>

But this part can work together with vnc with no problems, right?
If this is so, why don't we just start by merging it, while trying
to make the case for the rendering protocol in parallel ?

-- 
Glauber  Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

Re: [Qemu-devel] Spice project is now open

2009-12-11 Thread Glauber Costa

>
> But to introduce another protocol where a user has to make a choice to use
> Spice over VNC, I think we need a really good justification for that.  It's
> really about complexity.  A user shouldn't have to know about Spice or VNC.
>  They shouldn't have to contemplate the trade-offs of whether their
> management tool is aware or not.  It should Just Work.
>
>> I would happy to answer more questions if anyone have
>>
>
> Thanks.

Just to make a point clear:

AFAIU, there are two parts of qemu spice support. The protocol (vnc-like), and
the guest device (vga-like). I am right?


-- 
Glauber  Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

Re: [Qemu-devel] Spice project is now open

2009-12-11 Thread Glauber Costa

>> I think we should allow freedom of choice to the users to decide what
>> protcol they want to use, Spice and VNC are all diffrent and were born
>> to meet diffrent goals.
>>
>> I would happy to answer more questions if anyone have
>
> I think the simple point is that, AFAICS, the spice folks are expecting
> the qemu team to integrate their big ugly tarball, instead of doing what
> everyone else does, which is forward port everything to current head
> and then provide a current set of patches against GIT head.
>
> Anything else is just a waste of time. The paths both projects are
> at are too far apart.
>

More important than forward porting, is respecting the design decisions
a huge community has agreed upon. Of course, when you become part
of that community, you can try to shift these designs towards your goals,
but trying to force them is just ridiculous.

That said, I do believe spice can play a essential role in making us go
forward, but the attitude towards it has to change quite a bit.

-- 
Glauber  Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

Re: [Qemu-devel] Spice project is now open

2009-12-11 Thread Glauber Costa

>>
>> Spice is a library, it is library for remote display, it handle by
>> itself all the connection between the spice client to the host that
>> run the guest, it include:
>> sound, display, keyboard, usb, network tunneling (for printers) and so
>> on...
>>
>
> I want to add that qemu is not the sole user of spice, Spice will be
> used as a protocol to connect into physical windows/linux machines
>
> So how can we change the library just for qemu?
>
I don't fully understand spice yet, but what's the difficulty here?
libraries changes every single day to try to acomodate for the needs
of specific users, be it through generalizations, shims, or whatever.

This is just another day in the OSS world.


-- 
Glauber  Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

Re: [Qemu-devel] Spice project is now open

2009-12-11 Thread Glauber Costa

>> It already has.  It's not a git tree with staged patches.  It's a
>> tarball release of a really old version of kvm-userspace that's called
>>
>> 'vdesktop'.
>
>
> This guy is evil and he is motivate by personal agenda. I hope you all will
> wakeup.
>

I don't see anthony with any specific agenda here than making qemu as best
as it can get. If he is evil, I'm the devil myself.

-- 
Glauber  Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

Re: [Qemu-devel] [PATCH 00/17] pci: switch a ton of drivers to symbolic names

2009-12-10 Thread Glauber Costa

On Thu, Dec 10, 2009 at 4:09 PM, Michael S. Tsirkin  wrote:
> The recent e1000 bug made the important of using
> symbolic macros for pci config access clear for me.
> So I started going over drivers and converting
> to symbolic constants instead of hard-coded ones.
> I did a large part until I run out of steam.
> Maybe some brave soul will take up converting
> the rest of them, or maybe I will: note that
> when converting bridges one should be careful
> to use bridge macros where appropriate.
>
> Instead of testing a huge number of configurations,
> I compared binaries before and after conversion.
> Almost all of them generate exact same stripped binary
> before and after the change.
> The only object changed was eepro100, objdump showed
> that the change was because gcc for some reason
> decides to use a bit more stack for init function
> after comments are added there.
>
> This methodology was the reason that I added TODOs where I saw
> deviations from spec or other code ugliness, will have to be fixed
> separately.
>

IMHO, this is a huge enhancement.

I myself was found expending huge amounts of time trying do figure out the
meaning of some specific constants in the past.

+1

-- 
Glauber  Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

Re: [Qemu-devel] Using qemu-thread for synchronization

2009-12-10 Thread Glauber Costa

On Thu, Dec 10, 2009 at 12:46 PM, Liran Schour  wrote:
>
> I want to be able to synchronize between the code that is running the live
> migration with the code that call fro the completion callback of async IO.
> For that I am using qemu-thread.c (i.e QemuCond). I see that I have
> problems while linking if I do not use --enable-io-thread.
> Can someone explain me this. And if I want to do the above what is the
> right way to do that?

qemu_thread is only available when you use --enable-io-thread.

But beware, it is not 100 % stable yet. I had a queue of patches
that makes it slightly better, tough...


-- 
Glauber  Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

Re: [Qemu-devel] [PATCH] Permit zero-sized qemu_malloc() & friends

2009-12-07 Thread Glauber Costa

> So the only correct way would be to write:
>
> array = malloc(size);
> if (array == NULL && size != 0) {
>   return -ENOMEM;
> }
>

Of course we can differentiate. A failed malloc will abort, a successful one
will not.

> If you were writing portable code.  But that's not what people write.  You
> can argue that qemu_malloc() can have any semantics we want and while that's
> true, it doesn't change my above statement.  I think the main argument for
> this behavior in qemu is that people are used to using this idiom with
> malloc() but it's a non-portable practice.
>
> If qemu_malloc() didn't carry the name "malloc()" then semantics with size=0
> would be a different discussion.  But so far, all qemu_* functions tend to
> behave almost exactly like their C counterparts.  Relying on the result of
> size=0 with malloc() is broken.

We can change qemu_malloc to qemu_alloc_memory(), or whatever.
But from the moment we do things like abort on failing, we are already
deviating from its C counterpart.

-- 
Glauber  Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

Re: [Qemu-devel] Re: [PATCH 04/11] qemu_flush_work for remote vcpu execution

2009-12-03 Thread Glauber Costa

On Thu, Dec 3, 2009 at 12:29 PM, Paolo Bonzini  wrote:
> On 12/02/2009 02:48 PM, Glauber Costa wrote:
>>
>> +    if (env == qemu_get_current_env()) {
>
> Will always be false for TCG + iothread.  What's wrong with
> qemu_cpu_self(env)?  It appears to do the same, and it would also make the
> whole thread-local storage stuff redundant.
>
Indeed. I'll turn to qemu_cpu_self. thanks


-- 
Glauber  Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

Re: [Qemu-devel] Re: [PATCH 0/9] in-kernel irqchip, new spin

2009-12-03 Thread Glauber Costa

On Thu, Dec 3, 2009 at 10:45 AM, Avi Kivity  wrote:
> On 12/03/2009 02:39 PM, Michael S. Tsirkin wrote:
>>>
>>> Right now there are no knobs to disable it, since last time I checked,
>>> people were inclined to
>>> solve that by adding a machine type that does not do irqchip in
>>> kernel, if wanted.
>>>
>>
>> Can't everything doable by machine type also doable from
>> command line switch? Disabling irqchip used to be
>> a valuable debugging tool.
>>
>>
>
> Even worse, if we throw everything into the the machine type, we end up with
> 2^n machine types in order to control everything.

I am totally fine re-including the cmdline patch in the series.



-- 
Glauber  Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

Re: [Qemu-devel] Re: [PATCH 11/11] remove smp restriction from kvm

2009-12-03 Thread Glauber Costa

On Thu, Dec 3, 2009 at 10:37 AM, Avi Kivity  wrote:
> On 12/03/2009 02:36 PM, Glauber Costa wrote:
>>
>> On Thu, Dec 3, 2009 at 10:23 AM, Avi Kivity  wrote:
>>
>>>
>>> On 12/02/2009 03:48 PM, Glauber Costa wrote:
>>>
>>>>
>>>> We don't support smp without irqchip in kernel, so only abort in
>>>> that situation
>>>>
>>>>
>>>
>>> What's the reason for this restriction?
>>>
>>
>> It is temporary. But as far as my testing goes, we don't come even close
>> to
>> working without in-kernel irqchip.
>>
>
> This works well in qemu-kvm.  What's the reason?  it may indicate a bug in
> up.

The reason is that we do not handle SIPI in qemu.git yet. Now that the basics
of smp are working, it should not be too difficult to get it working.

-- 
Glauber  Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

Re: [Qemu-devel] Re: [PATCH 04/11] qemu_flush_work for remote vcpu execution

2009-12-03 Thread Glauber Costa

>
> Keep the name then.  The new name is misleading.

ok.

>>>> Totally synchronous,
>>>> and guarantees that a given function will be executed at the specified
>>>> vcpu.
>>>>
>>>> This patch also convert usage within the breakpoints system
>>>>
>>>> +void qemu_queue_work(CPUState *env, void (*func)(void *data), void
>>>> *data);
>>>>
>>>>
>>>
>>> The name suggests that it is asynchronous.
>>>
>>> Why is this patch necessary?
>>>
>>
>> to keep gdbstub working.
>>
>
> "Because it fixes things".
>
> Please elaborate.
>

gdbstub is called from the i/o thread , and call vcpu ioctls. So it
has to use the on_vcpu
mechanism to guarantee its execution in the right thread.

What I meant is that currently, gdbstub is the only user of it, at
least in qemu.git

-- 
Glauber  Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

Re: [Qemu-devel] Re: [PATCH 11/11] remove smp restriction from kvm

2009-12-03 Thread Glauber Costa

On Thu, Dec 3, 2009 at 10:23 AM, Avi Kivity  wrote:
> On 12/02/2009 03:48 PM, Glauber Costa wrote:
>>
>> We don't support smp without irqchip in kernel, so only abort in
>> that situation
>>
>
>
> What's the reason for this restriction?

It is temporary. But as far as my testing goes, we don't come even close to
working without in-kernel irqchip.


-- 
Glauber  Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

Re: [Qemu-devel] Re: [PATCH 04/11] qemu_flush_work for remote vcpu execution

2009-12-03 Thread Glauber Costa

On Thu, Dec 3, 2009 at 10:20 AM, Avi Kivity  wrote:
> On 12/02/2009 03:48 PM, Glauber Costa wrote:
>>
>> This function is similar to qemu-kvm's on_vcpu mechanism.
>
> Is similar?  You're replacing on_vcpu().
Yeah, it began similar, now it is pretty much the same thing, but
using qemu-specific
data structures

>
>> Totally synchronous,
>> and guarantees that a given function will be executed at the specified
>> vcpu.
>>
>> This patch also convert usage within the breakpoints system
>>
>> +void qemu_queue_work(CPUState *env, void (*func)(void *data), void
>> *data);
>>
>
> The name suggests that it is asynchronous.
>
> Why is this patch necessary?

to keep gdbstub working.

>
> --
> error compiling committee.c: too many arguments to function
>
>
>
>



-- 
Glauber  Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

Re: [Qemu-devel] Re: [PATCH 0/9] in-kernel irqchip, new spin

2009-12-03 Thread Glauber Costa

>> Haven't tested. But since I see that msix need some special code in qemu-kvm,
>> it probably won't. But I assume we can just add a patch ontop of this to add
>> that code and make it work, right?
>
> Sure. However - are you making in-kernel irqchip the default?
> If so, I think the best way to do it would be:
> 1. add in-kernel irqchip, off by default
> 2. fix msix with in-kernel
> 3. make in-kernel on by default
>
Right now there are no knobs to disable it, since last time I checked,
people were inclined to
solve that by adding a machine type that does not do irqchip in
kernel, if wanted.

However, since it is probably not going to reach 0.12 anyway, we can
come up with a patch
that fixes msix, and bundle in this series before I actually enable
the irqchip (which is one
of the last patches)

-- 
Glauber  Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

Re: [Qemu-devel] Re: [PATCH 0/9] in-kernel irqchip, new spin

2009-12-03 Thread Glauber Costa

>
> Does msix work with this patchset when in-kernel irqchip
> is enabled?

Haven't tested. But since I see that msix need some special code in qemu-kvm,
it probably won't. But I assume we can just add a patch ontop of this to add
that code and make it work, right?

-- 
Glauber  Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

Re: [Qemu-devel] [PATCH v2 04/11] qemu_flush_work for remote vcpu execution

2009-12-02 Thread Glauber Costa

On Wed, Dec 02, 2009 at 11:54:45AM -0200, Marcelo Tosatti wrote:
> On Wed, Dec 02, 2009 at 11:41:19AM -0200, Glauber Costa wrote:
> > > KVM vcpu threads should block SIGUSR1, set the in-kernel signal mask 
> > > with KVM_SET_SIGNAL_MASK ioctl, and eat the signal in
> > > qemu_wait_io_event (qemu_flush_work should run after eating
> > > the signal). Similarly to qemu-kvm's kvm_main_loop_wait.
> > > 
> > > Otherwise a vcpu thread can lose a signal (say handle SIGUSR1 when not
> > > holding qemu_global_mutex before kernel entry).
> > > 
> > > I think this the source of the problems patch 8 attempts to fix.
> > 
> > I remember you had patches for both theses fixes.
> 
> Only qemu_wait_io_event split.
> 
> > Can you resend them, ontop of this series?
> 
> There is not much sense to split qemu_wait_io_event after what your
> patch set does (it should be split before).

What's the difference here?

[Qemu-devel] [PATCH 10/11] Use __thread where available.

2009-12-02 Thread Glauber Costa

It is much faster than pthread_{g,s}et_specific.

Signed-off-by: Glauber Costa 
---
 configure |   17 +
 vl.c  |   16 
 2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/configure b/configure
index dca5a43..ce7bcc4 100755
--- a/configure
+++ b/configure
@@ -233,6 +233,7 @@ blobs="yes"
 pkgversion=""
 check_utests="no"
 user_pie="no"
+tls_thread="yes"
 
 # OS specific
 if check_define __linux__ ; then
@@ -1017,6 +1018,19 @@ EOF
 fi
 
 ##
+# Check for availability of the __thread keyword
+cat > $TMPC < $TMPC <> 
$config_host_mak
 if test "$mixemu" = "yes" ; then
   echo "CONFIG_MIXEMU=y" >> $config_host_mak
 fi
+if test "$tls_thread" = "yes" ; then
+  echo "CONFIG_TLS_HAS_THREAD=y" >> $config_host_mak
+fi
 if test "$vnc_tls" = "yes" ; then
   echo "CONFIG_VNC_TLS=y" >> $config_host_mak
   echo "VNC_TLS_CFLAGS=$vnc_tls_cflags" >> $config_host_mak
diff --git a/vl.c b/vl.c
index ad2e7d6..16a41e3 100644
--- a/vl.c
+++ b/vl.c
@@ -3445,21 +3445,37 @@ static void block_io_signals(void);
 static void unblock_io_signals(void);
 static int tcg_has_work(void);
 
+#ifdef CONFIG_TLS_HAS_THREAD
+static __thread CPUState *current_env;
+#else
 static pthread_key_t current_env;
+#endif
 
 static CPUState *qemu_get_current_env(void)
 {
+#ifdef CONFIG_TLS_HAS_THREAD
+return current_env;
+#else
 return pthread_getspecific(current_env);
+#endif
 }
 
 static void qemu_set_current_env(CPUState *env)
 {
+#ifdef CONFIG_TLS_HAS_THREAD
+current_env = env;
+#else
 pthread_setspecific(current_env, env);
+#endif
 }
 
 static void qemu_init_current_env(void)
 {
+#ifdef CONFIG_TLS_HAS_THREAD
+current_env = NULL;
+#else
 pthread_key_create(¤t_env, NULL);
+#endif
 }
 
 static int qemu_init_main_loop(void)
-- 
1.6.5.2

[Qemu-devel] [PATCH 11/11] remove smp restriction from kvm

2009-12-02 Thread Glauber Costa

We don't support smp without irqchip in kernel, so only abort in
that situation

Signed-off-by: Glauber Costa 
---
 kvm-all.c |   10 +-
 kvm.h |2 ++
 target-i386/kvm.c |7 +++
 target-ppc/kvm.c  |5 +
 4 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 22d84a3..b17bd74 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -427,7 +427,12 @@ static int kvm_create_irqchip(KVMState *s)
 return ret;
 
 s->irqchip_in_kernel = 1;
+
 #endif
+if (!kvm_arch_can_smp() && (smp_cpus > 1)) {
+fprintf(stderr, "No SMP KVM support, use '-smp 1'\n");
+ret = -EINVAL;
+}
 return ret;
 }
 
@@ -440,11 +445,6 @@ int kvm_init(int smp_cpus)
 int ret;
 int i;
 
-if (smp_cpus > 1) {
-fprintf(stderr, "No SMP KVM support, use '-smp 1'\n");
-return -EINVAL;
-}
-
 s = qemu_mallocz(sizeof(KVMState));
 
 #ifdef KVM_CAP_SET_GUEST_DEBUG
diff --git a/kvm.h b/kvm.h
index 7b9d8b3..91a4bf4 100644
--- a/kvm.h
+++ b/kvm.h
@@ -101,6 +101,8 @@ int kvm_arch_init_vcpu(CPUState *env);
 
 void kvm_arch_reset_vcpu(CPUState *env);
 
+int kvm_arch_can_smp(void);
+
 #ifdef TARGET_I386
 int kvm_set_lapic(CPUState *env, struct kvm_lapic_state *s);
 int kvm_get_lapic(CPUState *env, struct kvm_lapic_state *s);
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 65ad2a1..cea0cf1 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1171,3 +1171,10 @@ int kvm_get_lapic(CPUState *env, struct kvm_lapic_state 
*s)
 
 return kvm_vcpu_ioctl(env, KVM_GET_LAPIC, s);
 }
+
+int kvm_arch_can_smp(void)
+{
+if (kvm_irqchip_in_kernel())
+return 1;
+return 0;
+}
diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index d08639b..cfe467f 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -62,6 +62,11 @@ int kvm_arch_set_irq(KVMState *s, int irq, int level, int 
*status)
 return -ENOSYS;
 }
 
+int kvm_arch_can_smp(void)
+{
+return 0;
+}
+
 int kvm_arch_put_registers(CPUState *env)
 {
 struct kvm_regs regs;
-- 
1.6.5.2

[Qemu-devel] [PATCH 09/11] Use per-cpu reset handlers.

2009-12-02 Thread Glauber Costa

The proposal in this patch is to add a system_reset caller that only
resets state related to the cpu. This will guarantee that does functions
are called from the cpu-threads, not the I/O thread.

In principle, it might seem close to the remote execution mechanism, but:
 * It does not involve any extra signalling, so it should be faster.
 * The cpu is guaranteed to be stopped, so it is much less racy.
 * What runs where becomes more explicit.
 * This is much, much less racy

The previous implementation was giving me races on reset. This one makes
it work flawlesly w.r.t reset.

Signed-off-by: Glauber Costa 
---
 cpu-defs.h|2 ++
 exec-all.h|   12 
 exec.c|   32 
 hw/apic-kvm.c |   16 
 kvm-all.c |7 +++
 vl.c  |   10 ++
 6 files changed, 67 insertions(+), 12 deletions(-)

diff --git a/cpu-defs.h b/cpu-defs.h
index 18792fc..37fd11c 100644
--- a/cpu-defs.h
+++ b/cpu-defs.h
@@ -185,6 +185,8 @@ typedef struct QemuWorkItem {
 CPUWatchpoint *watchpoint_hit;  \
 \
 QemuWorkItem queued_work;   \
+int reset_requested;\
+QTAILQ_HEAD(reset_head, CPUResetEntry) reset_handlers;  \
 uint64_t queued_local, queued_total;\
 struct QemuMutex queue_lock;\
 \
diff --git a/exec-all.h b/exec-all.h
index 820b59e..5acf363 100644
--- a/exec-all.h
+++ b/exec-all.h
@@ -78,6 +78,18 @@ void cpu_io_recompile(CPUState *env, void *retaddr);
 TranslationBlock *tb_gen_code(CPUState *env, 
   target_ulong pc, target_ulong cs_base, int flags,
   int cflags);
+
+typedef void CPUResetHandler(CPUState *env);
+typedef struct CPUResetEntry {
+QTAILQ_ENTRY(CPUResetEntry) entry;
+CPUResetHandler *func;
+void *opaque;
+} CPUResetEntry;
+
+void qemu_register_reset_cpu(CPUState *env, CPUResetHandler *func);
+void qemu_unregister_reset_cpu(CPUState *env, CPUResetHandler *func);
+
+void qemu_cpu_reset(CPUState *env);
 void cpu_exec_init(CPUState *env);
 void QEMU_NORETURN cpu_loop_exit(void);
 int page_unprotect(target_ulong address, unsigned long pc, void *puc);
diff --git a/exec.c b/exec.c
index eb1ee51..f91009d 100644
--- a/exec.c
+++ b/exec.c
@@ -588,6 +588,7 @@ void cpu_exec_init(CPUState *env)
 env->numa_node = 0;
 QTAILQ_INIT(&env->breakpoints);
 QTAILQ_INIT(&env->watchpoints);
+QTAILQ_INIT(&env->reset_handlers);
 *penv = env;
 #if defined(CONFIG_USER_ONLY)
 cpu_list_unlock();
@@ -599,6 +600,37 @@ void cpu_exec_init(CPUState *env)
 #endif
 }
 
+void qemu_register_reset_cpu(CPUState *env, CPUResetHandler *func)
+{
+CPUResetEntry *re = qemu_mallocz(sizeof(CPUResetEntry));
+
+re->func = func;
+re->opaque = env;
+QTAILQ_INSERT_TAIL(&env->reset_handlers, re, entry);
+}
+
+void qemu_unregister_reset_cpu(CPUState *env, CPUResetHandler *func)
+{
+CPUResetEntry *re;
+
+QTAILQ_FOREACH(re, &env->reset_handlers, entry) {
+if (re->func == func && re->opaque == env) {
+QTAILQ_REMOVE(&env->reset_handlers, re, entry);
+qemu_free(re);
+return;
+}
+}
+}
+
+void qemu_cpu_reset(CPUState *env)
+{
+CPUResetEntry *re, *nre;
+/* reset all devices */
+QTAILQ_FOREACH_SAFE(re, &env->reset_handlers, entry, nre) {
+re->func(re->opaque);
+}
+}
+
 static inline void invalidate_page_bitmap(PageDesc *p)
 {
 if (p->code_bitmap) {
diff --git a/hw/apic-kvm.c b/hw/apic-kvm.c
index b2864b6..91a77d0 100644
--- a/hw/apic-kvm.c
+++ b/hw/apic-kvm.c
@@ -91,20 +91,20 @@ static const VMStateDescription vmstate_apic = {
 .post_load = apic_post_load,
 };
 
-static void kvm_apic_reset(void *opaque)
+static void kvm_apic_reset(CPUState *env)
 {
-APICState *s = opaque;
+APICState *s = env->apic_state;
 int bsp;
 int i;
 
-cpu_synchronize_state(s->cpu_env);
+cpu_synchronize_state(env);
 
-bsp = cpu_is_bsp(s->cpu_env);
+bsp = cpu_is_bsp(env);
 
 s->dev.apicbase = 0xfee0 |
 (bsp ? MSR_IA32_APICBASE_BSP : 0) | MSR_IA32_APICBASE_ENABLE;
 
-cpu_reset(s->cpu_env);
+cpu_reset(env);
 
 s->dev.tpr = 0;
 s->dev.spurious_vec = 0xff;
@@ -125,7 +125,7 @@ static void kvm_apic_reset(void *opaque)
 s->dev.wait_for_sipi = 1;
 
 
-s->cpu_env->mp_state
+env->mp_state
 = bsp ? KVM_MP_STATE_RUNNABLE : KVM_MP_STATE_UNINITIALIZED;
 
 /* We have to tell the kernel about mp_state, but also save sregs, since
@@ -141,7 +141,7 @@ static void

[Qemu-devel] [PATCH 06/11] flush state in migration post_load

2009-12-02 Thread Glauber Costa

This have already been identified in qemu-kvm. We have to synchronously
tell the kernel about the APIC state. Otherwise, other cpus can see
bogus state for this lapic.

Signed-off-by: Glauber Costa 
---
 hw/apic-kvm.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/hw/apic-kvm.c b/hw/apic-kvm.c
index 9e9790f..b2864b6 100644
--- a/hw/apic-kvm.c
+++ b/hw/apic-kvm.c
@@ -73,6 +73,8 @@ static int apic_post_load(void *opaque, int version_id)
 {
 APICState *s = opaque;
 
+cpu_flush_state(s->cpu_env);
+
 return kvm_set_lapic(s->cpu_env, &s->kvm_lapic_state);
 }
 
-- 
1.6.5.2

[Qemu-devel] [PATCH 07/11] Don't call kvm cpu reset on initialization

2009-12-02 Thread Glauber Costa

All reset functions are called from the same place, and this was a leftover

Signed-off-by: Glauber Costa 
---
 kvm-all.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 596416a..1072d63 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -204,8 +204,6 @@ int kvm_init_vcpu(CPUState *env)
 ret = kvm_arch_init_vcpu(env);
 if (ret == 0) {
 qemu_register_reset(kvm_reset_vcpu, env);
-kvm_arch_reset_vcpu(env);
-ret = kvm_arch_put_registers(env);
 }
 err:
 return ret;
-- 
1.6.5.2

[Qemu-devel] [PATCH 08/11] use cpu_kick instead of direct signalling.

2009-12-02 Thread Glauber Costa

Before signalling a cpu, we have to set exit_request = 1, otherwise
it may go back to executing itself. So every cpu wakeup becomes
at least two statements. The qemu_cpu_kick already provides semantics
to that. So use it all over.

Signed-off-by: Glauber Costa 
---
 vl.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/vl.c b/vl.c
index c7b46a9..97446fc 100644
--- a/vl.c
+++ b/vl.c
@@ -3568,6 +3568,7 @@ void qemu_cpu_kick(void *_env)
 {
 CPUState *env = _env;
 qemu_cond_broadcast(env->halt_cond);
+env->exit_request = 1;
 if (kvm_enabled())
 qemu_thread_signal(env->thread, SIGUSR1);
 }
@@ -3589,7 +3590,7 @@ void qemu_queue_work(CPUState *env, void (*func)(void 
*data), void *data)
 wii->data = data;
 wii->done = 0;
 
-qemu_thread_signal(env->thread, SIGUSR1);
+qemu_cpu_kick(env);
 
 while (!wii->done) {
 qemu_cond_wait(&env->work_cond, &qemu_global_mutex);
@@ -3716,7 +3717,7 @@ static void pause_all_vcpus(void)
 qemu_cond_timedwait(&qemu_pause_cond, &qemu_global_mutex, 100);
 penv = first_cpu;
 while (penv) {
-qemu_thread_signal(penv->thread, SIGUSR1);
+qemu_cpu_kick(penv);
 penv = (CPUState *)penv->next_cpu;
 }
 }
@@ -3729,7 +3730,6 @@ static void resume_all_vcpus(void)
 while (penv) {
 penv->stop = 0;
 penv->stopped = 0;
-qemu_thread_signal(penv->thread, SIGUSR1);
 qemu_cpu_kick(penv);
 penv = (CPUState *)penv->next_cpu;
 }
-- 
1.6.5.2

[Qemu-devel] [PATCH 05/11] tell kernel about all registers instead of just mp_state

2009-12-02 Thread Glauber Costa

This fix a bug with -smp in kvm. Since we have updated apic_base,
we also have to tell kernel about it. So instead of just updating
mp_state, update every regs.

It is mandatory that this happens synchronously, without waiting for
the next vcpu run. Otherwise, if we are migrating, or initializing
the cpu's APIC, other cpus can still see an invalid state.

Since putting registers already happen in vcpu entry, we factor
out the required code in cpu_flush_state()

Signed-off-by: Glauber Costa 
---
 hw/apic-kvm.c |5 -
 kvm-all.c |   13 +
 kvm.h |8 
 3 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/hw/apic-kvm.c b/hw/apic-kvm.c
index e5a0bfc..9e9790f 100644
--- a/hw/apic-kvm.c
+++ b/hw/apic-kvm.c
@@ -126,7 +126,10 @@ static void kvm_apic_reset(void *opaque)
 s->cpu_env->mp_state
 = bsp ? KVM_MP_STATE_RUNNABLE : KVM_MP_STATE_UNINITIALIZED;
 
-kvm_put_mp_state(s->cpu_env);
+/* We have to tell the kernel about mp_state, but also save sregs, since
+ * apic base was just updated
+ */
+cpu_flush_state(s->cpu_env);
 
 if (bsp) {
 /*
diff --git a/kvm-all.c b/kvm-all.c
index f7d89c6..596416a 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -612,6 +612,14 @@ void kvm_cpu_synchronize_state(CPUState *env)
 }
 }
 
+void kvm_cpu_flush_state(CPUState *env)
+{
+if (env->kvm_state->regs_modified) {
+kvm_arch_put_registers(env);
+env->kvm_state->regs_modified = 0;
+}
+}
+
 int kvm_cpu_exec(CPUState *env)
 {
 struct kvm_run *run = env->kvm_run;
@@ -626,10 +634,7 @@ int kvm_cpu_exec(CPUState *env)
 break;
 }
 
-if (env->kvm_state->regs_modified) {
-kvm_arch_put_registers(env);
-env->kvm_state->regs_modified = 0;
-}
+kvm_cpu_flush_state(env);
 
 kvm_arch_pre_run(env, run);
 qemu_mutex_unlock_iothread();
diff --git a/kvm.h b/kvm.h
index 15fb34a..7b9d8b3 100644
--- a/kvm.h
+++ b/kvm.h
@@ -144,6 +144,7 @@ int kvm_check_extension(KVMState *s, unsigned int 
extension);
 uint32_t kvm_arch_get_supported_cpuid(CPUState *env, uint32_t function,
   int reg);
 void kvm_cpu_synchronize_state(CPUState *env);
+void kvm_cpu_flush_state(CPUState *env);
 
 /* generic hooks - to be moved/refactored once there are more users */
 
@@ -154,4 +155,11 @@ static inline void cpu_synchronize_state(CPUState *env)
 }
 }
 
+static inline void cpu_flush_state(CPUState *env)
+{
+if (kvm_enabled()) {
+kvm_cpu_flush_state(env);
+}
+}
+
 #endif
-- 
1.6.5.2

[Qemu-devel] [PATCH 03/11] update halted state on mp_state sync

2009-12-02 Thread Glauber Costa

If we are using in-kernel irqchip, halted state belongs in the kernel.
So everytime we grab kernel's idea of mpstate, we also need to propagate
halted state to userspace.

Signed-off-by: Glauber Costa 
---
 target-i386/kvm.c |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 7e1ce15..65ad2a1 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -825,6 +825,11 @@ static int kvm_get_mp_state(CPUState *env)
 return ret;
 }
 env->mp_state = mp_state.mp_state;
+
+if (kvm_irqchip_in_kernel()) {
+env->halted = (env->mp_state == KVM_MP_STATE_HALTED);
+}
+
 return 0;
 }
 
-- 
1.6.5.2

[Qemu-devel] [PATCH 04/11] qemu_flush_work for remote vcpu execution

2009-12-02 Thread Glauber Costa

This function is similar to qemu-kvm's on_vcpu mechanism. Totally synchronous,
and guarantees that a given function will be executed at the specified vcpu.

This patch also convert usage within the breakpoints system

Signed-off-by: Glauber Costa 
---
 cpu-all.h  |3 ++
 cpu-defs.h |   14 +
 kvm-all.c  |   17 +--
 vl.c   |   61 +++
 4 files changed, 75 insertions(+), 20 deletions(-)

diff --git a/cpu-all.h b/cpu-all.h
index e214374..8270d43 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -763,6 +763,9 @@ extern CPUState *cpu_single_env;
 extern int64_t qemu_icount;
 extern int use_icount;
 
+void qemu_queue_work(CPUState *env, void (*func)(void *data), void *data);
+void qemu_flush_work(CPUState *env);
+
 #define CPU_INTERRUPT_HARD   0x02 /* hardware interrupt pending */
 #define CPU_INTERRUPT_EXITTB 0x04 /* exit the current TB (use for x86 a20 
case) */
 #define CPU_INTERRUPT_TIMER  0x08 /* internal timer exception pending */
diff --git a/cpu-defs.h b/cpu-defs.h
index 95068b5..18792fc 100644
--- a/cpu-defs.h
+++ b/cpu-defs.h
@@ -31,6 +31,8 @@
 #include "qemu-queue.h"
 #include "targphys.h"
 
+#include "qemu-thread.h"
+
 #ifndef TARGET_LONG_BITS
 #error TARGET_LONG_BITS must be defined before including this header
 #endif
@@ -134,6 +136,13 @@ typedef struct CPUWatchpoint {
 QTAILQ_ENTRY(CPUWatchpoint) entry;
 } CPUWatchpoint;
 
+typedef struct QemuWorkItem {
+void (*func)(void *data);
+void *data;
+int done;
+} QemuWorkItem;
+
+
 #define CPU_TEMP_BUF_NLONGS 128
 #define CPU_COMMON  \
 struct TranslationBlock *current_tb; /* currently executing TB  */  \
@@ -175,6 +184,10 @@ typedef struct CPUWatchpoint {
 QTAILQ_HEAD(watchpoints_head, CPUWatchpoint) watchpoints;\
 CPUWatchpoint *watchpoint_hit;  \
 \
+QemuWorkItem queued_work;   \
+uint64_t queued_local, queued_total;\
+struct QemuMutex queue_lock;\
+\
 struct GDBRegisterState *gdb_regs;  \
 \
 /* Core interrupt code */   \
@@ -194,6 +207,7 @@ typedef struct CPUWatchpoint {
 uint32_t created;   \
 struct QemuThread *thread;  \
 struct QemuCond *halt_cond; \
+struct QemuCond work_cond;  \
 const char *cpu_model_str;  \
 struct KVMState *kvm_state; \
 struct kvm_run *kvm_run;\
diff --git a/kvm-all.c b/kvm-all.c
index a5739c4..f7d89c6 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -617,7 +617,7 @@ int kvm_cpu_exec(CPUState *env)
 struct kvm_run *run = env->kvm_run;
 int ret;
 
-dprintf("kvm_cpu_exec()\n");
+dprintf("kvm_cpu_exec() %d\n", env->cpu_index);
 
 do {
 if (env->exit_request) {
@@ -932,19 +932,6 @@ void kvm_setup_guest_memory(void *start, size_t size)
 }
 
 #ifdef KVM_CAP_SET_GUEST_DEBUG
-static void on_vcpu(CPUState *env, void (*func)(void *data), void *data)
-{
-#ifdef CONFIG_IOTHREAD
-if (env == cpu_single_env) {
-func(data);
-return;
-}
-abort();
-#else
-func(data);
-#endif
-}
-
 struct kvm_sw_breakpoint *kvm_find_sw_breakpoint(CPUState *env,
  target_ulong pc)
 {
@@ -992,7 +979,7 @@ int kvm_update_guest_debug(CPUState *env, unsigned long 
reinject_trap)
 data.dbg.control |= reinject_trap;
 data.env = env;
 
-on_vcpu(env, kvm_invoke_set_guest_debug, &data);
+qemu_queue_work(env, kvm_invoke_set_guest_debug, &data);
 return data.err;
 }
 
diff --git a/vl.c b/vl.c
index 321b18d..c7b46a9 100644
--- a/vl.c
+++ b/vl.c
@@ -3403,6 +3403,11 @@ void qemu_notify_event(void)
 }
 }
 
+void qemu_queue_work(CPUState *env, void (*func)(void *data), void *data)
+{
+func(data);
+}
+
 void qemu_mutex_lock_iothread(void) {}
 void qemu_mutex_unlock_iothread(void) {}
 
@@ -3436,8 +3441,7 @@ static int tcg_has_work(void);
 
 static pthread_key_t current_env;
 
-CPUState *qemu_get_current_env(void);
-CPUState *qemu_get_current_env(void)
+static CPUState *qemu_get_current_env(void)
 {
 return pthread_getspecific(current_env);
 }
@@ -3474,8 +3478,10 @@ static int qemu_init_main_loop(void)
 
 static void qemu_wait_io_event(

[Qemu-devel] [PATCH 01/11] Don't mess with halted state.

2009-12-02 Thread Glauber Costa

When we have irqchip in kernel, halted state is kernel
business. So don't initialize it in our code.

Signed-off-by: Glauber Costa 
---
 hw/apic-kvm.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/hw/apic-kvm.c b/hw/apic-kvm.c
index 089fa45..e5a0bfc 100644
--- a/hw/apic-kvm.c
+++ b/hw/apic-kvm.c
@@ -122,10 +122,9 @@ static void kvm_apic_reset(void *opaque)
 s->dev.next_time = 0;
 s->dev.wait_for_sipi = 1;
 
-s->cpu_env->halted = !(s->dev.apicbase & MSR_IA32_APICBASE_BSP);
 
 s->cpu_env->mp_state
-= s->cpu_env->halted ? KVM_MP_STATE_UNINITIALIZED : 
KVM_MP_STATE_RUNNABLE;
+= bsp ? KVM_MP_STATE_RUNNABLE : KVM_MP_STATE_UNINITIALIZED;
 
 kvm_put_mp_state(s->cpu_env);
 
-- 
1.6.5.2

[Qemu-devel] [PATCH 02/11] store thread-specific env information

2009-12-02 Thread Glauber Costa

Since we'll have multiple cpu threads, at least for kvm, we need a way to store
and retrieve the CPUState associated with the current execution thread.
For the I/O thread, this will be NULL.

I am using pthread functions for that, for portability, but we could as well
use __thread keyword.

Signed-off-by: Glauber Costa 
---
 vl.c |   21 +
 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/vl.c b/vl.c
index 44763af..321b18d 100644
--- a/vl.c
+++ b/vl.c
@@ -3434,6 +3434,24 @@ static void block_io_signals(void);
 static void unblock_io_signals(void);
 static int tcg_has_work(void);
 
+static pthread_key_t current_env;
+
+CPUState *qemu_get_current_env(void);
+CPUState *qemu_get_current_env(void)
+{
+return pthread_getspecific(current_env);
+}
+
+static void qemu_set_current_env(CPUState *env)
+{
+pthread_setspecific(current_env, env);
+}
+
+static void qemu_init_current_env(void)
+{
+pthread_key_create(¤t_env, NULL);
+}
+
 static int qemu_init_main_loop(void)
 {
 int ret;
@@ -3446,6 +3464,7 @@ static int qemu_init_main_loop(void)
 qemu_mutex_init(&qemu_fair_mutex);
 qemu_mutex_init(&qemu_global_mutex);
 qemu_mutex_lock(&qemu_global_mutex);
+qemu_init_current_env();
 
 unblock_io_signals();
 qemu_thread_self(&io_thread);
@@ -3484,6 +3503,8 @@ static void *kvm_cpu_thread_fn(void *arg)
 
 block_io_signals();
 qemu_thread_self(env->thread);
+qemu_set_current_env(env);
+
 if (kvm_enabled())
 kvm_init_vcpu(env);
 
-- 
1.6.5.2

[Qemu-devel] [PATCH 00/11] SMP support in qemu.git

2009-12-02 Thread Glauber Costa

Avi/Marcelo

Please include this in a branch in qemu-kvm for future inclusion
in qemu.git. It is the same material people have already commented
on in the ML.

Glauber Costa (11):
  Don't mess with halted state.
  store thread-specific env information
  update halted state on mp_state sync
  qemu_flush_work for remote vcpu execution
  tell kernel about all registers instead of just mp_state
  flush state in migration post_load
  Don't call kvm cpu reset on initialization
  use cpu_kick instead of direct signalling.
  Use per-cpu reset handlers.
  Use __thread where available.
  remove smp restriction from kvm

 configure |   17 
 cpu-all.h |3 +
 cpu-defs.h|   16 
 exec-all.h|   12 ++
 exec.c|   32 
 hw/apic-kvm.c |   26 +++-
 kvm-all.c |   49 +---
 kvm.h |   10 +
 target-i386/kvm.c |   12 ++
 target-ppc/kvm.c  |5 ++
 vl.c  |  108 ++--
 11 files changed, 244 insertions(+), 46 deletions(-)

1 2 3 >

1 - 100 of 208 matches

Mail list logo