date:20110808

Re: [Qemu-devel] [PATCH 2/3] usb-redir: Call qemu_chr_guest_open/close

2011-08-08 Thread Hans de Goede


Hi,

On 08/07/2011 11:30 PM, Anthony Liguori wrote:

On 08/07/2011 12:41 PM, Hans de Goede wrote:

Hi,

On 08/07/2011 05:52 PM, Anthony Liguori wrote:

On 08/07/2011 08:21 AM, Hans de Goede wrote:

To let the chardev now we're ready start receiving data. This is
necessary
with the spicevmc chardev to get it registered with the spice-server.

Signed-off-by: Hans de Goede
---
usb-redir.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/usb-redir.c b/usb-redir.c
index e212993..ec88c0b 100644
--- a/usb-redir.c
+++ b/usb-redir.c
@@ -809,6 +809,8 @@ static int usbredir_initfn(USBDevice *udev)

qemu_chr_add_handlers(dev->cs, usbredir_chardev_can_read,
usbredir_chardev_read, usbredir_chardev_event, dev);
+ /* Let the other side know we are ready */
+ qemu_chr_guest_open(dev->cs);



You should do guest_open before adding handlers.


Erm, no, guest_open may lead to a callback in the
chardev, to which it may respond by immediately queuing a few writes /
doing a read.


So after my char-flow changes, you won't be allowed to set handlers unless 
you've called open.



Why not do it the other way around? So don't allow open until the handlers are 
set. My reasoning
behind this is that eventually we will want to have a struct describing a pipe 
endpoint, which
will contain handlers (by then identical for both sides) and besides the struct 
a priv / user_data
pointer which will get passed by the handlers when called.

Then we will have a chardev_create or pipe_create call which will take a struct 
+ user data ptr
for both ends (so twice). This matches what currently our set handlers call 
does. But I would
expect the open to come after the creation of the pipe.

At least to me it is much more logical to first set the handlers (which are 
really part
of object creation) and then later do the open, this matches the common 
programming
paradigm of having an init/create function and an open function.

Also forcing the set handlers after the open does not work well with 
virtio_console, as these
are not open until the port inside the guest is opened. So then it would need 
to delay its
set handlers till the first open, and what should it do at close, do a set 
handlers NULL
before doing the actual close ??

Regards,

Hans

Re: [Qemu-devel] Build broken

2011-08-08 Thread Kevin Wolf

Am 05.08.2011 18:49, schrieb malc:
> On Fri, 5 Aug 2011, Stefan Hajnoczi wrote:
> 
>> On Fri, Aug 5, 2011 at 7:22 AM, malc  wrote:
>>>
>>> /home/malc/x/rcs/git/qemuorg/coroutine-ucontext.c: In function 
>>> 'coroutine_new':
>>> /home/malc/x/rcs/git/qemuorg/coroutine-ucontext.c:160:16: error: 'arg.i[1]' 
>>> may be used uninitialized in this function
>>> /home/malc/x/rcs/git/qemuorg/coroutine-ucontext.c:136:18: note: 'arg.i[1]' 
>>> was declared here
>>>
>>> diff --git a/coroutine-ucontext.c b/coroutine-ucontext.c
>>> index 41c2379..42dc3e2 100644
>>> --- a/coroutine-ucontext.c
>>> +++ b/coroutine-ucontext.c
>>> @@ -133,7 +133,7 @@ static Coroutine *coroutine_new(void)
>>> CoroutineUContext *co;
>>> ucontext_t old_uc, uc;
>>> jmp_buf old_env;
>>> -union cc_arg arg;
>>> +union cc_arg arg = {0};
>>>
>>> /* The ucontext functions preserve signal masks which incurs a system 
>>> call
>>>  * overhead.  setjmp()/longjmp() does not preserve signal masks but only
>>>
>>> I guess gcc should yell not only here on ppc32 but on any machine where
>>> pointer size is less than the size of two ints.
>>
>> Makes sense.  Are you going to commit a fix or send a signed-off-by patch?
>>
> 
> If the author(s)(you and Kevin? just you?) agree with the above i can just 
> push it.

Feel free to push it. (Original code was by Anthony, then heavily
modified by me, and after that modified again by Stefan)

Acked-by: Kevin Wolf

Re: [Qemu-devel] Safely reopening image files by stashing fds

2011-08-08 Thread Kevin Wolf

Am 08.08.2011 09:02, schrieb Supriya Kannery:
> On 08/05/2011 09:19 PM, Anthony Liguori wrote:
>> On 08/05/2011 10:43 AM, Kevin Wolf wrote:
>>> Am 05.08.2011 17:24, schrieb Stefan Hajnoczi:
 On Fri, Aug 5, 2011 at 3:28 PM, Christoph Hellwig wrote:
> On Fri, Aug 05, 2011 at 02:12:48PM +0100, Daniel P. Berrange wrote:
>>> Because you cannot change O_DIRECT on an open fd :(. This is why
>>> we're going through this pain.
>>
>> Hmm, I remember hearing that before, but looking at the current
>> fcntl()
>> manpage, it claims you *can* change O_DIRECT using SET_FL. Perhaps
>> this
>> is a newish feature, but it'd be nicer to use it if possible ?
>
> It's been there since day 1 of O_DIRECT support.

 Sorry, my bad. So for Linux we could just use fcntl for
 block_set_hostcache and not bother with reopening. However, we will
 need to reopen should we wish to support changing O_DSYNC.
>>>
>>> We do wish to support that.
>>>
>>> Anthony thinks that allowing the guest to toggle WCE is a prerequisite
>>> for making cache=writeback the default. And this is something that I
>>> definitely want to do for 1.0.
>>
>> Indeed.
>>
> 
> We discussed the following so far...
> 1. How to safely reopen image files
> 2. Dynamic hostcache change
> 3. Support for dynamic change of O_DSYNC
> 
> Since 2 is independent of 1, shall I go ahead implementing
> hostcache change using fcntl.
> 
> Implementation for safely reopening image files using "BDRVReopenState"
> can be done separately as a pre-requisite before implementing 3

Doing it separately means that we would introduce yet another callback
that is used just to change O_DIRECT. In the end we want it to use
bdrv_reopen(), too, so I'm not sure if there is a need for a temporary
solution.

Actually, once we know what we really want (I haven't seen many comments
on the BDRVReopenState suggestion yet), it should be pretty easy to
implement.

Kevin

Re: [Qemu-devel] [PATCH] Permit -mem-path without sync mmu

2011-08-08 Thread Avi Kivity


On 08/08/2011 09:03 AM, David Gibson wrote:

Second, if userspace qemu passing hugepages to kvm can cause (host)
kernel memory corruption, that is clearly a host kernel bug.  So am I
correct in thinking this is basically just a safety feature if qemu is
run on a buggy kernel.


Seems so, yes.  2.6.2[456] are exploitable.  We only found out after 
these were all released.



Presumably this bug was corrected at some
point?  Is the presence of the SYNC_MMU feature just being used as a
proxy for "is this kernel recent enough to have the corruption bug
fixed"?


SYNC_MMU actually fixes the bug.


In any case this test sure as hell needs a big comment next to it
explaining this context.


Yes.




>  Why are mmu notifiers not implemented for PPC again?

It's just not done yet; we're working on it.  (That is, mmu notifiers
are certainly present on PPC, it's just they're not wired up to kvm,
yet).



If ppc doesn't have this issue even without SYNC_MMU, we can make the 
check x86 specific.


--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings

2011-08-08 Thread Avi Kivity


On 08/03/2011 05:04 AM, David Gibson wrote:

I still don't understand the distinction you're making.  We're saying
the group is "owned" by a given user or guest in the sense that no-one
else may use anything in the group (including host drivers).  At that
point none, some or all of the devices in the group may actually be
used by the guest.

You seem to be making a distinction between "owned by" and "assigned
to" and "used by" and I really don't see what it is.



Alex (and I) think that we should work with device/function granularity, 
as is common with other archs, and that the group thing is just a 
constraint on which functions may be assigned where, while you think 
that we should work at group granularity, with 1-function groups for 
archs which don't have constraints.


Is this an accurate way of putting it?

--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH 13/16] usb-hid: add hid_has_events()

2011-08-08 Thread TeLeMan

On Thu, Aug 4, 2011 at 23:10, Gerd Hoffmann  wrote:
> Add hid_has_events function, use it to figure whenever there are pending
> events instead of checking and updating USBHIDState->changed.
>
> Setting ->changed to 1 on init is removed, that should have absolutely
> no effect as the initial state of ->idle is 0 so we report hid state
> anyway until the guest configures some idle time.  Also should clear
> ->idle on reset.
>
> Signed-off-by: Gerd Hoffmann 
> ---
>  hw/usb-hid.c |   16 +++-
>  1 files changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/hw/usb-hid.c b/hw/usb-hid.c
> index 870cc66..b730692 100644
> --- a/hw/usb-hid.c
> +++ b/hw/usb-hid.c
> @@ -88,7 +88,6 @@ typedef struct USBHIDState {
>     int32_t protocol;
>     uint8_t idle;
>     int64_t next_idle_clock;
> -    int changed;
>     void *datain_opaque;
>     void (*datain)(void *);
>  } USBHIDState;
> @@ -444,12 +443,15 @@ static const uint8_t usb_hid_usage_keys[0x100] = {
>     0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
>  };
>
> +static bool hid_has_events(HIDState *hs)
> +{
> +    return hs->n > 0;
> +}
> +
>  static void usb_hid_changed(HIDState *hs)
>  {
>     USBHIDState *us = container_of(hs, USBHIDState, hid);
>
> -    us->changed = 1;
> -
>     if (us->datain) {
>         us->datain(us->datain_opaque);
>     }
> @@ -742,6 +744,7 @@ static void usb_hid_handle_reset(USBDevice *dev)
>
>     hid_handle_reset(&us->hid);
>     us->protocol = 1;
> +    us->idle = 0;
>  }
>
>  static void usb_hid_set_next_idle(USBHIDState *s, int64_t curtime)
> @@ -798,7 +801,6 @@ static int usb_hid_handle_control(USBDevice *dev, 
> USBPacket *p,
>         } else if (hs->kind == HID_KEYBOARD) {
>             ret = hid_keyboard_poll(hs, data, length);
>         }
> -        us->changed = hs->n > 0;
>         break;
>     case SET_REPORT:
>         if (hs->kind == HID_KEYBOARD) {
> @@ -849,7 +851,7 @@ static int usb_hid_handle_data(USBDevice *dev, USBPacket 
> *p)
>     case USB_TOKEN_IN:
>         if (p->devep == 1) {
>             int64_t curtime = qemu_get_clock_ns(vm_clock);
> -            if (!us->changed &&
> +            if (!hid_has_events(hs) &&
>                 (!us->idle || us->next_idle_clock - curtime > 0)) {
>                 return USB_RET_NAK;
>             }
> @@ -860,7 +862,6 @@ static int usb_hid_handle_data(USBDevice *dev, USBPacket 
> *p)
>                 ret = hid_keyboard_poll(hs, buf, p->iov.size);
>             }
>             usb_packet_copy(p, buf, ret);
> -            us->changed = hs->n > 0;
>         } else {
>             goto fail;
>         }
> @@ -914,9 +915,6 @@ static int usb_hid_initfn(USBDevice *dev, int kind)
>
>     usb_desc_init(dev);
>     hid_init(&us->hid, kind, usb_hid_changed);
> -
> -    /* Force poll routine to be run and grab input the first time.  */
> -    us->changed = 1;
USB tablet does not work on the winxp guest. I think this code can't be removed.

>     return 0;
>  }
>
> --
> 1.7.1
>
>
>

Re: [Qemu-devel] [PATCH v3] XBZRLE delta for live migration of large memory apps

2011-08-08 Thread Shribman, Aidan

> -Original Message-
> From: Stefan Hajnoczi [mailto:stefa...@gmail.com]
> Sent: Tuesday, August 02, 2011 9:06 PM
> To: Shribman, Aidan
> Cc: qemu-devel@nongnu.org; Anthony Liguori
> Subject: Re: [PATCH v3] XBZRLE delta for live migration of
> large memory apps
>
> On Tue, Aug 02, 2011 at 03:45:56PM +0200, Shribman, Aidan wrote:
> > Subject: [PATCH v3] XBZRLE delta for live migration of
> large memory apps
> > From: Aidan Shribman 
> >
> > By using XBZRLE (Xor Binary Zero Run-Length-Encoding) we
> can reduce VM downtime
> > and total live-migration time for VMs running memory write
> intensive workloads
> > typical of large enterprise applications such as SAP ERP
> Systems, and generally
> > speaking for representative of any application with a
> sparse memory update pattern.
> >
> > On the sender side XBZRLE is used as a compact delta
> encoding of page updates,
> > retrieving the old page content from an LRU cache (default
> size of 64 MB). The
> > receiving side uses the existing page content and XBZRLE to
> decode the new page
> > content.
> >
> > Work was originally based on research results published VEE
> 2011: Evaluation of
> > Delta Compression Techniques for Efficient Live Migration
> of Large Virtual
> > Machines by Benoit, Svard, Tordsson and Elmroth.
> Additionally the delta encoder
> > XBRLE was improved further using XBZRLE instead.
> >
> > XBZRLE has a sustained bandwidth of 1.5-2.2 GB/s for
> typical workloads making it
> > ideal for in-line, real-time encoding such as is needed for
> live-migration.
>
> What is the CPU cost of xbzrle live migration on the source host?  I'm
> thinking about a graph showing CPU utilization (e.g. from mpstat(1))
> that has two datasets: migration without xbzrle and migration with
> xbzrle.
>

zbzrle.out indicates that xbzrle is using 50% of the compute capacity during 
the xbzrle live-migration (which completed is  few seconds), In vanilla.out 
between 30%-60% of compute is directed toward the live-migration itself - in 
this case live-migration is not able to complete.

-

root@ilrsh01:~#
root@ilrsh01:~# cat xbzrle.out
Linux 2.6.35-22-server (ilrsh01)08/07/2011  _x86_64_(2 CPU)

10:55:37 AM  CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
%guest%idle
10:55:38 AM  all   40.500.001.001.500.009.000.00
0.00   48.00
10:55:38 AM00.000.001.003.000.000.000.00
0.00   96.00
10:55:38 AM1   81.000.001.000.000.00   18.000.00
0.000.00

10:55:38 AM  CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
%guest   %idle
10:55:39 AM  all   47.000.001.500.000.002.000.00
0.00   49.50
10:55:39 AM00.000.001.000.000.000.000.00
0.00   99.00
10:55:39 AM1   94.000.002.000.000.004.000.00
0.000.00

10:55:39 AM  CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
%guest   %idle
10:55:40 AM  all   50.000.000.500.000.000.000.00
0.00   49.50
10:55:40 AM00.000.001.000.000.000.000.00
0.00   99.00
10:55:40 AM1  100.000.000.000.000.000.000.00
0.000.00

10:55:40 AM  CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
%guest   %idle
10:55:41 AM  all   49.750.001.990.000.000.000.00
0.00   48.26
10:55:41 AM0   10.000.000.000.000.000.000.00
0.00   90.00
10:55:41 AM1   89.110.003.960.000.000.000.00
0.006.93

10:55:41 AM  CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
%guest%idle
10:55:42 AM  all   47.260.008.960.000.001.990.00
0.00   41.79
10:55:42 AM0   51.000.000.000.000.000.000.00
0.00   49.00
10:55:42 AM1   43.560.00   17.820.000.003.960.00
0.00   34.65

10:55:42 AM  CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
%guest%idle
10:55:43 AM  all   50.000.00   11.502.000.001.000.00
0.00   35.50
10:55:43 AM0  100.000.000.000.000.000.000.00
0.000.00
10:55:43 AM10.000.00   23.004.000.002.000.00
0.00   71.00

10:55:43 AM  CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
%guest%idle
10:55:44 AM  all   50.000.00   22.000.000.000.000.00
0.00   28.00
10:55:44 AM0  100.000.000.000.000.000.000.00
0.000.00
10:55:44 AM10.000.00   44.000.000.000.000.00
0.00   56.00

10:55:44 AM  CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
%guest   %idle
10:55:45 AM  all   49.500.00   23.500.000.002.000.00
0.00   25.00
10:55:45 AM0   99.000.001.000.000.000.000.00
0.000.00

[Qemu-devel] [PATCH v4] XBZRLE delta for live migration of large memory apps

2011-08-08 Thread Shribman, Aidan

Subject: [PATCH v4] XBZRLE delta for live migration of large memory apps
From: Aidan Shribman 

By using XBZRLE (Xor Binary Zero Run-Length-Encoding) we can reduce VM downtime
and total live-migration time of VMs running memory write intensive workloads
typical of large enterprise applications such as SAP ERP Systems, and generally
speaking for any application with a sparse memory update pattern.

On the sender side XBZRLE is used as a compact delta encoding of page updates,
retrieving the old page content from an LRU cache (default size of 64 MB). The
receiving side uses the existing page content and XBZRLE to decode the new page
content.

Work was originally based on research results published VEE 2011: Evaluation of
Delta Compression Techniques for Efficient Live Migration of Large Virtual
Machines by Benoit, Svard, Tordsson and Elmroth. Additionally the delta encoder
XBRLE was improved further using XBZRLE instead.

XBZRLE has a sustained bandwidth of 2-2.5 GB/s for typical workloads making it
ideal for in-line, real-time encoding such as is needed for live-migration.

A typical usage scenario:
{qemu} migrate_set_cachesize 256m
{qemu} migrate -x -d tcp:destination.host:
{qemu} info migrate
...
transferred ram-duplicate: A kbytes
transferred ram-duplicate: B pages
transferred ram-normal: C kbytes
transferred ram-normal: D pages
transferred ram-xbrle: E kbytes
transferred ram-xbrle: F pages
overflow ram-xbrle: G pages
cache-hit ram-xbrle: H pages
cache-lookup ram-xbrle: J pages

Testing: live migration with XBZRLE completed in 110 seconds, without live
migration was not able to complete.

A simple synthetic memory r/w load generator:
..include 
..include 
..int main()
..{
..char *buf = (char *) calloc(4096, 4096);
..while (1) {
..int i;
..for (i = 0; i < 4096 * 4; i++) {
..buf[i * 4096 / 4]++;
..}
..printf(".");
..}
..}

Signed-off-by: Benoit Hudzia 
Signed-off-by: Petter Svard 
Signed-off-by: Aidan Shribman 

--

 Makefile.target   |1 +
 arch_init.c   |  351 ++--
 block-migration.c |3 +-
 hash.h|   72 +++
 hmp-commands.hx   |   36 --
 hw/hw.h   |3 +-
 lru.c |  142 +
 lru.h |   13 ++
 migration-exec.c  |6 +-
 migration-fd.c|6 +-
 migration-tcp.c   |6 +-
 migration-unix.c  |6 +-
 migration.c   |  119 +-
 migration.h   |   25 +++-
 qmp-commands.hx   |   43 ++-
 savevm.c  |   13 ++-
 sysemu.h  |   13 ++-
 xbzrle.c  |  126 +++
 xbzrle.h  |   12 ++
 19 files changed, 917 insertions(+), 79 deletions(-)

diff --git a/Makefile.target b/Makefile.target
index 2800f47..b3215de 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -186,6 +186,7 @@ endif #CONFIG_BSD_USER
 ifdef CONFIG_SOFTMMU

 obj-y = arch_init.o cpus.o monitor.o machine.o gdbstub.o balloon.o
+obj-y += lru.o xbzrle.o
 # virtio has to be here due to weird dependency between PCI and virtio-net.
 # need to fix this properly
 obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-serial-bus.o
diff --git a/arch_init.c b/arch_init.c
old mode 100644
new mode 100755
index 4486925..d67dc82
--- a/arch_init.c
+++ b/arch_init.c
@@ -40,6 +40,17 @@
 #include "net.h"
 #include "gdbstub.h"
 #include "hw/smbios.h"
+#include "lru.h"
+#include "xbzrle.h"
+
+//#define DEBUG_ARCH_INIT
+#ifdef DEBUG_ARCH_INIT
+#define DPRINTF(fmt, ...) \
+do { fprintf(stdout, "arch_init: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+do { } while (0)
+#endif

 #ifdef TARGET_SPARC
 int graphic_width = 1024;
@@ -88,6 +99,161 @@ const uint32_t arch_type = QEMU_ARCH;
 #define RAM_SAVE_FLAG_PAGE 0x08
 #define RAM_SAVE_FLAG_EOS  0x10
 #define RAM_SAVE_FLAG_CONTINUE 0x20
+#define RAM_SAVE_FLAG_XBZRLE0x40
+
+/***/
+/* RAM Migration State */
+typedef struct ArchMigrationState {
+int use_xbrle;
+int64_t xbrle_cache_size;
+} ArchMigrationState;
+
+static ArchMigrationState arch_mig_state;
+
+void arch_set_params(int blk_enable, int shared_base, int use_xbrle,
+int64_t xbrle_cache_size, void *opaque)
+{
+arch_mig_state.use_xbrle = use_xbrle;
+arch_mig_state.xbrle_cache_size = xbrle_cache_size;
+}
+
+#define BE16_MAGIC 0x0123
+
+/***/
+/* XBZRLE (Xor Binary Zero Run-Length Encoding) */
+typedef struct XBZRLEHeader {
+uint32_t xh_cksum; /* not used */
+uint16_t xh_magic;
+uint16_t xh_len;
+uint8_t xh_flags;
+} XBZRLEHeader;
+
+static uint8_t dup_buf[TARGET_PAGE_SIZE];
+
+/***/
+/* accounting */
+typedef struct AccountingInfo{
+uint64_t dup_pages;
+

Re: [Qemu-devel] [PATCH 2/3] Introduce a 'client_add' monitor command accepting an open FD

2011-08-08 Thread Daniel P. Berrange

On Sat, Aug 06, 2011 at 09:38:03AM -0500, Anthony Liguori wrote:
> On 06/23/2011 07:31 AM, Daniel P. Berrange wrote:
> >Allow client connections for VNC and socket based character
> >devices to be passed in over the monitor using SCM_RIGHTS.
> >
> >One intended usage scenario is to start QEMU with VNC on a
> >UNIX domain socket. An unprivileged user which cannot access
> >the UNIX domain socket, can then connect to QEMU's VNC server
> >by passing an open FD to libvirt, which passes it onto QEMU.
> >
> >  { "execute": "get_fd", "arguments": { "fdname": "myclient" } }
> >  { "return": {} }
> >  { "execute": "add_client", "arguments": { "protocol": "vnc",
> >"fdname": "myclient",
> >"skipauth": true } }
> >  { "return": {} }
> >
> >In this case 'protocol' can be 'vnc' or 'spice', or the name
> >of a character device (eg from -chardev id=)
> >
> >The 'skipauth' parameter can be used to skip any configured
> >VNC authentication scheme, which is useful if the mgmt layer
> >talking to the monitor has already authenticated the client
> >in another way.
> >
> >* console.h: Define 'vnc_display_add_client' method
> >* monitor.c: Implement 'client_add' command
> >* qemu-char.c, qemu-char.h: Add 'qemu_char_add_client' method
> 
> I don't feel all that good about this anymore.  The notion of adding
> a fd to an arbitrary character device is a big layering violation
> and doesn't make much sense conceptually.
> 
> A chardev cannot have multiple connections.  It seems like the
> use-case is much better suited to having an fd: protocol to create
> char devices with.

The same idea of being able to configure a VNC server to listen
on a UNIX socket path by default, but be able to inject the client
connection via libvirt + FD passing, also applies to chardevs IMHO.
So having a fd: protocol chardev doesn't seem right to me. Even if
we don't support multiple connections, it is still useful to be able
to pass in the initial connection, for TCP/UNIX based chardevs.

> I'd like to partially revert this removing the qemu_chr_add_client
> interface.

The VNC bit was the most immediately useful part to me, so having the
chardev bits now is not critical.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

[Qemu-devel] build failure with coroutine-gthread

2011-08-08 Thread Aneesh Kumar K.V



  LINK  qemu-ga
coroutine-gthread.o: In function `coroutine_init':
/home/opensource/sources/qemu/qemu-upstream/coroutine-gthread.c:39: undefined 
reference to `g_thread_init'
collect2: ld returned 1 exit status

The below patch fix the failure.  I also added the patch in the VirtFS
pull request.

commit 4b76a481ee28166d5f415ef97833c624f4fc0792
Author: Stefan Hajnoczi 
Date:   Mon Aug 8 13:04:05 2011 +0530

coroutine: add gthread dependency

Commit 1fc7bd4a86a2bfeafcec29445871eb97469a2699 removed the gthread and
gio dependency since qemu-ga did not require it.  Coroutines require
gthread, so add it back in.

Signed-off-by: Stefan Hajnoczi 

diff --git a/configure b/configure
index 0c67a4a..a6687f7 100755
--- a/configure
+++ b/configure
@@ -1844,16 +1844,14 @@ fi
 
 ##
 # glib support probe
-if test "$guest_agent" != "no" ; then
-if $pkg_config --modversion glib-2.0 > /dev/null 2>&1 ; then
-glib_cflags=`$pkg_config --cflags glib-2.0 2>/dev/null`
-glib_libs=`$pkg_config --libs glib-2.0 2>/dev/null`
-libs_softmmu="$glib_libs $libs_softmmu"
-libs_tools="$glib_libs $libs_tools"
-else
-echo "glib-2.0 required to compile QEMU"
-exit 1
-fi
+if $pkg_config --modversion gthread-2.0 > /dev/null 2>&1 ; then
+glib_cflags=`$pkg_config --cflags gthread-2.0 2>/dev/null`
+glib_libs=`$pkg_config --libs gthread-2.0 2>/dev/null`
+libs_softmmu="$glib_libs $libs_softmmu"
+libs_tools="$glib_libs $libs_tools"
+else
+echo "glib-2.0 required to compile QEMU"
+exit 1
 fi
 
 ##

[Qemu-devel] [PULL] VirtFS coroutine changes

2011-08-08 Thread Aneesh Kumar K.V


The following changes since commit 23ddf2bb1e4bfe2b72a726fe5e828807b65941ad:

  Fix forcing multicast msgs to loopback on OpenBSD. (2011-08-07 11:06:43 +)

are available in the git repository at:
  git://repo.or.cz/qemu/v9fs.git for-upstream-1

Aneesh Kumar K.V (32):
  hw/9pfs: Add yield support for readdir related coroutines
  hw/9pfs: Update v9fs_readdir to use coroutines
  hw/9pfs: Add yield support to statfs coroutine
  hw/9pfs: Update v9fs_statfs to use coroutines
  hw/9pfs: Add yield support to lstat coroutine
  hw/9pfs: Update v9fs_getattr to use coroutines
  hw/9pfs: Add yield support to setattr related coroutines
  hw/9pfs: Update v9fs_setattr to use coroutines
  hw/9pfs: Add yield support to xattr related coroutine
  hw/9pfs: Update v9fs_xattrwalk to coroutines
  hw/9pfs: Update v9fs_xattrcreate to use coroutines
  hw/9pfs: Add yield support to mknod coroutine
  hw/9pfs: Update v9fs_mknod to use coroutines
  hw/9pfs: Add yeild support to rename coroutine
  hw/9pfs: Update vfs_rename to use coroutines
  hw/9pfs: Add yeild support for fstat coroutine
  hw/9pfs: Update v9fs_lock to use coroutines
  hw/9pfs: Update v9fs_getlock to use coroutines
  hw/9pfs: Add yield support for open and opendir coroutine
  hw/9pfs: Update v9fs_open to use coroutines
  hw/9pfs: Update v9fs_stat to use coroutines
  hw/9pfs: Update v9fs_walk to use coroutines
  hw/9pfs: Add yeild support for clunk related coroutine
  hw/9pfs: Update v9fs_clunk to use coroutines
  hw/9pfs: Add yield support for fsync coroutine
  hw/9pfs: Update v9fs_fsync to use coroutines
  hw/9pfs: Add yield support for pwritev coroutine
  hw/9pfs: Update v9fs_write to use coroutines
  hw/9pfs: Update v9fs_wstat to use coroutines
  hw/9pfs: Update v9fs_attach to use coroutines
  hw/9pfs: Add yield support for preadv coroutine
  hw/9pfs: Update v9fs_read to use coroutines

Harsh Prateek Bora (1):
  use readdir_r instead of readdir for reentrancy

Stefan Hajnoczi (1):
  coroutine: add gthread dependency

Venkateswararao Jujjuri (JV) (20):
  [virtio-9p] Add infrastructure to support glib threads and coroutines.
  [virtio-9p] Change all pdu handlers to coroutines.
  [virtio-9p] Remove post functions for v9fs_readlink.
  [virtio-9p] clean up v9fs_readlink.
  [virtio-9p] coroutines for readlink
  [virtio-9p] Remove post functions for v9fs_mkdir.
  [virtio-9p] clean up v9fs_mkdir.
  [virtio-9p] coroutine and threading for mkdir
  [virtio-9p] Remove post functions for v9fs_remove
  [virtio-9p] clean up v9fs_remove.
  [virtio-9p] coroutine and threading for remove/unlink
  [virtio-9p] Remove post functions for v9fs_lcreate
  [virtio-9p] clean up v9fs_lcreate
  [virtio-9p] coroutine and threading   for open2
  [virtio-9p] Remove post functions for v9fs_create
  [virtio-9p] clean up v9fs_create Rearrange the code
  [virtio-9p] Remove post functions for v9fs_symlink
  [virtio-9p] clean up v9fs_symlink
  [virtio-9p] coroutine and threading for v9fs_do_symlink
  [virtio-9p] coroutine and threading for v9fs_do_link

 Makefile.objs  |3 +
 configure  |   18 +-
 fsdev/file-op-9p.h |2 +-
 hw/9pfs/codir.c|  117 ++
 hw/9pfs/cofile.c   |  163 +++
 hw/9pfs/cofs.c |  191 +++
 hw/9pfs/coxattr.c  |   84 ++
 hw/9pfs/virtio-9p-coth.c   |  102 ++
 hw/9pfs/virtio-9p-coth.h   |   96 ++
 hw/9pfs/virtio-9p-device.c |7 +-
 hw/9pfs/virtio-9p-local.c  |7 +-
 hw/9pfs/virtio-9p.c| 3052 
 hw/9pfs/virtio-9p.h|  155 +---
 13 files changed, 1848 insertions(+), 2149 deletions(-)
 create mode 100644 hw/9pfs/codir.c
 create mode 100644 hw/9pfs/cofile.c
 create mode 100644 hw/9pfs/cofs.c
 create mode 100644 hw/9pfs/coxattr.c
 create mode 100644 hw/9pfs/virtio-9p-coth.c
 create mode 100644 hw/9pfs/virtio-9p-coth.h

Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Dor Laor


On 08/08/2011 06:24 AM, Isaku Yamahata wrote:

This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
on which we'll give a talk at KVM-forum.
The purpose of this mail is to letting developers know it in advance
so that we can get better feedback on its design/implementation approach
early before our starting to implement it.


Background
==
* What's is postcopy livemigration
It is is yet another live migration mechanism for Qemu/KVM, which
implements the migration technique known as "postcopy" or "lazy"
migration. Just after the "migrate" command is invoked, the execution
host of a VM is instantaneously switched to a destination host.

The benefit is, total migration time is shorter because it transfer
a page only once. On the other hand precopy may repeat sending same pages
again and again because they can be dirtied.
The switching time from the source to the destination is several
hunderds mili seconds so that it enables quick load balancing.
For details, please refer to the papers.

We believe this is useful for others so that we'd like to merge this
feature into the upstream qemu/kvm. The existing implementation that
we have right now is very ad-hoc because it's for academic research.
For the upstream merge, we're starting to re-design/implement it and
we'd like to get feedback early.  Although many improvements/optimizations
are possible, we should implement/merge the simple/clean, but extensible
as well, one at first and then improve/optimize it later.

postcopy livemigration will be introduced as optional feature. The existing
precopy livemigration remains as default behavior.


* related links:
project page
http://sites.google.com/site/grivonhome/quick-kvm-migration

Enabling Instantaneous Relocation of Virtual Machines with a
Lightweight VMM Extension,
(proof-of-concept, ad-hoc prototype. not a new design)
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-paper.pdf
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-talk.pdf

Reactive consolidation of virtual machines enabled by postcopy live migration
(advantage for VM consolidation)
http://portal.acm.org/citation.cfm?id=1996125
http://www.emn.fr/x-info/ascola/lib/exe/fetch.php?media=internet:vtdc-postcopy.pdf

Qemu wiki
http://wiki.qemu.org/Features/PostCopyLiveMigration


Design/Implementation
=
The basic idea of postcopy livemigration is to use a sort of distributed
shared memory between the migration source and destination.

The migration procedure looks like
   - start migration
 stop the guest VM on the source and send the machine states except
 guest RAM to the destination
   - resume the guest VM on the destination without guest RAM contents
   - Hook guest access to pages, and pull page contents from the source
 This continues until all the pages are pulled to the destination

   The big picture is depicted at
   http://wiki.qemu.org/File:Postcopy-livemigration.png


That's terrific  (nice video also)!
Orit and myself had the exact same idea too (now we can't patent it..).

Advantages:
- No down time due to memory copying.
- Efficient, reduce needed traffic no need to re-send pages.
- Reduce overall RAM consumption of the source and destination
as opposed from current live migration (both the source and the
destination allocate the memory until the live migration
completes). We can free copied memory once the destination guest
received it and save RAM.
- Increase parallelism for SMP guests we can have multiple
virtual CPU handle their demand paging . Less time to hold a
global lock, less thread contention.
- Virtual machines are using more and more memory resources ,
for a virtual machine with very large working set doing live
migration with reasonable down time is impossible today.

Disadvantageous:
- During the live migration the guest will run slower than in
today's live migration. We need to remember that even today
guests suffer from performance penalty on the source during the
COW stage (memory copy).
- Failure of the source or destination or the network will cause
us to lose the running virtual machine. Those failures are very
rare.
In case there is shared storage we can store a copy of the
memory there , that can be recovered in case of such failure .

Overall, it looks like a better approach for the vast majority of cases.
Hope it will get merged to kvm and become the default way.




There are several design points.
   - who takes care of pulling page contents.
 an independent daemon vs a thread in qemu
 The daemon approach is preferable because an independent daemon would
 easy for debug postcopy memory mechanism without qemu.
 If required, it wouldn't be difficult to convert a daemon into
 a thread in qemu

   - connection between the source and the destinati

Re: [Qemu-devel] build failure with coroutine-gthread

2011-08-08 Thread Stefan Hajnoczi

On Mon, Aug 8, 2011 at 10:01 AM, Aneesh Kumar K.V
 wrote:
>
>
>  LINK  qemu-ga
> coroutine-gthread.o: In function `coroutine_init':
> /home/opensource/sources/qemu/qemu-upstream/coroutine-gthread.c:39: undefined 
> reference to `g_thread_init'
> collect2: ld returned 1 exit status
>
> The below patch fix the failure.  I also added the patch in the VirtFS
> pull request.

It appears coroutine-core v7 was merged instead of v8.  The
differences are minor:
 * Bisectability: introduce gthread implementation before ucontext/fibers
 * Depend on gthread, not just glib, for multithreaded programming

Here is the patchwork link to the gthread dependency patch in case a
committer wants to merge it immediately:
http://patchwork.ozlabs.org/patch/106810/

Stefan

Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Stefan Hajnoczi

On Mon, Aug 8, 2011 at 4:24 AM, Isaku Yamahata  wrote:
> This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
> on which we'll give a talk at KVM-forum.

I'm curious if this approach is compatible with asynchronous page
faults?  The idea there was to tell the guest about a page fault so it
can continue to do useful work in the meantime (if the fault was in
guest userspace).

Stefan

Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Yaniv Kaul


On 08/08/2011 12:20, Dor Laor wrote:

On 08/08/2011 06:24 AM, Isaku Yamahata wrote:

This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
on which we'll give a talk at KVM-forum.
The purpose of this mail is to letting developers know it in advance
so that we can get better feedback on its design/implementation approach
early before our starting to implement it.


Background
==
* What's is postcopy livemigration
It is is yet another live migration mechanism for Qemu/KVM, which
implements the migration technique known as "postcopy" or "lazy"
migration. Just after the "migrate" command is invoked, the execution
host of a VM is instantaneously switched to a destination host.

The benefit is, total migration time is shorter because it transfer
a page only once. On the other hand precopy may repeat sending same 
pages

again and again because they can be dirtied.
The switching time from the source to the destination is several
hunderds mili seconds so that it enables quick load balancing.
For details, please refer to the papers.

We believe this is useful for others so that we'd like to merge this
feature into the upstream qemu/kvm. The existing implementation that
we have right now is very ad-hoc because it's for academic research.
For the upstream merge, we're starting to re-design/implement it and
we'd like to get feedback early.  Although many 
improvements/optimizations

are possible, we should implement/merge the simple/clean, but extensible
as well, one at first and then improve/optimize it later.

postcopy livemigration will be introduced as optional feature. The 
existing

precopy livemigration remains as default behavior.


* related links:
project page
http://sites.google.com/site/grivonhome/quick-kvm-migration

Enabling Instantaneous Relocation of Virtual Machines with a
Lightweight VMM Extension,
(proof-of-concept, ad-hoc prototype. not a new design)
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-paper.pdf
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-talk.pdf

Reactive consolidation of virtual machines enabled by postcopy live 
migration

(advantage for VM consolidation)
http://portal.acm.org/citation.cfm?id=1996125
http://www.emn.fr/x-info/ascola/lib/exe/fetch.php?media=internet:vtdc-postcopy.pdf 



Qemu wiki
http://wiki.qemu.org/Features/PostCopyLiveMigration


Design/Implementation
=
The basic idea of postcopy livemigration is to use a sort of distributed
shared memory between the migration source and destination.

The migration procedure looks like
   - start migration
 stop the guest VM on the source and send the machine states except
 guest RAM to the destination
   - resume the guest VM on the destination without guest RAM contents
   - Hook guest access to pages, and pull page contents from the source
 This continues until all the pages are pulled to the destination

   The big picture is depicted at
   http://wiki.qemu.org/File:Postcopy-livemigration.png


That's terrific  (nice video also)!
Orit and myself had the exact same idea too (now we can't patent it..).

Advantages:
- No down time due to memory copying.
- Efficient, reduce needed traffic no need to re-send pages.
- Reduce overall RAM consumption of the source and destination
as opposed from current live migration (both the source and the
destination allocate the memory until the live migration
completes). We can free copied memory once the destination guest
received it and save RAM.
- Increase parallelism for SMP guests we can have multiple
virtual CPU handle their demand paging . Less time to hold a
global lock, less thread contention.
- Virtual machines are using more and more memory resources ,
for a virtual machine with very large working set doing live
migration with reasonable down time is impossible today.

Disadvantageous:
- During the live migration the guest will run slower than in
today's live migration. We need to remember that even today
guests suffer from performance penalty on the source during the
COW stage (memory copy).
- Failure of the source or destination or the network will cause
us to lose the running virtual machine. Those failures are very
rare.


I highly doubt that's acceptable in enterprise deployments.


In case there is shared storage we can store a copy of the
memory there , that can be recovered in case of such failure .

Overall, it looks like a better approach for the vast majority of cases.
Hope it will get merged to kvm and become the default way.




There are several design points.
   - who takes care of pulling page contents.
 an independent daemon vs a thread in qemu
 The daemon approach is preferable because an independent daemon 
would

 easy for debug postcopy memory mechanism without qemu.
 If required, it wouldn'

Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Isaku Yamahata

On Mon, Aug 08, 2011 at 10:38:35AM +0100, Stefan Hajnoczi wrote:
> On Mon, Aug 8, 2011 at 4:24 AM, Isaku Yamahata  wrote:
> > This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
> > on which we'll give a talk at KVM-forum.
> 
> I'm curious if this approach is compatible with asynchronous page
> faults?  The idea there was to tell the guest about a page fault so it
> can continue to do useful work in the meantime (if the fault was in
> guest userspace).

Yes. It's quite possible to inject async page fault into the guest
when the faulted page isn't available on the destination. At the same
time the page will be requested to the source of the migration.
I think it's not so difficult.
-- 
yamahata

Re: [Qemu-devel] [PULL 0/7] ARM patch queue (for master)

2011-08-08 Thread Peter Maydell

Ping?

On 27 July 2011 14:26, Peter Maydell  wrote:
> This is a pull request for various outstanding ARM related
> patches; they've been on the list for a week or so.
>
> Some of these are bug fixes which I want to get into 0.15; I'm
> going to do a parallel pullreq for the 0.15 patches.
>
> Thanks
> -- PMM
>
>
> The following changes since commit c886edfb851c0c590d4e77f058f2ec8ed95ad1b5:
>
>  Let users select their pythons (2011-07-25 16:50:12 +)
>
> are available in the git repository at:
>  git://git.linaro.org/people/pmaydell/qemu-arm.git for-upstream
>
> Jamie Iles (2):
>      target-arm: make VMSAv7 remapping and AP dependent on V6K
>      target-arm: support for ARM1176JZF-s cores
>
> Peter Maydell (5):
>      target-arm: Mark 1136r1 as a v6K core
>      target-arm: Support v6 barriers in linux-user mode
>      target-arm: Handle UNDEF and UNPREDICTABLE cases for VLDM, VSTM
>      target-arm: UNDEF on a VCVTT/VCVTB UNPREDICTABLE to avoid TCG assert
>      target-arm: Don't print debug messages for various UNDEF cases
>
>  target-arm/cpu.h       |    2 +
>  target-arm/helper.c    |   47 ++-
>  target-arm/translate.c |  114 +++
>  3 files changed, 121 insertions(+), 42 deletions(-)

Re: [Qemu-devel] [PATCH v3 01/39] virtio-pci: get config on init

2011-08-08 Thread Michael S. Tsirkin

On Fri, Aug 05, 2011 at 08:52:25AM -0500, Anthony Liguori wrote:
> On 08/04/2011 08:05 AM, Avi Kivity wrote:
> >From: "Michael S. Tsirkin"
> >
> >We originally did get config on map, so that
> >following write accesses are done on an updated config.
> >New memory API doesn't give us a callback
> >on map, and arguably, devices don't know when
> >cpu really can access there. So updating on
> >init seems cleaner.
> >
> >Signed-off-by: Michael S. Tsirkin
> >Signed-off-by: Avi Kivity
> >---
> >  hw/virtio-pci.c |7 ---
> >  1 files changed, 4 insertions(+), 3 deletions(-)
> >
> >diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
> >index d685243..ca1f12f 100644
> >--- a/hw/virtio-pci.c
> >+++ b/hw/virtio-pci.c
> >@@ -506,9 +506,6 @@ static void virtio_map(PCIDevice *pci_dev, int 
> >region_num,
> >  register_ioport_read(addr, config_len, 1, virtio_pci_config_readb, 
> > proxy);
> >  register_ioport_read(addr, config_len, 2, virtio_pci_config_readw, 
> > proxy);
> >  register_ioport_read(addr, config_len, 4, virtio_pci_config_readl, 
> > proxy);
> >-
> >-if (vdev->config_len)
> >-vdev->get_config(vdev, vdev->config);
> >  }
> >
> >  static void virtio_write_config(PCIDevice *pci_dev, uint32_t address,
> >@@ -689,6 +686,10 @@ void virtio_init_pci(VirtIOPCIProxy *proxy, 
> >VirtIODevice *vdev)
> >  proxy->host_features |= 0x1<<  VIRTIO_F_NOTIFY_ON_EMPTY;
> >  proxy->host_features |= 0x1<<  VIRTIO_F_BAD_FEATURE;
> >  proxy->host_features = vdev->get_features(vdev, proxy->host_features);
> >+
> >+if (vdev->config_len) {
> >+vdev->get_config(vdev, vdev->config);
> >+}
> 
> Thinking more closely, I don't think this right.
> 
> Updating on map ensured that the config was refreshed after each
> time the bar was mapped.  In the very least, the config needs to be
> refreshed during reset because the guest may write to the guest
> space which should get cleared after reset.
> 
> Regards,
> 
> Anthony Liguori

Not sure I understand. Which register, for example,
do you have in mind?
Could you clarify please?


> >  }
> >
> >  static int virtio_blk_init_pci(PCIDevice *pci_dev)

Re: [Qemu-devel] Build broken

2011-08-08 Thread malc

On Mon, 8 Aug 2011, Kevin Wolf wrote:

> Am 05.08.2011 18:49, schrieb malc:
> > On Fri, 5 Aug 2011, Stefan Hajnoczi wrote:
> > 

[..snip..]

> 
> Feel free to push it. (Original code was by Anthony, then heavily
> modified by me, and after that modified again by Stefan)
> 
> Acked-by: Kevin Wolf 

Pushed. 

-- 
mailto:av1...@comtv.ru

Re: [Qemu-devel] build failure with coroutine-gthread

2011-08-08 Thread Aneesh Kumar K.V

On Mon, 8 Aug 2011 10:29:04 +0100, Stefan Hajnoczi  wrote:
> On Mon, Aug 8, 2011 at 10:01 AM, Aneesh Kumar K.V
>  wrote:
> >
> >
> >  LINK  qemu-ga
> > coroutine-gthread.o: In function `coroutine_init':
> > /home/opensource/sources/qemu/qemu-upstream/coroutine-gthread.c:39: 
> > undefined reference to `g_thread_init'
> > collect2: ld returned 1 exit status
> >
> > The below patch fix the failure.  I also added the patch in the VirtFS
> > pull request.
> 
> It appears coroutine-core v7 was merged instead of v8.  The
> differences are minor:
>  * Bisectability: introduce gthread implementation before ucontext/fibers
>  * Depend on gthread, not just glib, for multithreaded programming
> 
> Here is the patchwork link to the gthread dependency patch in case a
> committer wants to merge it immediately:
> http://patchwork.ozlabs.org/patch/106810/
> 

I added a version that applies to master in the pull request

http://article.gmane.org/gmane.comp.emulators.qemu/112648

Message-id:"87bow09sw1@skywalker.in.ibm.com"

http://repo.or.cz/w/qemu/v9fs.git/commit/4b76a481ee28166d5f415ef97833c624f4fc0792

-aneesh

Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Nadav Har'El

> >* What's is postcopy livemigration
> >It is is yet another live migration mechanism for Qemu/KVM, which
> >implements the migration technique known as "postcopy" or "lazy"
> >migration. Just after the "migrate" command is invoked, the execution
> >host of a VM is instantaneously switched to a destination host.

Sounds like a cool idea.

> >The benefit is, total migration time is shorter because it transfer
> >a page only once. On the other hand precopy may repeat sending same pages
> >again and again because they can be dirtied.
> >The switching time from the source to the destination is several
> >hunderds mili seconds so that it enables quick load balancing.
> >For details, please refer to the papers.

While these are the obvious benefits, the possible downside (that, as
always, depends on the workload) is the amount of time that the guest
workload runs more slowly than usual, waiting for pages it needs to
continue. There are a whole spectrum between the guest pausing completely
(which would solve all the problems of migration, but is often considered
unacceptible) and running at full-speed. Is it acceptable that the guest
runs at 90% speed during the migration? 50%? 10%?
I guess we could have nothing to lose from having both options, and choosing
the most appropriate technique for each guest!

> That's terrific  (nice video also)!
> Orit and myself had the exact same idea too (now we can't patent it..).

I think new implementation is not the only reason why you cannot patent
this idea :-) Demand-paged migration has actually been discussed (and done)
for nearly a quarter of a century (!) in the area of *process* migration.

The first use I'm aware of was in CMU's Accent 1987 - see [1].
Another paper, [2], written in 1991, discusses how process migration is done
in UCB's Sprite operating system, and evaluates the various alternatives
common at the time (20 years ago), including what it calls "lazy copying"
is more-or-less the same thing as "post copy". Mosix (a project which, in some
sense, is still alive to day) also used some sort of cross between pre-copying
(of dirty pages) and copying on-demand of clean pages (from their backing
store on the source machine).


References
[1] "Attacking the Process Migration Bottleneck"
 http://www.nd.edu/~dthain/courses/cse598z/fall2004/papers/accent.pdf
[2]  "Transparent Process Migration: Design Alternatives and the Sprite
 Implementation"
 http://nd.edu/~dthain/courses/cse598z/fall2004/papers/sprite-migration.pdf

> Advantages:
> - Virtual machines are using more and more memory resources ,
> for a virtual machine with very large working set doing live
> migration with reasonable down time is impossible today.

If a guest actually constantly uses (working set) most of its allocated
memory, it will basically be unable to do any significant amount of work
on the destination VM until this large working set is transfered to the
destination. So in this scenario, "post copying" doesn't give any
significant advantages over plain-old "pause guest and send it to the
destination". Or am I missing something?

> Disadvantageous:
> - During the live migration the guest will run slower than in
> today's live migration. We need to remember that even today
> guests suffer from performance penalty on the source during the
> COW stage (memory copy).

I wonder if something like asynchronous page faults can help somewhat with
multi-process guest workloads (and modified (PV) guest OS). 

> - Failure of the source or destination or the network will cause
> us to lose the running virtual machine. Those failures are very
> rare.

How is this different from a VM running on a single machine that fails?
Just that the small probability of failure (roughly) doubles for the
relatively-short duration of the transfer?


-- 
Nadav Har'El|   Monday, Aug  8 2011, 8 Av 5771
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |If glory comes after death, I'm not in a
http://nadav.harel.org.il   |hurry. (Latin proverb)

Re: [Qemu-devel] [PATCH 2/7] Replace VMSTOP macros with a proper QemuState type

2011-08-08 Thread Avi Kivity

On 08/04/2011 01:17 PM, Jan Kiszka wrote:

>
>  Why "QemuState"?  In general, "qemu" can be inferred from the fact that
>  we're in qemu.git.  Suggest "RunState".
>
>  Second, these states can coexist.  A user may pause the VM
>  simultaneously with the watchdog firing or entering premigrate state.
>  In fact, with multiple monitors, each monitor can pause and resume the
>  vm independently.
>
>  So I think we should keep a reference count instead of just a start/stop
>  state.  Perhaps
>
>  vm_stop(QemuState s)
>  {
>  ++stopcount[s];
>  }
>
>  vm_is_stopped()
>  {
>  for (s in states)
>if (stopcount[s])
>return true;
>  return false;
>  }

I don't think this makes sense nor is user-friendly. If one command
channel suspends the machine, others have the chance to subscribe for
that event.

It's inherently racy.

Maintaining a suspension counter would mean you also need a
channel to query its value.

Why?

IMHO, there is also no use for defining stopped orthogonally to
premigrate and other states that imply that the machine is stopped.
Basically they mean "stopped for/because of X". We just need to avoid
that you can enter plain stopped state from them by issuing the
corresponding monitor command. The other way around might be possible,
though, if there are race windows.

I'm worried about the following race:

  stop
  (qemu stopped for internal reason)
  stop comment processed

  resume

The (qemu stopped for internal reason) part is lost.

--
error compiling committee.c: too many arguments to function

[Qemu-devel] [PATCH] posix-aio-compat: fix latency issues

2011-08-08 Thread Avi Kivity

In certain circumstances, posix-aio-compat can incur a lot of latency:
 - threads are created by vcpu threads, so if vcpu affinity is set,
   aio threads inherit vcpu affinity.  This can cause many aio threads
   to compete for one cpu.
 - we can create up to max_threads (64) aio threads in one go; since a
   pthread_create can take around 30μs, we have up to 2ms of cpu time
   under a global lock.

Fix by:
 - moving thread creation to the main thread, so we inherit the main
   thread's affinity instead of the vcpu thread's affinity.
 - if a thread is currently being created, and we need to create yet
   another thread, let thread being born create the new thread, reducing
   the amount of time we spend under the main thread.
 - drop the local lock while creating a thread (we may still hold the
   global mutex, though)

Note this doesn't eliminate latency completely; scheduler artifacts or
lack of host cpu resources can still cause it.  We may want pre-allocated
threads when this cannot be tolerated.

Thanks to Uli Obergfell of Red Hat for his excellent analysis and suggestions.

Signed-off-by: Avi Kivity 
---
 posix-aio-compat.c |   48 ++--
 1 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/posix-aio-compat.c b/posix-aio-compat.c
index 8dc00cb..aa30673 100644
--- a/posix-aio-compat.c
+++ b/posix-aio-compat.c
@@ -30,6 +30,7 @@
 
 #include "block/raw-posix-aio.h"
 
+static void do_spawn_thread(void);
 
 struct qemu_paiocb {
 BlockDriverAIOCB common;
@@ -64,6 +65,9 @@ static pthread_attr_t attr;
 static int max_threads = 64;
 static int cur_threads = 0;
 static int idle_threads = 0;
+static int new_threads = 0; /* backlog of threads we need to create */
+static int pending_threads = 0; /* threads created but not running yet */
+static QEMUBH *new_thread_bh;
 static QTAILQ_HEAD(, qemu_paiocb) request_list;
 
 #ifdef CONFIG_PREADV
@@ -311,6 +315,13 @@ static void *aio_thread(void *unused)
 
 pid = getpid();
 
+mutex_lock(&lock);
+if (new_threads) {
+do_spawn_thread();
+}
+pending_threads--;
+mutex_unlock(&lock);
+
 while (1) {
 struct qemu_paiocb *aiocb;
 ssize_t ret = 0;
@@ -381,11 +392,18 @@ static void *aio_thread(void *unused)
 return NULL;
 }
 
-static void spawn_thread(void)
+static void do_spawn_thread(void)
 {
 sigset_t set, oldset;
 
-cur_threads++;
+if (!new_threads) {
+return;
+}
+
+new_threads--;
+pending_threads++;
+
+mutex_unlock(&lock);
 
 /* block all signals */
 if (sigfillset(&set)) die("sigfillset");
@@ -394,6 +412,31 @@ static void spawn_thread(void)
 thread_create(&thread_id, &attr, aio_thread, NULL);
 
 if (sigprocmask(SIG_SETMASK, &oldset, NULL)) die("sigprocmask restore");
+
+mutex_lock(&lock);
+}
+
+static void spawn_thread_bh_fn(void *opaque)
+{
+mutex_lock(&lock);
+do_spawn_thread();
+mutex_unlock(&lock);
+}
+
+static void spawn_thread(void)
+{
+cur_threads++;
+new_threads++;
+/* If there are threads being created, they will spawn new workers, so
+ * we don't spend time creating many threads in a loop holding a mutex or
+ * starving the current vcpu.
+ *
+ * If there are no idle threads, ask the main thread to create one, so we
+ * inherit the correct affinity instead of the vcpu affinity.
+ */
+if (!pending_threads) {
+qemu_bh_schedule(new_thread_bh);
+}
 }
 
 static void qemu_paio_submit(struct qemu_paiocb *aiocb)
@@ -665,6 +708,7 @@ int paio_init(void)
 die2(ret, "pthread_attr_setdetachstate");
 
 QTAILQ_INIT(&request_list);
+new_thread_bh = qemu_bh_new(spawn_thread_bh_fn, NULL);
 
 posix_aio_state = s;
 return 0;
-- 
1.7.5.3

Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Dor Laor


On 08/08/2011 01:59 PM, Nadav Har'El wrote:

* What's is postcopy livemigration
It is is yet another live migration mechanism for Qemu/KVM, which
implements the migration technique known as "postcopy" or "lazy"
migration. Just after the "migrate" command is invoked, the execution
host of a VM is instantaneously switched to a destination host.


Sounds like a cool idea.


The benefit is, total migration time is shorter because it transfer
a page only once. On the other hand precopy may repeat sending same pages
again and again because they can be dirtied.
The switching time from the source to the destination is several
hunderds mili seconds so that it enables quick load balancing.
For details, please refer to the papers.


While these are the obvious benefits, the possible downside (that, as
always, depends on the workload) is the amount of time that the guest
workload runs more slowly than usual, waiting for pages it needs to
continue. There are a whole spectrum between the guest pausing completely
(which would solve all the problems of migration, but is often considered
unacceptible) and running at full-speed. Is it acceptable that the guest
runs at 90% speed during the migration? 50%? 10%?
I guess we could have nothing to lose from having both options, and choosing
the most appropriate technique for each guest!


+1




That's terrific  (nice video also)!
Orit and myself had the exact same idea too (now we can't patent it..).


I think new implementation is not the only reason why you cannot patent
this idea :-) Demand-paged migration has actually been discussed (and done)
for nearly a quarter of a century (!) in the area of *process* migration.

The first use I'm aware of was in CMU's Accent 1987 - see [1].
Another paper, [2], written in 1991, discusses how process migration is done
in UCB's Sprite operating system, and evaluates the various alternatives
common at the time (20 years ago), including what it calls "lazy copying"
is more-or-less the same thing as "post copy". Mosix (a project which, in some
sense, is still alive to day) also used some sort of cross between pre-copying
(of dirty pages) and copying on-demand of clean pages (from their backing
store on the source machine).


References
[1] "Attacking the Process Migration Bottleneck"
  http://www.nd.edu/~dthain/courses/cse598z/fall2004/papers/accent.pdf


w/o reading the internals, patents enable you to implement an existing 
idea on a new field. Anyway, there won't be no patent in this case. 
Still let's have the kvm innovation merged.



[2]  "Transparent Process Migration: Design Alternatives and the Sprite
  Implementation"
  http://nd.edu/~dthain/courses/cse598z/fall2004/papers/sprite-migration.pdf


Advantages:
 - Virtual machines are using more and more memory resources ,
 for a virtual machine with very large working set doing live
 migration with reasonable down time is impossible today.


If a guest actually constantly uses (working set) most of its allocated
memory, it will basically be unable to do any significant amount of work
on the destination VM until this large working set is transfered to the
destination. So in this scenario, "post copying" doesn't give any
significant advantages over plain-old "pause guest and send it to the
destination". Or am I missing something?


There is one key advantage in this scheme/use case - if you have a guest 
with a very large working set, you'll need a very large downtime in 
order to migrate it with today's algorithm. With post copy (aka 
streaming/demand paging), the guest won't have any downtime but will run 
slower than expected.


There are guests today that is impractical to really live migrate them.

btw: Even today, marking pages RO also carries some performance penalty.




Disadvantageous:
 - During the live migration the guest will run slower than in
 today's live migration. We need to remember that even today
 guests suffer from performance penalty on the source during the
 COW stage (memory copy).


I wonder if something like asynchronous page faults can help somewhat with
multi-process guest workloads (and modified (PV) guest OS).


They should come in to play for some extent. Note that only newer Linux 
guest will enjoy of them.





 - Failure of the source or destination or the network will cause
 us to lose the running virtual machine. Those failures are very
 rare.


How is this different from a VM running on a single machine that fails?
Just that the small probability of failure (roughly) doubles for the
relatively-short duration of the transfer?


Exactly my point, this is not a major disadvantage because of this low 
probability.

[Qemu-devel] [PATCH] qemu-img: Use qemu_blockalign

2011-08-08 Thread Kevin Wolf

Now that you can use cache=none for the output file in qemu-img, we should
properly align our buffers so that raw-posix doesn't have to use its (smaller)
bounce buffer.

Signed-off-by: Kevin Wolf 
---
 qemu-img.c |   12 ++--
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index 5ee207b..27e07e9 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -814,7 +814,7 @@ static int img_convert(int argc, char **argv)
 bs_i = 0;
 bs_offset = 0;
 bdrv_get_geometry(bs[0], &bs_sectors);
-buf = qemu_malloc(IO_BUF_SIZE);
+buf = qemu_blockalign(out_bs, IO_BUF_SIZE);
 
 if (compress) {
 ret = bdrv_get_info(out_bs, &bdi);
@@ -986,7 +986,7 @@ out:
 qemu_progress_end();
 free_option_parameters(create_options);
 free_option_parameters(param);
-qemu_free(buf);
+qemu_vfree(buf);
 if (out_bs) {
 bdrv_delete(out_bs);
 }
@@ -1353,8 +1353,8 @@ static int img_rebase(int argc, char **argv)
 uint8_t * buf_new;
 float local_progress;
 
-buf_old = qemu_malloc(IO_BUF_SIZE);
-buf_new = qemu_malloc(IO_BUF_SIZE);
+buf_old = qemu_blockalign(bs, IO_BUF_SIZE);
+buf_new = qemu_blockalign(bs, IO_BUF_SIZE);
 
 bdrv_get_geometry(bs, &num_sectors);
 
@@ -1410,8 +1410,8 @@ static int img_rebase(int argc, char **argv)
 qemu_progress_print(local_progress, 100);
 }
 
-qemu_free(buf_old);
-qemu_free(buf_new);
+qemu_vfree(buf_old);
+qemu_vfree(buf_new);
 }
 
 /*
-- 
1.7.6

Re: [Qemu-devel] [PATCH] qcow2: fix typo in documentation for qcow2_get_cluster_offset()

2011-08-08 Thread Kevin Wolf

Am 08.08.2011 01:47, schrieb Devin Nakamura:
> Documentation states the num is measured in clusters, but its
> actually measured in sectors
> 
> Signed-off-by: Devin Nakamura 

Thanks, applied to the block branch.

Kevin

Re: [Qemu-devel] [PATCH] posix-aio-compat: fix latency issues

2011-08-08 Thread Anthony Liguori


On 08/08/2011 06:37 AM, Avi Kivity wrote:

In certain circumstances, posix-aio-compat can incur a lot of latency:
  - threads are created by vcpu threads, so if vcpu affinity is set,
aio threads inherit vcpu affinity.  This can cause many aio threads
to compete for one cpu.
  - we can create up to max_threads (64) aio threads in one go; since a
pthread_create can take around 30μs, we have up to 2ms of cpu time
under a global lock.

Fix by:
  - moving thread creation to the main thread, so we inherit the main
thread's affinity instead of the vcpu thread's affinity.
  - if a thread is currently being created, and we need to create yet
another thread, let thread being born create the new thread, reducing
the amount of time we spend under the main thread.
  - drop the local lock while creating a thread (we may still hold the
global mutex, though)

Note this doesn't eliminate latency completely; scheduler artifacts or
lack of host cpu resources can still cause it.  We may want pre-allocated
threads when this cannot be tolerated.

Thanks to Uli Obergfell of Red Hat for his excellent analysis and suggestions.


Do you have a scenario where you can measure the benefits of this 
change?  The idle time in the thread pool is rather large, it surprises 
me that it'd be an issue in practice.


Regards,

Anthony Liguori



Signed-off-by: Avi Kivity
---
  posix-aio-compat.c |   48 ++--
  1 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/posix-aio-compat.c b/posix-aio-compat.c
index 8dc00cb..aa30673 100644
--- a/posix-aio-compat.c
+++ b/posix-aio-compat.c
@@ -30,6 +30,7 @@

  #include "block/raw-posix-aio.h"

+static void do_spawn_thread(void);

  struct qemu_paiocb {
  BlockDriverAIOCB common;
@@ -64,6 +65,9 @@ static pthread_attr_t attr;
  static int max_threads = 64;
  static int cur_threads = 0;
  static int idle_threads = 0;
+static int new_threads = 0; /* backlog of threads we need to create */
+static int pending_threads = 0; /* threads created but not running yet */
+static QEMUBH *new_thread_bh;
  static QTAILQ_HEAD(, qemu_paiocb) request_list;

  #ifdef CONFIG_PREADV
@@ -311,6 +315,13 @@ static void *aio_thread(void *unused)

  pid = getpid();

+mutex_lock(&lock);
+if (new_threads) {
+do_spawn_thread();
+}
+pending_threads--;
+mutex_unlock(&lock);
+
  while (1) {
  struct qemu_paiocb *aiocb;
  ssize_t ret = 0;
@@ -381,11 +392,18 @@ static void *aio_thread(void *unused)
  return NULL;
  }

-static void spawn_thread(void)
+static void do_spawn_thread(void)
  {
  sigset_t set, oldset;

-cur_threads++;
+if (!new_threads) {
+return;
+}
+
+new_threads--;
+pending_threads++;
+
+mutex_unlock(&lock);

  /* block all signals */
  if (sigfillset(&set)) die("sigfillset");
@@ -394,6 +412,31 @@ static void spawn_thread(void)
  thread_create(&thread_id,&attr, aio_thread, NULL);

  if (sigprocmask(SIG_SETMASK,&oldset, NULL)) die("sigprocmask restore");
+
+mutex_lock(&lock);
+}
+
+static void spawn_thread_bh_fn(void *opaque)
+{
+mutex_lock(&lock);
+do_spawn_thread();
+mutex_unlock(&lock);
+}
+
+static void spawn_thread(void)
+{
+cur_threads++;
+new_threads++;
+/* If there are threads being created, they will spawn new workers, so
+ * we don't spend time creating many threads in a loop holding a mutex or
+ * starving the current vcpu.
+ *
+ * If there are no idle threads, ask the main thread to create one, so we
+ * inherit the correct affinity instead of the vcpu affinity.
+ */
+if (!pending_threads) {
+qemu_bh_schedule(new_thread_bh);
+}
  }

  static void qemu_paio_submit(struct qemu_paiocb *aiocb)
@@ -665,6 +708,7 @@ int paio_init(void)
  die2(ret, "pthread_attr_setdetachstate");

  QTAILQ_INIT(&request_list);
+new_thread_bh = qemu_bh_new(spawn_thread_bh_fn, NULL);

  posix_aio_state = s;
  return 0;

Re: [Qemu-devel] [PULL] VirtFS coroutine changes

2011-08-08 Thread Anthony Liguori


On 08/08/2011 04:03 AM, Aneesh Kumar K.V wrote:


The following changes since commit 23ddf2bb1e4bfe2b72a726fe5e828807b65941ad:

   Fix forcing multicast msgs to loopback on OpenBSD. (2011-08-07 11:06:43 
+)


This need to be sent to the list first as patches for review before you 
can do a pull request.


Regards,

Anthony Liguori

Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Avi Kivity


On 08/08/2011 06:24 AM, Isaku Yamahata wrote:

This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
on which we'll give a talk at KVM-forum.
The purpose of this mail is to letting developers know it in advance
so that we can get better feedback on its design/implementation approach
early before our starting to implement it.


Interesting; what is the impact of increased latency on memory reads?




There are several design points.
   - who takes care of pulling page contents.
 an independent daemon vs a thread in qemu
 The daemon approach is preferable because an independent daemon would
 easy for debug postcopy memory mechanism without qemu.
 If required, it wouldn't be difficult to convert a daemon into
 a thread in qemu


Isn't this equivalent to touching each page in sequence?

Care must be taken that we don't post too many requests, or it could 
affect the latency of synchronous accesses by the guest.




   - connection between the source and the destination
 The connection for live migration can be re-used after sending machine
 state.

   - transfer protocol
 The existing protocol that exists today can be extended.

   - hooking guest RAM access
 Introduce a character device to handle page fault.
 When page fault occurs, it queues page request up to user space daemon
 at the destination. And the daemon pulls page contents from the source
 and serves it into the character device. Then the page fault is resovlved.


This doesn't play well with host swapping, transparent hugepages, or 
ksm, does it?


I see you note this later on.


* More on hooking guest RAM access
There are several candidate for the implementation. Our preference is
character device approach.

   - inserting hooks into everywhere in qemu/kvm
 This is impractical

   - backing store for guest ram
 a block device or a file can be used to back guest RAM.
 Thus hook the guest ram access.

 pros
 - new device driver isn't needed.
 cons
 - future improvement would be difficult
 - some KVM host feature(KSM, THP) wouldn't work

   - character device
 qemu mmap() the dedicated character device, and then hook page fault.

 pros
 - straght forward approach
 - future improvement would be easy
 cons
 - new driver is needed
 - some KVM host feature(KSM, THP) wouldn't work
   They checks if a given VMA is anonymous. This can be fixed.

   - swap device
 When creating guest, it is set up as if all the guest RAM is swapped out
 to a dedicated swap device, which may be nbd disk (or some kind of user
 space block device, BUSE?).
 When the VM tries to access memory, swap-in is triggered and IO to the
 swap device is issued. Then the IO to swap is routed to the daemon
 in user space with nbd protocol (or BUSE, AOE, iSCSI...). The daemon pulls
 pages from the migration source and services the IO request.

 pros
 - After the page transfer is complete, everything is same as normal case.
 - no new device driver isn't needed
 cons
 - future improvement would be difficult
 - administration: setting up nbd, swap device



Using a swap device would be my preference.  We'd still be using 
anonymous memory so thp/ksm/ordinary swap still work.


It would need to be a special kind of swap device since we only want to 
swap in, and never out, to that device.  We'd also need a special way of 
telling the kernel that memory comes from that device.  In that it's 
similar your second option.


Maybe we should use a backing file (using nbd) and have a madvise() call 
that converts the vma to anonymous memory once the migration is finished.


--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH] posix-aio-compat: fix latency issues

2011-08-08 Thread Avi Kivity


On 08/08/2011 03:34 PM, Anthony Liguori wrote:

On 08/08/2011 06:37 AM, Avi Kivity wrote:

In certain circumstances, posix-aio-compat can incur a lot of latency:
  - threads are created by vcpu threads, so if vcpu affinity is set,
aio threads inherit vcpu affinity.  This can cause many aio threads
to compete for one cpu.
  - we can create up to max_threads (64) aio threads in one go; since a
pthread_create can take around 30μs, we have up to 2ms of cpu time
under a global lock.

Fix by:
  - moving thread creation to the main thread, so we inherit the main
thread's affinity instead of the vcpu thread's affinity.
  - if a thread is currently being created, and we need to create yet
another thread, let thread being born create the new thread, 
reducing

the amount of time we spend under the main thread.
  - drop the local lock while creating a thread (we may still hold the
global mutex, though)

Note this doesn't eliminate latency completely; scheduler artifacts or
lack of host cpu resources can still cause it.  We may want 
pre-allocated

threads when this cannot be tolerated.

Thanks to Uli Obergfell of Red Hat for his excellent analysis and 
suggestions.


Do you have a scenario where you can measure the benefits of this change? 


It's a customer scenario, so I can't share it.  Not that I know exactly 
what happened there in terms of workload.


The idle time in the thread pool is rather large, it surprises me that 
it'd be an issue in practice.




Just starting up a virtio guest will fill the queue with > max_threads 
requests, and if the vcpu is pinned, all 64 thread creations and 
executions will have to run on the same cpu, and will likely preempt the 
vcpu since it's classified as a "cpu hog" by some schedulers.


--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [RFC] postcopy livemigration proposal

2011-08-08 Thread Anthony Liguori


On 08/08/2011 04:20 AM, Dor Laor wrote:

On 08/08/2011 06:24 AM, Isaku Yamahata wrote:

This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
on which we'll give a talk at KVM-forum.
The purpose of this mail is to letting developers know it in advance
so that we can get better feedback on its design/implementation approach
early before our starting to implement it.


Background
==
* What's is postcopy livemigration
It is is yet another live migration mechanism for Qemu/KVM, which
implements the migration technique known as "postcopy" or "lazy"
migration. Just after the "migrate" command is invoked, the execution
host of a VM is instantaneously switched to a destination host.

The benefit is, total migration time is shorter because it transfer
a page only once. On the other hand precopy may repeat sending same pages
again and again because they can be dirtied.
The switching time from the source to the destination is several
hunderds mili seconds so that it enables quick load balancing.
For details, please refer to the papers.

We believe this is useful for others so that we'd like to merge this
feature into the upstream qemu/kvm. The existing implementation that
we have right now is very ad-hoc because it's for academic research.
For the upstream merge, we're starting to re-design/implement it and
we'd like to get feedback early. Although many improvements/optimizations
are possible, we should implement/merge the simple/clean, but extensible
as well, one at first and then improve/optimize it later.

postcopy livemigration will be introduced as optional feature. The
existing
precopy livemigration remains as default behavior.


* related links:
project page
http://sites.google.com/site/grivonhome/quick-kvm-migration

Enabling Instantaneous Relocation of Virtual Machines with a
Lightweight VMM Extension,
(proof-of-concept, ad-hoc prototype. not a new design)
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-paper.pdf
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-talk.pdf

Reactive consolidation of virtual machines enabled by postcopy live
migration
(advantage for VM consolidation)
http://portal.acm.org/citation.cfm?id=1996125
http://www.emn.fr/x-info/ascola/lib/exe/fetch.php?media=internet:vtdc-postcopy.pdf


Qemu wiki
http://wiki.qemu.org/Features/PostCopyLiveMigration


Design/Implementation
=
The basic idea of postcopy livemigration is to use a sort of distributed
shared memory between the migration source and destination.

The migration procedure looks like
- start migration
stop the guest VM on the source and send the machine states except
guest RAM to the destination
- resume the guest VM on the destination without guest RAM contents
- Hook guest access to pages, and pull page contents from the source
This continues until all the pages are pulled to the destination

The big picture is depicted at
http://wiki.qemu.org/File:Postcopy-livemigration.png


That's terrific (nice video also)!
Orit and myself had the exact same idea too (now we can't patent it..).

Advantages:
- No down time due to memory copying.


But non-deterministic down time due to network latency while trying to 
satisfy a page fault.



- Efficient, reduce needed traffic no need to re-send pages.


It's not quite that simple.  Post-copy needs to introduce a protocol 
capable of requesting pages.


I think in presenting something like this, it's important to collect 
quite a bit of performance data.  I'd suggest doing runs while running 
jitterd in the guest to attempt to quantify the actual downtime 
experienced too.


http://git.codemonkey.ws/cgit/jitterd.git/

There's a lot of potential in something like this, but it's not obvious 
to me whether it's a net win.  Should make for a very interesting 
presentation :-)



- Reduce overall RAM consumption of the source and destination
as opposed from current live migration (both the source and the
destination allocate the memory until the live migration
completes). We can free copied memory once the destination guest
received it and save RAM.
- Increase parallelism for SMP guests we can have multiple
virtual CPU handle their demand paging . Less time to hold a
global lock, less thread contention.
- Virtual machines are using more and more memory resources ,
for a virtual machine with very large working set doing live
migration with reasonable down time is impossible today.


This is really just a limitation of our implementation.  In theory, 
pre-copy allows you to exert fine grain resource control over the guest 
which you can use to encourage convergence.



Disadvantageous:
- During the live migration the guest will run slower than in
today's live migration. We need to remember that even today
guests suffer from performance penalty on the source during the
COW stage (memory copy).
- Failure of the source or destination or the network will cause
us to lose the running virtual machine. Those failures are very
rare.
In case

Re: [Qemu-devel] [PATCH v3 01/39] virtio-pci: get config on init

2011-08-08 Thread Anthony Liguori


On 08/08/2011 05:36 AM, Michael S. Tsirkin wrote:

Thinking more closely, I don't think this right.

Updating on map ensured that the config was refreshed after each
time the bar was mapped.  In the very least, the config needs to be
refreshed during reset because the guest may write to the guest
space which should get cleared after reset.

Regards,

Anthony Liguori


Not sure I understand. Which register, for example,
do you have in mind?
Could you clarify please?


Actually, you never need to call config_get() AFAICT.  It's called in 
every read/write access.  So I think the code you changed is extraneous now.


Regards,

Anthony Liguori




  }

  static int virtio_blk_init_pci(PCIDevice *pci_dev)

Re: [Qemu-devel] [PATCH v3 01/39] virtio-pci: get config on init

2011-08-08 Thread Avi Kivity


On 08/08/2011 03:45 PM, Anthony Liguori wrote:


Actually, you never need to call config_get() AFAICT.  It's called in 
every read/write access.  So I think the code you changed is 
extraneous now.


Ok; I'll drop this patch and report (and just remove the code in 
virtio_map()).


--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH] posix-aio-compat: fix latency issues

2011-08-08 Thread Frediano Ziglio

2011/8/8 Avi Kivity :
> In certain circumstances, posix-aio-compat can incur a lot of latency:
>  - threads are created by vcpu threads, so if vcpu affinity is set,
>   aio threads inherit vcpu affinity.  This can cause many aio threads
>   to compete for one cpu.
>  - we can create up to max_threads (64) aio threads in one go; since a
>   pthread_create can take around 30μs, we have up to 2ms of cpu time
>   under a global lock.
>
> Fix by:
>  - moving thread creation to the main thread, so we inherit the main
>   thread's affinity instead of the vcpu thread's affinity.
>  - if a thread is currently being created, and we need to create yet
>   another thread, let thread being born create the new thread, reducing
>   the amount of time we spend under the main thread.
>  - drop the local lock while creating a thread (we may still hold the
>   global mutex, though)
>
> Note this doesn't eliminate latency completely; scheduler artifacts or
> lack of host cpu resources can still cause it.  We may want pre-allocated
> threads when this cannot be tolerated.
>
> Thanks to Uli Obergfell of Red Hat for his excellent analysis and suggestions.
>
> Signed-off-by: Avi Kivity 

Why not calling pthread_attr_setaffinity_np (where available) before
thread creation or shed_setaffinity at thread start instead of telling
another thread to create a thread for us just to get affinity cleared?

Regards
  Frediano

> ---
>  posix-aio-compat.c |   48 ++--
>  1 files changed, 46 insertions(+), 2 deletions(-)
>
> diff --git a/posix-aio-compat.c b/posix-aio-compat.c
> index 8dc00cb..aa30673 100644
> --- a/posix-aio-compat.c
> +++ b/posix-aio-compat.c
> @@ -30,6 +30,7 @@
>
>  #include "block/raw-posix-aio.h"
>
> +static void do_spawn_thread(void);
>
>  struct qemu_paiocb {
>     BlockDriverAIOCB common;
> @@ -64,6 +65,9 @@ static pthread_attr_t attr;
>  static int max_threads = 64;
>  static int cur_threads = 0;
>  static int idle_threads = 0;
> +static int new_threads = 0;     /* backlog of threads we need to create */
> +static int pending_threads = 0; /* threads created but not running yet */
> +static QEMUBH *new_thread_bh;
>  static QTAILQ_HEAD(, qemu_paiocb) request_list;
>
>  #ifdef CONFIG_PREADV
> @@ -311,6 +315,13 @@ static void *aio_thread(void *unused)
>
>     pid = getpid();
>
> +    mutex_lock(&lock);
> +    if (new_threads) {
> +        do_spawn_thread();
> +    }
> +    pending_threads--;
> +    mutex_unlock(&lock);
> +
>     while (1) {
>         struct qemu_paiocb *aiocb;
>         ssize_t ret = 0;
> @@ -381,11 +392,18 @@ static void *aio_thread(void *unused)
>     return NULL;
>  }
>
> -static void spawn_thread(void)
> +static void do_spawn_thread(void)
>  {
>     sigset_t set, oldset;
>
> -    cur_threads++;
> +    if (!new_threads) {
> +        return;
> +    }
> +
> +    new_threads--;
> +    pending_threads++;
> +
> +    mutex_unlock(&lock);
>
>     /* block all signals */
>     if (sigfillset(&set)) die("sigfillset");
> @@ -394,6 +412,31 @@ static void spawn_thread(void)
>     thread_create(&thread_id, &attr, aio_thread, NULL);
>
>     if (sigprocmask(SIG_SETMASK, &oldset, NULL)) die("sigprocmask restore");
> +
> +    mutex_lock(&lock);
> +}
> +
> +static void spawn_thread_bh_fn(void *opaque)
> +{
> +    mutex_lock(&lock);
> +    do_spawn_thread();
> +    mutex_unlock(&lock);
> +}
> +
> +static void spawn_thread(void)
> +{
> +    cur_threads++;
> +    new_threads++;
> +    /* If there are threads being created, they will spawn new workers, so
> +     * we don't spend time creating many threads in a loop holding a mutex or
> +     * starving the current vcpu.
> +     *
> +     * If there are no idle threads, ask the main thread to create one, so we
> +     * inherit the correct affinity instead of the vcpu affinity.
> +     */
> +    if (!pending_threads) {
> +        qemu_bh_schedule(new_thread_bh);
> +    }
>  }
>
>  static void qemu_paio_submit(struct qemu_paiocb *aiocb)
> @@ -665,6 +708,7 @@ int paio_init(void)
>         die2(ret, "pthread_attr_setdetachstate");
>
>     QTAILQ_INIT(&request_list);
> +    new_thread_bh = qemu_bh_new(spawn_thread_bh_fn, NULL);
>
>     posix_aio_state = s;
>     return 0;
> --
> 1.7.5.3
>
>
>

Re: [Qemu-devel] [PATCH 2/3] usb-redir: Call qemu_chr_guest_open/close

2011-08-08 Thread Anthony Liguori


On 08/08/2011 03:01 AM, Hans de Goede wrote:

Hi,

On 08/07/2011 11:30 PM, Anthony Liguori wrote:

On 08/07/2011 12:41 PM, Hans de Goede wrote:

Hi,

On 08/07/2011 05:52 PM, Anthony Liguori wrote:

On 08/07/2011 08:21 AM, Hans de Goede wrote:

To let the chardev now we're ready start receiving data. This is
necessary
with the spicevmc chardev to get it registered with the spice-server.

Signed-off-by: Hans de Goede
---
usb-redir.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/usb-redir.c b/usb-redir.c
index e212993..ec88c0b 100644
--- a/usb-redir.c
+++ b/usb-redir.c
@@ -809,6 +809,8 @@ static int usbredir_initfn(USBDevice *udev)

qemu_chr_add_handlers(dev->cs, usbredir_chardev_can_read,
usbredir_chardev_read, usbredir_chardev_event, dev);
+ /* Let the other side know we are ready */
+ qemu_chr_guest_open(dev->cs);



You should do guest_open before adding handlers.


Erm, no, guest_open may lead to a callback in the
chardev, to which it may respond by immediately queuing a few writes /
doing a read.


So after my char-flow changes, you won't be allowed to set handlers
unless you've called open.



Why not do it the other way around? So don't allow open until the
handlers are set. My reasoning
behind this is that eventually we will want to have a struct describing
a pipe endpoint, which
will contain handlers (by then identical for both sides) and besides the
struct a priv / user_data
pointer which will get passed by the handlers when called.

Then we will have a chardev_create or pipe_create call which will take a
struct + user data ptr
for both ends (so twice). This matches what currently our set handlers
call does. But I would
expect the open to come after the creation of the pipe.


BTW, I'm 90% of the way there in my queue:

http://repo.or.cz/w/qemu/aliguori.git/shortlog/refs/heads/char-flow

My plan is to have a CharPipe structure that has two CharDriverStates 
embedded in it.  The backend/frontends need to attach themselves to the 
CharDriverState.  I see that as open().




At least to me it is much more logical to first set the handlers (which
are really part
of object creation) and then later do the open, this matches the common
programming
paradigm of having an init/create function and an open function.


But you need to change the handlers all of the time to implement flow 
control.  Today we overload the setting of handlers to have semantic 
meaning beyond setting the callbacks for various events.


The paradigm I think of is open()'ing a file, and then select()'ing on a 
file descriptor.



Also forcing the set handlers after the open does not work well with
virtio_console, as these
are not open until the port inside the guest is opened. So then it would
need to delay its
set handlers till the first open,


Right, what's the problem with this?

 and what should it do at close, do a

set handlers NULL
before doing the actual close ??


No, close will automatically remove any added handlers.

Regards,

Anthony Liguori



Regards,

Hans

Re: [Qemu-devel] [PATCH] posix-aio-compat: fix latency issues

2011-08-08 Thread Avi Kivity


On 08/08/2011 03:49 PM, Frediano Ziglio wrote:

2011/8/8 Avi Kivity:
>  In certain circumstances, posix-aio-compat can incur a lot of latency:
>- threads are created by vcpu threads, so if vcpu affinity is set,
> aio threads inherit vcpu affinity.  This can cause many aio threads
> to compete for one cpu.
>- we can create up to max_threads (64) aio threads in one go; since a
> pthread_create can take around 30μs, we have up to 2ms of cpu time
> under a global lock.
>
>  Fix by:
>- moving thread creation to the main thread, so we inherit the main
> thread's affinity instead of the vcpu thread's affinity.
>- if a thread is currently being created, and we need to create yet
> another thread, let thread being born create the new thread, reducing
> the amount of time we spend under the main thread.
>- drop the local lock while creating a thread (we may still hold the
> global mutex, though)
>
>  Note this doesn't eliminate latency completely; scheduler artifacts or
>  lack of host cpu resources can still cause it.  We may want pre-allocated
>  threads when this cannot be tolerated.
>
>  Thanks to Uli Obergfell of Red Hat for his excellent analysis and 
suggestions.
>
>  Signed-off-by: Avi Kivity

Why not calling pthread_attr_setaffinity_np (where available) before
thread creation or shed_setaffinity at thread start instead of telling
another thread to create a thread for us just to get affinity cleared?



The entire qemu process may be affined to a subset of the host cpus; we 
don't want to break that.


For example:

   taskset 0xf0 qemu 
   (qemu) info cpus



--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH v3 01/39] virtio-pci: get config on init

2011-08-08 Thread Michael S. Tsirkin

On Mon, Aug 08, 2011 at 07:45:19AM -0500, Anthony Liguori wrote:
> On 08/08/2011 05:36 AM, Michael S. Tsirkin wrote:
> >>Thinking more closely, I don't think this right.
> >>
> >>Updating on map ensured that the config was refreshed after each
> >>time the bar was mapped.  In the very least, the config needs to be
> >>refreshed during reset because the guest may write to the guest
> >>space which should get cleared after reset.
> >>
> >>Regards,
> >>
> >>Anthony Liguori
> >
> >Not sure I understand. Which register, for example,
> >do you have in mind?
> >Could you clarify please?
> 
> Actually, you never need to call config_get() AFAICT.  It's called
> in every read/write access.

Every read, yes. But every write? Are you sure?

>  So I think the code you changed is
> extraneous now.
> 
> Regards,
> 
> Anthony Liguori


-- 
MST

[Qemu-devel] [PATCH] Introduce short names for fixed width integer types

2011-08-08 Thread Avi Kivity

QEMU deals with a lot of fixed width integer types; their names
(uint64_t etc) are clumsy to use and take up a lot of space.

Following Linux, introduce shorter names, for example U64 for
uint64_t.

Signed-off-by: Avi Kivity 
---
 qemu-common.h |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/qemu-common.h b/qemu-common.h
index 0fdecf1..52a2300 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -112,6 +112,15 @@ static inline char *realpath(const char *path, char 
*resolved_path)
 int qemu_main(int argc, char **argv, char **envp);
 #endif
 
+typedef int8_t   S8;
+typedef uint8_t  U8;
+typedef int16_t  S16;
+typedef uint16_t U16;
+typedef int32_t  S32;
+typedef uint32_t U32;
+typedef int64_t  S64;
+typedef uint64_t U64;
+
 /* bottom halves */
 typedef void QEMUBHFunc(void *opaque);
 
-- 
1.7.5.3

Re: [Qemu-devel] [PATCH] Introduce short names for fixed width integer types

2011-08-08 Thread Anthony Liguori


On 08/08/2011 07:56 AM, Avi Kivity wrote:

QEMU deals with a lot of fixed width integer types; their names
(uint64_t etc) are clumsy to use and take up a lot of space.

Following Linux, introduce shorter names, for example U64 for
uint64_t.


Except Linux uses lower case letters.

I personally think Linux style is wrong here.  The int8_t types are 
standard types.


Besides, we save lots of characters by using 4-space tabs instead of 
8-space tabs.  We can afford to spend some of those saved characters on 
using proper type names :-)


Regards,

Anthony Liguori



Signed-off-by: Avi Kivity
---
  qemu-common.h |9 +
  1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/qemu-common.h b/qemu-common.h
index 0fdecf1..52a2300 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -112,6 +112,15 @@ static inline char *realpath(const char *path, char 
*resolved_path)
  int qemu_main(int argc, char **argv, char **envp);
  #endif

+typedef int8_t   S8;
+typedef uint8_t  U8;
+typedef int16_t  S16;
+typedef uint16_t U16;
+typedef int32_t  S32;
+typedef uint32_t U32;
+typedef int64_t  S64;
+typedef uint64_t U64;
+
  /* bottom halves */
  typedef void QEMUBHFunc(void *opaque);

Re: [Qemu-devel] [PATCH 2/3] usb-redir: Call qemu_chr_guest_open/close

2011-08-08 Thread Hans de Goede


Hi,

On 08/08/2011 02:52 PM, Anthony Liguori wrote:

On 08/08/2011 03:01 AM, Hans de Goede wrote:





So after my char-flow changes, you won't be allowed to set handlers
unless you've called open.



Why not do it the other way around? So don't allow open until the
handlers are set. My reasoning
behind this is that eventually we will want to have a struct describing
a pipe endpoint, which
will contain handlers (by then identical for both sides) and besides the
struct a priv / user_data
pointer which will get passed by the handlers when called.

Then we will have a chardev_create or pipe_create call which will take a
struct + user data ptr
for both ends (so twice). This matches what currently our set handlers
call does. But I would
expect the open to come after the creation of the pipe.


BTW, I'm 90% of the way there in my queue:

http://repo.or.cz/w/qemu/aliguori.git/shortlog/refs/heads/char-flow

My plan is to have a CharPipe structure that has two CharDriverStates embedded 
in it. The backend/frontends need to attach themselves to the CharDriverState. 
I see that as open().



So the attaching will cause the other end to see an open() (if the
other end is already attached) ? Or will their still be a separate
send open event call? And should that call be made before or after
the attach?

Regards,

Hans

Re: [Qemu-devel] [PATCH v3 01/39] virtio-pci: get config on init

2011-08-08 Thread Anthony Liguori


On 08/08/2011 07:56 AM, Michael S. Tsirkin wrote:

On Mon, Aug 08, 2011 at 07:45:19AM -0500, Anthony Liguori wrote:

On 08/08/2011 05:36 AM, Michael S. Tsirkin wrote:

Thinking more closely, I don't think this right.

Updating on map ensured that the config was refreshed after each
time the bar was mapped.  In the very least, the config needs to be
refreshed during reset because the guest may write to the guest
space which should get cleared after reset.

Regards,

Anthony Liguori


Not sure I understand. Which register, for example,
do you have in mind?
Could you clarify please?


Actually, you never need to call config_get() AFAICT.  It's called
in every read/write access.


Every read, yes. But every write? Are you sure?


Yeah, not on write, but I think this is a bug.  get_config() should be 
called before doing the memcpy() in order to have a proper RMW.


Regards,

Anthony Liguori




  So I think the code you changed is
extraneous now.

Regards,

Anthony Liguori

Re: [Qemu-devel] [PATCH 2/3] usb-redir: Call qemu_chr_guest_open/close

2011-08-08 Thread Anthony Liguori


On 08/08/2011 08:03 AM, Hans de Goede wrote:

Hi,

On 08/08/2011 02:52 PM, Anthony Liguori wrote:

On 08/08/2011 03:01 AM, Hans de Goede wrote:





So after my char-flow changes, you won't be allowed to set handlers
unless you've called open.



Why not do it the other way around? So don't allow open until the
handlers are set. My reasoning
behind this is that eventually we will want to have a struct describing
a pipe endpoint, which
will contain handlers (by then identical for both sides) and besides the
struct a priv / user_data
pointer which will get passed by the handlers when called.

Then we will have a chardev_create or pipe_create call which will take a
struct + user data ptr
for both ends (so twice). This matches what currently our set handlers
call does. But I would
expect the open to come after the creation of the pipe.


BTW, I'm 90% of the way there in my queue:

http://repo.or.cz/w/qemu/aliguori.git/shortlog/refs/heads/char-flow

My plan is to have a CharPipe structure that has two CharDriverStates
embedded in it. The backend/frontends need to attach themselves to the
CharDriverState. I see that as open().



So the attaching will cause the other end to see an open() (if the
other end is already attached) ? Or will their still be a separate
send open event call? And should that call be made before or after
the attach?


Doing an open() will result in an open event being generated.  Neither 
side should ever be directly involved in sending open/close events.


Regards,

Anthony Liguori



Regards,

Hans

[Qemu-devel] [PATCH v4 01/39] memory: rename PORTIO_END to PORTIO_END_OF_LIST

2011-08-08 Thread Avi Kivity

For consistency with other _END_OF_LIST macros.

Signed-off-by: Avi Kivity 
---
 memory.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/memory.h b/memory.h
index 4e518b2..da00a3b 100644
--- a/memory.h
+++ b/memory.h
@@ -133,7 +133,7 @@ struct MemoryRegionPortio {
 IOPortWriteFunc *write;
 };
 
-#define PORTIO_END { }
+#define PORTIO_END_OF_LIST() { }
 
 /**
  * memory_region_init: Initialize a memory region
-- 
1.7.5.3

[Qemu-devel] [PATCH v4 19/39] ivshmem: convert to memory API

2011-08-08 Thread Avi Kivity

excluding msix.

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/ivshmem.c |  148 --
 1 files changed, 50 insertions(+), 98 deletions(-)

diff --git a/hw/ivshmem.c b/hw/ivshmem.c
index 3055dd2..f80e7b6 100644
--- a/hw/ivshmem.c
+++ b/hw/ivshmem.c
@@ -56,11 +56,15 @@ typedef struct IVShmemState {
 
 CharDriverState **eventfd_chr;
 CharDriverState *server_chr;
-int ivshmem_mmio_io_addr;
+MemoryRegion ivshmem_mmio;
 
 pcibus_t mmio_addr;
-pcibus_t shm_pci_addr;
-uint64_t ivshmem_offset;
+/* We might need to register the BAR before we actually have the memory.
+ * So prepare a container MemoryRegion for the BAR immediately and
+ * add a subregion when we have the memory.
+ */
+MemoryRegion bar;
+MemoryRegion ivshmem;
 uint64_t ivshmem_size; /* size of shared memory region */
 int shm_fd; /* shared memory file descriptor */
 
@@ -96,23 +100,6 @@ static inline bool is_power_of_two(uint64_t x) {
 return (x & (x - 1)) == 0;
 }
 
-static void ivshmem_map(PCIDevice *pci_dev, int region_num,
-pcibus_t addr, pcibus_t size, int type)
-{
-IVShmemState *s = DO_UPCAST(IVShmemState, dev, pci_dev);
-
-s->shm_pci_addr = addr;
-
-if (s->ivshmem_offset > 0) {
-cpu_register_physical_memory(s->shm_pci_addr, s->ivshmem_size,
-s->ivshmem_offset);
-}
-
-IVSHMEM_DPRINTF("guest pci addr = %" FMT_PCIBUS ", guest h/w addr = %"
-PRIu64 ", size = %" FMT_PCIBUS "\n", addr, s->ivshmem_offset, size);
-
-}
-
 /* accessing registers - based on rtl8139 */
 static void ivshmem_update_irq(IVShmemState *s, int val)
 {
@@ -168,15 +155,8 @@ static uint32_t ivshmem_IntrStatus_read(IVShmemState *s)
 return ret;
 }
 
-static void ivshmem_io_writew(void *opaque, target_phys_addr_t addr,
-uint32_t val)
-{
-
-IVSHMEM_DPRINTF("We shouldn't be writing words\n");
-}
-
-static void ivshmem_io_writel(void *opaque, target_phys_addr_t addr,
-uint32_t val)
+static void ivshmem_io_write(void *opaque, target_phys_addr_t addr,
+ uint64_t val, unsigned size)
 {
 IVShmemState *s = opaque;
 
@@ -219,20 +199,8 @@ static void ivshmem_io_writel(void *opaque, 
target_phys_addr_t addr,
 }
 }
 
-static void ivshmem_io_writeb(void *opaque, target_phys_addr_t addr,
-uint32_t val)
-{
-IVSHMEM_DPRINTF("We shouldn't be writing bytes\n");
-}
-
-static uint32_t ivshmem_io_readw(void *opaque, target_phys_addr_t addr)
-{
-
-IVSHMEM_DPRINTF("We shouldn't be reading words\n");
-return 0;
-}
-
-static uint32_t ivshmem_io_readl(void *opaque, target_phys_addr_t addr)
+static uint64_t ivshmem_io_read(void *opaque, target_phys_addr_t addr,
+unsigned size)
 {
 
 IVShmemState *s = opaque;
@@ -265,23 +233,14 @@ static uint32_t ivshmem_io_readl(void *opaque, 
target_phys_addr_t addr)
 return ret;
 }
 
-static uint32_t ivshmem_io_readb(void *opaque, target_phys_addr_t addr)
-{
-IVSHMEM_DPRINTF("We shouldn't be reading bytes\n");
-
-return 0;
-}
-
-static CPUReadMemoryFunc * const ivshmem_mmio_read[3] = {
-ivshmem_io_readb,
-ivshmem_io_readw,
-ivshmem_io_readl,
-};
-
-static CPUWriteMemoryFunc * const ivshmem_mmio_write[3] = {
-ivshmem_io_writeb,
-ivshmem_io_writew,
-ivshmem_io_writel,
+static const MemoryRegionOps ivshmem_mmio_ops = {
+.read = ivshmem_io_read,
+.write = ivshmem_io_write,
+.endianness = DEVICE_NATIVE_ENDIAN,
+.impl = {
+.min_access_size = 4,
+.max_access_size = 4,
+},
 };
 
 static void ivshmem_receive(void *opaque, const uint8_t *buf, int size)
@@ -371,12 +330,12 @@ static void create_shared_memory_BAR(IVShmemState *s, int 
fd) {
 
 ptr = mmap(0, s->ivshmem_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
 
-s->ivshmem_offset = qemu_ram_alloc_from_ptr(&s->dev.qdev, "ivshmem.bar2",
-s->ivshmem_size, ptr);
+memory_region_init_ram_ptr(&s->ivshmem, &s->dev.qdev, "ivshmem.bar2",
+   s->ivshmem_size, ptr);
+memory_region_add_subregion(&s->bar, 0, &s->ivshmem);
 
 /* region for shared memory */
-pci_register_bar(&s->dev, 2, s->ivshmem_size,
-PCI_BASE_ADDRESS_SPACE_MEMORY, ivshmem_map);
+pci_register_bar_region(&s->dev, 2, PCI_BASE_ADDRESS_SPACE_MEMORY, 
&s->bar);
 }
 
 static void close_guest_eventfds(IVShmemState *s, int posn)
@@ -401,8 +360,12 @@ static void setup_ioeventfds(IVShmemState *s) {
 
 for (i = 0; i <= s->max_peer; i++) {
 for (j = 0; j < s->peers[i].nb_eventfds; j++) {
-kvm_set_ioeventf

[Qemu-devel] [PATCH v4 23/39] lsi53c895a: convert to memory API

2011-08-08 Thread Avi Kivity

An optimization that fast-pathed DMA reads from the SCRIPTS memory
was removed int the process.  Likely it breaks with iommus anyway.

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/lsi53c895a.c |  258 ---
 1 files changed, 56 insertions(+), 202 deletions(-)

diff --git a/hw/lsi53c895a.c b/hw/lsi53c895a.c
index e9904c4..0ab8c78 100644
--- a/hw/lsi53c895a.c
+++ b/hw/lsi53c895a.c
@@ -185,9 +185,9 @@ typedef struct lsi_request {
 
 typedef struct {
 PCIDevice dev;
-int mmio_io_addr;
-int ram_io_addr;
-uint32_t script_ram_base;
+MemoryRegion mmio_io;
+MemoryRegion ram_io;
+MemoryRegion io_io;
 
 int carry; /* ??? Should this be an a visible register somewhere?  */
 int status;
@@ -391,10 +391,9 @@ static inline uint32_t read_dword(LSIState *s, uint32_t 
addr)
 {
 uint32_t buf;
 
-/* Optimize reading from SCRIPTS RAM.  */
-if ((addr & 0xe000) == s->script_ram_base) {
-return s->script_ram[(addr & 0x1fff) >> 2];
-}
+/* XXX: an optimization here used to fast-path the read from scripts
+ * memory.  But that bypasses any iommu.
+ */
 cpu_physical_memory_read(addr, (uint8_t *)&buf, 4);
 return cpu_to_le32(buf);
 }
@@ -1899,232 +1898,90 @@ static void lsi_reg_writeb(LSIState *s, int offset, 
uint8_t val)
 #undef CASE_SET_REG32
 }
 
-static void lsi_mmio_writeb(void *opaque, target_phys_addr_t addr, uint32_t 
val)
+static void lsi_mmio_write(void *opaque, target_phys_addr_t addr,
+   uint64_t val, unsigned size)
 {
 LSIState *s = opaque;
 
 lsi_reg_writeb(s, addr & 0xff, val);
 }
 
-static void lsi_mmio_writew(void *opaque, target_phys_addr_t addr, uint32_t 
val)
-{
-LSIState *s = opaque;
-
-addr &= 0xff;
-lsi_reg_writeb(s, addr, val & 0xff);
-lsi_reg_writeb(s, addr + 1, (val >> 8) & 0xff);
-}
-
-static void lsi_mmio_writel(void *opaque, target_phys_addr_t addr, uint32_t 
val)
-{
-LSIState *s = opaque;
-
-addr &= 0xff;
-lsi_reg_writeb(s, addr, val & 0xff);
-lsi_reg_writeb(s, addr + 1, (val >> 8) & 0xff);
-lsi_reg_writeb(s, addr + 2, (val >> 16) & 0xff);
-lsi_reg_writeb(s, addr + 3, (val >> 24) & 0xff);
-}
-
-static uint32_t lsi_mmio_readb(void *opaque, target_phys_addr_t addr)
+static uint64_t lsi_mmio_read(void *opaque, target_phys_addr_t addr,
+  unsigned size)
 {
 LSIState *s = opaque;
 
 return lsi_reg_readb(s, addr & 0xff);
 }
 
-static uint32_t lsi_mmio_readw(void *opaque, target_phys_addr_t addr)
-{
-LSIState *s = opaque;
-uint32_t val;
-
-addr &= 0xff;
-val = lsi_reg_readb(s, addr);
-val |= lsi_reg_readb(s, addr + 1) << 8;
-return val;
-}
-
-static uint32_t lsi_mmio_readl(void *opaque, target_phys_addr_t addr)
-{
-LSIState *s = opaque;
-uint32_t val;
-addr &= 0xff;
-val = lsi_reg_readb(s, addr);
-val |= lsi_reg_readb(s, addr + 1) << 8;
-val |= lsi_reg_readb(s, addr + 2) << 16;
-val |= lsi_reg_readb(s, addr + 3) << 24;
-return val;
-}
-
-static CPUReadMemoryFunc * const lsi_mmio_readfn[3] = {
-lsi_mmio_readb,
-lsi_mmio_readw,
-lsi_mmio_readl,
-};
-
-static CPUWriteMemoryFunc * const lsi_mmio_writefn[3] = {
-lsi_mmio_writeb,
-lsi_mmio_writew,
-lsi_mmio_writel,
+static const MemoryRegionOps lsi_mmio_ops = {
+.read = lsi_mmio_read,
+.write = lsi_mmio_write,
+.endianness = DEVICE_NATIVE_ENDIAN,
+.impl = {
+.min_access_size = 1,
+.max_access_size = 1,
+},
 };
 
-static void lsi_ram_writeb(void *opaque, target_phys_addr_t addr, uint32_t val)
+static void lsi_ram_write(void *opaque, target_phys_addr_t addr,
+  uint64_t val, unsigned size)
 {
 LSIState *s = opaque;
 uint32_t newval;
+uint32_t mask;
 int shift;
 
-addr &= 0x1fff;
 newval = s->script_ram[addr >> 2];
 shift = (addr & 3) * 8;
-newval &= ~(0xff << shift);
+mask = ((uint64_t)1 << (size * 8)) - 1;
+newval &= ~(mask << shift);
 newval |= val << shift;
 s->script_ram[addr >> 2] = newval;
 }
 
-static void lsi_ram_writew(void *opaque, target_phys_addr_t addr, uint32_t val)
-{
-LSIState *s = opaque;
-uint32_t newval;
-
-addr &= 0x1fff;
-newval = s->script_ram[addr >> 2];
-if (addr & 2) {
-newval = (newval & 0x) | (val << 16);
-} else {
-newval = (newval & 0x) | val;
-}
-s->script_ram[addr >> 2] = newval;
-}
-
-
-static void lsi_ram_writel(void *opaque, target_phys_addr_t addr, uint32_t val)
-{
-LSIState *s = opaque;
-
-addr &= 0x1fff;
-s->script_ram[addr >> 2] = val;
-}
-
-static uint32_t lsi_ram_readb(void *opaque, target_phys_addr_t addr)
+static uint64_t lsi_ram_read(void *opaque, target_phys_addr_t addr,
+ unsigned size)
 {
 LSIState *s = opaque;
 uint32_t val;
+uint32_t mask;

[Qemu-devel] [PATCH v4 14/39] ac97: convert to memory API

2011-08-08 Thread Avi Kivity

fixes BAR sizing as well.

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/ac97.c |   89 +++-
 1 files changed, 52 insertions(+), 37 deletions(-)

diff --git a/hw/ac97.c b/hw/ac97.c
index 0b59896..52f0f0d 100644
--- a/hw/ac97.c
+++ b/hw/ac97.c
@@ -160,8 +160,9 @@ typedef struct AC97LinkState {
 SWVoiceIn *voice_mc;
 int invalid_freq[3];
 uint8_t silence[128];
-uint32_t base[2];
 int bup_flag;
+MemoryRegion io_nam;
+MemoryRegion io_nabm;
 } AC97LinkState;
 
 enum {
@@ -583,7 +584,7 @@ static uint32_t nam_readw (void *opaque, uint32_t addr)
 {
 AC97LinkState *s = opaque;
 uint32_t val = ~0U;
-uint32_t index = addr - s->base[0];
+uint32_t index = addr;
 s->cas = 0;
 val = mixer_load (s, index);
 return val;
@@ -611,7 +612,7 @@ static void nam_writeb (void *opaque, uint32_t addr, 
uint32_t val)
 static void nam_writew (void *opaque, uint32_t addr, uint32_t val)
 {
 AC97LinkState *s = opaque;
-uint32_t index = addr - s->base[0];
+uint32_t index = addr;
 s->cas = 0;
 switch (index) {
 case AC97_Reset:
@@ -714,7 +715,7 @@ static uint32_t nabm_readb (void *opaque, uint32_t addr)
 {
 AC97LinkState *s = opaque;
 AC97BusMasterRegs *r = NULL;
-uint32_t index = addr - s->base[1];
+uint32_t index = addr;
 uint32_t val = ~0U;
 
 switch (index) {
@@ -769,7 +770,7 @@ static uint32_t nabm_readw (void *opaque, uint32_t addr)
 {
 AC97LinkState *s = opaque;
 AC97BusMasterRegs *r = NULL;
-uint32_t index = addr - s->base[1];
+uint32_t index = addr;
 uint32_t val = ~0U;
 
 switch (index) {
@@ -798,7 +799,7 @@ static uint32_t nabm_readl (void *opaque, uint32_t addr)
 {
 AC97LinkState *s = opaque;
 AC97BusMasterRegs *r = NULL;
-uint32_t index = addr - s->base[1];
+uint32_t index = addr;
 uint32_t val = ~0U;
 
 switch (index) {
@@ -848,7 +849,7 @@ static void nabm_writeb (void *opaque, uint32_t addr, 
uint32_t val)
 {
 AC97LinkState *s = opaque;
 AC97BusMasterRegs *r = NULL;
-uint32_t index = addr - s->base[1];
+uint32_t index = addr;
 switch (index) {
 case PI_LVI:
 case PO_LVI:
@@ -904,7 +905,7 @@ static void nabm_writew (void *opaque, uint32_t addr, 
uint32_t val)
 {
 AC97LinkState *s = opaque;
 AC97BusMasterRegs *r = NULL;
-uint32_t index = addr - s->base[1];
+uint32_t index = addr;
 switch (index) {
 case PI_SR:
 case PO_SR:
@@ -924,7 +925,7 @@ static void nabm_writel (void *opaque, uint32_t addr, 
uint32_t val)
 {
 AC97LinkState *s = opaque;
 AC97BusMasterRegs *r = NULL;
-uint32_t index = addr - s->base[1];
+uint32_t index = addr;
 switch (index) {
 case PI_BDBAR:
 case PO_BDBAR:
@@ -1230,31 +1231,33 @@ static const VMStateDescription vmstate_ac97 = {
 }
 };
 
-static void ac97_map (PCIDevice *pci_dev, int region_num,
-  pcibus_t addr, pcibus_t size, int type)
-{
-AC97LinkState *s = DO_UPCAST (AC97LinkState, dev, pci_dev);
-PCIDevice *d = &s->dev;
-
-if (!region_num) {
-s->base[0] = addr;
-register_ioport_read (addr, 256 * 1, 1, nam_readb, d);
-register_ioport_read (addr, 256 * 2, 2, nam_readw, d);
-register_ioport_read (addr, 256 * 4, 4, nam_readl, d);
-register_ioport_write (addr, 256 * 1, 1, nam_writeb, d);
-register_ioport_write (addr, 256 * 2, 2, nam_writew, d);
-register_ioport_write (addr, 256 * 4, 4, nam_writel, d);
-}
-else {
-s->base[1] = addr;
-register_ioport_read (addr, 64 * 1, 1, nabm_readb, d);
-register_ioport_read (addr, 64 * 2, 2, nabm_readw, d);
-register_ioport_read (addr, 64 * 4, 4, nabm_readl, d);
-register_ioport_write (addr, 64 * 1, 1, nabm_writeb, d);
-register_ioport_write (addr, 64 * 2, 2, nabm_writew, d);
-register_ioport_write (addr, 64 * 4, 4, nabm_writel, d);
-}
-}
+static const MemoryRegionPortio nam_portio[] = {
+{ 0, 256 * 1, 1, .read = nam_readb, },
+{ 0, 256 * 2, 2, .read = nam_readw, },
+{ 0, 256 * 4, 4, .read = nam_readl, },
+{ 0, 256 * 1, 1, .write = nam_writeb, },
+{ 0, 256 * 2, 2, .write = nam_writew, },
+{ 0, 256 * 4, 4, .write = nam_writel, },
+PORTIO_END_OF_LIST(),
+};
+
+static const MemoryRegionOps ac97_io_nam_ops = {
+.old_portio = nam_portio,
+};
+
+static const MemoryRegionPortio nabm_portio[] = {
+{ 0, 64 * 1, 1, .read = nabm_readb, },
+{ 0, 64 * 2, 2, .read = nabm_readw, },
+{ 0, 64 * 4, 4, .read = nabm_readl, },
+{ 0, 64 * 1, 1, .write = nabm_writeb, },
+{ 0, 64 * 2, 2, .write = nabm_writew, },
+{ 0, 64 * 4, 4, .write = nabm_writel, },
+PORTIO_END_OF_LIST()
+};
+
+static const MemoryRegionOps ac97_io_nabm_ops = {
+.old_portio = nabm_portio,
+};
 
 static void ac97_on_reset (void *opaque)
 {
@@ -1311,15 +1314,26 @@ static

[Qemu-devel] [PATCH v4 18/39] ide: convert to memory API

2011-08-08 Thread Avi Kivity

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/ide/cmd646.c |  208 +++
 hw/ide/pci.c|   25 ---
 hw/ide/pci.h|   19 -
 hw/ide/piix.c   |   64 +
 hw/ide/via.c|   65 +
 5 files changed, 261 insertions(+), 120 deletions(-)

diff --git a/hw/ide/cmd646.c b/hw/ide/cmd646.c
index 56302b5..13e6f2f 100644
--- a/hw/ide/cmd646.c
+++ b/hw/ide/cmd646.c
@@ -44,35 +44,95 @@
 
 static void cmd646_update_irq(PCIIDEState *d);
 
-static void ide_map(PCIDevice *pci_dev, int region_num,
-pcibus_t addr, pcibus_t size, int type)
+static uint64_t cmd646_cmd_read(void *opaque, target_phys_addr_t addr,
+unsigned size)
 {
-PCIIDEState *d = DO_UPCAST(PCIIDEState, dev, pci_dev);
-IDEBus *bus;
-
-if (region_num <= 3) {
-bus = &d->bus[(region_num >> 1)];
-if (region_num & 1) {
-register_ioport_read(addr + 2, 1, 1, ide_status_read, bus);
-register_ioport_write(addr + 2, 1, 1, ide_cmd_write, bus);
+CMD646BAR *cmd646bar = opaque;
+
+if (addr != 2 || size != 1) {
+return ((uint64_t)1 << (size * 8)) - 1;
+}
+return ide_status_read(cmd646bar->bus, addr + 2);
+}
+
+static void cmd646_cmd_write(void *opaque, target_phys_addr_t addr,
+ uint64_t data, unsigned size)
+{
+CMD646BAR *cmd646bar = opaque;
+
+if (addr != 2 || size != 1) {
+return;
+}
+ide_cmd_write(cmd646bar->bus, addr + 2, data);
+}
+
+static MemoryRegionOps cmd646_cmd_ops = {
+.read = cmd646_cmd_read,
+.write = cmd646_cmd_write,
+.endianness = DEVICE_LITTLE_ENDIAN,
+};
+
+static uint64_t cmd646_data_read(void *opaque, target_phys_addr_t addr,
+ unsigned size)
+{
+CMD646BAR *cmd646bar = opaque;
+
+if (size == 1) {
+return ide_ioport_read(cmd646bar->bus, addr);
+} else if (addr == 0) {
+if (size == 2) {
+return ide_data_readw(cmd646bar->bus, addr);
 } else {
-register_ioport_write(addr, 8, 1, ide_ioport_write, bus);
-register_ioport_read(addr, 8, 1, ide_ioport_read, bus);
-
-/* data ports */
-register_ioport_write(addr, 2, 2, ide_data_writew, bus);
-register_ioport_read(addr, 2, 2, ide_data_readw, bus);
-register_ioport_write(addr, 4, 4, ide_data_writel, bus);
-register_ioport_read(addr, 4, 4, ide_data_readl, bus);
+return ide_data_readl(cmd646bar->bus, addr);
 }
 }
+return ((uint64_t)1 << (size * 8)) - 1;
 }
 
-static uint32_t bmdma_readb_common(PCIIDEState *pci_dev, BMDMAState *bm,
-   uint32_t addr)
+static void cmd646_data_write(void *opaque, target_phys_addr_t addr,
+ uint64_t data, unsigned size)
 {
+CMD646BAR *cmd646bar = opaque;
+
+if (size == 1) {
+return ide_ioport_write(cmd646bar->bus, addr, data);
+} else if (addr == 0) {
+if (size == 2) {
+return ide_data_writew(cmd646bar->bus, addr, data);
+} else {
+return ide_data_writel(cmd646bar->bus, addr, data);
+}
+}
+}
+
+static MemoryRegionOps cmd646_data_ops = {
+.read = cmd646_data_read,
+.write = cmd646_data_write,
+.endianness = DEVICE_LITTLE_ENDIAN,
+};
+
+static void setup_cmd646_bar(PCIIDEState *d, int bus_num)
+{
+IDEBus *bus = &d->bus[bus_num];
+CMD646BAR *bar = &d->cmd646_bar[bus_num];
+
+bar->bus = bus;
+bar->pci_dev = d;
+memory_region_init_io(&bar->cmd, &cmd646_cmd_ops, bar, "cmd646-cmd", 4);
+memory_region_init_io(&bar->data, &cmd646_data_ops, bar, "cmd646-data", 8);
+}
+
+static uint64_t bmdma_read(void *opaque, target_phys_addr_t addr,
+   unsigned size)
+{
+BMDMAState *bm = opaque;
+PCIIDEState *pci_dev = bm->pci_dev;
 uint32_t val;
 
+if (size != 1) {
+return ((uint64_t)1 << (size * 8)) - 1;
+}
+
 switch(addr & 3) {
 case 0:
 val = bm->cmd;
@@ -100,31 +160,22 @@ static uint32_t bmdma_readb_common(PCIIDEState *pci_dev, 
BMDMAState *bm,
 return val;
 }
 
-static uint32_t bmdma_readb_0(void *opaque, uint32_t addr)
+static void bmdma_write(void *opaque, target_phys_addr_t addr,
+uint64_t val, unsigned size)
 {
-PCIIDEState *pci_dev = opaque;
-BMDMAState *bm = &pci_dev->bmdma[0];
-
-return bmdma_readb_common(pci_dev, bm, addr);
-}
+BMDMAState *bm = opaque;
+PCIIDEState *pci_dev = bm->pci_dev;
 
-static uint32_t bmdma_readb_1(void *opaque, uint32_t addr)
-{
-PCIIDEState *pci_dev = opaque;
-BMDMAState *bm = &pci_dev->bmdma[1];
-
-return bmdma_readb_common(pci_dev, bm, addr);
-}
+if (size != 1) {
+return;
+}
 
-static void bmdma_writeb_common(PCIIDEState *pci_dev, BMDMAState *bm,
-

[Qemu-devel] [PATCH v4 25/39] ne2000: convert to memory API

2011-08-08 Thread Avi Kivity

Reviewed-by: Richard Henderson 
Signed-off-by: Avi Kivity 
---
 hw/ne2000-isa.c |   13 ++---
 hw/ne2000.c |   77 +-
 hw/ne2000.h |8 +
 3 files changed, 58 insertions(+), 40 deletions(-)

diff --git a/hw/ne2000-isa.c b/hw/ne2000-isa.c
index e41dbba..756ed5c 100644
--- a/hw/ne2000-isa.c
+++ b/hw/ne2000-isa.c
@@ -27,6 +27,7 @@
 #include "qdev.h"
 #include "net.h"
 #include "ne2000.h"
+#include "exec-memory.h"
 
 typedef struct ISANE2000State {
 ISADevice dev;
@@ -66,19 +67,11 @@ static int isa_ne2000_initfn(ISADevice *dev)
 ISANE2000State *isa = DO_UPCAST(ISANE2000State, dev, dev);
 NE2000State *s = &isa->ne2000;
 
-register_ioport_write(isa->iobase, 16, 1, ne2000_ioport_write, s);
-register_ioport_read(isa->iobase, 16, 1, ne2000_ioport_read, s);
+ne2000_setup_io(s, 0x20);
 isa_init_ioport_range(dev, isa->iobase, 16);
-
-register_ioport_write(isa->iobase + 0x10, 1, 1, ne2000_asic_ioport_write, 
s);
-register_ioport_read(isa->iobase + 0x10, 1, 1, ne2000_asic_ioport_read, s);
-register_ioport_write(isa->iobase + 0x10, 2, 2, ne2000_asic_ioport_write, 
s);
-register_ioport_read(isa->iobase + 0x10, 2, 2, ne2000_asic_ioport_read, s);
 isa_init_ioport_range(dev, isa->iobase + 0x10, 2);
-
-register_ioport_write(isa->iobase + 0x1f, 1, 1, ne2000_reset_ioport_write, 
s);
-register_ioport_read(isa->iobase + 0x1f, 1, 1, ne2000_reset_ioport_read, 
s);
 isa_init_ioport(dev, isa->iobase + 0x1f);
+memory_region_add_subregion(get_system_io(), isa->iobase, &s->io);
 
 isa_init_irq(dev, &s->irq, isa->isairq);
 
diff --git a/hw/ne2000.c b/hw/ne2000.c
index f8acaae..5b76acf 100644
--- a/hw/ne2000.c
+++ b/hw/ne2000.c
@@ -297,7 +297,7 @@ ssize_t ne2000_receive(VLANClientState *nc, const uint8_t 
*buf, size_t size_)
 return size_;
 }
 
-void ne2000_ioport_write(void *opaque, uint32_t addr, uint32_t val)
+static void ne2000_ioport_write(void *opaque, uint32_t addr, uint32_t val)
 {
 NE2000State *s = opaque;
 int offset, page, index;
@@ -394,7 +394,7 @@ void ne2000_ioport_write(void *opaque, uint32_t addr, 
uint32_t val)
 }
 }
 
-uint32_t ne2000_ioport_read(void *opaque, uint32_t addr)
+static uint32_t ne2000_ioport_read(void *opaque, uint32_t addr)
 {
 NE2000State *s = opaque;
 int offset, page, ret;
@@ -544,7 +544,7 @@ static inline void ne2000_dma_update(NE2000State *s, int 
len)
 }
 }
 
-void ne2000_asic_ioport_write(void *opaque, uint32_t addr, uint32_t val)
+static void ne2000_asic_ioport_write(void *opaque, uint32_t addr, uint32_t val)
 {
 NE2000State *s = opaque;
 
@@ -564,7 +564,7 @@ void ne2000_asic_ioport_write(void *opaque, uint32_t addr, 
uint32_t val)
 }
 }
 
-uint32_t ne2000_asic_ioport_read(void *opaque, uint32_t addr)
+static uint32_t ne2000_asic_ioport_read(void *opaque, uint32_t addr)
 {
 NE2000State *s = opaque;
 int ret;
@@ -612,12 +612,12 @@ static uint32_t ne2000_asic_ioport_readl(void *opaque, 
uint32_t addr)
 return ret;
 }
 
-void ne2000_reset_ioport_write(void *opaque, uint32_t addr, uint32_t val)
+static void ne2000_reset_ioport_write(void *opaque, uint32_t addr, uint32_t 
val)
 {
 /* nothing to do (end of reset pulse) */
 }
 
-uint32_t ne2000_reset_ioport_read(void *opaque, uint32_t addr)
+static uint32_t ne2000_reset_ioport_read(void *opaque, uint32_t addr)
 {
 NE2000State *s = opaque;
 ne2000_reset(s);
@@ -676,27 +676,55 @@ static const VMStateDescription vmstate_pci_ne2000 = {
 }
 };
 
-/***/
-/* PCI NE2000 definitions */
+static uint64_t ne2000_read(void *opaque, target_phys_addr_t addr,
+unsigned size)
+{
+NE2000State *s = opaque;
 
-static void ne2000_map(PCIDevice *pci_dev, int region_num,
-   pcibus_t addr, pcibus_t size, int type)
+if (addr < 0x10 && size == 1) {
+return ne2000_ioport_read(s, addr);
+} else if (addr == 0x10) {
+if (size <= 2) {
+return ne2000_asic_ioport_read(s, addr);
+} else {
+return ne2000_asic_ioport_readl(s, addr);
+}
+} else if (addr == 0x1f && size == 1) {
+return ne2000_reset_ioport_read(s, addr);
+}
+return ((uint64_t)1 << (size * 8)) - 1;
+}
+
+static void ne2000_write(void *opaque, target_phys_addr_t addr,
+ uint64_t data, unsigned size)
 {
-PCINE2000State *d = DO_UPCAST(PCINE2000State, dev, pci_dev);
-NE2000State *s = &d->ne2000;
+NE2000State *s = opaque;
+
+if (addr < 0x10 && size == 1) {
+return ne2000_ioport_write(s, addr, data);
+} else if (addr == 0x10) {
+if (size <= 2) {
+return ne2000_asic_ioport_write(s, addr, data);
+} else {
+return ne2000_asic_ioport_writel(s, addr, data);
+}
+} else if (addr == 0x1f && size == 1) {
+return ne2000_reset_ioport_write(s, addr,

Re: [Qemu-devel] [PATCH] Introduce short names for fixed width integer types

2011-08-08 Thread Avi Kivity


On 08/08/2011 04:00 PM, Anthony Liguori wrote:

On 08/08/2011 07:56 AM, Avi Kivity wrote:

QEMU deals with a lot of fixed width integer types; their names
(uint64_t etc) are clumsy to use and take up a lot of space.

Following Linux, introduce shorter names, for example U64 for
uint64_t.


Except Linux uses lower case letters.

I personally think Linux style is wrong here.  The int8_t types are 
standard types.


Besides, we save lots of characters by using 4-space tabs instead of 
8-space tabs.  We can afford to spend some of those saved characters 
on using proper type names :-)




It's not about saving space, it's about improving readability.  We have 
about 21k uses of these types, they deserve short names.


--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH v3 01/39] virtio-pci: get config on init

2011-08-08 Thread Michael S. Tsirkin

On Mon, Aug 08, 2011 at 08:02:08AM -0500, Anthony Liguori wrote:
> On 08/08/2011 07:56 AM, Michael S. Tsirkin wrote:
> >On Mon, Aug 08, 2011 at 07:45:19AM -0500, Anthony Liguori wrote:
> >>On 08/08/2011 05:36 AM, Michael S. Tsirkin wrote:
> Thinking more closely, I don't think this right.
> 
> Updating on map ensured that the config was refreshed after each
> time the bar was mapped.  In the very least, the config needs to be
> refreshed during reset because the guest may write to the guest
> space which should get cleared after reset.
> 
> Regards,
> 
> Anthony Liguori
> >>>
> >>>Not sure I understand. Which register, for example,
> >>>do you have in mind?
> >>>Could you clarify please?
> >>
> >>Actually, you never need to call config_get() AFAICT.  It's called
> >>in every read/write access.
> >
> >Every read, yes. But every write? Are you sure?
> 
> Yeah, not on write, but I think this is a bug.  get_config() should
> be called before doing the memcpy() in order to have a proper RMW.
> 
> Regards,
> 
> Anthony Liguori

Probably not noticeable because guests don't do the RMW
in practice.
We also send the config over on migration.
That's probably a bug as well ...

-- 
MST

Re: [Qemu-devel] [PATCH v3 01/39] virtio-pci: get config on init

2011-08-08 Thread Anthony Liguori


On 08/08/2011 08:14 AM, Michael S. Tsirkin wrote:

Probably not noticeable because guests don't do the RMW
in practice.
We also send the config over on migration.
That's probably a bug as well ...


It's probably unnecessary, but I don't think it's a bug..

Regards,

Anthony Liguori

Re: [Qemu-devel] [PATCH] Introduce short names for fixed width integer types

2011-08-08 Thread Anthony Liguori


On 08/08/2011 08:12 AM, Avi Kivity wrote:

On 08/08/2011 04:00 PM, Anthony Liguori wrote:

On 08/08/2011 07:56 AM, Avi Kivity wrote:

QEMU deals with a lot of fixed width integer types; their names
(uint64_t etc) are clumsy to use and take up a lot of space.

Following Linux, introduce shorter names, for example U64 for
uint64_t.


Except Linux uses lower case letters.

I personally think Linux style is wrong here. The int8_t types are
standard types.

Besides, we save lots of characters by using 4-space tabs instead of
8-space tabs. We can afford to spend some of those saved characters on
using proper type names :-)



It's not about saving space, it's about improving readability. We have
about 21k uses of these types, they deserve short names.


This is one of the few areas that we're actually consistent with today. 
 Introducing a new set of types will just create inconsistency.


Most importantly, these are standard types.  Every modern library and C 
program should be using them.  TBH, having short names is just a bad 
case of NIH.


Regards,

Anthony Liguori

Re: [Qemu-devel] [PATCH] Introduce short names for fixed width integer types

2011-08-08 Thread Peter Maydell

On 8 August 2011 13:56, Avi Kivity  wrote:
> QEMU deals with a lot of fixed width integer types; their names
> (uint64_t etc) are clumsy to use and take up a lot of space.
>
> Following Linux, introduce shorter names, for example U64 for
> uint64_t.

Strongly disagree. uint64_t &c are standard types and it's
immediately clear to a competent C programmer what they are.
Random qemu-specific funny named types just introduces an
unnecessary level of indirection.

We only just recently managed to get rid of the nonstandard
typenames for these from fpu/...

-- PMM

Re: [Qemu-devel] [PATCH] Introduce short names for fixed width integer types

2011-08-08 Thread Avi Kivity


On 08/08/2011 04:17 PM, Anthony Liguori wrote:


This is one of the few areas that we're actually consistent with 
today.  Introducing a new set of types will just create inconsistency.


Most importantly, these are standard types.  Every modern library and 
C program should be using them.  TBH, having short names is just a bad 
case of NIH.




Those are exactly the same types, compatible with all the libraries.  
NIH would be redefining them ourselves (and breaking pointer 
compatibility etc.)


--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH] posix-aio-compat: fix latency issues

2011-08-08 Thread Frediano Ziglio

2011/8/8 Avi Kivity :
> On 08/08/2011 03:49 PM, Frediano Ziglio wrote:
>>
>> 2011/8/8 Avi Kivity:
>> >  In certain circumstances, posix-aio-compat can incur a lot of latency:
>> >    - threads are created by vcpu threads, so if vcpu affinity is set,
>> >     aio threads inherit vcpu affinity.  This can cause many aio threads
>> >     to compete for one cpu.
>> >    - we can create up to max_threads (64) aio threads in one go; since a
>> >     pthread_create can take around 30μs, we have up to 2ms of cpu time
>> >     under a global lock.
>> >
>> >  Fix by:
>> >    - moving thread creation to the main thread, so we inherit the main
>> >     thread's affinity instead of the vcpu thread's affinity.
>> >    - if a thread is currently being created, and we need to create yet
>> >     another thread, let thread being born create the new thread,
>> > reducing
>> >     the amount of time we spend under the main thread.
>> >    - drop the local lock while creating a thread (we may still hold the
>> >     global mutex, though)
>> >
>> >  Note this doesn't eliminate latency completely; scheduler artifacts or
>> >  lack of host cpu resources can still cause it.  We may want
>> > pre-allocated
>> >  threads when this cannot be tolerated.
>> >
>> >  Thanks to Uli Obergfell of Red Hat for his excellent analysis and
>> > suggestions.
>> >
>> >  Signed-off-by: Avi Kivity
>>
>> Why not calling pthread_attr_setaffinity_np (where available) before
>> thread creation or shed_setaffinity at thread start instead of telling
>> another thread to create a thread for us just to get affinity cleared?
>>
>
> The entire qemu process may be affined to a subset of the host cpus; we
> don't want to break that.
>
> For example:
>
>   taskset 0xf0 qemu 
>   (qemu) info cpus
> 
>
>

Just call sched_getaffinity at program start, save to a global
variable and then set this affinity for io threads.
I didn't use affinity that much but from manual it seems that if you
own process you can set affinity as you like.
IMHO this patch introduce a delay in io thread creation due to posting
thread creation to another thread just to set different affinity.

Frediano

Re: [Qemu-devel] [PATCH 2/7] Replace VMSTOP macros with a proper QemuState type

2011-08-08 Thread Luiz Capitulino

On Mon, 08 Aug 2011 14:22:14 +0300
Avi Kivity  wrote:

> On 08/04/2011 01:17 PM, Jan Kiszka wrote:
> > >
> > >  Why "QemuState"?  In general, "qemu" can be inferred from the fact that
> > >  we're in qemu.git.  Suggest "RunState".
> > >
> > >  Second, these states can coexist.  A user may pause the VM
> > >  simultaneously with the watchdog firing or entering premigrate state.
> > >  In fact, with multiple monitors, each monitor can pause and resume the
> > >  vm independently.
> > >
> > >  So I think we should keep a reference count instead of just a start/stop
> > >  state.  Perhaps
> > >
> > >  vm_stop(QemuState s)
> > >  {
> > >  ++stopcount[s];
> > >  }
> > >
> > >  vm_is_stopped()
> > >  {
> > >  for (s in states)
> > >if (stopcount[s])
> > >return true;
> > >  return false;
> > >  }
> >
> > I don't think this makes sense nor is user-friendly. If one command
> > channel suspends the machine, others have the chance to subscribe for
> > that event.
> 
> It's inherently racy.
> 
> > Maintaining a suspension counter would mean you also need a
> > channel to query its value.
> 
> Why?
> 
> > IMHO, there is also no use for defining stopped orthogonally to
> > premigrate and other states that imply that the machine is stopped.
> > Basically they mean "stopped for/because of X". We just need to avoid
> > that you can enter plain stopped state from them by issuing the
> > corresponding monitor command. The other way around might be possible,
> > though, if there are race windows.
> >
> 
> I'm worried about the following race:
> 
>stop
>(qemu stopped for internal reason)
>stop comment processed
> 
>resume
> 
> The (qemu stopped for internal reason) part is lost.

If the "stop" you're referring to happens through vm_stop(), then no,
it won't be lost because do_vm_stop() doesn't allow qemu to be stopped
twice.

Re: [Qemu-devel] [PATCH] posix-aio-compat: fix latency issues

2011-08-08 Thread Avi Kivity

On 08/08/2011 04:21 PM, Frediano Ziglio wrote:

>
>  The entire qemu process may be affined to a subset of the host cpus; we
>  don't want to break that.
>
>  For example:
>
> taskset 0xf0 qemu 
> (qemu) info cpus
>  
>
>

Just call sched_getaffinity at program start, save to a global
variable and then set this affinity for io threads.

This affinity may change later on.

I didn't use affinity that much but from manual it seems that if you
own process you can set affinity as you like.
IMHO this patch introduce a delay in io thread creation due to posting
thread creation to another thread just to set different affinity.

It does.  But aio threads have a long life, so this happens very rarely.

--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] RFC: moving fsfreeze support from the userland guest agent to the guest kernel

2011-08-08 Thread Luiz Capitulino

On Sun, 07 Aug 2011 21:28:17 +0300
Ronen Hod  wrote:

> Well, we want to support Microsoft's VSS, and that requires a guest 
> agent that communicates with all the "writers" (applications), waiting 
> for them to flush their app data in order to generate a consistent 
> app-level snapshot. The VSS platform does most of the work.
> Still, at the bottom line, the agent's role is only to find the right 
> moment in time. This moment can be relayed back to libvirt, and from 
> there do it according to your suggestion, so that the guest agent does 
> not do the freeze, and it is actually not a mandatory component.

I think this discussion has reached the point where patches will speak
louder than words.

Re: [Qemu-devel] [PATCH 2/7] Replace VMSTOP macros with a proper QemuState type

2011-08-08 Thread Avi Kivity

On 08/08/2011 04:25 PM, Luiz Capitulino wrote:

>
>  I'm worried about the following race:
>
> stop
> (qemu stopped for internal reason)
> stop comment processed
>
> resume
>
>  The (qemu stopped for internal reason) part is lost.

If the "stop" you're referring to happens through vm_stop(), then no,
it won't be lost because do_vm_stop() doesn't allow qemu to be stopped
twice.

What happens then?  The user sees an error?

--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH 2/7] Replace VMSTOP macros with a proper QemuState type

2011-08-08 Thread Luiz Capitulino

On Mon, 08 Aug 2011 16:27:05 +0300
Avi Kivity  wrote:

> On 08/08/2011 04:25 PM, Luiz Capitulino wrote:
> > >
> > >  I'm worried about the following race:
> > >
> > > stop
> > > (qemu stopped for internal reason)
> > > stop comment processed
> > >
> > > resume
> > >
> > >  The (qemu stopped for internal reason) part is lost.
> >
> > If the "stop" you're referring to happens through vm_stop(), then no,
> > it won't be lost because do_vm_stop() doesn't allow qemu to be stopped
> > twice.
> 
> What happens then?  The user sees an error?

It's ignored.

Re: [Qemu-devel] [PATCH v4] XBZRLE delta for live migration of large memory apps

2011-08-08 Thread Anthony Liguori

On 08/08/2011 03:42 AM, Shribman, Aidan wrote:

Subject: [PATCH v4] XBZRLE delta for live migration of large memory apps
From: Aidan Shribman

By using XBZRLE (Xor Binary Zero Run-Length-Encoding) we can reduce VM downtime
and total live-migration time of VMs running memory write intensive workloads
typical of large enterprise applications such as SAP ERP Systems, and generally
speaking for any application with a sparse memory update pattern.

On the sender side XBZRLE is used as a compact delta encoding of page updates,
retrieving the old page content from an LRU cache (default size of 64 MB). The
receiving side uses the existing page content and XBZRLE to decode the new page
content.

Work was originally based on research results published VEE 2011: Evaluation of
Delta Compression Techniques for Efficient Live Migration of Large Virtual
Machines by Benoit, Svard, Tordsson and Elmroth. Additionally the delta encoder
XBRLE was improved further using XBZRLE instead.

XBZRLE has a sustained bandwidth of 2-2.5 GB/s for typical workloads making it
ideal for in-line, real-time encoding such as is needed for live-migration.

A typical usage scenario:
 {qemu} migrate_set_cachesize 256m
 {qemu} migrate -x -d tcp:destination.host:
 {qemu} info migrate
 ...
 transferred ram-duplicate: A kbytes
 transferred ram-duplicate: B pages
 transferred ram-normal: C kbytes
 transferred ram-normal: D pages
 transferred ram-xbrle: E kbytes
 transferred ram-xbrle: F pages
 overflow ram-xbrle: G pages
 cache-hit ram-xbrle: H pages
 cache-lookup ram-xbrle: J pages

Testing: live migration with XBZRLE completed in 110 seconds, without live
migration was not able to complete.

A simple synthetic memory r/w load generator:
..include
..include
..int main()
..{
..char *buf = (char *) calloc(4096, 4096);
..while (1) {
..int i;
..for (i = 0; i<  4096 * 4; i++) {
..buf[i * 4096 / 4]++;
..}
..printf(".");
..}
..}

Signed-off-by: Benoit Hudzia
Signed-off-by: Petter Svard
Signed-off-by: Aidan Shribman

One thing that strikes me about this algorithm is that it's very good 
for a particular type of workload--shockingly good really.

I think workload aware migration compression is possible for a lot of 
different types of workloads.  That makes me a bit wary of QEMU growing 
quite a lot of compression mechanisms.

It makes me think that this logic may really belong at a higher level 
where more information is known about the workload.  For instance, I can 
imagine XBZRLE living in something like libvirt.

Today, parsing migration traffic is pretty horrible but I think we're 
pretty strongly committed to fixing that in 1.0.  That makes me wonder 
if it would be nicer architecturally for a higher level tool to own 
something like this.

Originally, when I added migration, I had the view that we would have 
transport plugins based on the exec: protocol.  That hasn't really 
happened since libvirt really owns migration but I think having XBZRLE 
as a transport plugin for libvirt is something worth considering.

I'm curious what people think about this type of approach.  CC'ing 
libvirt to get their input.

Regards,

Anthony Liguori

--

  Makefile.target   |1 +
  arch_init.c   |  351 ++--
  block-migration.c |3 +-
  hash.h|   72 +++
  hmp-commands.hx   |   36 --
  hw/hw.h   |3 +-
  lru.c |  142 +
  lru.h |   13 ++
  migration-exec.c  |6 +-
  migration-fd.c|6 +-
  migration-tcp.c   |6 +-
  migration-unix.c  |6 +-
  migration.c   |  119 +-
  migration.h   |   25 +++-
  qmp-commands.hx   |   43 ++-
  savevm.c  |   13 ++-
  sysemu.h  |   13 ++-
  xbzrle.c  |  126 +++
  xbzrle.h  |   12 ++
  19 files changed, 917 insertions(+), 79 deletions(-)

diff --git a/Makefile.target b/Makefile.target
index 2800f47..b3215de 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -186,6 +186,7 @@ endif #CONFIG_BSD_USER
  ifdef CONFIG_SOFTMMU

  obj-y = arch_init.o cpus.o monitor.o machine.o gdbstub.o balloon.o
+obj-y += lru.o xbzrle.o
  # virtio has to be here due to weird dependency between PCI and virtio-net.
  # need to fix this properly
  obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-serial-bus.o
diff --git a/arch_init.c b/arch_init.c
old mode 100644
new mode 100755
index 4486925..d67dc82
--- a/arch_init.c
+++ b/arch_init.c
@@ -40,6 +40,17 @@
  #include "net.h"
  #include "gdbstub.h"
  #include "hw/smbios.h"
+#include "lru.h"
+#include "xbzrle.h"
+
+//#define DEBUG_ARCH_INIT
+#ifdef DEBUG_ARCH_INIT
+#define DPRINTF(fmt, ...) \
+do { fprintf(stdout, "arch_init: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+do { } wh

[Qemu-devel] [PATCH v4 03/39] vmsvga: don't remember pci BAR address in callback any more

2011-08-08 Thread Avi Kivity

We're going to remove the callback, so we can't use it to save the
address.  Use the pci API instead.

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/vmware_vga.c |   12 ++--
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/hw/vmware_vga.c b/hw/vmware_vga.c
index 354c221..190b005 100644
--- a/hw/vmware_vga.c
+++ b/hw/vmware_vga.c
@@ -52,8 +52,6 @@ struct vmsvga_state_s {
 int on;
 } cursor;
 
-target_phys_addr_t vram_base;
-
 int index;
 int scratch_size;
 uint32_t *scratch;
@@ -761,8 +759,11 @@ static uint32_t vmsvga_value_read(void *opaque, uint32_t 
address)
 case SVGA_REG_BYTES_PER_LINE:
 return ((s->depth + 7) >> 3) * s->new_width;
 
-case SVGA_REG_FB_START:
-return s->vram_base;
+case SVGA_REG_FB_START: {
+struct pci_vmsvga_state_s *pci_vmsvga
+= container_of(s, struct pci_vmsvga_state_s, chip);
+return pci_get_bar_addr(&pci_vmsvga->card, 1);
+}
 
 case SVGA_REG_FB_OFFSET:
 return 0x0;
@@ -1247,14 +1248,13 @@ static void pci_vmsvga_map_mem(PCIDevice *pci_dev, int 
region_num,
 struct vmsvga_state_s *s = &d->chip;
 ram_addr_t iomemtype;
 
-s->vram_base = addr;
 #ifdef DIRECT_VRAM
 iomemtype = cpu_register_io_memory(vmsvga_vram_read,
 vmsvga_vram_write, s, DEVICE_NATIVE_ENDIAN);
 #else
 iomemtype = s->vga.vram_offset | IO_MEM_RAM;
 #endif
-cpu_register_physical_memory(s->vram_base, s->vga.vram_size,
+cpu_register_physical_memory(addr, s->vga.vram_size,
 iomemtype);
 
 s->vga.map_addr = addr;
-- 
1.7.5.3

Re: [Qemu-devel] [PATCH 2/7] Replace VMSTOP macros with a proper QemuState type

2011-08-08 Thread Avi Kivity


On 08/08/2011 04:28 PM, Luiz Capitulino wrote:

On Mon, 08 Aug 2011 16:27:05 +0300
Avi Kivity  wrote:

>  On 08/08/2011 04:25 PM, Luiz Capitulino wrote:
>  >  >
>  >  >   I'm worried about the following race:
>  >  >
>  >  >  stop
>  >  >  (qemu stopped for internal reason)
>  >  >  stop comment processed
>  >  >
>  >  >  resume
>  >  >
>  >  >   The (qemu stopped for internal reason) part is lost.
>  >
>  >  If the "stop" you're referring to happens through vm_stop(), then no,
>  >  it won't be lost because do_vm_stop() doesn't allow qemu to be stopped
>  >  twice.
>
>  What happens then?  The user sees an error?

It's ignored.


Well, then, the user won't know something happened and will happily 
resume the guest, like I outlined above.


When you ignore something in the first set, something breaks in the third.

--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH v4] XBZRLE delta for live migration of large memory apps

2011-08-08 Thread Alexander Graf


On 08.08.2011, at 15:29, Anthony Liguori wrote:

> On 08/08/2011 03:42 AM, Shribman, Aidan wrote:
>> Subject: [PATCH v4] XBZRLE delta for live migration of large memory apps
>> From: Aidan Shribman
>> 
>> By using XBZRLE (Xor Binary Zero Run-Length-Encoding) we can reduce VM 
>> downtime
>> and total live-migration time of VMs running memory write intensive workloads
>> typical of large enterprise applications such as SAP ERP Systems, and 
>> generally
>> speaking for any application with a sparse memory update pattern.
>> 
>> On the sender side XBZRLE is used as a compact delta encoding of page 
>> updates,
>> retrieving the old page content from an LRU cache (default size of 64 MB). 
>> The
>> receiving side uses the existing page content and XBZRLE to decode the new 
>> page
>> content.
>> 
>> Work was originally based on research results published VEE 2011: Evaluation 
>> of
>> Delta Compression Techniques for Efficient Live Migration of Large Virtual
>> Machines by Benoit, Svard, Tordsson and Elmroth. Additionally the delta 
>> encoder
>> XBRLE was improved further using XBZRLE instead.
>> 
>> XBZRLE has a sustained bandwidth of 2-2.5 GB/s for typical workloads making 
>> it
>> ideal for in-line, real-time encoding such as is needed for live-migration.
>> 
>> A typical usage scenario:
>> {qemu} migrate_set_cachesize 256m
>> {qemu} migrate -x -d tcp:destination.host:
>> {qemu} info migrate
>> ...
>> transferred ram-duplicate: A kbytes
>> transferred ram-duplicate: B pages
>> transferred ram-normal: C kbytes
>> transferred ram-normal: D pages
>> transferred ram-xbrle: E kbytes
>> transferred ram-xbrle: F pages
>> overflow ram-xbrle: G pages
>> cache-hit ram-xbrle: H pages
>> cache-lookup ram-xbrle: J pages
>> 
>> Testing: live migration with XBZRLE completed in 110 seconds, without live
>> migration was not able to complete.
>> 
>> A simple synthetic memory r/w load generator:
>> ..include
>> ..include
>> ..int main()
>> ..{
>> ..char *buf = (char *) calloc(4096, 4096);
>> ..while (1) {
>> ..int i;
>> ..for (i = 0; i<  4096 * 4; i++) {
>> ..buf[i * 4096 / 4]++;
>> ..}
>> ..printf(".");
>> ..}
>> ..}
>> 
>> Signed-off-by: Benoit Hudzia
>> Signed-off-by: Petter Svard
>> Signed-off-by: Aidan Shribman
> 
> One thing that strikes me about this algorithm is that it's very good for a 
> particular type of workload--shockingly good really.
> 
> I think workload aware migration compression is possible for a lot of 
> different types of workloads.  That makes me a bit wary of QEMU growing quite 
> a lot of compression mechanisms.
> 
> It makes me think that this logic may really belong at a higher level where 
> more information is known about the workload.  For instance, I can imagine 
> XBZRLE living in something like libvirt.
> 
> Today, parsing migration traffic is pretty horrible but I think we're pretty 
> strongly committed to fixing that in 1.0.  That makes me wonder if it would 
> be nicer architecturally for a higher level tool to own something like this.
> 
> Originally, when I added migration, I had the view that we would have 
> transport plugins based on the exec: protocol.  That hasn't really happened 
> since libvirt really owns migration but I think having XBZRLE as a transport 
> plugin for libvirt is something worth considering.
> 
> I'm curious what people think about this type of approach.  CC'ing libvirt to 
> get their input.

In general, I believe it's a good idea to keep looking at libvirt as a vm 
management layer and only a vm management layer. Directly working with the 
migration protocol basically ties us to libvirt if we want to do migration, 
killing competition in the management stack. Just look at how xm is tied to xen 
- it's one of the major points I dislike about it :).


Alex

[Qemu-devel] [PATCH v4 16/39] eepro100: convert to memory API

2011-08-08 Thread Avi Kivity

Note: the existing code aliases the flash BAR into the MMIO bar.  This is
probably a bug.  This patch does not correct the problem.

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/eepro100.c |  182 -
 1 files changed, 37 insertions(+), 145 deletions(-)

diff --git a/hw/eepro100.c b/hw/eepro100.c
index 9b6f4a5..04723f3 100644
--- a/hw/eepro100.c
+++ b/hw/eepro100.c
@@ -228,13 +228,14 @@ typedef struct {
 PCIDevice dev;
 /* Hash register (multicast mask array, multiple individual addresses). */
 uint8_t mult[8];
-int mmio_index;
+MemoryRegion mmio_bar;
+MemoryRegion io_bar;
+MemoryRegion flash_bar;
 NICState *nic;
 NICConf conf;
 uint8_t scb_stat;   /* SCB stat/ack byte */
 uint8_t int_stat;   /* PCI interrupt status */
 /* region must not be saved by nic_save. */
-uint32_t region1;   /* PCI region 1 address */
 uint16_t mdimem[32];
 eeprom_t *eeprom;
 uint32_t device;/* device variant */
@@ -1584,147 +1585,36 @@ static void eepro100_write4(EEPRO100State * s, 
uint32_t addr, uint32_t val)
 }
 }
 
-/*
- *
- * Port mapped I/O.
- *
- /
-
-static uint32_t ioport_read1(void *opaque, uint32_t addr)
-{
-EEPRO100State *s = opaque;
-#if 0
-logout("addr=%s\n", regname(addr));
-#endif
-return eepro100_read1(s, addr - s->region1);
-}
-
-static uint32_t ioport_read2(void *opaque, uint32_t addr)
-{
-EEPRO100State *s = opaque;
-return eepro100_read2(s, addr - s->region1);
-}
-
-static uint32_t ioport_read4(void *opaque, uint32_t addr)
-{
-EEPRO100State *s = opaque;
-return eepro100_read4(s, addr - s->region1);
-}
-
-static void ioport_write1(void *opaque, uint32_t addr, uint32_t val)
-{
-EEPRO100State *s = opaque;
-#if 0
-logout("addr=%s val=0x%02x\n", regname(addr), val);
-#endif
-eepro100_write1(s, addr - s->region1, val);
-}
-
-static void ioport_write2(void *opaque, uint32_t addr, uint32_t val)
-{
-EEPRO100State *s = opaque;
-eepro100_write2(s, addr - s->region1, val);
-}
-
-static void ioport_write4(void *opaque, uint32_t addr, uint32_t val)
-{
-EEPRO100State *s = opaque;
-eepro100_write4(s, addr - s->region1, val);
-}
-
-/***/
-/* PCI EEPRO100 definitions */
-
-static void pci_map(PCIDevice * pci_dev, int region_num,
-pcibus_t addr, pcibus_t size, int type)
-{
-EEPRO100State *s = DO_UPCAST(EEPRO100State, dev, pci_dev);
-
-TRACE(OTHER, logout("region %d, addr=0x%08"FMT_PCIBUS", "
-  "size=0x%08"FMT_PCIBUS", type=%d\n",
-  region_num, addr, size, type));
-
-assert(region_num == 1);
-register_ioport_write(addr, size, 1, ioport_write1, s);
-register_ioport_read(addr, size, 1, ioport_read1, s);
-register_ioport_write(addr, size, 2, ioport_write2, s);
-register_ioport_read(addr, size, 2, ioport_read2, s);
-register_ioport_write(addr, size, 4, ioport_write4, s);
-register_ioport_read(addr, size, 4, ioport_read4, s);
-
-s->region1 = addr;
-}
-
-/*
- *
- * Memory mapped I/O.
- *
- /
-
-static void pci_mmio_writeb(void *opaque, target_phys_addr_t addr, uint32_t 
val)
-{
-EEPRO100State *s = opaque;
-#if 0
-logout("addr=%s val=0x%02x\n", regname(addr), val);
-#endif
-eepro100_write1(s, addr, val);
-}
-
-static void pci_mmio_writew(void *opaque, target_phys_addr_t addr, uint32_t 
val)
+static uint64_t eepro100_read(void *opaque, target_phys_addr_t addr,
+  unsigned size)
 {
 EEPRO100State *s = opaque;
-#if 0
-logout("addr=%s val=0x%02x\n", regname(addr), val);
-#endif
-eepro100_write2(s, addr, val);
-}
 
-static void pci_mmio_writel(void *opaque, target_phys_addr_t addr, uint32_t 
val)
-{
-EEPRO100State *s = opaque;
-#if 0
-logout("addr=%s val=0x%02x\n", regname(addr), val);
-#endif
-eepro100_write4(s, addr, val);
-}
-
-static uint32_t pci_mmio_readb(void *opaque, target_phys_addr_t addr)
-{
-EEPRO100State *s = opaque;
-#if 0
-logout("addr=%s\n", regname(addr));
-#endif
-return eepro100_read1(s, addr);
+switch (size) {
+case 1: return eepro100_read1(s, addr);
+case 2: return eepro100_read2(s, addr);
+case 4: return eepro100_read4(s, addr);
+default: abort();
+}
 }
 
-static uint32_t pci_mmio_readw(void *opaque, target_phys_addr_t addr)
+static void eepro100_write(void *opaque, target_phys_addr_t addr,
+   uint64_t data, unsigned size)
 {
 EEPRO100State *s = opaque;
-#if 0
-logout("addr=%s\n", regname(addr));
-#e

Re: [Qemu-devel] [PATCH v4] XBZRLE delta for live migration of large memory apps

2011-08-08 Thread Anthony Liguori


On 08/08/2011 08:41 AM, Alexander Graf wrote:


On 08.08.2011, at 15:29, Anthony Liguori wrote:


One thing that strikes me about this algorithm is that it's very good for a 
particular type of workload--shockingly good really.

I think workload aware migration compression is possible for a lot of different 
types of workloads.  That makes me a bit wary of QEMU growing quite a lot of 
compression mechanisms.

It makes me think that this logic may really belong at a higher level where 
more information is known about the workload.  For instance, I can imagine 
XBZRLE living in something like libvirt.

Today, parsing migration traffic is pretty horrible but I think we're pretty 
strongly committed to fixing that in 1.0.  That makes me wonder if it would be 
nicer architecturally for a higher level tool to own something like this.

Originally, when I added migration, I had the view that we would have transport 
plugins based on the exec: protocol.  That hasn't really happened since libvirt 
really owns migration but I think having XBZRLE as a transport plugin for 
libvirt is something worth considering.

I'm curious what people think about this type of approach.  CC'ing libvirt to 
get their input.


In general, I believe it's a good idea to keep looking at libvirt as a vm 
management layer and only a vm management layer. Directly working with the 
migration protocol basically ties us to libvirt if we want to do migration, 
killing competition in the management stack. Just look at how xm is tied to xen 
- it's one of the major points I dislike about it :).


The way I originally envisioned things, you'd have:

(qemu) migrate xbzrle://destination?opt1=value1&opt2=value2

Which would in turn be equivalent to:

(qemu) migrate exec:///usr/libexec/qemu/migration-helper-xbzrle 
--opt1=value1 --opt2=value2


But even if we supported that, it wouldn't get exposed via libvirt 
unless the libvirt guys exposed QEMU URIs directly.


So I think the open question is, how do we do transport plugins in a way 
that makes libvirt and QEMU both happy?


Regards,

Anthony Liguori



Alex

Re: [Qemu-devel] [PATCH 2/7] Replace VMSTOP macros with a proper QemuState type

2011-08-08 Thread Luiz Capitulino

On Mon, 08 Aug 2011 16:40:17 +0300
Avi Kivity  wrote:

> On 08/08/2011 04:28 PM, Luiz Capitulino wrote:
> > On Mon, 08 Aug 2011 16:27:05 +0300
> > Avi Kivity  wrote:
> >
> > >  On 08/08/2011 04:25 PM, Luiz Capitulino wrote:
> > >  >  >
> > >  >  >   I'm worried about the following race:
> > >  >  >
> > >  >  >  stop
> > >  >  >  (qemu stopped for internal reason)
> > >  >  >  stop comment processed
> > >  >  >
> > >  >  >  resume
> > >  >  >
> > >  >  >   The (qemu stopped for internal reason) part is lost.
> > >  >
> > >  >  If the "stop" you're referring to happens through vm_stop(), then no,
> > >  >  it won't be lost because do_vm_stop() doesn't allow qemu to be stopped
> > >  >  twice.
> > >
> > >  What happens then?  The user sees an error?
> >
> > It's ignored.
> 
> Well, then, the user won't know something happened and will happily 
> resume the guest, like I outlined above.

I think it makes sense to return an error in the monitor if the user
tries to stop qemu when it's already stopped. Not sure if it will do what you
think it should do, but we should always tell the user when we're unable to
carry his/her orders.

But it does make sense to me to not allow stopping twice. First because it
doesn't make sense to stop something which is not moving and second because
what else can stop the vm if it's already stopped?

Maybe vm_stop() should return an error, but I think this goes beyond this
series.

> 
> When you ignore something in the first set, something breaks in the third.
>

Re: [Qemu-devel] [PATCH] qemu-img: Use qemu_blockalign

2011-08-08 Thread Stefan Hajnoczi

On Mon, Aug 8, 2011 at 1:14 PM, Kevin Wolf  wrote:
> Now that you can use cache=none for the output file in qemu-img, we should
> properly align our buffers so that raw-posix doesn't have to use its (smaller)
> bounce buffer.
>
> Signed-off-by: Kevin Wolf 
> ---
>  qemu-img.c |   12 ++--
>  1 files changed, 6 insertions(+), 6 deletions(-)

Reviewed-by: Stefan Hajnoczi

[Qemu-devel] [PATCH v4 26/39] pcnet: convert to memory API

2011-08-08 Thread Avi Kivity

Also related chips.

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/lance.c |   31 ++-
 hw/pcnet-pci.c |   74 +--
 hw/pcnet.h |4 ++-
 3 files changed, 61 insertions(+), 48 deletions(-)

diff --git a/hw/lance.c b/hw/lance.c
index ddb1cbb..8e20360 100644
--- a/hw/lance.c
+++ b/hw/lance.c
@@ -55,8 +55,8 @@ static void parent_lance_reset(void *opaque, int irq, int 
level)
 pcnet_h_reset(&d->state);
 }
 
-static void lance_mem_writew(void *opaque, target_phys_addr_t addr,
- uint32_t val)
+static void lance_mem_write(void *opaque, target_phys_addr_t addr,
+uint64_t val, unsigned size)
 {
 SysBusPCNetState *d = opaque;
 
@@ -64,7 +64,8 @@ static void lance_mem_writew(void *opaque, target_phys_addr_t 
addr,
 pcnet_ioport_writew(&d->state, addr, val & 0x);
 }
 
-static uint32_t lance_mem_readw(void *opaque, target_phys_addr_t addr)
+static uint64_t lance_mem_read(void *opaque, target_phys_addr_t addr,
+   unsigned size)
 {
 SysBusPCNetState *d = opaque;
 uint32_t val;
@@ -74,16 +75,14 @@ static uint32_t lance_mem_readw(void *opaque, 
target_phys_addr_t addr)
 return val & 0x;
 }
 
-static CPUReadMemoryFunc * const lance_mem_read[3] = {
-NULL,
-lance_mem_readw,
-NULL,
-};
-
-static CPUWriteMemoryFunc * const lance_mem_write[3] = {
-NULL,
-lance_mem_writew,
-NULL,
+static const MemoryRegionOps lance_mem_ops = {
+.read = lance_mem_read,
+.write = lance_mem_write,
+.endianness = DEVICE_NATIVE_ENDIAN,
+.valid = {
+.min_access_size = 2,
+.max_access_size = 2,
+},
 };
 
 static void lance_cleanup(VLANClientState *nc)
@@ -117,13 +116,11 @@ static int lance_init(SysBusDevice *dev)
 SysBusPCNetState *d = FROM_SYSBUS(SysBusPCNetState, dev);
 PCNetState *s = &d->state;
 
-s->mmio_index =
-cpu_register_io_memory(lance_mem_read, lance_mem_write, d,
-   DEVICE_NATIVE_ENDIAN);
+memory_region_init_io(&s->mmio, &lance_mem_ops, s, "lance-mmio", 4);
 
 qdev_init_gpio_in(&dev->qdev, parent_lance_reset, 1);
 
-sysbus_init_mmio(dev, 4, s->mmio_index);
+sysbus_init_mmio_region(dev, &s->mmio);
 
 sysbus_init_irq(dev, &s->irq);
 
diff --git a/hw/pcnet-pci.c b/hw/pcnet-pci.c
index 216cf81..a25f565 100644
--- a/hw/pcnet-pci.c
+++ b/hw/pcnet-pci.c
@@ -46,6 +46,7 @@
 typedef struct {
 PCIDevice pci_dev;
 PCNetState state;
+MemoryRegion io_bar;
 } PCIPCNetState;
 
 static void pcnet_aprom_writeb(void *opaque, uint32_t addr, uint32_t val)
@@ -69,25 +70,41 @@ static uint32_t pcnet_aprom_readb(void *opaque, uint32_t 
addr)
 return val;
 }
 
-static void pcnet_ioport_map(PCIDevice *pci_dev, int region_num,
- pcibus_t addr, pcibus_t size, int type)
+static uint64_t pcnet_ioport_read(void *opaque, target_phys_addr_t addr,
+  unsigned size)
 {
-PCNetState *d = &DO_UPCAST(PCIPCNetState, pci_dev, pci_dev)->state;
+PCNetState *d = opaque;
 
-#ifdef PCNET_DEBUG_IO
-printf("pcnet_ioport_map addr=0x%04"FMT_PCIBUS" size=0x%04"FMT_PCIBUS"\n",
-   addr, size);
-#endif
+if (addr < 16 && size == 1) {
+return pcnet_aprom_readb(d, addr);
+} else if (addr >= 0x10 && addr < 0x20 && size == 2) {
+return pcnet_ioport_readw(d, addr);
+} else if (addr >= 0x10 && addr < 0x20 && size == 4) {
+return pcnet_ioport_readl(d, addr);
+}
+return ((uint64_t)1 << (size * 8)) - 1;
+}
 
-register_ioport_write(addr, 16, 1, pcnet_aprom_writeb, d);
-register_ioport_read(addr, 16, 1, pcnet_aprom_readb, d);
+static void pcnet_ioport_write(void *opaque, target_phys_addr_t addr,
+   uint64_t data, unsigned size)
+{
+PCNetState *d = opaque;
 
-register_ioport_write(addr + 0x10, 0x10, 2, pcnet_ioport_writew, d);
-register_ioport_read(addr + 0x10, 0x10, 2, pcnet_ioport_readw, d);
-register_ioport_write(addr + 0x10, 0x10, 4, pcnet_ioport_writel, d);
-register_ioport_read(addr + 0x10, 0x10, 4, pcnet_ioport_readl, d);
+if (addr < 16 && size == 1) {
+return pcnet_aprom_writeb(d, addr, data);
+} else if (addr >= 0x10 && addr < 0x20 && size == 2) {
+return pcnet_ioport_writew(d, addr, data);
+} else if (addr >= 0x10 && addr < 0x20 && size == 4) {
+return pcnet_ioport_writel(d, addr, data);
+}
 }
 
+static const MemoryRegionOps pcnet_io_ops = {
+.read = pcnet_ioport_read,
+.write = pcnet_ioport_write,
+.endianness = DEVICE_NATIVE_ENDIAN,
+};
+
 static void pcnet_mmio_writeb(void *opaque, target_phys_addr_t addr, uint32_t 
val)
 {
 PCNetState *d = opaque;
@@ -202,16 +219,12 @@ static const VMStateDescription vmstate_pci_pcnet = {
 
 /* PCI interface */
 
-static CPUWriteMemoryFun

Re: [Qemu-devel] [PATCH v4] XBZRLE delta for live migration of large memory apps

2011-08-08 Thread Avi Kivity


On 08/08/2011 04:41 PM, Alexander Graf wrote:

In general, I believe it's a good idea to keep looking at libvirt as a vm 
management layer and only a vm management layer.


Very much yes.

--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH v4] XBZRLE delta for live migration of large memory apps

2011-08-08 Thread Avi Kivity


On 08/08/2011 04:29 PM, Anthony Liguori wrote:


One thing that strikes me about this algorithm is that it's very good 
for a particular type of workload--shockingly good really.


Poking bytes at random places in memory is fairly generic.  If you have 
a lot of small objects, and modify a subset of them, this is the pattern 
you get.




I think workload aware migration compression is possible for a lot of 
different types of workloads.  That makes me a bit wary of QEMU 
growing quite a lot of compression mechanisms.


It makes me think that this logic may really belong at a higher level 
where more information is known about the workload.  For instance, I 
can imagine XBZRLE living in something like libvirt.


A better model would be plugin based.

--
error compiling committee.c: too many arguments to function

Re: [Qemu-devel] [PATCH 2/7] Replace VMSTOP macros with a proper QemuState type

2011-08-08 Thread Avi Kivity

On 08/08/2011 04:47 PM, Luiz Capitulino wrote:

>
>  Well, then, the user won't know something happened and will happily
>  resume the guest, like I outlined above.

I think it makes sense to return an error in the monitor if the user
tries to stop qemu when it's already stopped. Not sure if it will do what you
think it should do, but we should always tell the user when we're unable to
carry his/her orders.

But it does make sense to me to not allow stopping twice. First because it
doesn't make sense to stop something which is not moving and second because
what else can stop the vm if it's already stopped?

Maybe vm_stop() should return an error, but I think this goes beyond this
series.

This is why I suggested a reference count.  In this case, we can always 
stop the guest "twice", because we don't lost information when we resume.

--
error compiling committee.c: too many arguments to function

[Qemu-devel] [PATCH v4 13/39] rtl8139: convert to memory API

2011-08-08 Thread Avi Kivity

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/rtl8139.c |   72 ++---
 1 files changed, 38 insertions(+), 34 deletions(-)

diff --git a/hw/rtl8139.c b/hw/rtl8139.c
index 5214b8c..f07af35 100644
--- a/hw/rtl8139.c
+++ b/hw/rtl8139.c
@@ -474,7 +474,6 @@ typedef struct RTL8139State {
 
 NICState *nic;
 NICConf conf;
-int rtl8139_mmio_io_addr;
 
 /* C ring mode */
 uint32_t   currTxDesc;
@@ -506,6 +505,9 @@ typedef struct RTL8139State {
 QEMUTimer *timer;
 int64_t TimerExpire;
 
+MemoryRegion bar_io;
+MemoryRegion bar_mem;
+
 /* Support migration to/from old versions */
 int rtl8139_mmio_io_addr_dummy;
 } RTL8139State;
@@ -3283,7 +3285,7 @@ static void rtl8139_pre_save(void *opaque)
 rtl8139_set_next_tctr_time(s, current_time);
 s->TCTR = muldiv64(current_time - s->TCTR_base, PCI_FREQUENCY,
get_ticks_per_sec());
-s->rtl8139_mmio_io_addr_dummy = s->rtl8139_mmio_io_addr;
+s->rtl8139_mmio_io_addr_dummy = 0;
 }
 
 static const VMStateDescription vmstate_rtl8139 = {
@@ -3379,31 +3381,35 @@ static const VMStateDescription vmstate_rtl8139 = {
 /***/
 /* PCI RTL8139 definitions */
 
-static void rtl8139_ioport_map(PCIDevice *pci_dev, int region_num,
-   pcibus_t addr, pcibus_t size, int type)
-{
-RTL8139State *s = DO_UPCAST(RTL8139State, dev, pci_dev);
-
-register_ioport_write(addr, 0x100, 1, rtl8139_ioport_writeb, s);
-register_ioport_read( addr, 0x100, 1, rtl8139_ioport_readb,  s);
-
-register_ioport_write(addr, 0x100, 2, rtl8139_ioport_writew, s);
-register_ioport_read( addr, 0x100, 2, rtl8139_ioport_readw,  s);
-
-register_ioport_write(addr, 0x100, 4, rtl8139_ioport_writel, s);
-register_ioport_read( addr, 0x100, 4, rtl8139_ioport_readl,  s);
-}
+static const MemoryRegionPortio rtl8139_portio[] = {
+{ 0, 0x100, 1, .read = rtl8139_ioport_readb, },
+{ 0, 0x100, 1, .write = rtl8139_ioport_writeb, },
+{ 0, 0x100, 2, .read = rtl8139_ioport_readw, },
+{ 0, 0x100, 2, .write = rtl8139_ioport_writew, },
+{ 0, 0x100, 4, .read = rtl8139_ioport_readl, },
+{ 0, 0x100, 4, .write = rtl8139_ioport_writel, },
+PORTIO_END_OF_LIST()
+};
 
-static CPUReadMemoryFunc * const rtl8139_mmio_read[3] = {
-rtl8139_mmio_readb,
-rtl8139_mmio_readw,
-rtl8139_mmio_readl,
+static const MemoryRegionOps rtl8139_io_ops = {
+.old_portio = rtl8139_portio,
+.endianness = DEVICE_LITTLE_ENDIAN,
 };
 
-static CPUWriteMemoryFunc * const rtl8139_mmio_write[3] = {
-rtl8139_mmio_writeb,
-rtl8139_mmio_writew,
-rtl8139_mmio_writel,
+static const MemoryRegionOps rtl8139_mmio_ops = {
+.old_mmio = {
+.read = {
+rtl8139_mmio_readb,
+rtl8139_mmio_readw,
+rtl8139_mmio_readl,
+},
+.write = {
+rtl8139_mmio_writeb,
+rtl8139_mmio_writew,
+rtl8139_mmio_writel,
+},
+},
+.endianness = DEVICE_LITTLE_ENDIAN,
 };
 
 static void rtl8139_timer(void *opaque)
@@ -3432,7 +3438,8 @@ static int pci_rtl8139_uninit(PCIDevice *dev)
 {
 RTL8139State *s = DO_UPCAST(RTL8139State, dev, dev);
 
-cpu_unregister_io_memory(s->rtl8139_mmio_io_addr);
+memory_region_destroy(&s->bar_io);
+memory_region_destroy(&s->bar_mem);
 if (s->cplus_txbuffer) {
 qemu_free(s->cplus_txbuffer);
 s->cplus_txbuffer = NULL;
@@ -3462,15 +3469,12 @@ static int pci_rtl8139_init(PCIDevice *dev)
  * list bit in status register, and offset 0xdc seems unused. */
 pci_conf[PCI_CAPABILITY_LIST] = 0xdc;
 
-/* I/O handler for memory-mapped I/O */
-s->rtl8139_mmio_io_addr =
-cpu_register_io_memory(rtl8139_mmio_read, rtl8139_mmio_write, s,
-   DEVICE_LITTLE_ENDIAN);
-
-pci_register_bar(&s->dev, 0, 0x100,
-   PCI_BASE_ADDRESS_SPACE_IO,  rtl8139_ioport_map);
-
-pci_register_bar_simple(&s->dev, 1, 0x100, 0, s->rtl8139_mmio_io_addr);
+memory_region_init_io(&s->bar_io, &rtl8139_io_ops, s, "rtl8139", 0x100);
+memory_region_init_io(&s->bar_mem, &rtl8139_mmio_ops, s, "rtl8139", 0x100);
+pci_register_bar_region(&s->dev, 0, PCI_BASE_ADDRESS_SPACE_IO,
+&s->bar_io);
+pci_register_bar_region(&s->dev, 1, PCI_BASE_ADDRESS_SPACE_MEMORY,
+&s->bar_mem);
 
 qemu_macaddr_default_if_unset(&s->conf.macaddr);
 
-- 
1.7.5.3

Re: [Qemu-devel] support for Freescale MPC8xx (850/860) processors/platforms

2011-08-08 Thread Alexander Graf


On 08.08.2011, at 03:36, Brendan Simon (eTRIX) wrote:

> Hi,
> 
> Anyone working on Freescale MPC8xx (embedded PowerPC) processors ??
> 
> I'm trying to ascertain if the MPC8xx (MPC850) processors are supported, and 
> if not, how much effort would be required to get it working.

I'm not aware of anyone working on MPC8xx CPU support. In fact, I don't really 
know what they look like - I guess they're 601 derived?

Ben, any idea how much different 8xx is?


Alex

> 
> Thanks for any help.
> Brendan.
> 
> 
> On 5/08/11 9:31 PM, Brendan Simon (eTRIX) wrote:
>> 
>> Hello,
>> 
>> Does QEMU support the Freescale MPC8xx (MPC850) processors or platforms (e.g 
>> FADS860) ??
>> 
>> Googling shows some code that suggests that it is not supported.
>> > cpu_abort(env, "MPC8xx MMU model is not implemented\n");
>> 
>> If not supported, does anyone have any idea how much work there would be to 
>> add support for MPC850/860 and to create a platform that has supports DRAM, 
>> Flash, the CPM peripherals (SMC/USARTS, SCC/UARTS, SCC/Ethernet, BRG timers, 
>> etc) ??
>> 
>> The platform I would like to build has 1 x Ethernet (using SCC2), 1 x UART 
>> (on SMC1), 1 x UART (on SCC3), 1 x UART (external UART chip), and an Altera 
>> CPLD with digital I/O.
>> 
>> Are there any other similar platforms I could look at, to either port or use 
>> as a reference, to build an MPC850 platform as described ??
>> 
>> Thanks,
>> Brendan.
>> 
>

[Qemu-devel] [PATCH v4 05/39] cirrus: simplify mmio BAR access functions

2011-08-08 Thread Avi Kivity

Make use of the memory API's ability to satisfy multi-byte accesses via
multiple single-byte accesses.

Reviewed-by: Richard Henderson 
Signed-off-by: Avi Kivity 
---
 hw/cirrus_vga.c |   78 +-
 1 files changed, 8 insertions(+), 70 deletions(-)

diff --git a/hw/cirrus_vga.c b/hw/cirrus_vga.c
index ad23c4a..4f57b92 100644
--- a/hw/cirrus_vga.c
+++ b/hw/cirrus_vga.c
@@ -2827,12 +2827,11 @@ static void cirrus_vga_ioport_write(void *opaque, 
uint32_t addr, uint32_t val)
  *
  ***/
 
-static uint32_t cirrus_mmio_readb(void *opaque, target_phys_addr_t addr)
+static uint64_t cirrus_mmio_read(void *opaque, target_phys_addr_t addr,
+ unsigned size)
 {
 CirrusVGAState *s = opaque;
 
-addr &= CIRRUS_PNPMMIO_SIZE - 1;
-
 if (addr >= 0x100) {
 return cirrus_mmio_blt_read(s, addr - 0x100);
 } else {
@@ -2840,33 +2839,11 @@ static uint32_t cirrus_mmio_readb(void *opaque, 
target_phys_addr_t addr)
 }
 }
 
-static uint32_t cirrus_mmio_readw(void *opaque, target_phys_addr_t addr)
-{
-uint32_t v;
-
-v = cirrus_mmio_readb(opaque, addr);
-v |= cirrus_mmio_readb(opaque, addr + 1) << 8;
-return v;
-}
-
-static uint32_t cirrus_mmio_readl(void *opaque, target_phys_addr_t addr)
-{
-uint32_t v;
-
-v = cirrus_mmio_readb(opaque, addr);
-v |= cirrus_mmio_readb(opaque, addr + 1) << 8;
-v |= cirrus_mmio_readb(opaque, addr + 2) << 16;
-v |= cirrus_mmio_readb(opaque, addr + 3) << 24;
-return v;
-}
-
-static void cirrus_mmio_writeb(void *opaque, target_phys_addr_t addr,
-  uint32_t val)
+static void cirrus_mmio_write(void *opaque, target_phys_addr_t addr,
+  uint64_t val, unsigned size)
 {
 CirrusVGAState *s = opaque;
 
-addr &= CIRRUS_PNPMMIO_SIZE - 1;
-
 if (addr >= 0x100) {
cirrus_mmio_blt_write(s, addr - 0x100, val);
 } else {
@@ -2874,53 +2851,14 @@ static void cirrus_mmio_writeb(void *opaque, 
target_phys_addr_t addr,
 }
 }
 
-static void cirrus_mmio_writew(void *opaque, target_phys_addr_t addr,
-  uint32_t val)
-{
-cirrus_mmio_writeb(opaque, addr, val & 0xff);
-cirrus_mmio_writeb(opaque, addr + 1, (val >> 8) & 0xff);
-}
-
-static void cirrus_mmio_writel(void *opaque, target_phys_addr_t addr,
-  uint32_t val)
-{
-cirrus_mmio_writeb(opaque, addr, val & 0xff);
-cirrus_mmio_writeb(opaque, addr + 1, (val >> 8) & 0xff);
-cirrus_mmio_writeb(opaque, addr + 2, (val >> 16) & 0xff);
-cirrus_mmio_writeb(opaque, addr + 3, (val >> 24) & 0xff);
-}
-
-
-static uint64_t cirrus_mmio_read(void *opaque, target_phys_addr_t addr,
- unsigned size)
-{
-CirrusVGAState *s = opaque;
-
-switch (size) {
-case 1: return cirrus_mmio_readb(s, addr);
-case 2: return cirrus_mmio_readw(s, addr);
-case 4: return cirrus_mmio_readl(s, addr);
-default: abort();
-}
-};
-
-static void cirrus_mmio_write(void *opaque, target_phys_addr_t addr,
-  uint64_t data, unsigned size)
-{
-CirrusVGAState *s = opaque;
-
-switch (size) {
-case 1: return cirrus_mmio_writeb(s, addr, data);
-case 2: return cirrus_mmio_writew(s, addr, data);
-case 4: return cirrus_mmio_writel(s, addr, data);
-default: abort();
-}
-};
-
 static const MemoryRegionOps cirrus_mmio_io_ops = {
 .read = cirrus_mmio_read,
 .write = cirrus_mmio_write,
 .endianness = DEVICE_LITTLE_ENDIAN,
+.impl = {
+.min_access_size = 1,
+.max_access_size = 1,
+},
 };
 
 /* load/save state */
-- 
1.7.5.3

Re: [Qemu-devel] [PATCH v3] XBZRLE delta for live migration of large memory apps

2011-08-08 Thread Stefan Hajnoczi

On Mon, Aug 8, 2011 at 9:42 AM, Shribman, Aidan  wrote:
>> -Original Message-
>> From: Stefan Hajnoczi [mailto:stefa...@gmail.com]
>> Sent: Tuesday, August 02, 2011 9:06 PM
>> To: Shribman, Aidan
>> Cc: qemu-devel@nongnu.org; Anthony Liguori
>> Subject: Re: [PATCH v3] XBZRLE delta for live migration of
>> large memory apps
>>
>> On Tue, Aug 02, 2011 at 03:45:56PM +0200, Shribman, Aidan wrote:
>> > Subject: [PATCH v3] XBZRLE delta for live migration of
>> large memory apps
>> > From: Aidan Shribman 
>> >
>> > By using XBZRLE (Xor Binary Zero Run-Length-Encoding) we
>> can reduce VM downtime
>> > and total live-migration time for VMs running memory write
>> intensive workloads
>> > typical of large enterprise applications such as SAP ERP
>> Systems, and generally
>> > speaking for representative of any application with a
>> sparse memory update pattern.
>> >
>> > On the sender side XBZRLE is used as a compact delta
>> encoding of page updates,
>> > retrieving the old page content from an LRU cache (default
>> size of 64 MB). The
>> > receiving side uses the existing page content and XBZRLE to
>> decode the new page
>> > content.
>> >
>> > Work was originally based on research results published VEE
>> 2011: Evaluation of
>> > Delta Compression Techniques for Efficient Live Migration
>> of Large Virtual
>> > Machines by Benoit, Svard, Tordsson and Elmroth.
>> Additionally the delta encoder
>> > XBRLE was improved further using XBZRLE instead.
>> >
>> > XBZRLE has a sustained bandwidth of 1.5-2.2 GB/s for
>> typical workloads making it
>> > ideal for in-line, real-time encoding such as is needed for
>> live-migration.
>>
>> What is the CPU cost of xbzrle live migration on the source host?  I'm
>> thinking about a graph showing CPU utilization (e.g. from mpstat(1))
>> that has two datasets: migration without xbzrle and migration with
>> xbzrle.
>>
>
> zbzrle.out indicates that xbzrle is using 50% of the compute capacity during 
> the xbzrle live-migration (which completed is  few seconds), In vanilla.out 
> between 30%-60% of compute is directed toward the live-migration itself - in 
> this case live-migration is not able to complete.
>
> -
>
> root@ilrsh01:~#
> root@ilrsh01:~# cat xbzrle.out
> Linux 2.6.35-22-server (ilrsh01)        08/07/2011      _x86_64_        (2 
> CPU)
>
> 10:55:37 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
> %guest    %idle
> 10:55:38 AM  all   40.50    0.00    1.00    1.50    0.00    9.00    0.00    
> 0.00   48.00
> 10:55:38 AM    0    0.00    0.00    1.00    3.00    0.00    0.00    0.00    
> 0.00   96.00
> 10:55:38 AM    1   81.00    0.00    1.00    0.00    0.00   18.00    0.00    
> 0.00    0.00

Too bad mpstat %guest is not being displayed correctly here.  That
would make it much easier to see how much CPU is spent executing guest
code and how much doing live migration.  Old system?

>> > +int xbzrle_encode(uint8_t *xbzrle, const uint8_t *old,
>> const uint8_t *curr,
>> > +    const size_t max_compressed_len)
>> > +{
>> > +    int compressed_len;
>> > +
>> > +    xor_encode_word(xor_buf, old, curr);
>> > +    compressed_len = rle_encode((uint64_t *)xor_buf,
>> > +        sizeof(xor_buf)/sizeof(uint64_t), xbzrle_buf,
>> > +        sizeof(xbzrle_buf));
>> > +    if (compressed_len > max_compressed_len) {
>> > +        return -1;
>> > +    }
>> > +    memcpy(xbzrle, xbzrle_buf, compressed_len);
>>
>> Why the intermediate xbrzle_buf buffer and why the memcpy()?
>
> xbzrle encoding may take up to 150% in a rare worst case scenario - to avoid 
> having to check during each xbzrle iteration or alternatively adding a loop 
> that checks for overflow potential during the xbzrle encoding I use the 
> xbzrle_buf as working area. memcpy() is a factor faster than  xbzrle so it's 
> slow-down is in-significant.

I missed that the encode/decode functions do not check their dlen
parameter.  dlen is unused and should be removed.

Stefan

[Qemu-devel] [PATCH v4 37/39] pci: fold BAR mapping function into its caller

2011-08-08 Thread Avi Kivity

There is only one function, so no need for a function pointer.

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/pci.c |   25 +
 hw/pci.h |1 -
 2 files changed, 9 insertions(+), 17 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index 62b34d4..aa17395 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -881,18 +881,6 @@ static int pci_unregister_device(DeviceState *dev)
 return 0;
 }
 
-static void pci_simple_bar_mapfunc_region(PCIDevice *pci_dev, int region_num,
-  pcibus_t addr, pcibus_t size,
-  int type)
-{
-PCIIORegion *r = &pci_dev->io_regions[region_num];
-
-memory_region_add_subregion_overlap(r->address_space,
-addr,
-r->memory,
-1);
-}
-
 void pci_register_bar_region(PCIDevice *pci_dev, int region_num,
  uint8_t type, MemoryRegion *memory)
 {
@@ -914,7 +902,6 @@ void pci_register_bar_region(PCIDevice *pci_dev, int 
region_num,
 r->size = size;
 r->filtered_size = size;
 r->type = type;
-r->map_func = pci_simple_bar_mapfunc_region;
 r->memory = NULL;
 
 wmask = ~(size - 1);
@@ -1102,10 +1089,16 @@ static void pci_update_mappings(PCIDevice *d)
  * addr & (size - 1) != 0.
  */
 if (r->type & PCI_BASE_ADDRESS_SPACE_IO) {
-r->map_func(d, i, r->addr, r->filtered_size, r->type);
+memory_region_add_subregion_overlap(r->address_space,
+r->addr,
+r->memory,
+1);
 } else {
-r->map_func(d, i, pci_to_cpu_addr(d->bus, r->addr),
-r->filtered_size, r->type);
+memory_region_add_subregion_overlap(r->address_space,
+pci_to_cpu_addr(d->bus,
+r->addr),
+r->memory,
+1);
 }
 }
 }
diff --git a/hw/pci.h b/hw/pci.h
index 8028176..8d1662a 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -92,7 +92,6 @@ typedef struct PCIIORegion {
 pcibus_t size;
 pcibus_t filtered_size;
 uint8_t type;
-PCIMapIORegionFunc *map_func;
 MemoryRegion *memory;
 MemoryRegion *address_space;
 } PCIIORegion;
-- 
1.7.5.3

Re: [Qemu-devel] [libvirt] [PATCH v4] XBZRLE delta for live migration of large memory apps

2011-08-08 Thread Daniel P. Berrange

On Mon, Aug 08, 2011 at 08:29:51AM -0500, Anthony Liguori wrote:
> On 08/08/2011 03:42 AM, Shribman, Aidan wrote:
> >Subject: [PATCH v4] XBZRLE delta for live migration of large memory apps
> >From: Aidan Shribman
> >
> >By using XBZRLE (Xor Binary Zero Run-Length-Encoding) we can reduce VM 
> >downtime
> >and total live-migration time of VMs running memory write intensive workloads
> >typical of large enterprise applications such as SAP ERP Systems, and 
> >generally
> >speaking for any application with a sparse memory update pattern.

[snip]

> One thing that strikes me about this algorithm is that it's very
> good for a particular type of workload--shockingly good really.
> 
> I think workload aware migration compression is possible for a lot
> of different types of workloads.  That makes me a bit wary of QEMU
> growing quite a lot of compression mechanisms.
> 
> It makes me think that this logic may really belong at a higher
> level where more information is known about the workload.  For
> instance, I can imagine XBZRLE living in something like libvirt.
> 
> Today, parsing migration traffic is pretty horrible but I think
> we're pretty strongly committed to fixing that in 1.0.  That makes
> me wonder if it would be nicer architecturally for a higher level
> tool to own something like this.
> 
> Originally, when I added migration, I had the view that we would
> have transport plugins based on the exec: protocol.  That hasn't
> really happened since libvirt really owns migration but I think
> having XBZRLE as a transport plugin for libvirt is something worth
> considering.

NB I've not been much of a fan of the exec: migration code, since it
has proved rather buggy in practice when we used it for 'save/restore
to/from file' support. It has been hard to diagnose when things go
wrong, and difficult for QEMU to report any useful error messages.
Even with the tcp: protocol, QEMU is seemingly unable to provide any
useful error reporting even of things as simple as "unable to connect
to remote host". So with one exception, current libvirt now uses the
'fd:' protocol for everything, and the last exception will be removed
soon too.

> I'm curious what people think about this type of approach.  CC'ing
> libvirt to get their input.

In "normal" migration though, even when using fd:, we don't make
any attempt to touch the data stream. We just pass a pre-connected
TCP socket into QEMU and let it write directly to it. This avoids
extra data copying via libvirt.

In our alternative "tunnelled" migration mode, libvirt does touch
the data stream, passing a pipe FD into QEMU, and copying the data
from the pipe into packets to be sent over libvirtd's existing
secure RPC stream, and then copying it back to QEMU on the destination.
The downside here is that we've added several extra data copies.

In our "save/restore to file" code, we use 'fd:' and always have
to send the data via a filter program. For example, we have the
ability to compress/decompress data via gzip, bzip, xz, and lzop,
for which instead pass QEMU as pipe FD to the external compression
helper program. We also have another new option where we send data
via another I/O helper program that uses O_DIRECT, so save/restore
does not pollute the page cache.

With this kind of existing precedent, I won't strongly argue against
libvirt adding a filter to support this XBZRLE encoding scheme for
migration, or indeed save/restore too, if it proves better than
lzop which is our current optimal speed/compression winner.

My main concern with all these scenarios where libvirt touches the
actual data stream though is that we're introducing extra data copies
into the migration path which potentially waste CPU cycles.
If QEMU can directly XBZRLE encode data into the FD passed via 'fd:'
then we minimize data copies. Whether this is a big enough benefit
to offset the burden of having to maintain various compression code
options in QEMU I can't answer.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

Re: [Qemu-devel] [PATCH 2/7] Replace VMSTOP macros with a proper QemuState type

2011-08-08 Thread Luiz Capitulino

On Mon, 08 Aug 2011 16:54:16 +0300
Avi Kivity  wrote:

> On 08/08/2011 04:47 PM, Luiz Capitulino wrote:
> > >
> > >  Well, then, the user won't know something happened and will happily
> > >  resume the guest, like I outlined above.
> >
> > I think it makes sense to return an error in the monitor if the user
> > tries to stop qemu when it's already stopped. Not sure if it will do what 
> > you
> > think it should do, but we should always tell the user when we're unable to
> > carry his/her orders.
> >
> > But it does make sense to me to not allow stopping twice. First because it
> > doesn't make sense to stop something which is not moving and second because
> > what else can stop the vm if it's already stopped?
> >
> > Maybe vm_stop() should return an error, but I think this goes beyond this
> > series.
> >
> 
> This is why I suggested a reference count.  In this case, we can always 
> stop the guest "twice", because we don't lost information when we resume.

I could give it a try in the near future, as I really think it's independent
from this series, but I still don't understand what can stop an already stopped
VM besides the user. This is a real question, is it really possible?

If only the user can do that, then the refcount is overkill as just returning
an error will do it.

Re: [Qemu-devel] support for Freescale MPC8xx (850/860) processors/platforms

2011-08-08 Thread Benjamin Herrenschmidt

On Mon, 2011-08-08 at 15:57 +0200, Alexander Graf wrote:
> > Hi,
> > 
> > Anyone working on Freescale MPC8xx (embedded PowerPC) processors ??
> > 
> > I'm trying to ascertain if the MPC8xx (MPC850) processors are
> > supported, and if not, how much effort would be required to get it
> > working.
> > 
> 
> 
> I'm not aware of anyone working on MPC8xx CPU support. In fact, I
> don't really know what they look like - I guess they're 601 derived? 
> 
> Ben, any idea how much different 8xx is? 

Gothic horrors :-)

They have an MMU of their own. SW loaded TLB but different from anything
else. They have a lot of other "quirks" too.

Cheers,
Ben.

Re: [Qemu-devel] [PATCH] Introduce short names for fixed width integer types

2011-08-08 Thread Kevin Wolf

Am 08.08.2011 15:00, schrieb Anthony Liguori:
> On 08/08/2011 07:56 AM, Avi Kivity wrote:
>> QEMU deals with a lot of fixed width integer types; their names
>> (uint64_t etc) are clumsy to use and take up a lot of space.
>>
>> Following Linux, introduce shorter names, for example U64 for
>> uint64_t.
> 
> Except Linux uses lower case letters.
> 
> I personally think Linux style is wrong here.  The int8_t types are 
> standard types.

I fully agree, we should use the standard types.

> Besides, we save lots of characters by using 4-space tabs instead of 
> 8-space tabs.  We can afford to spend some of those saved characters on 
> using proper type names :-)

Heh, I like this reasoning. :-)

Kevin

Re: [Qemu-devel] [PATCH v4] XBZRLE delta for live migration of large memory apps

2011-08-08 Thread Anthony Liguori


On 08/08/2011 08:51 AM, Avi Kivity wrote:

On 08/08/2011 04:29 PM, Anthony Liguori wrote:


One thing that strikes me about this algorithm is that it's very good
for a particular type of workload--shockingly good really.


Poking bytes at random places in memory is fairly generic. If you have a
lot of small objects, and modify a subset of them, this is the pattern
you get.



I think workload aware migration compression is possible for a lot of
different types of workloads. That makes me a bit wary of QEMU growing
quite a lot of compression mechanisms.

It makes me think that this logic may really belong at a higher level
where more information is known about the workload. For instance, I
can imagine XBZRLE living in something like libvirt.


A better model would be plugin based.


exec helpers are plugins.  They just live in a different address space 
and a channel to exchange data (pipe).


If we did .so plugins, which I'm really not opposed to, I'd want the 
interface to be something like:


typedef struct MigrationTransportClass
{
   ssize_t (*writev)(MigrationTransport *obj,
 struct iovec *iov,
 int iovcnt);
} MigrationTransportClass;

I think it's useful to use an interface like this because it makes it 
easy to put the transport in a dedicated thread that didn't hold 
qemu_mutex (which is sort of equivalent to using a fork'd helper but is 
zero-copy at the expense of less isolation).


Regards,

Anthony Liguori

[Qemu-devel] [PATCH v4 17/39] es1370: convert to memory API

2011-08-08 Thread Avi Kivity

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/es1370.c |   43 +--
 1 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/hw/es1370.c b/hw/es1370.c
index 1ed62b7..4e43c4a 100644
--- a/hw/es1370.c
+++ b/hw/es1370.c
@@ -268,6 +268,7 @@ struct chan {
 typedef struct ES1370State {
 PCIDevice dev;
 QEMUSoundCard card;
+MemoryRegion io;
 struct chan chan[NB_CHANNELS];
 SWVoiceOut *dac_voice[2];
 SWVoiceIn *adc_voice;
@@ -775,7 +776,6 @@ IO_READ_PROTO (es1370_readl)
 return val;
 }
 
-
 static void es1370_transfer_audio (ES1370State *s, struct chan *d, int 
loop_sel,
int max, int *irq)
 {
@@ -906,23 +906,20 @@ static void es1370_adc_callback (void *opaque, int avail)
 es1370_run_channel (s, ADC_CHANNEL, avail);
 }
 
-static void es1370_map (PCIDevice *pci_dev, int region_num,
-pcibus_t addr, pcibus_t size, int type)
-{
-ES1370State *s = DO_UPCAST (ES1370State, dev, pci_dev);
-
-(void) region_num;
-(void) size;
-(void) type;
-
-register_ioport_write (addr, 0x40 * 4, 1, es1370_writeb, s);
-register_ioport_write (addr, 0x40 * 2, 2, es1370_writew, s);
-register_ioport_write (addr, 0x40, 4, es1370_writel, s);
+static const MemoryRegionPortio es1370_portio[] = {
+{ 0, 0x40 * 4, 1, .write = es1370_writeb, },
+{ 0, 0x40 * 2, 2, .write = es1370_writew, },
+{ 0, 0x40, 4, .write = es1370_writel, },
+{ 0, 0x40 * 4, 1, .read = es1370_readb, },
+{ 0, 0x40 * 2, 2, .read = es1370_readw, },
+{ 0, 0x40, 4, .read = es1370_readl, },
+PORTIO_END_OF_LIST()
+};
 
-register_ioport_read (addr, 0x40 * 4, 1, es1370_readb, s);
-register_ioport_read (addr, 0x40 * 2, 2, es1370_readw, s);
-register_ioport_read (addr, 0x40, 4, es1370_readl, s);
-}
+static const MemoryRegionOps es1370_io_ops = {
+.old_portio = es1370_portio,
+.endianness = DEVICE_LITTLE_ENDIAN,
+};
 
 static const VMStateDescription vmstate_es1370_channel = {
 .name = "es1370_channel",
@@ -1011,7 +1008,8 @@ static int es1370_initfn (PCIDevice *dev)
 c[PCI_MIN_GNT] = 0x0c;
 c[PCI_MAX_LAT] = 0x80;
 
-pci_register_bar (&s->dev, 0, 256, PCI_BASE_ADDRESS_SPACE_IO, es1370_map);
+memory_region_init_io (&s->io, &es1370_io_ops, s, "es1370", 256);
+pci_register_bar_region (&s->dev, 0, PCI_BASE_ADDRESS_SPACE_IO, &s->io);
 qemu_register_reset (es1370_on_reset, s);
 
 AUD_register_card ("es1370", &s->card);
@@ -1019,6 +1017,14 @@ static int es1370_initfn (PCIDevice *dev)
 return 0;
 }
 
+static int es1370_exitfn(PCIDevice *dev)
+{
+ES1370State *s = DO_UPCAST (ES1370State, dev, dev);
+
+memory_region_destroy (&s->io);
+return 0;
+}
+
 int es1370_init (PCIBus *bus)
 {
 pci_create_simple (bus, -1, "ES1370");
@@ -1031,6 +1037,7 @@ static PCIDeviceInfo es1370_info = {
 .qdev.size= sizeof (ES1370State),
 .qdev.vmsd= &vmstate_es1370,
 .init = es1370_initfn,
+.exit = es1370_exitfn,
 .vendor_id= PCI_VENDOR_ID_ENSONIQ,
 .device_id= PCI_DEVICE_ID_ENSONIQ_ES1370,
 .class_id = PCI_CLASS_MULTIMEDIA_AUDIO,
-- 
1.7.5.3

[Qemu-devel] [PATCH v4 35/39] pci: convert pci rom to memory API

2011-08-08 Thread Avi Kivity

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/pci.c |   20 +++-
 hw/pci.h |3 ++-
 2 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index 7a70037..f885d4e 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -1855,11 +1855,6 @@ static uint8_t pci_find_capability_list(PCIDevice *pdev, 
uint8_t cap_id,
 return next;
 }
 
-static void pci_map_option_rom(PCIDevice *pdev, int region_num, pcibus_t addr, 
pcibus_t size, int type)
-{
-cpu_register_physical_memory(addr, size, pdev->rom_offset);
-}
-
 /* Patch the PCI vendor and device ids in a PCI rom image if necessary.
This is needed for an option rom which is used for more than one device. */
 static void pci_patch_ids(PCIDevice *pdev, uint8_t *ptr, int size)
@@ -1963,9 +1958,9 @@ static int pci_add_option_rom(PCIDevice *pdev, bool 
is_default_rom)
 snprintf(name, sizeof(name), "%s.rom", pdev->qdev.info->vmsd->name);
 else
 snprintf(name, sizeof(name), "%s.rom", pdev->qdev.info->name);
-pdev->rom_offset = qemu_ram_alloc(&pdev->qdev, name, size);
-
-ptr = qemu_get_ram_ptr(pdev->rom_offset);
+pdev->has_rom = true;
+memory_region_init_ram(&pdev->rom, &pdev->qdev, name, size);
+ptr = memory_region_get_ram_ptr(&pdev->rom);
 load_image(path, ptr);
 qemu_free(path);
 
@@ -1976,19 +1971,18 @@ static int pci_add_option_rom(PCIDevice *pdev, bool 
is_default_rom)
 
 qemu_put_ram_ptr(ptr);
 
-pci_register_bar(pdev, PCI_ROM_SLOT, size,
- 0, pci_map_option_rom);
+pci_register_bar_region(pdev, PCI_ROM_SLOT, 0, &pdev->rom);
 
 return 0;
 }
 
 static void pci_del_option_rom(PCIDevice *pdev)
 {
-if (!pdev->rom_offset)
+if (!pdev->has_rom)
 return;
 
-qemu_ram_free(pdev->rom_offset);
-pdev->rom_offset = 0;
+memory_region_destroy(&pdev->rom);
+pdev->has_rom = false;
 }
 
 /*
diff --git a/hw/pci.h b/hw/pci.h
index 25e28b1..6e2bcea 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -191,7 +191,8 @@ struct PCIDevice {
 
 /* Location of option rom */
 char *romfile;
-ram_addr_t rom_offset;
+bool has_rom;
+MemoryRegion rom;
 uint32_t rom_bar;
 };
 
-- 
1.7.5.3

[Qemu-devel] [PATCH v4 04/39] vga: convert vga and its derivatives to the memory API

2011-08-08 Thread Avi Kivity

Convert all vga memory to the memory API.  Note we need to fall back to
get_system_memory(), since the various buses don't pass the vga window
as a memory region.

We no longer need to sync the dirty bitmap of the cirrus mapped memory
banks, since the memory API takes care of that for us.

[jan: fix vga-pci logging]

Reviewed-by: Richard Henderson 
Signed-off-by: Avi Kivity 
---
 hw/cirrus_vga.c |  342 ---
 hw/qxl-render.c |2 +-
 hw/qxl.c|  135 --
 hw/qxl.h|6 +-
 hw/vga-isa-mm.c |   46 +---
 hw/vga-isa.c|   10 +-
 hw/vga-pci.c|   28 +
 hw/vga.c|  146 +++-
 hw/vga_int.h|   14 +--
 hw/vmware_vga.c |  143 ---
 10 files changed, 437 insertions(+), 435 deletions(-)

diff --git a/hw/cirrus_vga.c b/hw/cirrus_vga.c
index f39d1f8..ad23c4a 100644
--- a/hw/cirrus_vga.c
+++ b/hw/cirrus_vga.c
@@ -32,6 +32,7 @@
 #include "console.h"
 #include "vga_int.h"
 #include "loader.h"
+#include "exec-memory.h"
 
 /*
  * TODO:
@@ -200,9 +201,14 @@ typedef void (*cirrus_fill_t)(struct CirrusVGAState *s,
 typedef struct CirrusVGAState {
 VGACommonState vga;
 
-int cirrus_linear_io_addr;
-int cirrus_linear_bitblt_io_addr;
-int cirrus_mmio_io_addr;
+MemoryRegion cirrus_linear_io;
+MemoryRegion cirrus_linear_bitblt_io;
+MemoryRegion cirrus_mmio_io;
+MemoryRegion pci_bar;
+bool linear_vram;  /* vga.vram mapped over cirrus_linear_io */
+MemoryRegion low_mem_container; /* container for 0xa-0xc */
+MemoryRegion low_mem;   /* always mapped, overridden by: */
+MemoryRegion *cirrus_bank[2];   /*   aliases at 0xa-0xb  */
 uint32_t cirrus_addr_mask;
 uint32_t linear_mmio_mask;
 uint8_t cirrus_shadow_gr0;
@@ -612,7 +618,7 @@ static void cirrus_invalidate_region(CirrusVGAState * s, 
int off_begin,
off_cur_end = (off_cur + bytesperline) & s->cirrus_addr_mask;
off_cur &= TARGET_PAGE_MASK;
while (off_cur < off_cur_end) {
-   cpu_physical_memory_set_dirty(s->vga.vram_offset + off_cur);
+   memory_region_set_dirty(&s->vga.vram, off_cur);
off_cur += TARGET_PAGE_SIZE;
}
off_begin += off_pitch;
@@ -1177,12 +1183,6 @@ static void cirrus_update_bank_ptr(CirrusVGAState * s, 
unsigned bank_index)
 }
 
 if (limit > 0) {
-/* Thinking about changing bank base? First, drop the dirty bitmap 
information
- * on the current location, otherwise we lose this pointer forever */
-if (s->vga.lfb_vram_mapped) {
-target_phys_addr_t base_addr = isa_mem_base + 0xa + bank_index 
* 0x8000;
-cpu_physical_sync_dirty_bitmap(base_addr, base_addr + 0x8000);
-}
s->cirrus_bank_base[bank_index] = offset;
s->cirrus_bank_limit[bank_index] = limit;
 } else {
@@ -1921,8 +1921,8 @@ static void 
cirrus_mem_writeb_mode4and5_8bpp(CirrusVGAState * s,
val <<= 1;
dst++;
 }
-cpu_physical_memory_set_dirty(s->vga.vram_offset + offset);
-cpu_physical_memory_set_dirty(s->vga.vram_offset + offset + 7);
+memory_region_set_dirty(&s->vga.vram, offset);
+memory_region_set_dirty(&s->vga.vram, offset + 7);
 }
 
 static void cirrus_mem_writeb_mode4and5_16bpp(CirrusVGAState * s,
@@ -1946,8 +1946,8 @@ static void 
cirrus_mem_writeb_mode4and5_16bpp(CirrusVGAState * s,
val <<= 1;
dst += 2;
 }
-cpu_physical_memory_set_dirty(s->vga.vram_offset + offset);
-cpu_physical_memory_set_dirty(s->vga.vram_offset + offset + 15);
+memory_region_set_dirty(&s->vga.vram, offset);
+memory_region_set_dirty(&s->vga.vram, offset + 15);
 }
 
 /***
@@ -2057,8 +2057,7 @@ static void cirrus_vga_mem_writeb(void *opaque, 
target_phys_addr_t addr,
mode = s->vga.gr[0x05] & 0x7;
if (mode < 4 || mode > 5 || ((s->vga.gr[0x0B] & 0x4) == 0)) {
*(s->vga.vram_ptr + bank_offset) = mem_value;
-   cpu_physical_memory_set_dirty(s->vga.vram_offset +
- bank_offset);
+   memory_region_set_dirty(&s->vga.vram, bank_offset);
} else {
if ((s->vga.gr[0x0B] & 0x14) != 0x14) {
cirrus_mem_writeb_mode4and5_8bpp(s, mode,
@@ -2099,16 +2098,37 @@ static void cirrus_vga_mem_writel(void *opaque, 
target_phys_addr_t addr, uint32_
 cirrus_vga_mem_writeb(opaque, addr + 3, (val >> 24) & 0xff);
 }
 
-static CPUReadMemoryFunc * const cirrus_vga_mem_read[3] = {
-cirrus_vga_mem_readb,
-cirrus_vga_mem_readw,
-cirrus_vga_mem_readl,
+static uint64_t cirrus_vga_mem_read(void *opaque,
+target_phys_addr_t addr,
+uint32_t size)
+{
+CirrusVGAState *s = opaque;
+
+switch (size) {
+case 1: retur

[Qemu-devel] [PATCH v4 24/39] ppc: convert to memory API

2011-08-08 Thread Avi Kivity

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/cuda.c |6 ++-
 hw/escc.c |   42 +--
 hw/escc.h |2 +-
 hw/heathrow_pic.c |   29 --
 hw/ide.h  |2 +-
 hw/ide/macio.c|   36 ---
 hw/mac_dbdma.c|   32 ++--
 hw/mac_dbdma.h|4 ++-
 hw/mac_nvram.c|   39 ++---
 hw/macio.c|   74 +++-
 hw/openpic.c  |   81 +
 hw/openpic.h  |2 +-
 hw/ppc_mac.h  |   16 ++
 hw/ppc_newworld.c |   30 +--
 hw/ppc_oldworld.c |   23 +++
 15 files changed, 201 insertions(+), 217 deletions(-)

diff --git a/hw/cuda.c b/hw/cuda.c
index 065c362..5c92d81 100644
--- a/hw/cuda.c
+++ b/hw/cuda.c
@@ -117,6 +117,7 @@ typedef struct CUDATimer {
 } CUDATimer;
 
 typedef struct CUDAState {
+MemoryRegion mem;
 /* cuda registers */
 uint8_t b;  /* B-side data */
 uint8_t a;  /* A-side data */
@@ -722,7 +723,7 @@ static void cuda_reset(void *opaque)
 set_counter(s, &s->timers[1], 0x);
 }
 
-void cuda_init (int *cuda_mem_index, qemu_irq irq)
+void cuda_init (MemoryRegion **cuda_mem, qemu_irq irq)
 {
 struct tm tm;
 CUDAState *s = &cuda_state;
@@ -738,8 +739,9 @@ void cuda_init (int *cuda_mem_index, qemu_irq irq)
 s->tick_offset = (uint32_t)mktimegm(&tm) + RTC_OFFSET;
 
 s->adb_poll_timer = qemu_new_timer_ns(vm_clock, cuda_adb_poll, s);
-*cuda_mem_index = cpu_register_io_memory(cuda_read, cuda_write, s,
+cpu_register_io_memory(cuda_read, cuda_write, s,
  DEVICE_NATIVE_ENDIAN);
+*cuda_mem = &s->mem;
 vmstate_register(NULL, -1, &vmstate_cuda, s);
 qemu_register_reset(cuda_reset, s);
 }
diff --git a/hw/escc.c b/hw/escc.c
index f6fd919..bea5873 100644
--- a/hw/escc.c
+++ b/hw/escc.c
@@ -126,7 +126,7 @@ struct SerialState {
 SysBusDevice busdev;
 struct ChannelState chn[2];
 uint32_t it_shift;
-int mmio_index;
+MemoryRegion mmio;
 uint32_t disabled;
 uint32_t frequency;
 };
@@ -490,7 +490,8 @@ static void escc_update_parameters(ChannelState *s)
 qemu_chr_ioctl(s->chr, CHR_IOCTL_SERIAL_SET_PARAMS, &ssp);
 }
 
-static void escc_mem_writeb(void *opaque, target_phys_addr_t addr, uint32_t 
val)
+static void escc_mem_write(void *opaque, target_phys_addr_t addr,
+   uint64_t val, unsigned size)
 {
 SerialState *serial = opaque;
 ChannelState *s;
@@ -592,7 +593,8 @@ static void escc_mem_writeb(void *opaque, 
target_phys_addr_t addr, uint32_t val)
 }
 }
 
-static uint32_t escc_mem_readb(void *opaque, target_phys_addr_t addr)
+static uint64_t escc_mem_read(void *opaque, target_phys_addr_t addr,
+  unsigned size)
 {
 SerialState *serial = opaque;
 ChannelState *s;
@@ -627,6 +629,16 @@ static uint32_t escc_mem_readb(void *opaque, 
target_phys_addr_t addr)
 return 0;
 }
 
+static const MemoryRegionOps escc_mem_ops = {
+.read = escc_mem_read,
+.write = escc_mem_write,
+.endianness = DEVICE_NATIVE_ENDIAN,
+.valid = {
+.min_access_size = 1,
+.max_access_size = 1,
+},
+};
+
 static int serial_can_receive(void *opaque)
 {
 ChannelState *s = opaque;
@@ -668,18 +680,6 @@ static void serial_event(void *opaque, int event)
 serial_receive_break(s);
 }
 
-static CPUReadMemoryFunc * const escc_mem_read[3] = {
-escc_mem_readb,
-NULL,
-NULL,
-};
-
-static CPUWriteMemoryFunc * const escc_mem_write[3] = {
-escc_mem_writeb,
-NULL,
-NULL,
-};
-
 static const VMStateDescription vmstate_escc_chn = {
 .name ="escc_chn",
 .version_id = 2,
@@ -712,7 +712,7 @@ static const VMStateDescription vmstate_escc = {
 }
 };
 
-int escc_init(target_phys_addr_t base, qemu_irq irqA, qemu_irq irqB,
+MemoryRegion *escc_init(target_phys_addr_t base, qemu_irq irqA, qemu_irq irqB,
   CharDriverState *chrA, CharDriverState *chrB,
   int clock, int it_shift)
 {
@@ -737,7 +737,7 @@ int escc_init(target_phys_addr_t base, qemu_irq irqA, 
qemu_irq irqB,
 }
 
 d = FROM_SYSBUS(SerialState, s);
-return d->mmio_index;
+return &d->mmio;
 }
 
 static const uint8_t keycodes[128] = {
@@ -901,7 +901,6 @@ void slavio_serial_ms_kbd_init(target_phys_addr_t base, 
qemu_irq irq,
 static int escc_init1(SysBusDevice *dev)
 {
 SerialState *s = FROM_SYSBUS(SerialState, dev);
-int io;
 unsigned int i;
 
 s->chn[0].disabled = s->disabled;
@@ -918,10 +917,9 @@ static int escc_init1(SysBusDevice *dev)
 s->chn[0].otherchn = &s->chn[1];
 s->chn[1].otherchn = &s->chn[0];
 
-io = cpu_register_io_memory(escc_mem_read, escc_mem_write, s,
-DEVICE_NATIVE_ENDIAN);
-sysbus_init_mmio(dev, ESCC_SIZE <

[Qemu-devel] [PATCH v4 36/39] pci: remove pci_register_bar()

2011-08-08 Thread Avi Kivity

Superceded by pci_register_bar_region().  The implementations
are folded together.

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/pci.c |   42 +-
 hw/pci.h |3 ---
 2 files changed, 17 insertions(+), 28 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index f885d4e..62b34d4 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -881,13 +881,25 @@ static int pci_unregister_device(DeviceState *dev)
 return 0;
 }
 
-void pci_register_bar(PCIDevice *pci_dev, int region_num,
-pcibus_t size, uint8_t type,
-PCIMapIORegionFunc *map_func)
+static void pci_simple_bar_mapfunc_region(PCIDevice *pci_dev, int region_num,
+  pcibus_t addr, pcibus_t size,
+  int type)
+{
+PCIIORegion *r = &pci_dev->io_regions[region_num];
+
+memory_region_add_subregion_overlap(r->address_space,
+addr,
+r->memory,
+1);
+}
+
+void pci_register_bar_region(PCIDevice *pci_dev, int region_num,
+ uint8_t type, MemoryRegion *memory)
 {
 PCIIORegion *r;
 uint32_t addr;
 uint64_t wmask;
+pcibus_t size = memory_region_size(memory);
 
 assert(region_num >= 0);
 assert(region_num < PCI_NUM_REGIONS);
@@ -902,7 +914,7 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
 r->size = size;
 r->filtered_size = size;
 r->type = type;
-r->map_func = map_func;
+r->map_func = pci_simple_bar_mapfunc_region;
 r->memory = NULL;
 
 wmask = ~(size - 1);
@@ -920,29 +932,9 @@ void pci_register_bar(PCIDevice *pci_dev, int region_num,
 pci_set_long(pci_dev->wmask + addr, wmask & 0x);
 pci_set_long(pci_dev->cmask + addr, 0x);
 }
-}
-
-static void pci_simple_bar_mapfunc_region(PCIDevice *pci_dev, int region_num,
-  pcibus_t addr, pcibus_t size,
-  int type)
-{
-PCIIORegion *r = &pci_dev->io_regions[region_num];
-
-memory_region_add_subregion_overlap(r->address_space,
-addr,
-r->memory,
-1);
-}
-
-void pci_register_bar_region(PCIDevice *pci_dev, int region_num,
- uint8_t attr, MemoryRegion *memory)
-{
-pci_register_bar(pci_dev, region_num, memory_region_size(memory),
- attr,
- pci_simple_bar_mapfunc_region);
 pci_dev->io_regions[region_num].memory = memory;
 pci_dev->io_regions[region_num].address_space
-= attr & PCI_BASE_ADDRESS_SPACE_IO
+= type & PCI_BASE_ADDRESS_SPACE_IO
 ? pci_dev->bus->address_space_io
 : pci_dev->bus->address_space_mem;
 }
diff --git a/hw/pci.h b/hw/pci.h
index 6e2bcea..8028176 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -201,9 +201,6 @@ PCIDevice *pci_register_device(PCIBus *bus, const char 
*name,
PCIConfigReadFunc *config_read,
PCIConfigWriteFunc *config_write);
 
-void pci_register_bar(PCIDevice *pci_dev, int region_num,
-pcibus_t size, uint8_t type,
-PCIMapIORegionFunc *map_func);
 void pci_register_bar_region(PCIDevice *pci_dev, int region_num,
  uint8_t attr, MemoryRegion *memory);
 pcibus_t pci_get_bar_addr(PCIDevice *pci_dev, int region_num);
-- 
1.7.5.3

Re: [Qemu-devel] [PATCH v4] XBZRLE delta for live migration of large memory apps

2011-08-08 Thread Avi Kivity


On 08/08/2011 05:15 PM, Anthony Liguori wrote:




I think workload aware migration compression is possible for a lot of
different types of workloads. That makes me a bit wary of QEMU growing
quite a lot of compression mechanisms.

It makes me think that this logic may really belong at a higher level
where more information is known about the workload. For instance, I
can imagine XBZRLE living in something like libvirt.


A better model would be plugin based.



exec helpers are plugins.  They just live in a different address space 
and a channel to exchange data (pipe).


libvirt isn't an exec helper.



If we did .so plugins, which I'm really not opposed to, I'd want the 
interface to be something like:


typedef struct MigrationTransportClass
{
   ssize_t (*writev)(MigrationTransport *obj,
 struct iovec *iov,
 int iovcnt);
} MigrationTransportClass;

I think it's useful to use an interface like this because it makes it 
easy to put the transport in a dedicated thread that didn't hold 
qemu_mutex (which is sort of equivalent to using a fork'd helper but 
is zero-copy at the expense of less isolation).


If we have a shared object helper, the thread should be maintained by 
qemu proper, not the plugin.


I wouldn't call it "migration transport", but instead a 
compression/decompression plugin.


I don't think it merits a plugin at all though.  There's limited scope 
for compression and it best sits in qemu proper.  If anything, it needs 
to be more integrated (for example turning itself off if it doesn't 
match enough).


--
error compiling committee.c: too many arguments to function

[Qemu-devel] [PATCH v4 00/39] Memory API, batch 2: PCI devices

2011-08-08 Thread Avi Kivity

This is a mostly mindless conversion of all QEMU PCI devices to the memory API.
After this patchset is applied, it is no longer possible to create a PCI device
using the old API.

An immediate benefit is that PCI BARs that overlap each other are now handled
correctly: currently, the sequence

  map BAR 0
  map BAR 1 at an overlapping address
  unmap either BAR 0 or BAR 1

will leave a hole where the overlap exists.  With the patchset, the memory map
is restored correctly.

Note that overlaps of PCI BARs with memory or non-PCI resources are still not
resolved correctly; this will be fixed later on.

The vga patches have ugly intermediate states; however the result is fairly 
clean.

Changes from v3:
 - dropped virtio-pci config patch; will be fixed outside this patchset if
   necessary
 - minor style fixes

Changes from v2:
 - added patch from Michael simplifying virtio-pci config setup

Changes from v1:
 - cmd646 type fix
 - folded a fixlet into its parent

Avi Kivity (39):
  memory: rename PORTIO_END to PORTIO_END_OF_LIST
  pci: add API to get a BAR's mapped address
  vmsvga: don't remember pci BAR address in callback any more
  vga: convert vga and its derivatives to the memory API
  cirrus: simplify mmio BAR access functions
  cirrus: simplify bitblt BAR access functions
  cirrus: simplify vga window mmio access functions
  vga: simplify vga window mmio access functions
  cirrus: simplify linear framebuffer access functions
  Integrate I/O memory regions into qemu
  pci: pass I/O address space to new PCI bus
  pci: allow I/O BARs to be registered with pci_register_bar_region()
  rtl8139: convert to memory API
  ac97: convert to memory API
  e1000: convert to memory API
  eepro100: convert to memory API
  es1370: convert to memory API
  ide: convert to memory API
  ivshmem: convert to memory API
  virtio-pci: convert to memory API
  ahci: convert to memory API
  intel-hda: convert to memory API
  lsi53c895a: convert to memory API
  ppc: convert to memory API
  ne2000: convert to memory API
  pcnet: convert to memory API
  i6300esb: convert to memory API
  isa-mmio: convert to memory API
  sun4u: convert to memory API
  ehci: convert to memory API
  uhci: convert to memory API
  xen-platform: convert to memory API
  msix: convert to memory API
  pci: remove pci_register_bar_simple()
  pci: convert pci rom to memory API
  pci: remove pci_register_bar()
  pci: fold BAR mapping function into its caller
  pci: rename pci_register_bar_region() to pci_register_bar()
  pci: remove support for pre memory API BARs

 exec-memory.h  |5 +
 exec.c |   10 ++
 hw/ac97.c  |   88 ++-
 hw/apb_pci.c   |1 +
 hw/bonito.c|1 +
 hw/cirrus_vga.c|  459 ---
 hw/cuda.c  |6 +-
 hw/e1000.c |  113 ++
 hw/eepro100.c  |  181 -
 hw/es1370.c|   43 +++--
 hw/escc.c  |   42 +++---
 hw/escc.h  |2 +-
 hw/grackle_pci.c   |8 +-
 hw/gt64xxx.c   |4 +-
 hw/heathrow_pic.c  |   29 ++--
 hw/ide.h   |2 +-
 hw/ide/ahci.c  |   31 ++--
 hw/ide/ahci.h  |2 +-
 hw/ide/cmd646.c|  204 +++-
 hw/ide/ich.c   |3 +-
 hw/ide/macio.c |   36 +++--
 hw/ide/pci.c   |   25 ++--
 hw/ide/pci.h   |   19 ++-
 hw/ide/piix.c  |   63 ++--
 hw/ide/via.c   |   64 ++--
 hw/intel-hda.c |   35 +++--
 hw/isa.h   |2 +
 hw/isa_mmio.c  |   29 ++--
 hw/ivshmem.c   |  158 +++
 hw/lance.c |   31 ++--
 hw/lsi53c895a.c|  257 +++---
 hw/mac_dbdma.c |   32 ++--
 hw/mac_dbdma.h |4 +-
 hw/mac_nvram.c |   39 ++---
 hw/macio.c |   73 -
 hw/msix.c  |   64 +++-
 hw/msix.h  |6 +-
 hw/ne2000-isa.c|   13 +--
 hw/ne2000.c|   77 ++---
 hw/ne2000.h|8 +-
 hw/openpic.c   |   81 +-
 hw/openpic.h   |2 +-
 hw/pc.h|4 +-
 hw/pc_piix.c   |6 +-
 hw/pci.c   |  133 +---
 hw/pci.h   |   26 ++--
 hw/pci_internals.h |3 +-
 hw/pcnet-pci.c |   74 +
 hw/pcnet.h |4 +-
 hw/piix_pci.c  |   14 +-
 hw/ppc4xx_pci.c|1 +
 hw/ppc_mac.h   |   27 ++--
 hw/ppc_newworld.c  |   34 ++--
 hw/ppc_oldworld.c  |   27 ++--
 hw/ppc_prep.c  |2 +-
 hw/ppce500_pci.c   |7 +-
 hw/prep_pci.c  |8 +-
 hw/prep_pci.h  |4 +-
 hw/qxl-render.c|2 +-
 hw/qxl.c   |  129 ++--
 hw/qxl.h   |6 +-
 hw/rtl8139.c   |   70 
 hw/sh_pci.c|4 +-
 hw/sun4u.c |   53 +++
 hw/unin_pci.c  |   16 ++-
 hw/usb-ehci.c  |   36 +---
 hw/usb-ohci.c  |2 +-
 hw/usb-uhci.c  |   41 +++--
 hw/versatile_pci.c |2 +-
 hw/vga-isa-mm.c|   46 --
 hw/vga-isa.c   |   10 +-
 hw/vga-pci.c   |   27 +---

[Qemu-devel] [PATCH v4 22/39] intel-hda: convert to memory API

2011-08-08 Thread Avi Kivity

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/intel-hda.c |   35 +++
 1 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/hw/intel-hda.c b/hw/intel-hda.c
index 5a2bc3a..1e4c71e 100644
--- a/hw/intel-hda.c
+++ b/hw/intel-hda.c
@@ -177,7 +177,7 @@ struct IntelHDAState {
 IntelHDAStream st[8];
 
 /* state */
-int mmio_addr;
+MemoryRegion mmio;
 uint32_t rirb_count;
 int64_t wall_base_ns;
 
@@ -1084,16 +1084,20 @@ static uint32_t intel_hda_mmio_readl(void *opaque, 
target_phys_addr_t addr)
 return intel_hda_reg_read(d, reg, 0x);
 }
 
-static CPUReadMemoryFunc * const intel_hda_mmio_read[3] = {
-intel_hda_mmio_readb,
-intel_hda_mmio_readw,
-intel_hda_mmio_readl,
-};
-
-static CPUWriteMemoryFunc * const intel_hda_mmio_write[3] = {
-intel_hda_mmio_writeb,
-intel_hda_mmio_writew,
-intel_hda_mmio_writel,
+static const MemoryRegionOps intel_hda_mmio_ops = {
+.old_mmio = {
+.read = {
+intel_hda_mmio_readb,
+intel_hda_mmio_readw,
+intel_hda_mmio_readl,
+},
+.write = {
+intel_hda_mmio_writeb,
+intel_hda_mmio_writew,
+intel_hda_mmio_writel,
+},
+},
+.endianness = DEVICE_NATIVE_ENDIAN,
 };
 
 /* - */
@@ -1130,10 +1134,9 @@ static int intel_hda_init(PCIDevice *pci)
 /* HDCTL off 0x40 bit 0 selects signaling mode (1-HDA, 0 - Ac97) 18.1.19 */
 conf[0x40] = 0x01;
 
-d->mmio_addr = cpu_register_io_memory(intel_hda_mmio_read,
-  intel_hda_mmio_write, d,
-  DEVICE_NATIVE_ENDIAN);
-pci_register_bar_simple(&d->pci, 0, 0x4000, 0, d->mmio_addr);
+memory_region_init_io(&d->mmio, &intel_hda_mmio_ops, d,
+  "intel-hda", 0x4000);
+pci_register_bar_region(&d->pci, 0, 0, &d->mmio);
 if (d->msi) {
 msi_init(&d->pci, 0x50, 1, true, false);
 }
@@ -1149,7 +1152,7 @@ static int intel_hda_exit(PCIDevice *pci)
 IntelHDAState *d = DO_UPCAST(IntelHDAState, pci, pci);
 
 msi_uninit(&d->pci);
-cpu_unregister_io_memory(d->mmio_addr);
+memory_region_destroy(&d->mmio);
 return 0;
 }
 
-- 
1.7.5.3

[Qemu-devel] [PATCH v4 08/39] vga: simplify vga window mmio access functions

2011-08-08 Thread Avi Kivity

Make use of the memory API's ability to satisfy multi-byte accesses via
multiple single-byte accesses.

We have to keep vga_mem_{read,write}b() since they're used by cirrus.

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/cirrus_vga.c |4 +-
 hw/vga.c|   56 +++---
 hw/vga_int.h|4 +-
 3 files changed, 12 insertions(+), 52 deletions(-)

diff --git a/hw/cirrus_vga.c b/hw/cirrus_vga.c
index 5ded1ff..2a9bd25 100644
--- a/hw/cirrus_vga.c
+++ b/hw/cirrus_vga.c
@@ -1966,7 +1966,7 @@ static uint64_t cirrus_vga_mem_read(void *opaque,
 uint32_t val;
 
 if ((s->vga.sr[0x07] & 0x01) == 0) {
-   return vga_mem_readb(s, addr);
+return vga_mem_readb(&s->vga, addr);
 }
 
 if (addr < 0x1) {
@@ -2011,7 +2011,7 @@ static void cirrus_vga_mem_write(void *opaque,
 unsigned mode;
 
 if ((s->vga.sr[0x07] & 0x01) == 0) {
-   vga_mem_writeb(s, addr, mem_value);
+vga_mem_writeb(&s->vga, addr, mem_value);
 return;
 }
 
diff --git a/hw/vga.c b/hw/vga.c
index 8b6e6b6..33dc478 100644
--- a/hw/vga.c
+++ b/hw/vga.c
@@ -708,9 +708,8 @@ static void vbe_ioport_write_data(void *opaque, uint32_t 
addr, uint32_t val)
 #endif
 
 /* called for accesses between 0xa and 0xc */
-uint32_t vga_mem_readb(void *opaque, target_phys_addr_t addr)
+uint32_t vga_mem_readb(VGACommonState *s, target_phys_addr_t addr)
 {
-VGACommonState *s = opaque;
 int memory_map_mode, plane;
 uint32_t ret;
 
@@ -764,28 +763,9 @@ uint32_t vga_mem_readb(void *opaque, target_phys_addr_t 
addr)
 return ret;
 }
 
-static uint32_t vga_mem_readw(void *opaque, target_phys_addr_t addr)
-{
-uint32_t v;
-v = vga_mem_readb(opaque, addr);
-v |= vga_mem_readb(opaque, addr + 1) << 8;
-return v;
-}
-
-static uint32_t vga_mem_readl(void *opaque, target_phys_addr_t addr)
-{
-uint32_t v;
-v = vga_mem_readb(opaque, addr);
-v |= vga_mem_readb(opaque, addr + 1) << 8;
-v |= vga_mem_readb(opaque, addr + 2) << 16;
-v |= vga_mem_readb(opaque, addr + 3) << 24;
-return v;
-}
-
 /* called for accesses between 0xa and 0xc */
-void vga_mem_writeb(void *opaque, target_phys_addr_t addr, uint32_t val)
+void vga_mem_writeb(VGACommonState *s, target_phys_addr_t addr, uint32_t val)
 {
-VGACommonState *s = opaque;
 int memory_map_mode, plane, write_mode, b, func_select, mask;
 uint32_t write_mask, bit_mask, set_mask;
 
@@ -917,20 +897,6 @@ void vga_mem_writeb(void *opaque, target_phys_addr_t addr, 
uint32_t val)
 }
 }
 
-static void vga_mem_writew(void *opaque, target_phys_addr_t addr, uint32_t val)
-{
-vga_mem_writeb(opaque, addr, val & 0xff);
-vga_mem_writeb(opaque, addr + 1, (val >> 8) & 0xff);
-}
-
-static void vga_mem_writel(void *opaque, target_phys_addr_t addr, uint32_t val)
-{
-vga_mem_writeb(opaque, addr, val & 0xff);
-vga_mem_writeb(opaque, addr + 1, (val >> 8) & 0xff);
-vga_mem_writeb(opaque, addr + 2, (val >> 16) & 0xff);
-vga_mem_writeb(opaque, addr + 3, (val >> 24) & 0xff);
-}
-
 typedef void vga_draw_glyph8_func(uint8_t *d, int linesize,
  const uint8_t *font_ptr, int h,
  uint32_t fgcol, uint32_t bgcol);
@@ -2105,12 +2071,7 @@ static uint64_t vga_mem_read(void *opaque, 
target_phys_addr_t addr,
 {
 VGACommonState *s = opaque;
 
-switch (size) {
-case 1: return vga_mem_readb(s, addr);
-case 2: return vga_mem_readw(s, addr);
-case 4: return vga_mem_readl(s, addr);
-default: abort();
-}
+return vga_mem_readb(s, addr);
 }
 
 static void vga_mem_write(void *opaque, target_phys_addr_t addr,
@@ -2118,18 +2079,17 @@ static void vga_mem_write(void *opaque, 
target_phys_addr_t addr,
 {
 VGACommonState *s = opaque;
 
-switch (size) {
-case 1: return vga_mem_writeb(s, addr, data);
-case 2: return vga_mem_writew(s, addr, data);
-case 4: return vga_mem_writel(s, addr, data);
-default: abort();
-}
+return vga_mem_writeb(s, addr, data);
 }
 
 const MemoryRegionOps vga_mem_ops = {
 .read = vga_mem_read,
 .write = vga_mem_write,
 .endianness = DEVICE_LITTLE_ENDIAN,
+.impl = {
+.min_access_size = 1,
+.max_access_size = 1,
+},
 };
 
 static int vga_common_post_load(void *opaque, int version_id)
diff --git a/hw/vga_int.h b/hw/vga_int.h
index 4592d2c..100d98c 100644
--- a/hw/vga_int.h
+++ b/hw/vga_int.h
@@ -198,8 +198,8 @@ void vga_dirty_log_restart(VGACommonState *s);
 extern const VMStateDescription vmstate_vga_common;
 uint32_t vga_ioport_read(void *opaque, uint32_t addr);
 void vga_ioport_write(void *opaque, uint32_t addr, uint32_t val);
-uint32_t vga_mem_readb(void *opaque, target_phys_addr_t addr);
-void vga_mem_writeb(void *opaque, target_phys_addr_t addr, uint32_t val);
+uint32_t vga_mem_readb(VGACommonState *s, target_phys_addr_t addr);
+void vga_mem_writeb(VGACommonSta

[Qemu-devel] [PATCH v4 33/39] msix: convert to memory API

2011-08-08 Thread Avi Kivity

The msix table is defined as a subregion, to allow for a BAR that
mixes device specific regions with the msix table.

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/ivshmem.c|   11 +
 hw/msix.c   |   64 +++
 hw/msix.h   |6 +---
 hw/pci.h|2 +-
 hw/virtio-pci.c |   16 -
 hw/virtio-pci.h |1 +
 6 files changed, 42 insertions(+), 58 deletions(-)

diff --git a/hw/ivshmem.c b/hw/ivshmem.c
index f80e7b6..bacba60 100644
--- a/hw/ivshmem.c
+++ b/hw/ivshmem.c
@@ -65,6 +65,7 @@ typedef struct IVShmemState {
  */
 MemoryRegion bar;
 MemoryRegion ivshmem;
+MemoryRegion msix_bar;
 uint64_t ivshmem_size; /* size of shared memory region */
 int shm_fd; /* shared memory file descriptor */
 
@@ -540,11 +541,11 @@ static void ivshmem_setup_msi(IVShmemState * s) {
 
 /* allocate the MSI-X vectors */
 
-if (!msix_init(&s->dev, s->vectors, 1, 0)) {
-pci_register_bar(&s->dev, 1,
- msix_bar_size(&s->dev),
- PCI_BASE_ADDRESS_SPACE_MEMORY,
- msix_mmio_map);
+memory_region_init(&s->msix_bar, "ivshmem-msix", 4096);
+if (!msix_init(&s->dev, s->vectors, &s->msix_bar, 1, 0)) {
+pci_register_bar_region(&s->dev, 1,
+PCI_BASE_ADDRESS_SPACE_MEMORY,
+&s->msix_bar);
 IVSHMEM_DPRINTF("msix initialized (%d vectors)\n", s->vectors);
 } else {
 IVSHMEM_DPRINTF("msix initialization failed\n");
diff --git a/hw/msix.c b/hw/msix.c
index e67e700..8536c3f 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -82,7 +82,8 @@ static int msix_add_config(struct PCIDevice *pdev, unsigned 
short nentries,
 return 0;
 }
 
-static uint32_t msix_mmio_readl(void *opaque, target_phys_addr_t addr)
+static uint64_t msix_mmio_read(void *opaque, target_phys_addr_t addr,
+   unsigned size)
 {
 PCIDevice *dev = opaque;
 unsigned int offset = addr & (MSIX_PAGE_SIZE - 1) & ~0x3;
@@ -91,12 +92,6 @@ static uint32_t msix_mmio_readl(void *opaque, 
target_phys_addr_t addr)
 return pci_get_long(page + offset);
 }
 
-static uint32_t msix_mmio_read_unallowed(void *opaque, target_phys_addr_t addr)
-{
-fprintf(stderr, "MSI-X: only dword read is allowed!\n");
-return 0;
-}
-
 static uint8_t msix_pending_mask(int vector)
 {
 return 1 << (vector % 8);
@@ -169,8 +164,8 @@ void msix_write_config(PCIDevice *dev, uint32_t addr,
 }
 }
 
-static void msix_mmio_writel(void *opaque, target_phys_addr_t addr,
- uint32_t val)
+static void msix_mmio_write(void *opaque, target_phys_addr_t addr,
+uint64_t val, unsigned size)
 {
 PCIDevice *dev = opaque;
 unsigned int offset = addr & (MSIX_PAGE_SIZE - 1) & ~0x3;
@@ -179,37 +174,25 @@ static void msix_mmio_writel(void *opaque, 
target_phys_addr_t addr,
 msix_handle_mask_update(dev, vector);
 }
 
-static void msix_mmio_write_unallowed(void *opaque, target_phys_addr_t addr,
-  uint32_t val)
-{
-fprintf(stderr, "MSI-X: only dword write is allowed!\n");
-}
-
-static CPUWriteMemoryFunc * const msix_mmio_write[] = {
-msix_mmio_write_unallowed, msix_mmio_write_unallowed, msix_mmio_writel
-};
-
-static CPUReadMemoryFunc * const msix_mmio_read[] = {
-msix_mmio_read_unallowed, msix_mmio_read_unallowed, msix_mmio_readl
+static const MemoryRegionOps msix_mmio_ops = {
+.read = msix_mmio_read,
+.write = msix_mmio_write,
+.endianness = DEVICE_NATIVE_ENDIAN,
+.valid = {
+.min_access_size = 4,
+.max_access_size = 4,
+},
 };
 
-/* Should be called from device's map method. */
-void msix_mmio_map(PCIDevice *d, int region_num,
-   pcibus_t addr, pcibus_t size, int type)
+static void msix_mmio_setup(PCIDevice *d, MemoryRegion *bar)
 {
 uint8_t *config = d->config + d->msix_cap;
 uint32_t table = pci_get_long(config + PCI_MSIX_TABLE);
 uint32_t offset = table & ~(MSIX_PAGE_SIZE - 1);
 /* TODO: for assigned devices, we'll want to make it possible to map
  * pending bits separately in case they are in a separate bar. */
-int table_bir = table & PCI_MSIX_FLAGS_BIRMASK;
 
-if (table_bir != region_num)
-return;
-if (size <= offset)
-return;
-cpu_register_physical_memory(addr + offset, size - offset,
- d->msix_mmio_index);
+memory_region_add_subregion(bar, offset, &d->msix_mmio);
 }
 
 static void msix_mask_all(struct PCIDevice *dev, unsigned nentries)
@@ -225,6 +208,7 @@ static void msix_mask_all(struct PCIDevice *dev, unsigned 
nentries)
 /* Initialize the MSI-X structures. Note: if MSI-X is supported, BAR size is
  * modified, it should be retrieved with msix_bar_size. */
 int msix_init(struct PCIDevice *dev, unsigned

Re: [Qemu-devel] [PATCH 2/7] Replace VMSTOP macros with a proper QemuState type

2011-08-08 Thread Avi Kivity

On 08/08/2011 05:06 PM, Luiz Capitulino wrote:

>
>  This is why I suggested a reference count.  In this case, we can always
>  stop the guest "twice", because we don't lost information when we resume.

I could give it a try in the near future, as I really think it's independent
from this series, but I still don't understand what can stop an already stopped
VM besides the user. This is a real question, is it really possible?

If only the user can do that, then the refcount is overkill as just returning
an error will do it.

Most of the reasons in QemuState are asynchronous and can happen at any 
time while the guest is running.  Because they're asynchronous, they can 
trigger before a monitor stop is issued but caught after it is processed.

We could possibly synchronize during user stop, but that lets us capture 
only the first non-user reason.

And even if we return an error, that only pushes the refcounting to the 
user; you probably don't want to return a "vm is already stopped" error 
to the user, he'll just ask "why are you telling me that".

--
error compiling committee.c: too many arguments to function

[Qemu-devel] [PATCH v4 31/39] uhci: convert to memory API

2011-08-08 Thread Avi Kivity

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/usb-uhci.c |   42 --
 1 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/hw/usb-uhci.c b/hw/usb-uhci.c
index 824e3a5..ea38169 100644
--- a/hw/usb-uhci.c
+++ b/hw/usb-uhci.c
@@ -129,6 +129,7 @@ typedef struct UHCIPort {
 
 struct UHCIState {
 PCIDevice dev;
+MemoryRegion io_bar;
 USBBus bus; /* Note unused when we're a companion controller */
 uint16_t cmd; /* cmd register */
 uint16_t status;
@@ -1096,18 +1097,19 @@ static void uhci_frame_timer(void *opaque)
 qemu_mod_timer(s->frame_timer, s->expire_time);
 }
 
-static void uhci_map(PCIDevice *pci_dev, int region_num,
-pcibus_t addr, pcibus_t size, int type)
-{
-UHCIState *s = (UHCIState *)pci_dev;
-
-register_ioport_write(addr, 32, 2, uhci_ioport_writew, s);
-register_ioport_read(addr, 32, 2, uhci_ioport_readw, s);
-register_ioport_write(addr, 32, 4, uhci_ioport_writel, s);
-register_ioport_read(addr, 32, 4, uhci_ioport_readl, s);
-register_ioport_write(addr, 32, 1, uhci_ioport_writeb, s);
-register_ioport_read(addr, 32, 1, uhci_ioport_readb, s);
-}
+static const MemoryRegionPortio uhci_portio[] = {
+{ 0, 32, 2, .write = uhci_ioport_writew, },
+{ 0, 32, 2, .read = uhci_ioport_readw, },
+{ 0, 32, 4, .write = uhci_ioport_writel, },
+{ 0, 32, 4, .read = uhci_ioport_readl, },
+{ 0, 32, 1, .write = uhci_ioport_writeb, },
+{ 0, 32, 1, .read = uhci_ioport_readb, },
+PORTIO_END_OF_LIST()
+};
+
+static const MemoryRegionOps uhci_ioport_ops = {
+.old_portio = uhci_portio,
+};
 
 static USBPortOps uhci_port_ops = {
 .attach = uhci_attach,
@@ -1154,10 +1156,11 @@ static int usb_uhci_common_initfn(PCIDevice *dev)
 
 qemu_register_reset(uhci_reset, s);
 
+memory_region_init_io(&s->io_bar, &uhci_ioport_ops, s, "uhci", 0x20);
 /* Use region 4 for consistency with real hardware.  BSD guests seem
to rely on this.  */
-pci_register_bar(&s->dev, 4, 0x20,
-   PCI_BASE_ADDRESS_SPACE_IO, uhci_map);
+pci_register_bar_region(&s->dev, 4,
+PCI_BASE_ADDRESS_SPACE_IO, &s->io_bar);
 
 return 0;
 }
@@ -1177,6 +1180,14 @@ static int usb_uhci_vt82c686b_initfn(PCIDevice *dev)
 return usb_uhci_common_initfn(dev);
 }
 
+static int usb_uhci_exit(PCIDevice *dev)
+{
+UHCIState *s = DO_UPCAST(UHCIState, dev, dev);
+
+memory_region_destroy(&s->io_bar);
+return 0;
+}
+
 static Property uhci_properties[] = {
 DEFINE_PROP_STRING("masterbus", UHCIState, masterbus),
 DEFINE_PROP_UINT32("firstport", UHCIState, firstport, 0),
@@ -1189,6 +1200,7 @@ static PCIDeviceInfo uhci_info[] = {
 .qdev.size= sizeof(UHCIState),
 .qdev.vmsd= &vmstate_uhci,
 .init = usb_uhci_common_initfn,
+.exit = usb_uhci_exit,
 .vendor_id= PCI_VENDOR_ID_INTEL,
 .device_id= PCI_DEVICE_ID_INTEL_82371SB_2,
 .revision = 0x01,
@@ -1199,6 +1211,7 @@ static PCIDeviceInfo uhci_info[] = {
 .qdev.size= sizeof(UHCIState),
 .qdev.vmsd= &vmstate_uhci,
 .init = usb_uhci_common_initfn,
+.exit = usb_uhci_exit,
 .vendor_id= PCI_VENDOR_ID_INTEL,
 .device_id= PCI_DEVICE_ID_INTEL_82371AB_2,
 .revision = 0x01,
@@ -1209,6 +1222,7 @@ static PCIDeviceInfo uhci_info[] = {
 .qdev.size= sizeof(UHCIState),
 .qdev.vmsd= &vmstate_uhci,
 .init = usb_uhci_vt82c686b_initfn,
+.exit = usb_uhci_exit,
 .vendor_id= PCI_VENDOR_ID_VIA,
 .device_id= PCI_DEVICE_ID_VIA_UHCI,
 .revision = 0x01,
-- 
1.7.5.3

Re: [Qemu-devel] [PATCH v4] XBZRLE delta for live migration of large memory apps

2011-08-08 Thread Anthony Liguori


On 08/08/2011 09:23 AM, Avi Kivity wrote:

On 08/08/2011 05:15 PM, Anthony Liguori wrote:


If we did .so plugins, which I'm really not opposed to, I'd want the
interface to be something like:

typedef struct MigrationTransportClass
{
ssize_t (*writev)(MigrationTransport *obj,
struct iovec *iov,
int iovcnt);
} MigrationTransportClass;

I think it's useful to use an interface like this because it makes it
easy to put the transport in a dedicated thread that didn't hold
qemu_mutex (which is sort of equivalent to using a fork'd helper but
is zero-copy at the expense of less isolation).


If we have a shared object helper, the thread should be maintained by
qemu proper, not the plugin.

I wouldn't call it "migration transport", but instead a
compression/decompression plugin.

I don't think it merits a plugin at all though. There's limited scope
for compression and it best sits in qemu proper. If anything, it needs
to be more integrated (for example turning itself off if it doesn't
match enough).


That adds a tremendous amount of complexity to QEMU.  If we're going to 
change our compression algorithm, we would need to use a single 
algorithm that worked well for a wide variety of workloads.


We struggle enough with migration as it is, it only would get worse if 
we have 10 different algorithms that we were dynamically enabling/disabling.


The other option is to allow 1-off compression algorithms in the form of 
plugins.  I think in this case, plugins are a pretty good compromise in 
terms of isolating complexity while allowing something that at least 
works very well for one particular type of workload.


Regards,

Anthony Liguori

[Qemu-devel] [PATCH v4 20/39] virtio-pci: convert to memory API

2011-08-08 Thread Avi Kivity

except msix.

[jan: fix build]

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/virtio-pci.c |   71 +--
 hw/virtio-pci.h |2 +-
 2 files changed, 28 insertions(+), 45 deletions(-)

diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index f3b3293..5df380d 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -162,7 +162,8 @@ static int 
virtio_pci_set_host_notifier_internal(VirtIOPCIProxy *proxy,
 {
 VirtQueue *vq = virtio_get_queue(proxy->vdev, n);
 EventNotifier *notifier = virtio_queue_get_host_notifier(vq);
-int r;
+int r = 0;
+
 if (assign) {
 r = event_notifier_init(notifier, 1);
 if (r < 0) {
@@ -170,24 +171,11 @@ static int 
virtio_pci_set_host_notifier_internal(VirtIOPCIProxy *proxy,
  __func__, r);
 return r;
 }
-r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier),
-   proxy->addr + VIRTIO_PCI_QUEUE_NOTIFY,
-   n, assign);
-if (r < 0) {
-error_report("%s: unable to map ioeventfd: %d",
- __func__, r);
-event_notifier_cleanup(notifier);
-}
+memory_region_add_eventfd(&proxy->bar, VIRTIO_PCI_QUEUE_NOTIFY, 2,
+  true, n, event_notifier_get_fd(notifier));
 } else {
-r = kvm_set_ioeventfd_pio_word(event_notifier_get_fd(notifier),
-   proxy->addr + VIRTIO_PCI_QUEUE_NOTIFY,
-   n, assign);
-if (r < 0) {
-error_report("%s: unable to unmap ioeventfd: %d",
- __func__, r);
-return r;
-}
-
+memory_region_del_eventfd(&proxy->bar, VIRTIO_PCI_QUEUE_NOTIFY, 2,
+  true, n, event_notifier_get_fd(notifier));
 /* Handle the race condition where the guest kicked and we deassigned
  * before we got around to handling the kick.
  */
@@ -424,7 +412,6 @@ static uint32_t virtio_pci_config_readb(void *opaque, 
uint32_t addr)
 {
 VirtIOPCIProxy *proxy = opaque;
 uint32_t config = VIRTIO_PCI_CONFIG(&proxy->pci_dev);
-addr -= proxy->addr;
 if (addr < config)
 return virtio_ioport_read(proxy, addr);
 addr -= config;
@@ -435,7 +422,6 @@ static uint32_t virtio_pci_config_readw(void *opaque, 
uint32_t addr)
 {
 VirtIOPCIProxy *proxy = opaque;
 uint32_t config = VIRTIO_PCI_CONFIG(&proxy->pci_dev);
-addr -= proxy->addr;
 if (addr < config)
 return virtio_ioport_read(proxy, addr);
 addr -= config;
@@ -446,7 +432,6 @@ static uint32_t virtio_pci_config_readl(void *opaque, 
uint32_t addr)
 {
 VirtIOPCIProxy *proxy = opaque;
 uint32_t config = VIRTIO_PCI_CONFIG(&proxy->pci_dev);
-addr -= proxy->addr;
 if (addr < config)
 return virtio_ioport_read(proxy, addr);
 addr -= config;
@@ -457,7 +442,6 @@ static void virtio_pci_config_writeb(void *opaque, uint32_t 
addr, uint32_t val)
 {
 VirtIOPCIProxy *proxy = opaque;
 uint32_t config = VIRTIO_PCI_CONFIG(&proxy->pci_dev);
-addr -= proxy->addr;
 if (addr < config) {
 virtio_ioport_write(proxy, addr, val);
 return;
@@ -470,7 +454,6 @@ static void virtio_pci_config_writew(void *opaque, uint32_t 
addr, uint32_t val)
 {
 VirtIOPCIProxy *proxy = opaque;
 uint32_t config = VIRTIO_PCI_CONFIG(&proxy->pci_dev);
-addr -= proxy->addr;
 if (addr < config) {
 virtio_ioport_write(proxy, addr, val);
 return;
@@ -483,7 +466,6 @@ static void virtio_pci_config_writel(void *opaque, uint32_t 
addr, uint32_t val)
 {
 VirtIOPCIProxy *proxy = opaque;
 uint32_t config = VIRTIO_PCI_CONFIG(&proxy->pci_dev);
-addr -= proxy->addr;
 if (addr < config) {
 virtio_ioport_write(proxy, addr, val);
 return;
@@ -492,30 +474,26 @@ static void virtio_pci_config_writel(void *opaque, 
uint32_t addr, uint32_t val)
 virtio_config_writel(proxy->vdev, addr, val);
 }
 
-static void virtio_map(PCIDevice *pci_dev, int region_num,
-   pcibus_t addr, pcibus_t size, int type)
-{
-VirtIOPCIProxy *proxy = container_of(pci_dev, VirtIOPCIProxy, pci_dev);
-VirtIODevice *vdev = proxy->vdev;
-unsigned config_len = VIRTIO_PCI_REGION_SIZE(pci_dev) + vdev->config_len;
-
-proxy->addr = addr;
-
-register_ioport_write(addr, config_len, 1, virtio_pci_config_writeb, 
proxy);
-register_ioport_write(addr, config_len, 2, virtio_pci_config_writew, 
proxy);
-register_ioport_write(addr, config_len, 4, virtio_pci_config_writel, 
proxy);
-register_ioport_read(addr, config_len, 1, virtio_pci_config_readb, proxy);
-register_ioport_read(addr, config_len, 2, virtio_pci_config_readw, proxy);
-register_ioport_read(addr, config_len, 4, virtio_pci_config_readl,

[Qemu-devel] [PATCH v4 06/39] cirrus: simplify bitblt BAR access functions

2011-08-08 Thread Avi Kivity

Make use of the memory API's ability to satisfy multi-byte accesses via
multiple single-byte accesses.

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/cirrus_vga.c |   81 +--
 1 files changed, 13 insertions(+), 68 deletions(-)

diff --git a/hw/cirrus_vga.c b/hw/cirrus_vga.c
index 4f57b92..c39acb9 100644
--- a/hw/cirrus_vga.c
+++ b/hw/cirrus_vga.c
@@ -2446,37 +2446,23 @@ static void cirrus_linear_write(void *opaque, 
target_phys_addr_t addr,
  ***/
 
 
-static uint32_t cirrus_linear_bitblt_readb(void *opaque, target_phys_addr_t 
addr)
+static uint64_t cirrus_linear_bitblt_read(void *opaque,
+  target_phys_addr_t addr,
+  unsigned size)
 {
+CirrusVGAState *s = opaque;
 uint32_t ret;
 
 /* XXX handle bitblt */
+(void)s;
 ret = 0xff;
 return ret;
 }
 
-static uint32_t cirrus_linear_bitblt_readw(void *opaque, target_phys_addr_t 
addr)
-{
-uint32_t v;
-
-v = cirrus_linear_bitblt_readb(opaque, addr);
-v |= cirrus_linear_bitblt_readb(opaque, addr + 1) << 8;
-return v;
-}
-
-static uint32_t cirrus_linear_bitblt_readl(void *opaque, target_phys_addr_t 
addr)
-{
-uint32_t v;
-
-v = cirrus_linear_bitblt_readb(opaque, addr);
-v |= cirrus_linear_bitblt_readb(opaque, addr + 1) << 8;
-v |= cirrus_linear_bitblt_readb(opaque, addr + 2) << 16;
-v |= cirrus_linear_bitblt_readb(opaque, addr + 3) << 24;
-return v;
-}
-
-static void cirrus_linear_bitblt_writeb(void *opaque, target_phys_addr_t addr,
-uint32_t val)
+static void cirrus_linear_bitblt_write(void *opaque,
+   target_phys_addr_t addr,
+   uint64_t val,
+   unsigned size)
 {
 CirrusVGAState *s = opaque;
 
@@ -2489,55 +2475,14 @@ static void cirrus_linear_bitblt_writeb(void *opaque, 
target_phys_addr_t addr,
 }
 }
 
-static void cirrus_linear_bitblt_writew(void *opaque, target_phys_addr_t addr,
-uint32_t val)
-{
-cirrus_linear_bitblt_writeb(opaque, addr, val & 0xff);
-cirrus_linear_bitblt_writeb(opaque, addr + 1, (val >> 8) & 0xff);
-}
-
-static void cirrus_linear_bitblt_writel(void *opaque, target_phys_addr_t addr,
-uint32_t val)
-{
-cirrus_linear_bitblt_writeb(opaque, addr, val & 0xff);
-cirrus_linear_bitblt_writeb(opaque, addr + 1, (val >> 8) & 0xff);
-cirrus_linear_bitblt_writeb(opaque, addr + 2, (val >> 16) & 0xff);
-cirrus_linear_bitblt_writeb(opaque, addr + 3, (val >> 24) & 0xff);
-}
-
-static uint64_t cirrus_linear_bitblt_read(void *opaque,
-  target_phys_addr_t addr,
-  unsigned size)
-{
-CirrusVGAState *s = opaque;
-
-switch (size) {
-case 1: return cirrus_linear_bitblt_readb(s, addr);
-case 2: return cirrus_linear_bitblt_readw(s, addr);
-case 4: return cirrus_linear_bitblt_readl(s, addr);
-default: abort();
-}
-};
-
-static void cirrus_linear_bitblt_write(void *opaque,
-   target_phys_addr_t addr,
-   uint64_t data,
-   unsigned size)
-{
-CirrusVGAState *s = opaque;
-
-switch (size) {
-case 1: return cirrus_linear_bitblt_writeb(s, addr, data);
-case 2: return cirrus_linear_bitblt_writew(s, addr, data);
-case 4: return cirrus_linear_bitblt_writel(s, addr, data);
-default: abort();
-}
-};
-
 static const MemoryRegionOps cirrus_linear_bitblt_io_ops = {
 .read = cirrus_linear_bitblt_read,
 .write = cirrus_linear_bitblt_write,
 .endianness = DEVICE_LITTLE_ENDIAN,
+.impl = {
+.min_access_size = 1,
+.max_access_size = 1,
+},
 };
 
 static void unmap_bank(CirrusVGAState *s, unsigned bank)
-- 
1.7.5.3

Re: [Qemu-devel] [PULL] VirtFS coroutine changes

2011-08-08 Thread Aneesh Kumar K.V

On Mon, 08 Aug 2011 07:36:48 -0500, Anthony Liguori  
wrote:
> On 08/08/2011 04:03 AM, Aneesh Kumar K.V wrote:
> >
> > The following changes since commit 23ddf2bb1e4bfe2b72a726fe5e828807b65941ad:
> >
> >Fix forcing multicast msgs to loopback on OpenBSD. (2011-08-07 11:06:43 
> > +)
> 
> This need to be sent to the list first as patches for review before you 
> can do a pull request.

The patch series was sent to the list before 

Message-id:1306367597-797-1-git-send-email-jv...@linux.vnet.ibm.com

http://thread.gmane.org/gmane.comp.emulators.qemu/104120

-aneesh

[Qemu-devel] [PATCH v4 30/39] ehci: convert to memory API

2011-08-08 Thread Avi Kivity

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/usb-ehci.c |   36 +---
 1 files changed, 9 insertions(+), 27 deletions(-)

diff --git a/hw/usb-ehci.c b/hw/usb-ehci.c
index 2b43895..6ef7798 100644
--- a/hw/usb-ehci.c
+++ b/hw/usb-ehci.c
@@ -370,8 +370,7 @@ struct EHCIState {
 PCIDevice dev;
 USBBus bus;
 qemu_irq irq;
-target_phys_addr_t mem_base;
-int mem;
+MemoryRegion mem;
 int companion_count;
 
 /* properties */
@@ -2179,29 +2178,15 @@ static void ehci_frame_timer(void *opaque)
 qemu_mod_timer(ehci->frame_timer, expire_time);
 }
 
-static CPUReadMemoryFunc *ehci_readfn[3]={
-ehci_mem_readb,
-ehci_mem_readw,
-ehci_mem_readl
-};
 
-static CPUWriteMemoryFunc *ehci_writefn[3]={
-ehci_mem_writeb,
-ehci_mem_writew,
-ehci_mem_writel
+static const MemoryRegionOps ehci_mem_ops = {
+.old_mmio = {
+.read = { ehci_mem_readb, ehci_mem_readw, ehci_mem_readl },
+.write = { ehci_mem_writeb, ehci_mem_writew, ehci_mem_writel },
+},
+.endianness = DEVICE_LITTLE_ENDIAN,
 };
 
-static void ehci_map(PCIDevice *pci_dev, int region_num,
- pcibus_t addr, pcibus_t size, int type)
-{
-EHCIState *s =(EHCIState *)pci_dev;
-
-DPRINTF("ehci_map: region %d, addr %08" PRIx64 ", size %" PRId64 ", s->mem 
%08X\n",
-region_num, addr, size, s->mem);
-s->mem_base = addr;
-cpu_register_physical_memory(addr, size, s->mem);
-}
-
 static int usb_ehci_initfn(PCIDevice *dev);
 
 static USBPortOps ehci_port_ops = {
@@ -2316,11 +2301,8 @@ static int usb_ehci_initfn(PCIDevice *dev)
 
 qemu_register_reset(ehci_reset, s);
 
-s->mem = cpu_register_io_memory(ehci_readfn, ehci_writefn, s,
-DEVICE_LITTLE_ENDIAN);
-
-pci_register_bar(&s->dev, 0, MMIO_SIZE, PCI_BASE_ADDRESS_SPACE_MEMORY,
-ehci_map);
+memory_region_init_io(&s->mem, &ehci_mem_ops, s, "ehci", MMIO_SIZE);
+pci_register_bar_region(&s->dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, 
&s->mem);
 
 fprintf(stderr, "*** EHCI support is under development ***\n");
 
-- 
1.7.5.3

[Qemu-devel] [PATCH v4 21/39] ahci: convert to memory API

2011-08-08 Thread Avi Kivity

Reviewed-by: Richard Henderson 
Reviewed-by: Anthony Liguori 
Signed-off-by: Avi Kivity 
---
 hw/ide/ahci.c |   31 +--
 hw/ide/ahci.h |2 +-
 hw/ide/ich.c  |3 +--
 3 files changed, 15 insertions(+), 21 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 1f008a3..e207ca0 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -276,12 +276,12 @@ static void  ahci_port_write(AHCIState *s, int port, int 
offset, uint32_t val)
 }
 }
 
-static uint32_t ahci_mem_readl(void *ptr, target_phys_addr_t addr)
+static uint64_t ahci_mem_read(void *opaque, target_phys_addr_t addr,
+  unsigned size)
 {
-AHCIState *s = ptr;
+AHCIState *s = opaque;
 uint32_t val = 0;
 
-addr = addr & 0xfff;
 if (addr < AHCI_GENERIC_HOST_CONTROL_REGS_MAX_ADDR) {
 switch (addr) {
 case HOST_CAP:
@@ -314,10 +314,10 @@ static uint32_t ahci_mem_readl(void *ptr, 
target_phys_addr_t addr)
 
 
 
-static void ahci_mem_writel(void *ptr, target_phys_addr_t addr, uint32_t val)
+static void ahci_mem_write(void *opaque, target_phys_addr_t addr,
+   uint64_t val, unsigned size)
 {
-AHCIState *s = ptr;
-addr = addr & 0xfff;
+AHCIState *s = opaque;
 
 /* Only aligned reads are allowed on AHCI */
 if (addr & 3) {
@@ -364,16 +364,10 @@ static void ahci_mem_writel(void *ptr, target_phys_addr_t 
addr, uint32_t val)
 
 }
 
-static CPUReadMemoryFunc * const ahci_readfn[3]={
-ahci_mem_readl,
-ahci_mem_readl,
-ahci_mem_readl
-};
-
-static CPUWriteMemoryFunc * const ahci_writefn[3]={
-ahci_mem_writel,
-ahci_mem_writel,
-ahci_mem_writel
+static MemoryRegionOps ahci_mem_ops = {
+.read = ahci_mem_read,
+.write = ahci_mem_write,
+.endianness = DEVICE_LITTLE_ENDIAN,
 };
 
 static void ahci_reg_init(AHCIState *s)
@@ -1131,8 +1125,8 @@ void ahci_init(AHCIState *s, DeviceState *qdev, int ports)
 s->ports = ports;
 s->dev = qemu_mallocz(sizeof(AHCIDevice) * ports);
 ahci_reg_init(s);
-s->mem = cpu_register_io_memory(ahci_readfn, ahci_writefn, s,
-DEVICE_LITTLE_ENDIAN);
+/* XXX BAR size should be 1k, but that breaks, so bump it to 4k for now */
+memory_region_init_io(&s->mem, &ahci_mem_ops, s, "ahci", 0x1000);
 irqs = qemu_allocate_irqs(ahci_irq_set, s, s->ports);
 
 for (i = 0; i < s->ports; i++) {
@@ -1151,6 +1145,7 @@ void ahci_init(AHCIState *s, DeviceState *qdev, int ports)
 
 void ahci_uninit(AHCIState *s)
 {
+memory_region_destroy(&s->mem);
 qemu_free(s->dev);
 }
 
diff --git a/hw/ide/ahci.h b/hw/ide/ahci.h
index dc86951..e456193 100644
--- a/hw/ide/ahci.h
+++ b/hw/ide/ahci.h
@@ -289,7 +289,7 @@ struct AHCIDevice {
 typedef struct AHCIState {
 AHCIDevice *dev;
 AHCIControlRegs control_regs;
-int mem;
+MemoryRegion mem;
 int ports;
 qemu_irq irq;
 } AHCIState;
diff --git a/hw/ide/ich.c b/hw/ide/ich.c
index d241ea8..698b5f6 100644
--- a/hw/ide/ich.c
+++ b/hw/ide/ich.c
@@ -98,8 +98,7 @@ static int pci_ich9_ahci_init(PCIDevice *dev)
 msi_init(dev, 0x50, 1, true, false);
 d->ahci.irq = d->card.irq[0];
 
-/* XXX BAR size should be 1k, but that breaks, so bump it to 4k for now */
-pci_register_bar_simple(&d->card, 5, 0x1000, 0, d->ahci.mem);
+pci_register_bar_region(&d->card, 5, 0, &d->ahci.mem);
 
 return 0;
 }
-- 
1.7.5.3

1 2 3 >

1 - 100 of 253 matches

Mail list logo