Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage

2016-11-01 Thread Michael R. Hines

On 10/31/2016 05:00 PM, Michael R. Hines wrote:

On 10/18/2016 05:47 AM, Peter Lieven wrote:

Am 12.10.2016 um 23:18 schrieb Michael R. Hines:

Peter,

Greetings from DigitalOcean. We're experiencing the same symptoms 
without this patch.
We have, collectively, many gigabytes of un-planned-for RSS being 
used per-hypervisor

that we would like to get rid of =).

Without explicitly trying this patch (will do that ASAP), we 
immediately noticed that the
192MB mentioned immediately melts away (Yay) when we disabled the 
coroutine thread pool explicitly,
with another ~100MB in additional stack usage that would likely also 
go away if we

applied the entirety of your patch.

Is there any chance you have revisited this or have a timeline for it?


Hi Michael,

the current master already includes some of the patches of this 
original series. There are still some changes left, but

what works for me is the current master +

diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index 5816702..3eaef68 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -25,8 +25,6 @@ enum {
 };

 /** Free list to speed up creation */
-static QSLIST_HEAD(, Coroutine) release_pool = 
QSLIST_HEAD_INITIALIZER(pool);

-static unsigned int release_pool_size;
 static __thread QSLIST_HEAD(, Coroutine) alloc_pool = 
QSLIST_HEAD_INITIALIZER(pool);

 static __thread unsigned int alloc_pool_size;
 static __thread Notifier coroutine_pool_cleanup_notifier;
@@ -49,20 +47,10 @@ Coroutine *qemu_coroutine_create(CoroutineEntry 
*entry)

 if (CONFIG_COROUTINE_POOL) {
 co = QSLIST_FIRST(_pool);
 if (!co) {
-if (release_pool_size > POOL_BATCH_SIZE) {
-/* Slow path; a good place to register the 
destructor, too.  */

-if (!coroutine_pool_cleanup_notifier.notify) {
-coroutine_pool_cleanup_notifier.notify = 
coroutine_pool_cleanup;

- qemu_thread_atexit_add(_pool_cleanup_notifier);
-}
-
-/* This is not exact; there could be a little skew 
between
- * release_pool_size and the actual size of 
release_pool.  But
- * it is just a heuristic, it does not need to be 
perfect.

- */
-alloc_pool_size = atomic_xchg(_pool_size, 0);
-QSLIST_MOVE_ATOMIC(_pool, _pool);
-co = QSLIST_FIRST(_pool);
+/* Slow path; a good place to register the destructor, 
too.  */

+if (!coroutine_pool_cleanup_notifier.notify) {
+coroutine_pool_cleanup_notifier.notify = 
coroutine_pool_cleanup;

+ qemu_thread_atexit_add(_pool_cleanup_notifier);
 }
 }
 if (co) {
@@ -85,11 +73,6 @@ static void coroutine_delete(Coroutine *co)
 co->caller = NULL;

 if (CONFIG_COROUTINE_POOL) {
-if (release_pool_size < POOL_BATCH_SIZE * 2) {
-QSLIST_INSERT_HEAD_ATOMIC(_pool, co, pool_next);
-atomic_inc(_pool_size);
-return;
-}
 if (alloc_pool_size < POOL_BATCH_SIZE) {
 QSLIST_INSERT_HEAD(_pool, co, pool_next);
 alloc_pool_size++;

+ invoking qemu with the following environemnet variable set:

MALLOC_MMAP_THRESHOLD_=32768 qemu-system-x86_64 

The last one makes glibc automatically using mmap when the malloced 
memory exceeds 32kByte.




Peter,

I tested the above patch (and the environment variable --- it doesn't 
quite come close to as lean of
an RSS tally as the original patchset  there's still about 
70-80 MB of remaining RSS.


Any chance you could trim the remaining fat before merging this? =)




False alarm! I didn't set the MMAP threshold low enough. Now the results 
are on-par with the other patchset.


Thank you!



Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage

2016-10-31 Thread Michael R. Hines

On 10/18/2016 05:47 AM, Peter Lieven wrote:

Am 12.10.2016 um 23:18 schrieb Michael R. Hines:

Peter,

Greetings from DigitalOcean. We're experiencing the same symptoms 
without this patch.
We have, collectively, many gigabytes of un-planned-for RSS being 
used per-hypervisor

that we would like to get rid of =).

Without explicitly trying this patch (will do that ASAP), we 
immediately noticed that the
192MB mentioned immediately melts away (Yay) when we disabled the 
coroutine thread pool explicitly,
with another ~100MB in additional stack usage that would likely also 
go away if we

applied the entirety of your patch.

Is there any chance you have revisited this or have a timeline for it?


Hi Michael,

the current master already includes some of the patches of this 
original series. There are still some changes left, but

what works for me is the current master +

diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index 5816702..3eaef68 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -25,8 +25,6 @@ enum {
 };

 /** Free list to speed up creation */
-static QSLIST_HEAD(, Coroutine) release_pool = 
QSLIST_HEAD_INITIALIZER(pool);

-static unsigned int release_pool_size;
 static __thread QSLIST_HEAD(, Coroutine) alloc_pool = 
QSLIST_HEAD_INITIALIZER(pool);

 static __thread unsigned int alloc_pool_size;
 static __thread Notifier coroutine_pool_cleanup_notifier;
@@ -49,20 +47,10 @@ Coroutine *qemu_coroutine_create(CoroutineEntry 
*entry)

 if (CONFIG_COROUTINE_POOL) {
 co = QSLIST_FIRST(_pool);
 if (!co) {
-if (release_pool_size > POOL_BATCH_SIZE) {
-/* Slow path; a good place to register the 
destructor, too.  */

-if (!coroutine_pool_cleanup_notifier.notify) {
-coroutine_pool_cleanup_notifier.notify = 
coroutine_pool_cleanup;

- qemu_thread_atexit_add(_pool_cleanup_notifier);
-}
-
-/* This is not exact; there could be a little skew 
between
- * release_pool_size and the actual size of 
release_pool.  But
- * it is just a heuristic, it does not need to be 
perfect.

- */
-alloc_pool_size = atomic_xchg(_pool_size, 0);
-QSLIST_MOVE_ATOMIC(_pool, _pool);
-co = QSLIST_FIRST(_pool);
+/* Slow path; a good place to register the destructor, 
too.  */

+if (!coroutine_pool_cleanup_notifier.notify) {
+coroutine_pool_cleanup_notifier.notify = 
coroutine_pool_cleanup;

+ qemu_thread_atexit_add(_pool_cleanup_notifier);
 }
 }
 if (co) {
@@ -85,11 +73,6 @@ static void coroutine_delete(Coroutine *co)
 co->caller = NULL;

 if (CONFIG_COROUTINE_POOL) {
-if (release_pool_size < POOL_BATCH_SIZE * 2) {
-QSLIST_INSERT_HEAD_ATOMIC(_pool, co, pool_next);
-atomic_inc(_pool_size);
-return;
-}
 if (alloc_pool_size < POOL_BATCH_SIZE) {
 QSLIST_INSERT_HEAD(_pool, co, pool_next);
 alloc_pool_size++;

+ invoking qemu with the following environemnet variable set:

MALLOC_MMAP_THRESHOLD_=32768 qemu-system-x86_64 

The last one makes glibc automatically using mmap when the malloced 
memory exceeds 32kByte.




Peter,

I tested the above patch (and the environment variable --- it doesn't 
quite come close to as lean of
an RSS tally as the original patchset  there's still about 70-80 
MB of remaining RSS.


Any chance you could trim the remaining fat before merging this? =)


/*
 * Michael R. Hines
 * Senior Engineer, DigitalOcean.
 */





Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage

2016-10-19 Thread Michael R. Hines

Thank you for the response! I'll run off and test that. =)

/*
 * Michael R. Hines
 * Senior Engineer, DigitalOcean.
 */

On 10/18/2016 05:47 AM, Peter Lieven wrote:

Am 12.10.2016 um 23:18 schrieb Michael R. Hines:

Peter,

Greetings from DigitalOcean. We're experiencing the same symptoms 
without this patch.
We have, collectively, many gigabytes of un-planned-for RSS being 
used per-hypervisor

that we would like to get rid of =).

Without explicitly trying this patch (will do that ASAP), we 
immediately noticed that the
192MB mentioned immediately melts away (Yay) when we disabled the 
coroutine thread pool explicitly,
with another ~100MB in additional stack usage that would likely also 
go away if we

applied the entirety of your patch.

Is there any chance you have revisited this or have a timeline for it?


Hi Michael,

the current master already includes some of the patches of this 
original series. There are still some changes left, but

what works for me is the current master +

diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index 5816702..3eaef68 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -25,8 +25,6 @@ enum {
 };

 /** Free list to speed up creation */
-static QSLIST_HEAD(, Coroutine) release_pool = 
QSLIST_HEAD_INITIALIZER(pool);

-static unsigned int release_pool_size;
 static __thread QSLIST_HEAD(, Coroutine) alloc_pool = 
QSLIST_HEAD_INITIALIZER(pool);

 static __thread unsigned int alloc_pool_size;
 static __thread Notifier coroutine_pool_cleanup_notifier;
@@ -49,20 +47,10 @@ Coroutine *qemu_coroutine_create(CoroutineEntry 
*entry)

 if (CONFIG_COROUTINE_POOL) {
 co = QSLIST_FIRST(_pool);
 if (!co) {
-if (release_pool_size > POOL_BATCH_SIZE) {
-/* Slow path; a good place to register the 
destructor, too.  */

-if (!coroutine_pool_cleanup_notifier.notify) {
-coroutine_pool_cleanup_notifier.notify = 
coroutine_pool_cleanup;

- qemu_thread_atexit_add(_pool_cleanup_notifier);
-}
-
-/* This is not exact; there could be a little skew 
between
- * release_pool_size and the actual size of 
release_pool.  But
- * it is just a heuristic, it does not need to be 
perfect.

- */
-alloc_pool_size = atomic_xchg(_pool_size, 0);
-QSLIST_MOVE_ATOMIC(_pool, _pool);
-co = QSLIST_FIRST(_pool);
+/* Slow path; a good place to register the destructor, 
too.  */

+if (!coroutine_pool_cleanup_notifier.notify) {
+coroutine_pool_cleanup_notifier.notify = 
coroutine_pool_cleanup;

+ qemu_thread_atexit_add(_pool_cleanup_notifier);
 }
 }
 if (co) {
@@ -85,11 +73,6 @@ static void coroutine_delete(Coroutine *co)
 co->caller = NULL;

 if (CONFIG_COROUTINE_POOL) {
-if (release_pool_size < POOL_BATCH_SIZE * 2) {
-QSLIST_INSERT_HEAD_ATOMIC(_pool, co, pool_next);
-atomic_inc(_pool_size);
-return;
-}
 if (alloc_pool_size < POOL_BATCH_SIZE) {
 QSLIST_INSERT_HEAD(_pool, co, pool_next);
 alloc_pool_size++;

+ invoking qemu with the following environemnet variable set:

MALLOC_MMAP_THRESHOLD_=32768 qemu-system-x86_64 

The last one makes glibc automatically using mmap when the malloced 
memory exceeds 32kByte.


Hope this helps,
Peter






Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage

2016-10-18 Thread Peter Lieven

Am 12.10.2016 um 23:18 schrieb Michael R. Hines:

Peter,

Greetings from DigitalOcean. We're experiencing the same symptoms without this 
patch.
We have, collectively, many gigabytes of un-planned-for RSS being used 
per-hypervisor
that we would like to get rid of =).

Without explicitly trying this patch (will do that ASAP), we immediately 
noticed that the
192MB mentioned immediately melts away (Yay) when we disabled the coroutine 
thread pool explicitly,
with another ~100MB in additional stack usage that would likely also go away if 
we
applied the entirety of your patch.

Is there any chance you have revisited this or have a timeline for it?


Hi Michael,

the current master already includes some of the patches of this original 
series. There are still some changes left, but
what works for me is the current master +

diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
index 5816702..3eaef68 100644
--- a/util/qemu-coroutine.c
+++ b/util/qemu-coroutine.c
@@ -25,8 +25,6 @@ enum {
 };

 /** Free list to speed up creation */
-static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
-static unsigned int release_pool_size;
 static __thread QSLIST_HEAD(, Coroutine) alloc_pool = 
QSLIST_HEAD_INITIALIZER(pool);
 static __thread unsigned int alloc_pool_size;
 static __thread Notifier coroutine_pool_cleanup_notifier;
@@ -49,20 +47,10 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
 if (CONFIG_COROUTINE_POOL) {
 co = QSLIST_FIRST(_pool);
 if (!co) {
-if (release_pool_size > POOL_BATCH_SIZE) {
-/* Slow path; a good place to register the destructor, too.  */
-if (!coroutine_pool_cleanup_notifier.notify) {
-coroutine_pool_cleanup_notifier.notify = 
coroutine_pool_cleanup;
- qemu_thread_atexit_add(_pool_cleanup_notifier);
-}
-
-/* This is not exact; there could be a little skew between
- * release_pool_size and the actual size of release_pool.  But
- * it is just a heuristic, it does not need to be perfect.
- */
-alloc_pool_size = atomic_xchg(_pool_size, 0);
-QSLIST_MOVE_ATOMIC(_pool, _pool);
-co = QSLIST_FIRST(_pool);
+/* Slow path; a good place to register the destructor, too.  */
+if (!coroutine_pool_cleanup_notifier.notify) {
+coroutine_pool_cleanup_notifier.notify = 
coroutine_pool_cleanup;
+ qemu_thread_atexit_add(_pool_cleanup_notifier);
 }
 }
 if (co) {
@@ -85,11 +73,6 @@ static void coroutine_delete(Coroutine *co)
 co->caller = NULL;

 if (CONFIG_COROUTINE_POOL) {
-if (release_pool_size < POOL_BATCH_SIZE * 2) {
-QSLIST_INSERT_HEAD_ATOMIC(_pool, co, pool_next);
-atomic_inc(_pool_size);
-return;
-}
 if (alloc_pool_size < POOL_BATCH_SIZE) {
 QSLIST_INSERT_HEAD(_pool, co, pool_next);
 alloc_pool_size++;

+ invoking qemu with the following environemnet variable set:

MALLOC_MMAP_THRESHOLD_=32768 qemu-system-x86_64 

The last one makes glibc automatically using mmap when the malloced memory 
exceeds 32kByte.

Hope this helps,
Peter




Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage

2016-10-12 Thread Michael R. Hines

Peter,

Greetings from DigitalOcean. We're experiencing the same symptoms 
without this patch.
We have, collectively, many gigabytes of un-planned-for RSS being used 
per-hypervisor

that we would like to get rid of =).

Without explicitly trying this patch (will do that ASAP), we immediately 
noticed that the
192MB mentioned immediately melts away (Yay) when we disabled the 
coroutine thread pool explicitly,
with another ~100MB in additional stack usage that would likely also go 
away if we

applied the entirety of your patch.

Is there any chance you have revisited this or have a timeline for it?

- Michael

/*
 * Michael R. Hines
 * Senior Engineer, DigitalOcean.
 */

On 06/28/2016 04:01 AM, Peter Lieven wrote:

I recently found that Qemu is using several hundred megabytes of RSS memory
more than older versions such as Qemu 2.2.0. So I started tracing
memory allocation and found 2 major reasons for this.

1) We changed the qemu coroutine pool to have a per thread and a global release
pool. The choosen poolsize and the changed algorithm could lead to up to
192 free coroutines with just a single iothread. Each of the coroutines
in the pool each having 1MB of stack memory.

2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed freeing
of memory. This lead to higher heap allocations which could not effectively
be returned to kernel (most likely due to fragmentation).

The following series is what I came up with. Beside the coroutine patches I 
changed
some allocations to forcibly use mmap. All these allocations are not repeatly 
made
during runtime so the impact of using mmap should be neglectible.

There are still some big malloced allocations left which cannot be easily 
changed
(e.g. the pixman buffers in VNC). So it might an idea to set a lower mmap 
threshold for
malloc since this threshold seems to be in the order of several Megabytes on 
modern systems.

Peter Lieven (15):
   coroutine-ucontext: mmap stack memory
   coroutine-ucontext: add a switch to monitor maximum stack size
   coroutine-ucontext: reduce stack size to 64kB
   coroutine: add a knob to disable the shared release pool
   util: add a helper to mmap private anonymous memory
   exec: use mmap for subpages
   qapi: use mmap for QmpInputVisitor
   virtio: use mmap for VirtQueue
   loader: use mmap for ROMs
   vmware_svga: use mmap for scratch pad
   qom: use mmap for bigger Objects
   util: add a function to realloc mmapped memory
   exec: use mmap for PhysPageMap->nodes
   vnc-tight: make the encoding palette static
   vnc: use mmap for VncState

  configure | 33 ++--
  exec.c| 11 ---
  hw/core/loader.c  | 16 +-
  hw/display/vmware_vga.c   |  3 +-
  hw/virtio/virtio.c|  5 +--
  include/qemu/mmap-alloc.h |  7 +
  include/qom/object.h  |  1 +
  qapi/qmp-input-visitor.c  |  5 +--
  qom/object.c  | 20 ++--
  ui/vnc-enc-tight.c| 21 ++---
  ui/vnc.c  |  5 +--
  ui/vnc.h  |  1 +
  util/coroutine-ucontext.c | 66 +--
  util/mmap-alloc.c | 27 
  util/qemu-coroutine.c | 79 ++-
  15 files changed, 225 insertions(+), 75 deletions(-)





Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage

2016-06-28 Thread Peter Lieven

Am 28.06.2016 um 16:43 schrieb Peter Lieven:

Am 28.06.2016 um 14:56 schrieb Dr. David Alan Gilbert:

* Peter Lieven (p...@kamp.de) wrote:

Am 28.06.2016 um 14:29 schrieb Paolo Bonzini:

Am 28.06.2016 um 13:37 schrieb Paolo Bonzini:

On 28/06/2016 11:01, Peter Lieven wrote:

I recently found that Qemu is using several hundred megabytes of RSS
memory
more than older versions such as Qemu 2.2.0. So I started tracing
memory allocation and found 2 major reasons for this.

1) We changed the qemu coroutine pool to have a per thread and a global
release
  pool. The choosen poolsize and the changed algorithm could lead to up
  to
  192 free coroutines with just a single iothread. Each of the
  coroutines
  in the pool each having 1MB of stack memory.

But the fix, as you correctly note, is to reduce the stack size.  It
would be nice to compile block-obj-y with -Wstack-usage=2048 too.

To reveal if there are any big stack allocations in the block layer?

Yes.  Most should be fixed by now, but a handful are probably still there.
(definitely one in vvfat.c).


As it seems reducing to 64kB breaks live migration in some (non reproducible) 
cases.

Does it hit the guard page?

How would that look like? I get segfaults like this:

segfault at 7f91aa642b78 ip 555ab714ef7d sp 7f91aa642b50 error 6 in 
qemu-system-x86_64[555ab6f2c000+794000]

most of the time error 6. Sometimes error 7. segfault is near the sp.

A backtrace would be good.


Here we go. My old friend nc_senv_compat ;-)


This has already been fixed in master. My test systems use an older Qemu ;-)

Peter



Again the question: Would you go for reducing the stack size an eliminating all 
stack eaters ?

The static netbuf in nc_sendv_compat is no problem.

And: I would go for adding the guard page without MAP_GROWSDOWN and mmaping the 
rest of the
stack with this flag if availble. So we are save on non Linux systems or Linux 
before 3.9 or merged memory regions.

Peter

---

Program received signal SIGSEGV, Segmentation fault.
0x55a2ee35 in nc_sendv_compat (nc=0x0, iov=0x0, iovcnt=0, flags=0)
at net/net.c:701
(gdb) bt full
#0  0x55a2ee35 in nc_sendv_compat (nc=0x0, iov=0x0, iovcnt=0, flags=0)
at net/net.c:701
buf = '\000' ...
buffer = 0x0
offset = 0
#1  0x55a2f058 in qemu_deliver_packet_iov (sender=0x565a46b0,
flags=0, iov=0x77e98d20, iovcnt=1, opaque=0x57802370)
at net/net.c:745
nc = 0x57802370
ret = 21845
#2  0x55a3132d in qemu_net_queue_deliver (queue=0x57802590,
sender=0x565a46b0, flags=0, data=0x5659e2a8 "", size=74)
at net/queue.c:163
ret = -1
iov = {iov_base = 0x5659e2a8, iov_len = 74}
#3  0x55a3178b in qemu_net_queue_flush (queue=0x57802590)
at net/queue.c:260
packet = 0x5659e280
ret = 21845
#4  0x55a2eb7a in qemu_flush_or_purge_queued_packets (
nc=0x57802370, purge=false) at net/net.c:629
No locals.
#5  0x55a2ebe4 in qemu_flush_queued_packets (nc=0x57802370)
at net/net.c:642
No locals.
#6  0x557747b7 in virtio_net_set_status (vdev=0x56fb32a8,
status=7 '\a') at /usr/src/qemu-2.5.0/hw/net/virtio-net.c:178
ncs = 0x57802370
queue_started = true
n = 0x56fb32a8
__func__ = "virtio_net_set_status"
q = 0x57308b50
i = 0
queue_status = 7 '\a'
#7  0x55795501 in virtio_set_status (vdev=0x56fb32a8, val=7 '\a')
at /usr/src/qemu-2.5.0/hw/virtio/virtio.c:618
k = 0x5657eb40
__func__ = "virtio_set_status"
#8  0x557985e6 in virtio_vmstate_change (opaque=0x56fb32a8,
running=1, state=RUN_STATE_RUNNING)
at /usr/src/qemu-2.5.0/hw/virtio/virtio.c:1539
vdev = 0x56fb32a8
qbus = 0x56fb3240
__func__ = "virtio_vmstate_change"
k = 0x56570420
backend_run = true
#9  0x558592ae in vm_state_notify (running=1, state=RUN_STATE_RUNNING)
at vl.c:1601
e = 0x57320cf0
next = 0x57af4c40
#10 0x5585737d in vm_start () at vl.c:756
requested = RUN_STATE_MAX
#11 0x55a209ec in process_incoming_migration_co (opaque=0x566a1600)
at migration/migration.c:392
f = 0x566a1600
local_err = 0x0
mis = 0x575ab0e0
ps = POSTCOPY_INCOMING_NONE
ret = 0
#12 0x55b61efd in coroutine_trampoline (i0=1465036928, i1=21845)
at util/coroutine-ucontext.c:80
arg = {p = 0x5752b080, i = {1465036928, 21845}}
self = 0x5752b080
co = 0x5752b080
#13 0x75cb7800 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#14 0x7fffcb40 in ?? ()
No symbol table info available.
#15 0x in ?? ()
No symbol table info available.




Dave




2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to 

Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage

2016-06-28 Thread Peter Lieven

Am 28.06.2016 um 14:56 schrieb Dr. David Alan Gilbert:

* Peter Lieven (p...@kamp.de) wrote:

Am 28.06.2016 um 14:29 schrieb Paolo Bonzini:

Am 28.06.2016 um 13:37 schrieb Paolo Bonzini:

On 28/06/2016 11:01, Peter Lieven wrote:

I recently found that Qemu is using several hundred megabytes of RSS
memory
more than older versions such as Qemu 2.2.0. So I started tracing
memory allocation and found 2 major reasons for this.

1) We changed the qemu coroutine pool to have a per thread and a global
release
  pool. The choosen poolsize and the changed algorithm could lead to up
  to
  192 free coroutines with just a single iothread. Each of the
  coroutines
  in the pool each having 1MB of stack memory.

But the fix, as you correctly note, is to reduce the stack size.  It
would be nice to compile block-obj-y with -Wstack-usage=2048 too.

To reveal if there are any big stack allocations in the block layer?

Yes.  Most should be fixed by now, but a handful are probably still there.
(definitely one in vvfat.c).


As it seems reducing to 64kB breaks live migration in some (non reproducible) 
cases.

Does it hit the guard page?

How would that look like? I get segfaults like this:

segfault at 7f91aa642b78 ip 555ab714ef7d sp 7f91aa642b50 error 6 in 
qemu-system-x86_64[555ab6f2c000+794000]

most of the time error 6. Sometimes error 7. segfault is near the sp.

A backtrace would be good.


Here we go. My old friend nc_senv_compat ;-)

Again the question: Would you go for reducing the stack size an eliminating all 
stack eaters ?

The static netbuf in nc_sendv_compat is no problem.

And: I would go for adding the guard page without MAP_GROWSDOWN and mmaping the 
rest of the
stack with this flag if availble. So we are save on non Linux systems or Linux 
before 3.9 or merged memory regions.

Peter

---

Program received signal SIGSEGV, Segmentation fault.
0x55a2ee35 in nc_sendv_compat (nc=0x0, iov=0x0, iovcnt=0, flags=0)
at net/net.c:701
(gdb) bt full
#0  0x55a2ee35 in nc_sendv_compat (nc=0x0, iov=0x0, iovcnt=0, flags=0)
at net/net.c:701
buf = '\000' ...
buffer = 0x0
offset = 0
#1  0x55a2f058 in qemu_deliver_packet_iov (sender=0x565a46b0,
flags=0, iov=0x77e98d20, iovcnt=1, opaque=0x57802370)
at net/net.c:745
nc = 0x57802370
ret = 21845
#2  0x55a3132d in qemu_net_queue_deliver (queue=0x57802590,
sender=0x565a46b0, flags=0, data=0x5659e2a8 "", size=74)
at net/queue.c:163
ret = -1
iov = {iov_base = 0x5659e2a8, iov_len = 74}
#3  0x55a3178b in qemu_net_queue_flush (queue=0x57802590)
at net/queue.c:260
packet = 0x5659e280
ret = 21845
#4  0x55a2eb7a in qemu_flush_or_purge_queued_packets (
nc=0x57802370, purge=false) at net/net.c:629
No locals.
#5  0x55a2ebe4 in qemu_flush_queued_packets (nc=0x57802370)
at net/net.c:642
No locals.
#6  0x557747b7 in virtio_net_set_status (vdev=0x56fb32a8,
status=7 '\a') at /usr/src/qemu-2.5.0/hw/net/virtio-net.c:178
ncs = 0x57802370
queue_started = true
n = 0x56fb32a8
__func__ = "virtio_net_set_status"
q = 0x57308b50
i = 0
queue_status = 7 '\a'
#7  0x55795501 in virtio_set_status (vdev=0x56fb32a8, val=7 '\a')
at /usr/src/qemu-2.5.0/hw/virtio/virtio.c:618
k = 0x5657eb40
__func__ = "virtio_set_status"
#8  0x557985e6 in virtio_vmstate_change (opaque=0x56fb32a8,
running=1, state=RUN_STATE_RUNNING)
at /usr/src/qemu-2.5.0/hw/virtio/virtio.c:1539
vdev = 0x56fb32a8
qbus = 0x56fb3240
__func__ = "virtio_vmstate_change"
k = 0x56570420
backend_run = true
#9  0x558592ae in vm_state_notify (running=1, state=RUN_STATE_RUNNING)
at vl.c:1601
e = 0x57320cf0
next = 0x57af4c40
#10 0x5585737d in vm_start () at vl.c:756
requested = RUN_STATE_MAX
#11 0x55a209ec in process_incoming_migration_co (opaque=0x566a1600)
at migration/migration.c:392
f = 0x566a1600
local_err = 0x0
mis = 0x575ab0e0
ps = POSTCOPY_INCOMING_NONE
ret = 0
#12 0x55b61efd in coroutine_trampoline (i0=1465036928, i1=21845)
at util/coroutine-ucontext.c:80
arg = {p = 0x5752b080, i = {1465036928, 21845}}
self = 0x5752b080
co = 0x5752b080
#13 0x75cb7800 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#14 0x7fffcb40 in ?? ()
No symbol table info available.
#15 0x in ?? ()
No symbol table info available.




Dave




2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed
freeing
  of memory. This lead to higher heap allocations which could not
  effectively
  be returned to kernel 

Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage

2016-06-28 Thread Paolo Bonzini


- Original Message -
> From: "Peter Lieven" 
> To: "Paolo Bonzini" 
> Cc: qemu-devel@nongnu.org, kw...@redhat.com, "peter maydell" 
> , m...@redhat.com,
> dgilb...@redhat.com, mre...@redhat.com, kra...@redhat.com
> Sent: Tuesday, June 28, 2016 2:33:02 PM
> Subject: Re: [PATCH 00/15] optimize Qemu RSS usage
> 
> Am 28.06.2016 um 14:29 schrieb Paolo Bonzini:
> >> Am 28.06.2016 um 13:37 schrieb Paolo Bonzini:
> >>> On 28/06/2016 11:01, Peter Lieven wrote:
>  I recently found that Qemu is using several hundred megabytes of RSS
>  memory
>  more than older versions such as Qemu 2.2.0. So I started tracing
>  memory allocation and found 2 major reasons for this.
> 
>  1) We changed the qemu coroutine pool to have a per thread and a global
>  release
>   pool. The choosen poolsize and the changed algorithm could lead to
>   up
>   to
>   192 free coroutines with just a single iothread. Each of the
>   coroutines
>   in the pool each having 1MB of stack memory.
> >>> But the fix, as you correctly note, is to reduce the stack size.  It
> >>> would be nice to compile block-obj-y with -Wstack-usage=2048 too.
> >> To reveal if there are any big stack allocations in the block layer?
> > Yes.  Most should be fixed by now, but a handful are probably still there.
> > (definitely one in vvfat.c).
> >
> >> As it seems reducing to 64kB breaks live migration in some (non
> >> reproducible) cases.
> > Does it hit the guard page?
> 
> How would that look like? I get segfaults like this:
> 
> segfault at 7f91aa642b78 ip 555ab714ef7d sp 7f91aa642b50 error 6 in
> qemu-system-x86_64[555ab6f2c000+794000]
> 
> most of the time error 6. Sometimes error 7. segfault is near the sp.

You can use "p ((CoroutineUContext*)current)->stack" from gdb
to check the stack base of the currently running coroutine (do it in the thread
that received the segfault).

You can also check the instruction with that ip and try to get a backtrace.

Paolo


>  2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed
>  freeing
>   of memory. This lead to higher heap allocations which could not
>   effectively
>   be returned to kernel (most likely due to fragmentation).
> >>> I agree that some of the exec.c allocations need some care, but I would
> >>> prefer to use a custom free list or lazy allocation instead of mmap.
> >> This would only help if the elements from the free list would be allocated
> >> using mmap? The issue is that RCU delays the freeing so that the number of
> >> concurrent allocations is high and then a bunch is freed at once. If the
> >> memory
> >> was malloced it would still have caused trouble.
> > The free list should improve reuse and fragmentation.  I'll take a look at
> > lazy allocation of subpages, too.
> 
> Ok, that would be good. And for the PhsyPageMap we use mmap and try to avoid
> the realloc?

I think that with lazy allocation of subpages the PhysPageMap will be much
smaller, but I need to check.

Paolo



Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage

2016-06-28 Thread Dr. David Alan Gilbert
* Peter Lieven (p...@kamp.de) wrote:
> Am 28.06.2016 um 14:29 schrieb Paolo Bonzini:
> > > Am 28.06.2016 um 13:37 schrieb Paolo Bonzini:
> > > > On 28/06/2016 11:01, Peter Lieven wrote:
> > > > > I recently found that Qemu is using several hundred megabytes of RSS
> > > > > memory
> > > > > more than older versions such as Qemu 2.2.0. So I started tracing
> > > > > memory allocation and found 2 major reasons for this.
> > > > > 
> > > > > 1) We changed the qemu coroutine pool to have a per thread and a 
> > > > > global
> > > > > release
> > > > >  pool. The choosen poolsize and the changed algorithm could lead 
> > > > > to up
> > > > >  to
> > > > >  192 free coroutines with just a single iothread. Each of the
> > > > >  coroutines
> > > > >  in the pool each having 1MB of stack memory.
> > > > But the fix, as you correctly note, is to reduce the stack size.  It
> > > > would be nice to compile block-obj-y with -Wstack-usage=2048 too.
> > > To reveal if there are any big stack allocations in the block layer?
> > Yes.  Most should be fixed by now, but a handful are probably still there.
> > (definitely one in vvfat.c).
> > 
> > > As it seems reducing to 64kB breaks live migration in some (non 
> > > reproducible) cases.
> > Does it hit the guard page?
> 
> How would that look like? I get segfaults like this:
> 
> segfault at 7f91aa642b78 ip 555ab714ef7d sp 7f91aa642b50 error 6 in 
> qemu-system-x86_64[555ab6f2c000+794000]
> 
> most of the time error 6. Sometimes error 7. segfault is near the sp.

A backtrace would be good.

Dave

> 
> 
> > 
> > > > > 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to 
> > > > > delayed
> > > > > freeing
> > > > >  of memory. This lead to higher heap allocations which could not
> > > > >  effectively
> > > > >  be returned to kernel (most likely due to fragmentation).
> > > > I agree that some of the exec.c allocations need some care, but I would
> > > > prefer to use a custom free list or lazy allocation instead of mmap.
> > > This would only help if the elements from the free list would be allocated
> > > using mmap? The issue is that RCU delays the freeing so that the number of
> > > concurrent allocations is high and then a bunch is freed at once. If the 
> > > memory
> > > was malloced it would still have caused trouble.
> > The free list should improve reuse and fragmentation.  I'll take a look at
> > lazy allocation of subpages, too.
> 
> Ok, that would be good. And for the PhsyPageMap we use mmap and try to avoid
> the realloc?
> 
> Peter
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage

2016-06-28 Thread Peter Lieven

Am 28.06.2016 um 14:29 schrieb Paolo Bonzini:

Am 28.06.2016 um 13:37 schrieb Paolo Bonzini:

On 28/06/2016 11:01, Peter Lieven wrote:

I recently found that Qemu is using several hundred megabytes of RSS
memory
more than older versions such as Qemu 2.2.0. So I started tracing
memory allocation and found 2 major reasons for this.

1) We changed the qemu coroutine pool to have a per thread and a global
release
 pool. The choosen poolsize and the changed algorithm could lead to up
 to
 192 free coroutines with just a single iothread. Each of the
 coroutines
 in the pool each having 1MB of stack memory.

But the fix, as you correctly note, is to reduce the stack size.  It
would be nice to compile block-obj-y with -Wstack-usage=2048 too.

To reveal if there are any big stack allocations in the block layer?

Yes.  Most should be fixed by now, but a handful are probably still there.
(definitely one in vvfat.c).


As it seems reducing to 64kB breaks live migration in some (non reproducible) 
cases.

Does it hit the guard page?


How would that look like? I get segfaults like this:

segfault at 7f91aa642b78 ip 555ab714ef7d sp 7f91aa642b50 error 6 in 
qemu-system-x86_64[555ab6f2c000+794000]

most of the time error 6. Sometimes error 7. segfault is near the sp.





2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed
freeing
 of memory. This lead to higher heap allocations which could not
 effectively
 be returned to kernel (most likely due to fragmentation).

I agree that some of the exec.c allocations need some care, but I would
prefer to use a custom free list or lazy allocation instead of mmap.

This would only help if the elements from the free list would be allocated
using mmap? The issue is that RCU delays the freeing so that the number of
concurrent allocations is high and then a bunch is freed at once. If the memory
was malloced it would still have caused trouble.

The free list should improve reuse and fragmentation.  I'll take a look at
lazy allocation of subpages, too.


Ok, that would be good. And for the PhsyPageMap we use mmap and try to avoid
the realloc?

Peter




Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage

2016-06-28 Thread Paolo Bonzini
> Am 28.06.2016 um 13:37 schrieb Paolo Bonzini:
> > On 28/06/2016 11:01, Peter Lieven wrote:
> >> I recently found that Qemu is using several hundred megabytes of RSS
> >> memory
> >> more than older versions such as Qemu 2.2.0. So I started tracing
> >> memory allocation and found 2 major reasons for this.
> >>
> >> 1) We changed the qemu coroutine pool to have a per thread and a global
> >> release
> >> pool. The choosen poolsize and the changed algorithm could lead to up
> >> to
> >> 192 free coroutines with just a single iothread. Each of the
> >> coroutines
> >> in the pool each having 1MB of stack memory.
> > But the fix, as you correctly note, is to reduce the stack size.  It
> > would be nice to compile block-obj-y with -Wstack-usage=2048 too.
> 
> To reveal if there are any big stack allocations in the block layer?

Yes.  Most should be fixed by now, but a handful are probably still there.
(definitely one in vvfat.c).

> As it seems reducing to 64kB breaks live migration in some (non reproducible) 
> cases.

Does it hit the guard page?

> >> 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed
> >> freeing
> >> of memory. This lead to higher heap allocations which could not
> >> effectively
> >> be returned to kernel (most likely due to fragmentation).
> > I agree that some of the exec.c allocations need some care, but I would
> > prefer to use a custom free list or lazy allocation instead of mmap.
> 
> This would only help if the elements from the free list would be allocated
> using mmap? The issue is that RCU delays the freeing so that the number of
> concurrent allocations is high and then a bunch is freed at once. If the 
> memory
> was malloced it would still have caused trouble.

The free list should improve reuse and fragmentation.  I'll take a look at
lazy allocation of subpages, too.

Paolo



Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage

2016-06-28 Thread Peter Lieven

Am 28.06.2016 um 13:37 schrieb Paolo Bonzini:


On 28/06/2016 11:01, Peter Lieven wrote:

I recently found that Qemu is using several hundred megabytes of RSS memory
more than older versions such as Qemu 2.2.0. So I started tracing
memory allocation and found 2 major reasons for this.

1) We changed the qemu coroutine pool to have a per thread and a global release
pool. The choosen poolsize and the changed algorithm could lead to up to
192 free coroutines with just a single iothread. Each of the coroutines
in the pool each having 1MB of stack memory.

But the fix, as you correctly note, is to reduce the stack size.  It
would be nice to compile block-obj-y with -Wstack-usage=2048 too.


To reveal if there are any big stack allocations in the block layer?

As it seems reducing to 64kB breaks live migration in some (non reproducible) 
cases.
The question is which way to go? Reduce the stack size and fix the big stack 
allocations
or keep the stack size at 1MB?




2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed freeing
of memory. This lead to higher heap allocations which could not effectively
be returned to kernel (most likely due to fragmentation).

I agree that some of the exec.c allocations need some care, but I would
prefer to use a custom free list or lazy allocation instead of mmap.


This would only help if the elements from the free list would be allocated using
mmap? The issue is that RCU delays the freeing so that the number of concurrent
allocations is high and then a bunch is freed at once. If the memory was 
malloced
it would still have caused trouble.



Changing allocations to use mmap also is not really useful if you do it
for objects that are never freed (as in patches 8-9-10-15 at least, and
probably 11 too which is one of the most contentious).


9 actually frees the memory ;-)
15 frees the memory as soon as the vnc client disconnects.

The others I agree. If the objects in Patch 11 are freed needs to be checked.



In other words, the effort tracking down the allocation is really,
really appreciated.  But the patches look like you only had a hammer at
hand, and everything looked like a nail. :)


I just have observed that forcing ptmalloc to use mmap for everything
above 4kB significantly reduced the RSS usage.

Peter




Re: [Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage

2016-06-28 Thread Paolo Bonzini


On 28/06/2016 11:01, Peter Lieven wrote:
> I recently found that Qemu is using several hundred megabytes of RSS memory
> more than older versions such as Qemu 2.2.0. So I started tracing
> memory allocation and found 2 major reasons for this.
> 
> 1) We changed the qemu coroutine pool to have a per thread and a global 
> release
>pool. The choosen poolsize and the changed algorithm could lead to up to
>192 free coroutines with just a single iothread. Each of the coroutines
>in the pool each having 1MB of stack memory.

But the fix, as you correctly note, is to reduce the stack size.  It
would be nice to compile block-obj-y with -Wstack-usage=2048 too.

> 2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed 
> freeing
>of memory. This lead to higher heap allocations which could not effectively
>be returned to kernel (most likely due to fragmentation).

I agree that some of the exec.c allocations need some care, but I would
prefer to use a custom free list or lazy allocation instead of mmap.

Changing allocations to use mmap also is not really useful if you do it
for objects that are never freed (as in patches 8-9-10-15 at least, and
probably 11 too which is one of the most contentious).

In other words, the effort tracking down the allocation is really,
really appreciated.  But the patches look like you only had a hammer at
hand, and everything looked like a nail. :)

Paolo

> The following series is what I came up with. Beside the coroutine patches I 
> changed
> some allocations to forcibly use mmap. All these allocations are not repeatly 
> made
> during runtime so the impact of using mmap should be neglectible.
> 
> There are still some big malloced allocations left which cannot be easily 
> changed
> (e.g. the pixman buffers in VNC). So it might an idea to set a lower mmap 
> threshold for
> malloc since this threshold seems to be in the order of several Megabytes on 
> modern systems.
> 
> Peter Lieven (15):
>   coroutine-ucontext: mmap stack memory
>   coroutine-ucontext: add a switch to monitor maximum stack size
>   coroutine-ucontext: reduce stack size to 64kB
>   coroutine: add a knob to disable the shared release pool
>   util: add a helper to mmap private anonymous memory
>   exec: use mmap for subpages
>   qapi: use mmap for QmpInputVisitor
>   virtio: use mmap for VirtQueue
>   loader: use mmap for ROMs
>   vmware_svga: use mmap for scratch pad
>   qom: use mmap for bigger Objects
>   util: add a function to realloc mmapped memory
>   exec: use mmap for PhysPageMap->nodes
>   vnc-tight: make the encoding palette static
>   vnc: use mmap for VncState
> 
>  configure | 33 ++--
>  exec.c| 11 ---
>  hw/core/loader.c  | 16 +-
>  hw/display/vmware_vga.c   |  3 +-
>  hw/virtio/virtio.c|  5 +--
>  include/qemu/mmap-alloc.h |  7 +
>  include/qom/object.h  |  1 +
>  qapi/qmp-input-visitor.c  |  5 +--
>  qom/object.c  | 20 ++--
>  ui/vnc-enc-tight.c| 21 ++---
>  ui/vnc.c  |  5 +--
>  ui/vnc.h  |  1 +
>  util/coroutine-ucontext.c | 66 +--
>  util/mmap-alloc.c | 27 
>  util/qemu-coroutine.c | 79 
> ++-
>  15 files changed, 225 insertions(+), 75 deletions(-)
> 



[Qemu-devel] [PATCH 00/15] optimize Qemu RSS usage

2016-06-28 Thread Peter Lieven
I recently found that Qemu is using several hundred megabytes of RSS memory
more than older versions such as Qemu 2.2.0. So I started tracing
memory allocation and found 2 major reasons for this.

1) We changed the qemu coroutine pool to have a per thread and a global release
   pool. The choosen poolsize and the changed algorithm could lead to up to
   192 free coroutines with just a single iothread. Each of the coroutines
   in the pool each having 1MB of stack memory.

2) Between Qemu 2.2.0 and 2.3.0 RCU was introduced which lead to delayed freeing
   of memory. This lead to higher heap allocations which could not effectively
   be returned to kernel (most likely due to fragmentation).

The following series is what I came up with. Beside the coroutine patches I 
changed
some allocations to forcibly use mmap. All these allocations are not repeatly 
made
during runtime so the impact of using mmap should be neglectible.

There are still some big malloced allocations left which cannot be easily 
changed
(e.g. the pixman buffers in VNC). So it might an idea to set a lower mmap 
threshold for
malloc since this threshold seems to be in the order of several Megabytes on 
modern systems.

Peter Lieven (15):
  coroutine-ucontext: mmap stack memory
  coroutine-ucontext: add a switch to monitor maximum stack size
  coroutine-ucontext: reduce stack size to 64kB
  coroutine: add a knob to disable the shared release pool
  util: add a helper to mmap private anonymous memory
  exec: use mmap for subpages
  qapi: use mmap for QmpInputVisitor
  virtio: use mmap for VirtQueue
  loader: use mmap for ROMs
  vmware_svga: use mmap for scratch pad
  qom: use mmap for bigger Objects
  util: add a function to realloc mmapped memory
  exec: use mmap for PhysPageMap->nodes
  vnc-tight: make the encoding palette static
  vnc: use mmap for VncState

 configure | 33 ++--
 exec.c| 11 ---
 hw/core/loader.c  | 16 +-
 hw/display/vmware_vga.c   |  3 +-
 hw/virtio/virtio.c|  5 +--
 include/qemu/mmap-alloc.h |  7 +
 include/qom/object.h  |  1 +
 qapi/qmp-input-visitor.c  |  5 +--
 qom/object.c  | 20 ++--
 ui/vnc-enc-tight.c| 21 ++---
 ui/vnc.c  |  5 +--
 ui/vnc.h  |  1 +
 util/coroutine-ucontext.c | 66 +--
 util/mmap-alloc.c | 27 
 util/qemu-coroutine.c | 79 ++-
 15 files changed, 225 insertions(+), 75 deletions(-)

-- 
1.9.1