Re: [Qemu-devel] [PULL 42/57] Page request: Consume pages off the post-copy queue

2015-11-16 Thread Andreas Färber
Am 12.11.2015 um 14:36 schrieb Markus Armbruster:
> Peter Maydell  writes:
> 
>> [...] it's usually painful to get a
>> backtrace out of this kind of qtest, because it's clearly starting
>> a whole pile of QEMUs and there's no way I know of to say "only
>> run a few of these tests, not the whole huge pile".
> 
> Not sure this helps, but here goes anyway: run the make with V=1, fish
> out the gtester line, add -p /name/of/the/test.  Here's an example from
> my shell history:
> 
> $ MALLOC_PERTURB_=${MALLOC_PERTURB_:-$((RANDOM % 255 + 1))} gtester -k 
> --verbose -m=quick -p /dicts/large_dict tests/check-qjson

You can just run the test from the command line. For QTests you also
need to pass a variable telling it which QEMU binary to use and since
recently also which qemu-img or so.

If someone has an idea how we could implement make
check-qtest-- or check-unit- that would be
appreciated. check-qtest- would seem to collide with our
wildcard handling of targets, but might be handy, too.

Regards,
Andreas

-- 
SUSE Linux GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Felix Imendörffer, Jane Smithard, Graham Norton; HRB 21284 (AG Nürnberg)



Re: [Qemu-devel] [PULL 42/57] Page request: Consume pages off the post-copy queue

2015-11-12 Thread Peter Maydell
On 10 November 2015 at 14:25, Juan Quintela  wrote:
> From: "Dr. David Alan Gilbert" 
>
> When transmitting RAM pages, consume pages that have been queued by
> MIG_RPCOMM_REQPAGE commands and send them ahead of normal page scanning.
>
> Note:
>   a) After a queued page the linear walk carries on from after the
> unqueued page; there is a reasonable chance that the destination
> was about to ask for other closeby pages anyway.
>
>   b) We have to be careful of any assumptions that the page walking
> code makes, in particular it does some short cuts on its first linear
> walk that break as soon as we do a queued page.
>
>   c) We have to be careful to not break up host-page size chunks, since
> this makes it harder to place the pages on the destination.
>
> Signed-off-by: Dr. David Alan Gilbert 
> Reviewed-by: Juan Quintela 
> Signed-off-by: Juan Quintela 

I've just discovered that this is causing 'make check' failures on
my OSX host (unfortunately something in my setup is causing
'make check' failures to not always cause a build failure, so I
didn't notice earlier):

manooth$ (make -C build/x86 -j8 && cd build/x86 &&
QTEST_QEMU_BINARY=x86_64-softmmu/qemu-system-x86_64
QTEST_QEMU_IMG=qemu-img MALLOC_PERTURB_=${MALLOC_PERTURB_:-$((RANDOM %
255 + 1))}  tests/ahci-test)
[...]

/x86_64/ahci/flush/simple: OK
/x86_64/ahci/flush/retry: OK
/x86_64/ahci/flush/migrate: qemu: qemu_mutex_lock: Invalid argument
qemu-system-x86_64:Broken pipe
 Not a migration stream
qemu-system-x86_64: load of migration failed: Invalid argument

thanks
-- PMM



Re: [Qemu-devel] [PULL 42/57] Page request: Consume pages off the post-copy queue

2015-11-12 Thread Peter Maydell
On 12 November 2015 at 12:04, Dr. David Alan Gilbert
 wrote:
> * Peter Maydell (peter.mayd...@linaro.org) wrote:
>> On 10 November 2015 at 14:25, Juan Quintela  wrote:
>> > From: "Dr. David Alan Gilbert" 
>> >
>> > When transmitting RAM pages, consume pages that have been queued by
>> > MIG_RPCOMM_REQPAGE commands and send them ahead of normal page scanning.
>> >
>> > Note:
>> >   a) After a queued page the linear walk carries on from after the
>> > unqueued page; there is a reasonable chance that the destination
>> > was about to ask for other closeby pages anyway.
>> >
>> >   b) We have to be careful of any assumptions that the page walking
>> > code makes, in particular it does some short cuts on its first linear
>> > walk that break as soon as we do a queued page.
>> >
>> >   c) We have to be careful to not break up host-page size chunks, since
>> > this makes it harder to place the pages on the destination.
>> >
>> > Signed-off-by: Dr. David Alan Gilbert 
>> > Reviewed-by: Juan Quintela 
>> > Signed-off-by: Juan Quintela 
>>
>> I've just discovered that this is causing 'make check' failures on
>> my OSX host (unfortunately something in my setup is causing
>> 'make check' failures to not always cause a build failure, so I
>> didn't notice earlier):
>
> It's only failing on OSX? Every time or only sometimes?

Only OSX, and always. I think OSX is pickier about mutexes really
needing to be initialized before use.

> If you can find a way to get a backtrace off that qemu_mutex_lock case
> that would be great; I'd assume the later errors are the fall out from that.

I'll have a look after lunch, but it's usually painful to get a
backtrace out of this kind of qtest, because it's clearly starting
a whole pile of QEMUs and there's no way I know of to say "only
run a few of these tests, not the whole huge pile".

thanks
-- PMM



Re: [Qemu-devel] [PULL 42/57] Page request: Consume pages off the post-copy queue

2015-11-12 Thread Dr. David Alan Gilbert
* Peter Maydell (peter.mayd...@linaro.org) wrote:
> On 12 November 2015 at 12:04, Dr. David Alan Gilbert
>  wrote:
> > * Peter Maydell (peter.mayd...@linaro.org) wrote:
> >> On 10 November 2015 at 14:25, Juan Quintela  wrote:
> >> > From: "Dr. David Alan Gilbert" 
> >> >
> >> > When transmitting RAM pages, consume pages that have been queued by
> >> > MIG_RPCOMM_REQPAGE commands and send them ahead of normal page scanning.
> >> >
> >> > Note:
> >> >   a) After a queued page the linear walk carries on from after the
> >> > unqueued page; there is a reasonable chance that the destination
> >> > was about to ask for other closeby pages anyway.
> >> >
> >> >   b) We have to be careful of any assumptions that the page walking
> >> > code makes, in particular it does some short cuts on its first linear
> >> > walk that break as soon as we do a queued page.
> >> >
> >> >   c) We have to be careful to not break up host-page size chunks, since
> >> > this makes it harder to place the pages on the destination.
> >> >
> >> > Signed-off-by: Dr. David Alan Gilbert 
> >> > Reviewed-by: Juan Quintela 
> >> > Signed-off-by: Juan Quintela 
> >>
> >> I've just discovered that this is causing 'make check' failures on
> >> my OSX host (unfortunately something in my setup is causing
> >> 'make check' failures to not always cause a build failure, so I
> >> didn't notice earlier):
> >
> > It's only failing on OSX? Every time or only sometimes?
> 
> Only OSX, and always. I think OSX is pickier about mutexes really
> needing to be initialized before use.

OK, at least an 'always' should be easier to debug.

> > If you can find a way to get a backtrace off that qemu_mutex_lock case
> > that would be great; I'd assume the later errors are the fall out from that.
> 
> I'll have a look after lunch, but it's usually painful to get a
> backtrace out of this kind of qtest, because it's clearly starting
> a whole pile of QEMUs and there's no way I know of to say "only
> run a few of these tests, not the whole huge pile".

You could add an abort/assert into util/qemu-thread-posix.c qemu_mutex_lock
in the error path.

Could you also add:

diff --git a/migration/migration.c b/migration/migration.c
index 9bd2ce7..85e5766 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -93,6 +93,7 @@ MigrationState *migrate_get_current(void)
 };
 
 if (!once) {
+fprintf(stderr,"migrate_get_current do init of current_migration 
%d\n", getpid());
 qemu_mutex_init(_migration.src_page_req_mutex);
 once = true;
 }
diff --git a/migration/ram.c b/migration/ram.c
index 4266687..72b46f2 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1036,6 +1036,7 @@ static RAMBlock *unqueue_page(MigrationState *ms, 
ram_addr_t *offset,
 {
 RAMBlock *block = NULL;
 
+fprintf(stderr,"unqueue_page %d\n", getpid());
 qemu_mutex_lock(>src_page_req_mutex);
 if (!QSIMPLEQ_EMPTY(>src_page_requests)) {
 struct MigrationSrcPageRequest *entry =


and make sure that the init happens before the first unqueue (you'll get
loads of calls to unqueue).

Dave

> 
> thanks
> -- PMM
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PULL 42/57] Page request: Consume pages off the post-copy queue

2015-11-12 Thread Peter Maydell
On 12 November 2015 at 12:23, Dr. David Alan Gilbert
 wrote:
> * Peter Maydell (peter.mayd...@linaro.org) wrote:
>> On 12 November 2015 at 12:04, Dr. David Alan Gilbert
>>  wrote:
>> > * Peter Maydell (peter.mayd...@linaro.org) wrote:
>> >> On 10 November 2015 at 14:25, Juan Quintela  wrote:
>> >> > From: "Dr. David Alan Gilbert" 
>> >> >
>> >> > When transmitting RAM pages, consume pages that have been queued by
>> >> > MIG_RPCOMM_REQPAGE commands and send them ahead of normal page scanning.
>> >> >
>> >> > Note:
>> >> >   a) After a queued page the linear walk carries on from after the
>> >> > unqueued page; there is a reasonable chance that the destination
>> >> > was about to ask for other closeby pages anyway.
>> >> >
>> >> >   b) We have to be careful of any assumptions that the page walking
>> >> > code makes, in particular it does some short cuts on its first linear
>> >> > walk that break as soon as we do a queued page.
>> >> >
>> >> >   c) We have to be careful to not break up host-page size chunks, since
>> >> > this makes it harder to place the pages on the destination.
>> >> >
>> >> > Signed-off-by: Dr. David Alan Gilbert 
>> >> > Reviewed-by: Juan Quintela 
>> >> > Signed-off-by: Juan Quintela 
>> >>
>> >> I've just discovered that this is causing 'make check' failures on
>> >> my OSX host (unfortunately something in my setup is causing
>> >> 'make check' failures to not always cause a build failure, so I
>> >> didn't notice earlier):
>> >
>> > It's only failing on OSX? Every time or only sometimes?
>>
>> Only OSX, and always. I think OSX is pickier about mutexes really
>> needing to be initialized before use.
>
> OK, at least an 'always' should be easier to debug.
>
>> > If you can find a way to get a backtrace off that qemu_mutex_lock case
>> > that would be great; I'd assume the later errors are the fall out from 
>> > that.
>>
>> I'll have a look after lunch, but it's usually painful to get a
>> backtrace out of this kind of qtest, because it's clearly starting
>> a whole pile of QEMUs and there's no way I know of to say "only
>> run a few of these tests, not the whole huge pile".
>
> You could add an abort/assert into util/qemu-thread-posix.c qemu_mutex_lock
> in the error path.

abort/assert doesn't print a backtrace.

I added some OSX backtrace-gathering/printing functions to the errorpath,
and got this:

0   qemu-system-x86_64  0x00010c66d203 qemu_mutex_lock + 83
1   qemu-system-x86_64  0x00010c2ac7af unqueue_page + 47
2   qemu-system-x86_64  0x00010c2ac386 get_queued_page + 54
3   qemu-system-x86_64  0x00010c2ac135
ram_find_and_save_block + 165
4   qemu-system-x86_64  0x00010c2ab5a2
ram_save_iterate + 130
5   qemu-system-x86_64  0x00010c2afa2e
qemu_savevm_state_iterate + 302
6   qemu-system-x86_64  0x00010c53acbb
migration_thread + 571
7   libsystem_pthread.dylib 0x7fff9146c05a _pthread_body + 131
8   libsystem_pthread.dylib 0x7fff9146bfd7 _pthread_body + 0
9   libsystem_pthread.dylib 0x7fff914693ed thread_start + 13


>
> Could you also add:
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 9bd2ce7..85e5766 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -93,6 +93,7 @@ MigrationState *migrate_get_current(void)
>  };
>
>  if (!once) {
> +fprintf(stderr,"migrate_get_current do init of current_migration 
> %d\n", getpid());
>  qemu_mutex_init(_migration.src_page_req_mutex);
>  once = true;
>  }
> diff --git a/migration/ram.c b/migration/ram.c
> index 4266687..72b46f2 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1036,6 +1036,7 @@ static RAMBlock *unqueue_page(MigrationState *ms, 
> ram_addr_t *offset,
>  {
>  RAMBlock *block = NULL;
>
> +fprintf(stderr,"unqueue_page %d\n", getpid());
>  qemu_mutex_lock(>src_page_req_mutex);
>  if (!QSIMPLEQ_EMPTY(>src_page_requests)) {
>  struct MigrationSrcPageRequest *entry =
>
>
> and make sure that the init happens before the first unqueue (you'll get
> loads of calls to unqueue).

With that change, plus the backtracing:

/x86_64/ahci/flush/retry: OK
/x86_64/ahci/flush/migrate: migrate_get_current do init of
current_migration 60427
migrate_get_current do init of current_migration 60428
unqueue_page 60427
0   qemu-system-x86_64  0x000101a751c3 qemu_mutex_lock + 83
1   qemu-system-x86_64  0x0001016b4749 unqueue_page + 89
2   qemu-system-x86_64  0x0001016b42f6 get_queued_page + 54
3   qemu-system-x86_64  0x0001016b40a5
ram_find_and_save_block + 165
4   qemu-system-x86_64  0x0001016b3512
ram_save_iterate + 

Re: [Qemu-devel] [PULL 42/57] Page request: Consume pages off the post-copy queue

2015-11-12 Thread Dr. David Alan Gilbert
* Peter Maydell (peter.mayd...@linaro.org) wrote:
> On 10 November 2015 at 14:25, Juan Quintela  wrote:
> > From: "Dr. David Alan Gilbert" 
> >
> > When transmitting RAM pages, consume pages that have been queued by
> > MIG_RPCOMM_REQPAGE commands and send them ahead of normal page scanning.
> >
> > Note:
> >   a) After a queued page the linear walk carries on from after the
> > unqueued page; there is a reasonable chance that the destination
> > was about to ask for other closeby pages anyway.
> >
> >   b) We have to be careful of any assumptions that the page walking
> > code makes, in particular it does some short cuts on its first linear
> > walk that break as soon as we do a queued page.
> >
> >   c) We have to be careful to not break up host-page size chunks, since
> > this makes it harder to place the pages on the destination.
> >
> > Signed-off-by: Dr. David Alan Gilbert 
> > Reviewed-by: Juan Quintela 
> > Signed-off-by: Juan Quintela 
> 
> I've just discovered that this is causing 'make check' failures on
> my OSX host (unfortunately something in my setup is causing
> 'make check' failures to not always cause a build failure, so I
> didn't notice earlier):

It's only failing on OSX? Every time or only sometimes?

> manooth$ (make -C build/x86 -j8 && cd build/x86 &&
> QTEST_QEMU_BINARY=x86_64-softmmu/qemu-system-x86_64
> QTEST_QEMU_IMG=qemu-img MALLOC_PERTURB_=${MALLOC_PERTURB_:-$((RANDOM %
> 255 + 1))}  tests/ahci-test)
> [...]
> 
> /x86_64/ahci/flush/simple: OK
> /x86_64/ahci/flush/retry: OK
> /x86_64/ahci/flush/migrate: qemu: qemu_mutex_lock: Invalid argument
> qemu-system-x86_64:Broken pipe
>  Not a migration stream
> qemu-system-x86_64: load of migration failed: Invalid argument

OK, let me see if I can find - I do add a mutex in there (src_page_req_mutex)
that I guess is the most likely culprit; it's initialised by 
migrate_get_current (once
using a bool) so it should be OK.

If you can find a way to get a backtrace off that qemu_mutex_lock case
that would be great; I'd assume the later errors are the fall out from that.

Dave

> 
> thanks
> -- PMM
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PULL 42/57] Page request: Consume pages off the post-copy queue

2015-11-12 Thread Dr. David Alan Gilbert
* Peter Maydell (peter.mayd...@linaro.org) wrote:
> On 12 November 2015 at 13:18, Peter Maydell  wrote:
> > On 12 November 2015 at 13:08, Dr. David Alan Gilbert
> >  wrote:
> >> OK, can you try a simple migration by hand outside of the test harness;
> >> just something simple like:
> >>
> >> ./bin/qemu-system-x86_64 -M pc -nographic
> >> (qemu) migrate "exec: cat > /dev/null"
> >>
> >> and the same with q35 ?
> >
> > (qemu) migrate "exec: cat > /dev/null"
> > migrate_get_current do init of current_migration 65307
> > unqueue_page 65307
> > 0   qemu-system-x86_64  0x0001067c01c3 qemu_mutex_lock 
> > + 83
> 
> This turns out to be because migrate_init() is corrupting the
> mutex memory when it does "memset(s, 0, sizeof(*s))". Presumably
> Linux's initialized-mutex is all-zeroes, but OSX's is not.

OK, thanks for finding that; I've just smoke tested the following
patch and will post it properly after I test it more thoroughly in
a couple of hours.

Dave


commit 689d4964442c3ee34a2dac77411a30b96c214288
Author: Dr. David Alan Gilbert 
Date:   Thu Nov 12 14:10:33 2015 +

migration_init: Fix lock initialisation/make it explicit

Peter reported a lock error on MacOS after my a82d593b
patch.

migrate_get_current does one-time initialisation of
a bunch of variables.
migrate_init does reinitialisation even on a 2nd
migrate after a cancel.

The problem here was that I'd initialised the mutex
in migrate_get_current, and the memset in migrate_init
corrupted it.

Remove the memset and replace it by explicit initialisation
of fields that need initialising; this also turns out to be simpler
than the old code that had to preserve some fields.

Reported-by: Peter Maydell 
Signed-off-by: Dr. David Alan Gilbert 
Fixes: a82d593b

diff --git a/migration/migration.c b/migration/migration.c
index 9bd2ce7..7e4e27b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -902,38 +902,31 @@ bool migration_in_postcopy(MigrationState *s)
 MigrationState *migrate_init(const MigrationParams *params)
 {
 MigrationState *s = migrate_get_current();
-int64_t bandwidth_limit = s->bandwidth_limit;
-bool enabled_capabilities[MIGRATION_CAPABILITY_MAX];
-int64_t xbzrle_cache_size = s->xbzrle_cache_size;
-int compress_level = s->parameters[MIGRATION_PARAMETER_COMPRESS_LEVEL];
-int compress_thread_count =
-s->parameters[MIGRATION_PARAMETER_COMPRESS_THREADS];
-int decompress_thread_count =
-s->parameters[MIGRATION_PARAMETER_DECOMPRESS_THREADS];
-int x_cpu_throttle_initial =
-s->parameters[MIGRATION_PARAMETER_X_CPU_THROTTLE_INITIAL];
-int x_cpu_throttle_increment =
-s->parameters[MIGRATION_PARAMETER_X_CPU_THROTTLE_INCREMENT];
-
-memcpy(enabled_capabilities, s->enabled_capabilities,
-   sizeof(enabled_capabilities));
 
-memset(s, 0, sizeof(*s));
+/*
+ * Reinitialise all migration state, except
+ * parameters/capabilities that the user set, and
+ * locks.
+ */
+s->bytes_xfer = 0;
+s->xfer_limit = 0;
+s->cleanup_bh = 0;
+s->file = NULL;
+s->state = MIGRATION_STATUS_NONE;
 s->params = *params;
-memcpy(s->enabled_capabilities, enabled_capabilities,
-   sizeof(enabled_capabilities));
-s->xbzrle_cache_size = xbzrle_cache_size;
-
-s->parameters[MIGRATION_PARAMETER_COMPRESS_LEVEL] = compress_level;
-s->parameters[MIGRATION_PARAMETER_COMPRESS_THREADS] =
-   compress_thread_count;
-s->parameters[MIGRATION_PARAMETER_DECOMPRESS_THREADS] =
-   decompress_thread_count;
-s->parameters[MIGRATION_PARAMETER_X_CPU_THROTTLE_INITIAL] =
-x_cpu_throttle_initial;
-s->parameters[MIGRATION_PARAMETER_X_CPU_THROTTLE_INCREMENT] =
-x_cpu_throttle_increment;
-s->bandwidth_limit = bandwidth_limit;
+s->rp_state.from_dst_file = NULL;
+s->rp_state.error = false;
+s->mbps = 0.0;
+s->downtime = 0;
+s->expected_downtime = 0;
+s->dirty_pages_rate = 0;
+s->dirty_bytes_rate = 0;
+s->setup_time = 0;
+s->dirty_sync_count = 0;
+s->start_postcopy = false;
+s->migration_thread_running = false;
+s->last_req_rb = NULL;
+
 migrate_set_state(s, MIGRATION_STATUS_NONE, MIGRATION_STATUS_SETUP);
 
 QSIMPLEQ_INIT(>src_page_requests);
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PULL 42/57] Page request: Consume pages off the post-copy queue

2015-11-12 Thread Dr. David Alan Gilbert
* Peter Maydell (peter.mayd...@linaro.org) wrote:

> abort/assert doesn't print a backtrace.
> 
> I added some OSX backtrace-gathering/printing functions to the errorpath,
> and got this:
> 
> 0   qemu-system-x86_64  0x00010c66d203 qemu_mutex_lock + 
> 83
> 1   qemu-system-x86_64  0x00010c2ac7af unqueue_page + 47

OK, it is the mutex that I added in that patch.

> 2   qemu-system-x86_64  0x00010c2ac386 get_queued_page + 
> 54
> 3   qemu-system-x86_64  0x00010c2ac135
> ram_find_and_save_block + 165
> 4   qemu-system-x86_64  0x00010c2ab5a2
> ram_save_iterate + 130
> 5   qemu-system-x86_64  0x00010c2afa2e
> qemu_savevm_state_iterate + 302
> 6   qemu-system-x86_64  0x00010c53acbb
> migration_thread + 571
> 7   libsystem_pthread.dylib 0x7fff9146c05a _pthread_body + 131
> 8   libsystem_pthread.dylib 0x7fff9146bfd7 _pthread_body + 0
> 9   libsystem_pthread.dylib 0x7fff914693ed thread_start + 13
> 
> 
> >
> > Could you also add:
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 9bd2ce7..85e5766 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -93,6 +93,7 @@ MigrationState *migrate_get_current(void)
> >  };
> >
> >  if (!once) {
> > +fprintf(stderr,"migrate_get_current do init of current_migration 
> > %d\n", getpid());
> >  qemu_mutex_init(_migration.src_page_req_mutex);
> >  once = true;
> >  }
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 4266687..72b46f2 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1036,6 +1036,7 @@ static RAMBlock *unqueue_page(MigrationState *ms, 
> > ram_addr_t *offset,
> >  {
> >  RAMBlock *block = NULL;
> >
> > +fprintf(stderr,"unqueue_page %d\n", getpid());
> >  qemu_mutex_lock(>src_page_req_mutex);
> >  if (!QSIMPLEQ_EMPTY(>src_page_requests)) {
> >  struct MigrationSrcPageRequest *entry =
> >
> >
> > and make sure that the init happens before the first unqueue (you'll get
> > loads of calls to unqueue).
> 
> With that change, plus the backtracing:
> 
> /x86_64/ahci/flush/retry: OK
> /x86_64/ahci/flush/migrate: migrate_get_current do init of
> current_migration 60427
> migrate_get_current do init of current_migration 60428
> unqueue_page 60427
> 0   qemu-system-x86_64  0x000101a751c3 qemu_mutex_lock + 
> 83
> 1   qemu-system-x86_64  0x0001016b4749 unqueue_page + 89

OK, so the lock I added is apparently being initialised before it's being 
locked,
so that's good.

OK, can you try a simple migration by hand outside of the test harness;
just something simple like:

./bin/qemu-system-x86_64 -M pc -nographic
(qemu) migrate "exec: cat > /dev/null"

and the same with q35 ?

Dave

> 2   qemu-system-x86_64  0x0001016b42f6 get_queued_page + 
> 54
> 3   qemu-system-x86_64  0x0001016b40a5
> ram_find_and_save_block + 165
> 4   qemu-system-x86_64  0x0001016b3512
> ram_save_iterate + 130
> 5   qemu-system-x86_64  0x0001016b79be
> qemu_savevm_state_iterate + 302
> 6   qemu-system-x86_64  0x000101942c7b
> migration_thread + 571
> 7   libsystem_pthread.dylib 0x7fff9146c05a _pthread_body + 131
> 8   libsystem_pthread.dylib 0x7fff9146bfd7 _pthread_body + 0
> 9   libsystem_pthread.dylib 0x7fff914693ed thread_start + 13
> qemu: qemu_mutex_lock: Invalid argument
> qemu-system-x86_64:Broken pipe
>  Not a migration stream
> qemu-system-x86_64: load of migration failed: Invalid argument
> 
> thanks
> -- PMM
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PULL 42/57] Page request: Consume pages off the post-copy queue

2015-11-12 Thread Markus Armbruster
Peter Maydell  writes:

> On 12 November 2015 at 12:04, Dr. David Alan Gilbert
>  wrote:
>> * Peter Maydell (peter.mayd...@linaro.org) wrote:
>>> On 10 November 2015 at 14:25, Juan Quintela  wrote:
>>> > From: "Dr. David Alan Gilbert" 
>>> >
>>> > When transmitting RAM pages, consume pages that have been queued by
>>> > MIG_RPCOMM_REQPAGE commands and send them ahead of normal page scanning.
>>> >
>>> > Note:
>>> >   a) After a queued page the linear walk carries on from after the
>>> > unqueued page; there is a reasonable chance that the destination
>>> > was about to ask for other closeby pages anyway.
>>> >
>>> >   b) We have to be careful of any assumptions that the page walking
>>> > code makes, in particular it does some short cuts on its first linear
>>> > walk that break as soon as we do a queued page.
>>> >
>>> >   c) We have to be careful to not break up host-page size chunks, since
>>> > this makes it harder to place the pages on the destination.
>>> >
>>> > Signed-off-by: Dr. David Alan Gilbert 
>>> > Reviewed-by: Juan Quintela 
>>> > Signed-off-by: Juan Quintela 
>>>
>>> I've just discovered that this is causing 'make check' failures on
>>> my OSX host (unfortunately something in my setup is causing
>>> 'make check' failures to not always cause a build failure, so I
>>> didn't notice earlier):
>>
>> It's only failing on OSX? Every time or only sometimes?
>
> Only OSX, and always. I think OSX is pickier about mutexes really
> needing to be initialized before use.
>
>> If you can find a way to get a backtrace off that qemu_mutex_lock case
>> that would be great; I'd assume the later errors are the fall out from that.
>
> I'll have a look after lunch, but it's usually painful to get a
> backtrace out of this kind of qtest, because it's clearly starting
> a whole pile of QEMUs and there's no way I know of to say "only
> run a few of these tests, not the whole huge pile".

Not sure this helps, but here goes anyway: run the make with V=1, fish
out the gtester line, add -p /name/of/the/test.  Here's an example from
my shell history:

$ MALLOC_PERTURB_=${MALLOC_PERTURB_:-$((RANDOM % 255 + 1))} gtester -k 
--verbose -m=quick -p /dicts/large_dict tests/check-qjson



Re: [Qemu-devel] [PULL 42/57] Page request: Consume pages off the post-copy queue

2015-11-12 Thread Peter Maydell
On 12 November 2015 at 13:08, Dr. David Alan Gilbert
 wrote:
> OK, can you try a simple migration by hand outside of the test harness;
> just something simple like:
>
> ./bin/qemu-system-x86_64 -M pc -nographic
> (qemu) migrate "exec: cat > /dev/null"
>
> and the same with q35 ?

(qemu) migrate "exec: cat > /dev/null"
migrate_get_current do init of current_migration 65307
unqueue_page 65307
0   qemu-system-x86_64  0x0001067c01c3 qemu_mutex_lock + 83
1   qemu-system-x86_64  0x0001063ff749 unqueue_page + 89
2   qemu-system-x86_64  0x0001063ff2f6 get_queued_page + 54
3   qemu-system-x86_64  0x0001063ff0a5
ram_find_and_save_block + 165
4   qemu-system-x86_64  0x0001063fe512
ram_save_iterate + 130
5   qemu-system-x86_64  0x0001064029be
qemu_savevm_state_iterate + 302
6   qemu-system-x86_64  0x00010668dc7b
migration_thread + 571
7   libsystem_pthread.dylib 0x7fff9146c05a _pthread_body + 131
8   libsystem_pthread.dylib 0x7fff9146bfd7 _pthread_body + 0
9   libsystem_pthread.dylib 0x7fff914693ed thread_start + 13

(identical behaviour for q35 and pc).

thanks
-- PMM



Re: [Qemu-devel] [PULL 42/57] Page request: Consume pages off the post-copy queue

2015-11-12 Thread Peter Maydell
On 12 November 2015 at 13:18, Peter Maydell  wrote:
> On 12 November 2015 at 13:08, Dr. David Alan Gilbert
>  wrote:
>> OK, can you try a simple migration by hand outside of the test harness;
>> just something simple like:
>>
>> ./bin/qemu-system-x86_64 -M pc -nographic
>> (qemu) migrate "exec: cat > /dev/null"
>>
>> and the same with q35 ?
>
> (qemu) migrate "exec: cat > /dev/null"
> migrate_get_current do init of current_migration 65307
> unqueue_page 65307
> 0   qemu-system-x86_64  0x0001067c01c3 qemu_mutex_lock + 
> 83

This turns out to be because migrate_init() is corrupting the
mutex memory when it does "memset(s, 0, sizeof(*s))". Presumably
Linux's initialized-mutex is all-zeroes, but OSX's is not.

thanks
-- PMM



Re: [Qemu-devel] [PULL 42/57] Page request: Consume pages off the post-copy queue

2015-11-12 Thread Juan Quintela
Peter Maydell  wrote:
> On 12 November 2015 at 13:18, Peter Maydell  wrote:
>> On 12 November 2015 at 13:08, Dr. David Alan Gilbert
>>  wrote:
>>> OK, can you try a simple migration by hand outside of the test harness;
>>> just something simple like:
>>>
>>> ./bin/qemu-system-x86_64 -M pc -nographic
>>> (qemu) migrate "exec: cat > /dev/null"
>>>
>>> and the same with q35 ?
>>
>> (qemu) migrate "exec: cat > /dev/null"
>> migrate_get_current do init of current_migration 65307
>> unqueue_page 65307
>> 0   qemu-system-x86_64  0x0001067c01c3 qemu_mutex_lock + 
>> 83
>
> This turns out to be because migrate_init() is corrupting the
> mutex memory when it does "memset(s, 0, sizeof(*s))". Presumably
> Linux's initialized-mutex is all-zeroes, but OSX's is not.

grrr, that is probably the case.  Thanks for pinponting this.


>
> thanks
> -- PMM



Re: [Qemu-devel] [PULL 42/57] Page request: Consume pages off the post-copy queue

2015-11-12 Thread Dr. David Alan Gilbert
* Juan Quintela (quint...@redhat.com) wrote:
> "Dr. David Alan Gilbert"  wrote:
> > * Peter Maydell (peter.mayd...@linaro.org) wrote:
> >> On 12 November 2015 at 13:18, Peter Maydell  
> >> wrote:
> >> > On 12 November 2015 at 13:08, Dr. David Alan Gilbert
> >> >  wrote:
> >> >> OK, can you try a simple migration by hand outside of the test harness;
> >> >> just something simple like:
> >> >>
> >> >> ./bin/qemu-system-x86_64 -M pc -nographic
> >> >> (qemu) migrate "exec: cat > /dev/null"
> >> >>
> >> >> and the same with q35 ?
> >> >
> >> > (qemu) migrate "exec: cat > /dev/null"
> >> > migrate_get_current do init of current_migration 65307
> >> > unqueue_page 65307
> >> > 0   qemu-system-x86_64  0x0001067c01c3 
> >> > qemu_mutex_lock + 83
> >> 
> >> This turns out to be because migrate_init() is corrupting the
> >> mutex memory when it does "memset(s, 0, sizeof(*s))". Presumably
> >> Linux's initialized-mutex is all-zeroes, but OSX's is not.
> >
> > OK, thanks for finding that; I've just smoke tested the following
> > patch and will post it properly after I test it more thoroughly in
> > a couple of hours.
> 
> I did a patch that was almost identical. It is passing for me virt-test.

and the one I posted seems to survive postcopy as well; so looks good.

Dave

> 
> Later, Juan.
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PULL 42/57] Page request: Consume pages off the post-copy queue

2015-11-12 Thread Juan Quintela
"Dr. David Alan Gilbert"  wrote:
> * Peter Maydell (peter.mayd...@linaro.org) wrote:
>> On 12 November 2015 at 13:18, Peter Maydell  wrote:
>> > On 12 November 2015 at 13:08, Dr. David Alan Gilbert
>> >  wrote:
>> >> OK, can you try a simple migration by hand outside of the test harness;
>> >> just something simple like:
>> >>
>> >> ./bin/qemu-system-x86_64 -M pc -nographic
>> >> (qemu) migrate "exec: cat > /dev/null"
>> >>
>> >> and the same with q35 ?
>> >
>> > (qemu) migrate "exec: cat > /dev/null"
>> > migrate_get_current do init of current_migration 65307
>> > unqueue_page 65307
>> > 0   qemu-system-x86_64  0x0001067c01c3 qemu_mutex_lock 
>> > + 83
>> 
>> This turns out to be because migrate_init() is corrupting the
>> mutex memory when it does "memset(s, 0, sizeof(*s))". Presumably
>> Linux's initialized-mutex is all-zeroes, but OSX's is not.
>
> OK, thanks for finding that; I've just smoke tested the following
> patch and will post it properly after I test it more thoroughly in
> a couple of hours.

I did a patch that was almost identical. It is passing for me virt-test.

Later, Juan.



[Qemu-devel] [PULL 42/57] Page request: Consume pages off the post-copy queue

2015-11-10 Thread Juan Quintela
From: "Dr. David Alan Gilbert" 

When transmitting RAM pages, consume pages that have been queued by
MIG_RPCOMM_REQPAGE commands and send them ahead of normal page scanning.

Note:
  a) After a queued page the linear walk carries on from after the
unqueued page; there is a reasonable chance that the destination
was about to ask for other closeby pages anyway.

  b) We have to be careful of any assumptions that the page walking
code makes, in particular it does some short cuts on its first linear
walk that break as soon as we do a queued page.

  c) We have to be careful to not break up host-page size chunks, since
this makes it harder to place the pages on the destination.

Signed-off-by: Dr. David Alan Gilbert 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 249 +---
 trace-events|   2 +
 2 files changed, 220 insertions(+), 31 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 8302d09..d09d5ab 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -548,9 +548,9 @@ static int save_xbzrle_page(QEMUFile *f, uint8_t 
**current_data,
  * Returns: byte offset within memory region of the start of a dirty page
  */
 static inline
-ram_addr_t migration_bitmap_find_and_reset_dirty(RAMBlock *rb,
- ram_addr_t start,
- ram_addr_t *ram_addr_abs)
+ram_addr_t migration_bitmap_find_dirty(RAMBlock *rb,
+   ram_addr_t start,
+   ram_addr_t *ram_addr_abs)
 {
 unsigned long base = rb->offset >> TARGET_PAGE_BITS;
 unsigned long nr = base + (start >> TARGET_PAGE_BITS);
@@ -567,15 +567,24 @@ ram_addr_t migration_bitmap_find_and_reset_dirty(RAMBlock 
*rb,
 next = find_next_bit(bitmap, size, nr);
 }

-if (next < size) {
-clear_bit(next, bitmap);
-migration_dirty_pages--;
-}
 *ram_addr_abs = next << TARGET_PAGE_BITS;
 return (next - base) << TARGET_PAGE_BITS;
 }

-/* Called with rcu_read_lock() to protect migration_bitmap */
+static inline bool migration_bitmap_clear_dirty(ram_addr_t addr)
+{
+bool ret;
+int nr = addr >> TARGET_PAGE_BITS;
+unsigned long *bitmap = atomic_rcu_read(_bitmap_rcu)->bmap;
+
+ret = test_and_clear_bit(nr, bitmap);
+
+if (ret) {
+migration_dirty_pages--;
+}
+return ret;
+}
+
 static void migration_bitmap_sync_range(ram_addr_t start, ram_addr_t length)
 {
 unsigned long *bitmap;
@@ -974,9 +983,8 @@ static int ram_save_compressed_page(QEMUFile *f, RAMBlock 
*block,
 static bool find_dirty_block(QEMUFile *f, PageSearchStatus *pss,
  bool *again, ram_addr_t *ram_addr_abs)
 {
-pss->offset = migration_bitmap_find_and_reset_dirty(pss->block,
-   pss->offset,
-   ram_addr_abs);
+pss->offset = migration_bitmap_find_dirty(pss->block, pss->offset,
+  ram_addr_abs);
 if (pss->complete_round && pss->block == last_seen_block &&
 pss->offset >= last_offset) {
 /*
@@ -1015,6 +1023,107 @@ static bool find_dirty_block(QEMUFile *f, 
PageSearchStatus *pss,
 }
 }

+/*
+ * Helper for 'get_queued_page' - gets a page off the queue
+ *  ms:  MigrationState in
+ * *offset:  Used to return the offset within the RAMBlock
+ * ram_addr_abs: global offset in the dirty/sent bitmaps
+ *
+ * Returns:  block (or NULL if none available)
+ */
+static RAMBlock *unqueue_page(MigrationState *ms, ram_addr_t *offset,
+  ram_addr_t *ram_addr_abs)
+{
+RAMBlock *block = NULL;
+
+qemu_mutex_lock(>src_page_req_mutex);
+if (!QSIMPLEQ_EMPTY(>src_page_requests)) {
+struct MigrationSrcPageRequest *entry =
+QSIMPLEQ_FIRST(>src_page_requests);
+block = entry->rb;
+*offset = entry->offset;
+*ram_addr_abs = (entry->offset + entry->rb->offset) &
+TARGET_PAGE_MASK;
+
+if (entry->len > TARGET_PAGE_SIZE) {
+entry->len -= TARGET_PAGE_SIZE;
+entry->offset += TARGET_PAGE_SIZE;
+} else {
+memory_region_unref(block->mr);
+QSIMPLEQ_REMOVE_HEAD(>src_page_requests, next_req);
+g_free(entry);
+}
+}
+qemu_mutex_unlock(>src_page_req_mutex);
+
+return block;
+}
+
+/*
+ * Unqueue a page from the queue fed by postcopy page requests; skips pages
+ * that are already sent (!dirty)
+ *
+ *  ms:  MigrationState in
+ * pss:  PageSearchStatus structure updated with found block/offset
+ * ram_addr_abs: global offset in the dirty/sent bitmaps
+ *
+ * Returns:  true if a queued page is 

[Qemu-devel] [PULL 42/57] Page request: Consume pages off the post-copy queue

2015-11-09 Thread Juan Quintela
From: "Dr. David Alan Gilbert" 

When transmitting RAM pages, consume pages that have been queued by
MIG_RPCOMM_REQPAGE commands and send them ahead of normal page scanning.

Note:
  a) After a queued page the linear walk carries on from after the
unqueued page; there is a reasonable chance that the destination
was about to ask for other closeby pages anyway.

  b) We have to be careful of any assumptions that the page walking
code makes, in particular it does some short cuts on its first linear
walk that break as soon as we do a queued page.

  c) We have to be careful to not break up host-page size chunks, since
this makes it harder to place the pages on the destination.

Signed-off-by: Dr. David Alan Gilbert 
Reviewed-by: Juan Quintela 
Signed-off-by: Juan Quintela 
---
 migration/ram.c | 249 +---
 trace-events|   2 +
 2 files changed, 220 insertions(+), 31 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 8302d09..d09d5ab 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -548,9 +548,9 @@ static int save_xbzrle_page(QEMUFile *f, uint8_t 
**current_data,
  * Returns: byte offset within memory region of the start of a dirty page
  */
 static inline
-ram_addr_t migration_bitmap_find_and_reset_dirty(RAMBlock *rb,
- ram_addr_t start,
- ram_addr_t *ram_addr_abs)
+ram_addr_t migration_bitmap_find_dirty(RAMBlock *rb,
+   ram_addr_t start,
+   ram_addr_t *ram_addr_abs)
 {
 unsigned long base = rb->offset >> TARGET_PAGE_BITS;
 unsigned long nr = base + (start >> TARGET_PAGE_BITS);
@@ -567,15 +567,24 @@ ram_addr_t migration_bitmap_find_and_reset_dirty(RAMBlock 
*rb,
 next = find_next_bit(bitmap, size, nr);
 }

-if (next < size) {
-clear_bit(next, bitmap);
-migration_dirty_pages--;
-}
 *ram_addr_abs = next << TARGET_PAGE_BITS;
 return (next - base) << TARGET_PAGE_BITS;
 }

-/* Called with rcu_read_lock() to protect migration_bitmap */
+static inline bool migration_bitmap_clear_dirty(ram_addr_t addr)
+{
+bool ret;
+int nr = addr >> TARGET_PAGE_BITS;
+unsigned long *bitmap = atomic_rcu_read(_bitmap_rcu)->bmap;
+
+ret = test_and_clear_bit(nr, bitmap);
+
+if (ret) {
+migration_dirty_pages--;
+}
+return ret;
+}
+
 static void migration_bitmap_sync_range(ram_addr_t start, ram_addr_t length)
 {
 unsigned long *bitmap;
@@ -974,9 +983,8 @@ static int ram_save_compressed_page(QEMUFile *f, RAMBlock 
*block,
 static bool find_dirty_block(QEMUFile *f, PageSearchStatus *pss,
  bool *again, ram_addr_t *ram_addr_abs)
 {
-pss->offset = migration_bitmap_find_and_reset_dirty(pss->block,
-   pss->offset,
-   ram_addr_abs);
+pss->offset = migration_bitmap_find_dirty(pss->block, pss->offset,
+  ram_addr_abs);
 if (pss->complete_round && pss->block == last_seen_block &&
 pss->offset >= last_offset) {
 /*
@@ -1015,6 +1023,107 @@ static bool find_dirty_block(QEMUFile *f, 
PageSearchStatus *pss,
 }
 }

+/*
+ * Helper for 'get_queued_page' - gets a page off the queue
+ *  ms:  MigrationState in
+ * *offset:  Used to return the offset within the RAMBlock
+ * ram_addr_abs: global offset in the dirty/sent bitmaps
+ *
+ * Returns:  block (or NULL if none available)
+ */
+static RAMBlock *unqueue_page(MigrationState *ms, ram_addr_t *offset,
+  ram_addr_t *ram_addr_abs)
+{
+RAMBlock *block = NULL;
+
+qemu_mutex_lock(>src_page_req_mutex);
+if (!QSIMPLEQ_EMPTY(>src_page_requests)) {
+struct MigrationSrcPageRequest *entry =
+QSIMPLEQ_FIRST(>src_page_requests);
+block = entry->rb;
+*offset = entry->offset;
+*ram_addr_abs = (entry->offset + entry->rb->offset) &
+TARGET_PAGE_MASK;
+
+if (entry->len > TARGET_PAGE_SIZE) {
+entry->len -= TARGET_PAGE_SIZE;
+entry->offset += TARGET_PAGE_SIZE;
+} else {
+memory_region_unref(block->mr);
+QSIMPLEQ_REMOVE_HEAD(>src_page_requests, next_req);
+g_free(entry);
+}
+}
+qemu_mutex_unlock(>src_page_req_mutex);
+
+return block;
+}
+
+/*
+ * Unqueue a page from the queue fed by postcopy page requests; skips pages
+ * that are already sent (!dirty)
+ *
+ *  ms:  MigrationState in
+ * pss:  PageSearchStatus structure updated with found block/offset
+ * ram_addr_abs: global offset in the dirty/sent bitmaps
+ *
+ * Returns:  true if a queued page is