Hi, Git bisect is pointing to your patch 084140bd49: exec: fix access to ram_list.dirty_memory when sync dirty bitmap
trying to diagnose a bug I'm seeing; it looks like the dirty page count is wrong for some reason. Alex Bennée spotted a problem where the postcopy test would occasionally fail under very heavy load; attaching a debugger and it looks like the problem is we have a migration_dirty_page count stuck at 2; in the normal migration tests we don't spot this, because 2 pages is smaller than the threshold to end migration and so an extra 2 pages doesn't block it finishing. However, with a very small downtime setting (like we use in the postcopy test) and with very low bandwidth (as when Alex ran the test on a very heavily loaded machine) we end up never calling the bitmap sync again and never completing the iteration. I'm using the following addition to spot the problem: diff --git a/migration/ram.c b/migration/ram.c index e75f1050e4..3ddf884952 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1350,6 +1350,13 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage) } } while (!pages && again); + if (!pages && !again && pss.complete_round && rs->migration_dirty_pages) + { + /* Should make this fail migration ? */ + fprintf(stderr, "%s: no page found, yet dirty_pages=%"PRIu64"\n", + __func__, rs->migration_dirty_pages); + } + rs->last_seen_block = pss.block; rs->last_page = pss.page; (which I might add as a test to fail a migration) That test fails easily even on an unloaded machine: tests/postcopy-test /x86_64/postcopy: ram_find_and_save_block: no page found, yet dirty_pages=2 ram_find_and_save_block: no page found, yet dirty_pages=2 ram_find_and_save_block: no page found, yet dirty_pages=2 OK I'll try and debug where our extra two pages are coming from. Dave -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK