Dr. David Alan Gilbert <dgilb...@redhat.com> writes: > * Dr. David Alan Gilbert (dgilb...@redhat.com) wrote: >> Hi, >> Git bisect is pointing to your patch 084140bd49: >> exec: fix access to ram_list.dirty_memory when sync dirty bitmap >> >> trying to diagnose a bug I'm seeing; it looks like the dirty page count >> is wrong for some reason. >> >> Alex Bennée spotted a problem where the postcopy test would occasionally >> fail under very heavy load; attaching a debugger and it looks like >> the problem is we have a migration_dirty_page count stuck at 2; >> in the normal migration tests we don't spot this, because 2 pages is >> smaller than the threshold to end migration and so an extra 2 pages >> doesn't block it finishing. However, with a very >> small downtime setting (like we use in the postcopy test) and with >> very low bandwidth (as when Alex ran the test on a very heavily loaded >> machine) we end up never calling the bitmap sync again and never >> completing the iteration. >> >> I'm using the following addition to spot the problem: > > I think I have the fix for this, or at least the reason; the alignment > check also needs fixing up, but the 08414 patch missed it. > > Alex: Please test the fix included at the bottom > >> >> diff --git a/migration/ram.c b/migration/ram.c >> index e75f1050e4..3ddf884952 100644 >> --- a/migration/ram.c >> +++ b/migration/ram.c >> @@ -1350,6 +1350,13 @@ static int ram_find_and_save_block(RAMState *rs, bool >> last_stage) >> } >> } while (!pages && again); >> >> + if (!pages && !again && pss.complete_round && rs->migration_dirty_pages) >> + { >> + /* Should make this fail migration ? */ >> + fprintf(stderr, "%s: no page found, yet dirty_pages=%"PRIu64"\n", >> + __func__, rs->migration_dirty_pages); >> + } >> + >> rs->last_seen_block = pss.block; >> rs->last_page = pss.page; >> >> (which I might add as a test to fail a migration) >> >> That test fails easily even on an unloaded machine: >> tests/postcopy-test >> /x86_64/postcopy: ram_find_and_save_block: no page found, yet dirty_pages=2 >> ram_find_and_save_block: no page found, yet dirty_pages=2 >> ram_find_and_save_block: no page found, yet dirty_pages=2 >> OK >> >> >> I'll try and debug where our extra two pages are coming from. > > From b2c86818eaaaf790246f21ae21fead903c1d64f0 Mon Sep 17 00:00:00 2001 > From: "Dr. David Alan Gilbert" <dgilb...@redhat.com> > Date: Fri, 21 Jul 2017 19:59:01 +0100 > Subject: [PATCH] cpu_physical_memory_sync_dirty_bitmap: Fix alignment check > > This code has an optimised, word aligned version, and a boring > unaligned version. Recently 084140bd498909 fixed a missing offset > addition from the core of both versions. However, the offset isn't > necessarily aligned and thus the choice between the two versions > needs fixing up to also include the offset. > > DANGER: Friday evening, post-dinner lightly tested patch. > > Symptom: > A few stuck unsent pages during migration; not normally noticed > unless under very low bandwidth. > > Fixes: 084140bd498909 > --- > include/exec/ram_addr.h | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h > index c04f4f6..c802f7f 100644 > --- a/include/exec/ram_addr.h > +++ b/include/exec/ram_addr.h > @@ -378,15 +378,16 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock > *rb, > { > ram_addr_t addr; > unsigned long page = BIT_WORD(start >> TARGET_PAGE_BITS); > + unsigned long word = BIT_WORD((start + rb->offset) >> TARGET_PAGE_BITS); > uint64_t num_dirty = 0; > unsigned long *dest = rb->bmap; > > /* start address is aligned at the start of a word? */ > - if (((page * BITS_PER_LONG) << TARGET_PAGE_BITS) == start) { > + if (((word * BITS_PER_LONG) << TARGET_PAGE_BITS) == > + (start + rb->offset)) { > int k; > int nr = BITS_TO_LONGS(length >> TARGET_PAGE_BITS); > unsigned long * const *src; > - unsigned long word = BIT_WORD((start + rb->offset) >> > TARGET_PAGE_BITS); > unsigned long idx = (word * BITS_PER_LONG) / DIRTY_MEMORY_BLOCK_SIZE; > unsigned long offset = BIT_WORD((word * BITS_PER_LONG) % > DIRTY_MEMORY_BLOCK_SIZE); > -- > 1.8.3.1 > >> Dave >> -- >> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
I saw no hangs, there where 4 times it failed but that might be another test crashing. I'm investigating that so a cautious: Tested-by: Alex Bennée <alex.ben...@linaro.org> -- Alex Bennée