On Thu, Nov 30, 2017 at 10:24:38AM +0000, Dr. David Alan Gilbert wrote: > * Peter Xu (pet...@redhat.com) wrote: > > If the postcopy down due to some reason, we can always see this on dst: > > > > qemu-system-x86_64: RP: Received invalid message 0x0000 length 0x0000 > > > > However in most cases that's not the real issue. The problem is that > > qemu_get_be16() has no way to show whether the returned data is valid or > > not, and we are _always_ assuming it is valid. That's possibly not wise. > > > > The best approach to solve this would be: refactoring QEMUFile interface > > to allow the APIs to return error if there is. However it needs quite a > > bit of work and testing. For now, let's explicitly check the validity > > first before using the data in all places for qemu_get_*(). > > > > This patch tries to fix most of the cases I can see. Only if we are with > > this, can we make sure we are processing the valid data, and also can we > > make sure we can capture the channel down events correctly. > > > > Signed-off-by: Peter Xu <pet...@redhat.com> > > --- > > migration/migration.c | 5 +++++ > > migration/ram.c | 26 ++++++++++++++++++++++---- > > migration/savevm.c | 40 ++++++++++++++++++++++++++++++++++++++-- > > 3 files changed, 65 insertions(+), 6 deletions(-) > > > > diff --git a/migration/migration.c b/migration/migration.c > > index c0206023d7..eae34d0524 100644 > > --- a/migration/migration.c > > +++ b/migration/migration.c > > @@ -1708,6 +1708,11 @@ static void *source_return_path_thread(void *opaque) > > header_type = qemu_get_be16(rp); > > header_len = qemu_get_be16(rp); > > > > + if (qemu_file_get_error(rp)) { > > + mark_source_rp_bad(ms); > > + goto out; > > + } > > + > > if (header_type >= MIG_RP_MSG_MAX || > > header_type == MIG_RP_MSG_INVALID) { > > error_report("RP: Received invalid message 0x%04x length > > 0x%04x", > > diff --git a/migration/ram.c b/migration/ram.c > > index 8620aa400a..960c726ff2 100644 > > --- a/migration/ram.c > > +++ b/migration/ram.c > > @@ -2687,7 +2687,7 @@ static int ram_load_postcopy(QEMUFile *f) > > void *last_host = NULL; > > bool all_zero = false; > > > > - while (!ret && !(flags & RAM_SAVE_FLAG_EOS)) { > > + while (!(flags & RAM_SAVE_FLAG_EOS)) { > > I still think you need to keep the !ret && - see below; > anyway, there's no harm in keeping it!
Fair enough; I'll keep it no matter what. :-) > > > ram_addr_t addr; > > void *host = NULL; > > void *page_buffer = NULL; > > @@ -2696,6 +2696,16 @@ static int ram_load_postcopy(QEMUFile *f) > > uint8_t ch; > > > > addr = qemu_get_be64(f); > > + > > + /* > > + * If qemu file error, we should stop here, and then "addr" > > + * may be invalid > > + */ > > + ret = qemu_file_get_error(f); > > + if (ret) { > > + break; > > + } > > + > > flags = addr & ~TARGET_PAGE_MASK; > > addr &= TARGET_PAGE_MASK; > > > > @@ -2776,6 +2786,13 @@ static int ram_load_postcopy(QEMUFile *f) > > error_report("Unknown combination of migration flags: %#x" > > " (postcopy mode)", flags); > > ret = -EINVAL; [1] > > + break; > > This 'break' breaks from the switch, but doesn't break the loop and > because you remove dthe !ret && from the top, the loop keeps going when > it shouldn't. Ah yes I missed this one, thanks. What I should have written here is a "goto out", and also I should add that "out" label at the end. I think after this single change current patch should be fine. However I understand that you would prefer me to check the ret every time. IMHO it's a matter of taste. I would prefer current way to do things since I see it awkward to keep checking against (!ret) possibly multiple times even we already know it's non-zero (especially when the failure happens at the beginning of the loop block). But for this patch, I can follow yours (since you asked for twice already :). > > > + } > > + > > + /* Detect for any possible file errors */ > > + if (qemu_file_get_error(f)) { > > + ret = qemu_file_get_error(f); > > + break; > > } > > This is all simpler if you just leave the !ret && at the top, and then > make this: > if (!ret) { > ret = qemu_file_get_error(f); > } Sure. (So to show what I meant: if we failed at [1] above we still need to check this, which is unecessary imho) > > > > > if (place_needed) { > > Make that > > if (!ret && place_needed) { Will do. (same here if we failed at [1], actually we don't need to check the ret value so many times) > > > @@ -2789,9 +2806,10 @@ static int ram_load_postcopy(QEMUFile *f) > > ret = postcopy_place_page(mis, place_dest, > > place_source, block); > > } > > - } > > - if (!ret) { > > - ret = qemu_file_get_error(f); > > + > > + if (ret) { > > + break; > > + } > > And with the !ret check at the top this goes again. Yes, will remove it. Thanks! -- Peter Xu