> Am 14.02.2020 um 20:45 schrieb Peter Xu <pet...@redhat.com>: > > On Fri, Feb 14, 2020 at 07:26:59PM +0100, David Hildenbrand wrote: >>>>>> + if (!postcopy_is_running()) { >>>>>> + Error *err = NULL; >>>>>> + >>>>>> + /* >>>>>> + * Precopy code cannot deal with the size of ram blocks >>>>>> changing at >>>>>> + * random points in time. We're still running on the source, >>>>>> abort >>>>>> + * the migration and continue running here. Make sure to wait >>>>>> until >>>>>> + * migration was canceled. >>>>>> + */ >>>>>> + error_setg(&err, "RAM resized during precopy."); >>>>>> + migrate_set_error(migrate_get_current(), err); >>>>>> + error_free(err); >>>>>> + migration_cancel(); >>>>>> + } else { >>>>>> + /* >>>>>> + * Postcopy code cannot deal with the size of ram blocks >>>>>> changing at >>>>>> + * random points in time. We're running on the target. Fail >>>>>> hard. >>>>>> + * >>>>>> + * TODO: How to handle this in a better way? >>>>>> + */ >>>>>> + error_report("RAM resized during postcopy."); >>>>>> + exit(-1); >>>>> >>>>> Now I'm rethinking the postcopy case.... >>>>> >>>>> ram_dirty_bitmap_reload() should only happen during the postcopy >>>>> recovery, and when that happens the VM should be stopped on both >>>>> sides. Which means, ram resizing should not trigger there... >>>> >>>> But that guest got the chance to run for a bit and eventually reboot >>>> AFAIK. Also, there are other data races possible when used_length >>>> suddenly changes, this is just the most obvious one where things will; >>>> get screwed up. >>> >>> Right, the major one could be in ram_load_postcopy() where we call >>> host_from_ram_block_offset(). However if FW extension is the major >>> use case then it seems to still work (still better than crashing, >>> isn't it? :). >> >> "Let's close our eyes and hope it will never happen" ? :) No, I don't >> like that. This screams for a better solution long term, and until then >> a proper fencing IMHO. We're making here wild guesses about data races >> and why they might not be that bad in certain cases (did I mention >> load/store tearing? used_length is not an atomic value ...). > > Yeah fencing is good, but crashing a VM while it wasn't going to crash > is another thing, imho. You can dump an error message if you really > like, but instead of exit() I really prefer we either still let the > old way to at least work or really fix it.
I‘ll do whatever Juan/Dave think is best. I am not convinced that there is no way to corrupt data or crash later when the guest is already running again post-reboot and doing real work. > > When I say "really fix it", I mean we can even start to think about > the shrinking case and how to support that for postcopy. For example, > in the above case host_from_ram_block_offset() will return NULL then, > and the fix could be that we drop that extra page because we don't > need that any more, instead of bailing out. > I have patches on the list that will make everything exceed used_length inaccessible. If there is still an access, we will crash. Printing a warning might help figure out what went wrong. I have a patch lying around that allocates the bitmaps only for the used_length. Access outside of that (esp. receiced bitmap) will, well, depends, crash or mess up data. Printing an error might help to figure out what went wrong. Maybe. Just FYI how I found this issue and why I want to sanitize the code. And we are trying to keep something alive here that never could have worked 100% reliably as it is inherently racy. Cheers!