On Fri, Feb 18, 2022 at 2:36 PM Juan Quintela <quint...@redhat.com> wrote: > > Leonardo Bras Soares Passos <leob...@redhat.com> wrote: > > Hello Peter, thanks for reviewing! > > > > On Mon, Feb 7, 2022 at 11:22 PM Peter Xu <pet...@redhat.com> wrote: > >> > >> On Tue, Feb 01, 2022 at 03:29:03AM -0300, Leonardo Bras wrote: > >> > -void multifd_send_sync_main(QEMUFile *f) > >> > +int multifd_send_sync_main(QEMUFile *f) > >> > { > >> > int i; > >> > + bool flush_zero_copy; > >> > > >> > if (!migrate_use_multifd()) { > >> > - return; > >> > + return 0; > >> > } > >> > if (multifd_send_state->pages->num) { > >> > if (multifd_send_pages(f) < 0) { > >> > error_report("%s: multifd_send_pages fail", __func__); > >> > - return; > >> > + return 0; > >> > >> I've not checked how it used to do if multifd_send_pages() failed, but.. > >> should > >> it returns -1 rather than 0 when there will be a return code? > > > > Yeah, that makes sense. > > The point here is that I was trying not to modify much of the current > > behavior. > > if (qatomic_read(&multifd_send_state->exiting)) { > return -1; > } > > if (p->quit) { > error_report("%s: channel %d has already quit!", __func__, i); > qemu_mutex_unlock(&p->mutex); > return -1; > } > > This are the only two cases where the current code can return one > error. In the 1st case we are exiting, we are already in the middle of > finishing, so we don't really care. > In the second one, we have already quit, and error as already quite big. > > But I agree with both comments: > - we need to improve the error paths > - leonardo changes don't affect what is already there. >
> > > I mean, multifd_send_sync_main() would previously return void, so any > > other errors would not matter to the caller of this function, which > > will continue to run as if nothing happened. > > > > Now, if it fails with flush_zero_copy, the operation needs to be aborted. > > > > Maybe, I should make it different: > > - In any error, return -1. > > - Create/use a specific error code in the case of a failing > > flush_zero_copy, so I can test the return value for it on the caller > > function and return early. > > We need to add the check. It don't matter if the problem is zero_copy > or the existing one, we are under a minor catastrophe and migration has > to be aborted. Ok, I will fix that so we can abort in case of any error. Maybe it's better to do that on a separated patch, before 5/5, right? > > > Or alternatively, the other errors could also return early, but since > > this will change how the code currently works, I would probably need > > another patch for that change. (so it can be easily reverted if > > needed) > > > > What do you think is better? > > > > > >> > } > >> > } > >> > + > >> > + /* > >> > + * When using zero-copy, it's necessary to flush after each > >> > iteration to > >> > + * make sure pages from earlier iterations don't end up replacing > >> > newer > >> > + * pages. > >> > + */ > >> > + flush_zero_copy = migrate_use_zero_copy_send(); > >> > + > >> > for (i = 0; i < migrate_multifd_channels(); i++) { > >> > MultiFDSendParams *p = &multifd_send_state->params[i]; > >> > > >> > @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f) > >> > if (p->quit) { > >> > error_report("%s: channel %d has already quit", __func__, > >> > i); > >> > qemu_mutex_unlock(&p->mutex); > >> > - return; > >> > + return 0; > >> > >> Same question here. > > > > Please see above, > > > >> > >> > } > >> > >> The rest looks good. Thanks, > > Later, Juan. > Thanks for the feedback! Best regards, Leo