On 2015/12/11 17:18, Dr. David Alan Gilbert wrote:
* Hailiang Zhang (zhang.zhanghaili...@huawei.com) wrote:
On 2015/12/11 4:03, Dr. David Alan Gilbert wrote:
* zhanghailiang (zhang.zhanghaili...@huawei.com) wrote:
If the net connection between COLO's two sides is broken while colo/colo 
incoming
thread is blocked in 'read'/'write' socket fd. It will not detect this error 
until
connect timeout. It will be a long time.

Here we shutdown all the related socket file descriptors to wake up the blocking
operation in failover BH. Besides, we should close the corresponding file 
descriptors
after failvoer BH shutdown them, or there will be an error.

Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com>
Signed-off-by: Li Zhijian <lizhij...@cn.fujitsu.com>
---
v11:
- Only shutdown fd for once
---
  migration/colo.c | 31 +++++++++++++++++++++++++++++--
  1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/migration/colo.c b/migration/colo.c
index 4cd7b00..994b80d 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -68,6 +68,14 @@ static void secondary_vm_do_failover(void)
          /* recover runstate to normal migration finish state */
          autostart = true;
      }
+    /*
+    * Make sure colo incoming thread not block in recv,
+    * mis->from_src_file and mis->to_src_file use the same fd,
+    * so here we only need to shutdown it for once.
+    */

That assumption is only valid for the current socket transport;
for example I have a (partially-working) RDMA transport where the
forward and reverse paths are quite separate.


Yes, you are right, but we really catch the bad stuck case in test for many 
times,
and it is easy to reproduce. So i think it's necessary to keep this codes.

Yes, I agree you must shutdown from_src_file, but I mean do you also need to
shutdown to_src_file ?  On sockets they share an fd, but that might not always 
be

Yes, in that case, we need to shutdown it, i will add it. Thanks.

true, so you might have to shutdown both QEMUFile's - but only if you can
get blocked on them.


+    if (mis->from_src_file) {
+        qemu_file_shutdown(mis->from_src_file);
+    }

      old_state = failover_set_state(FAILOVER_STATUS_HANDLING,
                                     FAILOVER_STATUS_COMPLETED);
@@ -92,6 +100,15 @@ static void primary_vm_do_failover(void)
                            MIGRATION_STATUS_COMPLETED);
      }

+    /*
+    * Make sure colo thread no block in recv,
+    * Besides, s->rp_state.from_dst_file and s->to_dst_file use the
+    * same fd, so here we only need to shutdown it for once.
+    */
+    if (s->to_dst_file) {
+        qemu_file_shutdown(s->to_dst_file);
+    }
+
      old_state = failover_set_state(FAILOVER_STATUS_HANDLING,
                                     FAILOVER_STATUS_COMPLETED);
      if (old_state != FAILOVER_STATUS_HANDLING) {
@@ -333,7 +350,7 @@ static void colo_process_checkpoint(MigrationState *s)

  out:
      current_time = error_time = qemu_clock_get_ms(QEMU_CLOCK_HOST);
-    if (ret < 0) {
+    if (ret < 0 || (!ret && !failover_request_is_active())) {

Why is this needed - when would you get a 0 return that is actually an error?


If we shutdown the fd, it will return 0, and the ret will be 0, this only happen
when we do failover actively.

OK

          error_report("%s: %s", __func__, strerror(-ret));
          qapi_event_send_colo_exit(COLO_MODE_PRIMARY, COLO_EXIT_REASON_ERROR,
                                    true, strerror(-ret), NULL);
@@ -362,6 +379,11 @@ out:
      qsb_free(buffer);
      buffer = NULL;

+    /* Hope this not to be too long to loop here */
+    while (failover_get_state() != FAILOVER_STATUS_COMPLETED) {
+        ;
+    }
+    /* Must be called after failover BH is completed */

Is this the right patch for this?


Yes, we must ensure the close() called after the shutdown() in failover,
or there maybe an shutdown wrong 'fd' error, the 'fd' maybe just re-used by
other thread after we close here, but before shutdown() in failover. This
is a race case, which is rarely happening. I'll add more comments here.

Thanks, that makes sense; closing the wrong fd can be really hard to debug :-)

Dave


Thanks,
Hailiang


      if (s->rp_state.from_dst_file) {
          qemu_fclose(s->rp_state.from_dst_file);
      }
@@ -534,7 +556,7 @@ void *colo_process_incoming_thread(void *opaque)

  out:
      current_time = error_time = qemu_clock_get_ms(QEMU_CLOCK_HOST);
-    if (ret < 0) {
+    if (ret < 0 || (!ret && !failover_request_is_active())) {
          error_report("colo incoming thread will exit, detect error: %s",
                       strerror(-ret));
          qapi_event_send_colo_exit(COLO_MODE_SECONDARY, COLO_EXIT_REASON_ERROR,
@@ -573,6 +595,11 @@ out:
      */
      colo_release_ram_cache();

+    /* Hope this not to be too long to loop here */
+    while (failover_get_state() != FAILOVER_STATUS_COMPLETED) {
+        ;
+    }
+    /* Must be called after failover BH is completed */
      if (mis->to_src_file) {
          qemu_fclose(mis->to_src_file);
      }
--
1.8.3.1


--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

.



--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

.




Reply via email to