From: Zhanghailiang <zhang.zhanghaili...@huawei.com> Sent: Friday, May 15, 2020 11:29 AM To: Zhang, Chen <chen.zh...@intel.com>; Dr . David Alan Gilbert <dgilb...@redhat.com>; qemu-devel <qemu-devel@nongnu.org>; Li Zhijian <lizhij...@cn.fujitsu.com> Cc: Jason Wang <jasow...@redhat.com>; Lukas Straub <lukasstra...@web.de> Subject: RE: About migration/colo issue
Hi Zhang Chen, >From your tracing log, it seems to be hanged in colo_flush_ram_cache()? Does it come across a dead loop there ? Maybe, I haven't looked in depth. I'll test it by using the new qemu. Thanks Thanks, Hailiang From: Zhang, Chen [mailto:chen.zh...@intel.com] Sent: Friday, May 15, 2020 11:16 AM To: Zhanghailiang <zhang.zhanghaili...@huawei.com<mailto:zhang.zhanghaili...@huawei.com>>; Dr . David Alan Gilbert <dgilb...@redhat.com<mailto:dgilb...@redhat.com>>; qemu-devel <qemu-devel@nongnu.org<mailto:qemu-devel@nongnu.org>>; Li Zhijian <lizhij...@cn.fujitsu.com<mailto:lizhij...@cn.fujitsu.com>> Cc: Jason Wang <jasow...@redhat.com<mailto:jasow...@redhat.com>>; Lukas Straub <lukasstra...@web.de<mailto:lukasstra...@web.de>> Subject: About migration/colo issue Hi Hailiang/Dave. I found a urgent problem in current upstream code, COLO will stuck on secondary checkpoint and later. The guest will stuck by this issue. I have bisect upstream code, this issue caused by Hailiang's optimize patch: >From 0393031a16735835a441b6d6e0495a1bd14adb90 Mon Sep 17 00:00:00 2001 From: zhanghailiang <zhang.zhanghaili...@huawei.com<mailto:zhang.zhanghaili...@huawei.com>> Date: Mon, 24 Feb 2020 14:54:10 +0800 Subject: [PATCH] COLO: Optimize memory back-up process This patch will reduce the downtime of VM for the initial process, Previously, we copied all these memory in preparing stage of COLO while we need to stop VM, which is a time-consuming process. Here we optimize it by a trick, back-up every page while in migration process while COLO is enabled, though it affects the speed of the migration, but it obviously reduce the downtime of back-up all SVM'S memory in COLO preparing stage. Signed-off-by: zhanghailiang <zhang.zhanghaili...@huawei.com<mailto:zhang.zhanghaili...@huawei.com>> Message-Id: <20200224065414.36524-5-zhang.zhanghaili...@huawei.com<mailto:20200224065414.36524-5-zhang.zhanghaili...@huawei.com>> Signed-off-by: Dr. David Alan Gilbert <dgilb...@redhat.com<mailto:dgilb...@redhat.com>> minor typo fixes Hailiang, do you have time to look into it? The detail log: Primary node: 13322@1589511271.917346:colo_receive_message<mailto:13322@1589511271.917346:colo_receive_message> Receive 'checkpoint-ready' message {"timestamp": {"seconds": 1589511271, "microseconds": 917406}, "event": "RESUME"} 13322@1589511271.917842:colo_vm_state_change<mailto:13322@1589511271.917842:colo_vm_state_change> Change 'stop' => 'run' 13322@1589511291.243346:colo_send_message<mailto:13322@1589511291.243346:colo_send_message> Send 'checkpoint-request' message 13322@1589511291.243978:colo_receive_message<mailto:13322@1589511291.243978:colo_receive_message> Receive 'checkpoint-reply' message {"timestamp": {"seconds": 1589511291, "microseconds": 244096}, "event": "STOP"} 13322@1589511291.244444:colo_vm_state_change<mailto:13322@1589511291.244444:colo_vm_state_change> Change 'run' => 'stop' 13322@1589511291.244561:colo_send_message<mailto:13322@1589511291.244561:colo_send_message> Send 'vmstate-send' message 13322@1589511291.258594:colo_send_message<mailto:13322@1589511291.258594:colo_send_message> Send 'vmstate-size' message 13322@1589511305.412479:colo_receive_message<mailto:13322@1589511305.412479:colo_receive_message> Receive 'vmstate-received' message 13322@1589511309.031826:colo_receive_message<mailto:13322@1589511309.031826:colo_receive_message> Receive 'vmstate-loaded' message {"timestamp": {"seconds": 1589511309, "microseconds": 31862}, "event": "RESUME"} 13322@1589511309.033075:colo_vm_state_change<mailto:13322@1589511309.033075:colo_vm_state_change> Change 'stop' => 'run' {"timestamp": {"seconds": 1589511311, "microseconds": 111617}, "event": "VNC_CONNECTED", "data": {"server": {"auth": "none", "family": "ipv4", "service": "5907", "host": "0.0.0.0", "websocket": false}, "client": {"family": "ipv4", "service": "51564", "host": "10.239.13.19", "websocket": false}}} {"timestamp": {"seconds": 1589511311, "microseconds": 116197}, "event": "VNC_INITIALIZED", "data": {"server": {"auth": "none", "family": "ipv4", "service": "5907", "host": "0.0.0.0", "websocket": false}, "client": {"family": "ipv4", "service": "51564", "host": "10.239.13.19", "websocket": false}}} 13322@1589511311.243271:colo_send_message<mailto:13322@1589511311.243271:colo_send_message> Send 'checkpoint-request' message 13322@1589511311.351361:colo_receive_message<mailto:13322@1589511311.351361:colo_receive_message> Receive 'checkpoint-reply' message {"timestamp": {"seconds": 1589511311, "microseconds": 351439}, "event": "STOP"} 13322@1589511311.415779:colo_vm_state_change<mailto:13322@1589511311.415779:colo_vm_state_change> Change 'run' => 'stop' 13322@1589511311.416001:colo_send_message<mailto:13322@1589511311.416001:colo_send_message> Send 'vmstate-send' message 13322@1589511311.418620:colo_send_message<mailto:13322@1589511311.418620:colo_send_message> Send 'vmstate-size' message Secondary node: {"timestamp": {"seconds": 1589510920, "microseconds": 778207}, "event": "RESUME"} 23619@1589510920.778835:colo_vm_state_change<mailto:23619@1589510920.778835:colo_vm_state_change> Change 'stop' => 'run' 23619@1589510920.778891:colo_send_message<mailto:23619@1589510920.778891:colo_send_message> Send 'checkpoint-ready' message 23619@1589510940.105539:colo_receive_message<mailto:23619@1589510940.105539:colo_receive_message> Receive 'checkpoint-request' message {"timestamp": {"seconds": 1589510940, "microseconds": 105712}, "event": "STOP"} 23619@1589510940.105917:colo_vm_state_change<mailto:23619@1589510940.105917:colo_vm_state_change> Change 'run' => 'stop' 23619@1589510940.105971:colo_send_message<mailto:23619@1589510940.105971:colo_send_message> Send 'checkpoint-reply' message 23619@1589510940.106767:colo_receive_message<mailto:23619@1589510940.106767:colo_receive_message> Receive 'vmstate-send' message 23619@1589510940.122808:colo_flush_ram_cache_begin<mailto:23619@1589510940.122808:colo_flush_ram_cache_begin> dirty_pages 2456 23619@1589510953.618672:colo_flush_ram_cache_end<mailto:23619@1589510953.618672:colo_flush_ram_cache_end> 23619@1589510953.945083:colo_receive_message<mailto:23619@1589510953.945083:colo_receive_message> Receive 'vmstate-size' message 23619@1589510954.274816:colo_send_message<mailto:23619@1589510954.274816:colo_send_message> Send 'vmstate-received' message qemu-system-x86_64: warning: TSC frequency mismatch between VM (2792980 kHz) and host (2925999 kHz), and TSC scaling unavailable {"timestamp": {"seconds": 1589510957, "microseconds": 754184}, "event": "RESUME"} 23619@1589510957.894113:colo_vm_state_change<mailto:23619@1589510957.894113:colo_vm_state_change> Change 'stop' => 'run' 23619@1589510957.894162:colo_send_message<mailto:23619@1589510957.894162:colo_send_message> Send 'vmstate-loaded' message 23619@1589510960.105977:colo_receive_message<mailto:23619@1589510960.105977:colo_receive_message> Receive 'checkpoint-request' message {"timestamp": {"seconds": 1589510960, "microseconds": 106148}, "event": "STOP"} 23619@1589510960.213773:colo_vm_state_change<mailto:23619@1589510960.213773:colo_vm_state_change> Change 'run' => 'stop' 23619@1589510960.213797:colo_send_message<mailto:23619@1589510960.213797:colo_send_message> Send 'checkpoint-reply' message 23619@1589510960.278771:colo_receive_message<mailto:23619@1589510960.278771:colo_receive_message> Receive 'vmstate-send' message 23619@1589510960.423268:colo_flush_ram_cache_begin<mailto:23619@1589510960.423268:colo_flush_ram_cache_begin> dirty_pages 25