"Chuang Xu" <[email protected]> writes: > In our long-term experience in Bytedance, we've found that under > the same load, live migration of larger VMs with more devices is > often more difficult to converge (requiring a larger downtime limit). > > Through some testing and calculations, we conclude that bitmap sync time > affects the calculation of live migration bandwidth. > > When the addresses processed are not aligned, a large number of > clear_dirty ioctl occur (e.g. a 4MB misaligned memory can generate > 2048 clear_dirty ioctls from two different memory_listener), > which increases the time required for bitmap_sync and makes it > more difficult for dirty pages to converge. > > For a 64C256G vm with 8 vhost-user-net(32 queue per nic) and > 16 vhost-user-blk(4 queue per blk), the sync time is as high as *73ms* > (tested with 10GBps dirty rate, the sync time increases as the dirty > page rate increases), Here are each part of the sync time: > > - sync from kvm to ram_list: 2.5ms > - vhost_log_sync:3ms > - sync aligned memory from ram_list to RAMBlock: 5ms > - sync misaligned memory from ram_list to RAMBlock: 61ms > > Attempt to merge those fragmented clear_dirty ioctls, then syncing > misaligned memory from ram_list to RAMBlock takes only about 1ms, > and the total sync time is only *12ms*. > > Signed-off-by: Chuang Xu <[email protected]>
Reviewed-by: Fabiano Rosas <[email protected]>
