* Daniel Cho (daniel...@qnap.com) wrote: > Hi Hailiang, > > I have already patched the file to my branch, but there is a problem while > doing migration. > Here is the error message from SVM > "qemu-system-x86_64: /root/download/qemu-4.1.0/memory.c:1079: > memory_region_transaction_commit: Assertion `qemu_mutex_iothread_locked()' > failed."
It's probably worth getting the full backtrace. Dave > Do you have this problem? > > Best regards, > Daniel Cho > > Daniel Cho <daniel...@qnap.com> 於 2020年2月20日 週四 上午11:49寫道: > > > Hi Zhang, > > > > Thanks, I will configure on code for testing first. > > However, if you have free time, could you please send the patch file to > > us, Thanks. > > > > Best Regard, > > Daniel Cho > > > > > > Zhang, Chen <chen.zh...@intel.com> 於 2020年2月20日 週四 上午11:07寫道: > > > >> > >> On 2/18/2020 5:22 PM, Daniel Cho wrote: > >> > >> Hi Hailiang, > >> Thanks for your help. If we have any problems we will contact you for > >> your favor. > >> > >> > >> Hi Zhang, > >> > >> " If colo-compare got a primary packet without related secondary packet > >> in a certain time , it will automatically trigger checkpoint. " > >> As you said, the colo-compare will trigger checkpoint, but does it need > >> to limit checkpoint times? > >> There is a problem about doing many checkpoints while we use fio to > >> random write files. Then it will cause low throughput on PVM. > >> Is this situation is normal on COLO? > >> > >> > >> Hi Daniel, > >> > >> The checkpoint time is designed to be user adjustable based on user > >> environment(workload/network status/business conditions...). > >> > >> In net/colo-compare.c > >> > >> /* TODO: Should be configurable */ > >> #define REGULAR_PACKET_CHECK_MS 3000 > >> > >> If you need, I can send a patch for this issue. Make users can change the > >> value by QMP and qemu monitor commands. > >> > >> Thanks > >> > >> Zhang Chen > >> > >> > >> > >> Best regards, > >> Daniel Cho > >> > >> Zhang, Chen <chen.zh...@intel.com> 於 2020年2月17日 週一 下午1:36寫道: > >> > >>> > >>> On 2/15/2020 11:35 AM, Daniel Cho wrote: > >>> > >>> Hi Dave, > >>> > >>> Yes, I agree with you, it does need a timeout. > >>> > >>> > >>> Hi Daniel and Dave, > >>> > >>> Current colo-compare already have the timeout mechanism. > >>> > >>> Named packet_check_timer, It will scan primary packet queue to make > >>> sure all the primary packet not stay too long time. > >>> > >>> If colo-compare got a primary packet without related secondary packet in > >>> a certain time , it will automatic trigger checkpoint. > >>> > >>> https://github.com/qemu/qemu/blob/master/net/colo-compare.c#L847 > >>> > >>> > >>> Thanks > >>> > >>> Zhang Chen > >>> > >>> > >>> > >>> Hi Hailiang, > >>> > >>> We base on qemu-4.1.0 for using COLO feature, in your patch, we found a > >>> lot of difference between your version and ours. > >>> Could you give us a latest release version which is close your > >>> developing code? > >>> > >>> Thanks. > >>> > >>> Regards > >>> Daniel Cho > >>> > >>> Dr. David Alan Gilbert <dgilb...@redhat.com> 於 2020年2月13日 週四 下午6:38寫道: > >>> > >>>> * Daniel Cho (daniel...@qnap.com) wrote: > >>>> > Hi Hailiang, > >>>> > > >>>> > 1. > >>>> > OK, we will try the patch > >>>> > “0001-COLO-Optimize-memory-back-up-process.patch”, > >>>> > and thanks for your help. > >>>> > > >>>> > 2. > >>>> > We understand the reason to compare PVM and SVM's packet. > >>>> However, the > >>>> > empty of SVM's packet queue might happened on setting COLO feature > >>>> and SVM > >>>> > broken. > >>>> > > >>>> > On situation 1 ( setting COLO feature ): > >>>> > We could force do checkpoint after setting COLO feature finish, > >>>> then it > >>>> > will protect the state of PVM and SVM . As the Zhang Chen said. > >>>> > > >>>> > On situation 2 ( SVM broken ): > >>>> > COLO will do failover for PVM, so it might not cause any wrong on > >>>> PVM. > >>>> > > >>>> > However, those situations are our views, so there might be a big > >>>> difference > >>>> > between reality and our views. > >>>> > If we have any wrong views and opinions, please let us know, and > >>>> correct > >>>> > us. > >>>> > >>>> It does need a timeout; the SVM being broken or being in a state where > >>>> it never sends the corresponding packet (because of a state difference) > >>>> can happen and COLO needs to timeout when the packet hasn't arrived > >>>> after a while and trigger the checkpoint. > >>>> > >>>> Dave > >>>> > >>>> > Thanks. > >>>> > > >>>> > Best regards, > >>>> > Daniel Cho > >>>> > > >>>> > Zhang, Chen <chen.zh...@intel.com> 於 2020年2月13日 週四 上午10:17寫道: > >>>> > > >>>> > > Add cc Jason Wang, he is a network expert. > >>>> > > > >>>> > > In case some network things goes wrong. > >>>> > > > >>>> > > > >>>> > > > >>>> > > Thanks > >>>> > > > >>>> > > Zhang Chen > >>>> > > > >>>> > > > >>>> > > > >>>> > > *From:* Zhang, Chen > >>>> > > *Sent:* Thursday, February 13, 2020 10:10 AM > >>>> > > *To:* 'Zhanghailiang' <zhang.zhanghaili...@huawei.com>; Daniel Cho > >>>> < > >>>> > > daniel...@qnap.com> > >>>> > > *Cc:* Dr. David Alan Gilbert <dgilb...@redhat.com>; > >>>> qemu-devel@nongnu.org > >>>> > > *Subject:* RE: The issues about architecture of the COLO checkpoint > >>>> > > > >>>> > > > >>>> > > > >>>> > > For the issue 2: > >>>> > > > >>>> > > > >>>> > > > >>>> > > COLO need use the network packets to confirm PVM and SVM in the > >>>> same state, > >>>> > > > >>>> > > Generally speaking, we can’t send PVM packets without compared with > >>>> SVM > >>>> > > packets. > >>>> > > > >>>> > > But to prevent jamming, I think COLO can do force checkpoint and > >>>> send the > >>>> > > PVM packets in this case. > >>>> > > > >>>> > > > >>>> > > > >>>> > > Thanks > >>>> > > > >>>> > > Zhang Chen > >>>> > > > >>>> > > > >>>> > > > >>>> > > *From:* Zhanghailiang <zhang.zhanghaili...@huawei.com> > >>>> > > *Sent:* Thursday, February 13, 2020 9:45 AM > >>>> > > *To:* Daniel Cho <daniel...@qnap.com> > >>>> > > *Cc:* Dr. David Alan Gilbert <dgilb...@redhat.com>; > >>>> qemu-devel@nongnu.org; > >>>> > > Zhang, Chen <chen.zh...@intel.com> > >>>> > > *Subject:* RE: The issues about architecture of the COLO checkpoint > >>>> > > > >>>> > > > >>>> > > > >>>> > > Hi, > >>>> > > > >>>> > > > >>>> > > > >>>> > > 1. After re-walked through the codes, yes, you are right, > >>>> actually, > >>>> > > after the first migration, we will keep dirty log on in primary > >>>> side, > >>>> > > > >>>> > > And only send the dirty pages in PVM to SVM. The ram cache in > >>>> secondary > >>>> > > side is always a backup of PVM, so we don’t have to > >>>> > > > >>>> > > Re-send the none-dirtied pages. > >>>> > > > >>>> > > The reason why the first checkpoint takes longer time is we have to > >>>> backup > >>>> > > the whole VM’s ram into ram cache, that is colo_init_ram_cache(). > >>>> > > > >>>> > > It is time consuming, but I have optimized in the second patch > >>>> > > “0001-COLO-Optimize-memory-back-up-process.patch” which you can > >>>> find in my > >>>> > > previous reply. > >>>> > > > >>>> > > > >>>> > > > >>>> > > Besides, I found that, In my previous reply “We can only copy the > >>>> pages > >>>> > > that dirtied by PVM and SVM in last checkpoint.”, > >>>> > > > >>>> > > We have done this optimization in current upstream codes. > >>>> > > > >>>> > > > >>>> > > > >>>> > > 2.I don’t quite understand this question. For COLO, we always need > >>>> both > >>>> > > network packets of PVM’s and SVM’s to compare before send this > >>>> packets to > >>>> > > client. > >>>> > > > >>>> > > It depends on this to decide whether or not PVM and SVM are in same > >>>> state. > >>>> > > > >>>> > > > >>>> > > > >>>> > > Thanks, > >>>> > > > >>>> > > hailiang > >>>> > > > >>>> > > > >>>> > > > >>>> > > *From:* Daniel Cho [mailto:daniel...@qnap.com <daniel...@qnap.com>] > >>>> > > *Sent:* Wednesday, February 12, 2020 4:37 PM > >>>> > > *To:* Zhang, Chen <chen.zh...@intel.com> > >>>> > > *Cc:* Zhanghailiang <zhang.zhanghaili...@huawei.com>; Dr. David > >>>> Alan > >>>> > > Gilbert <dgilb...@redhat.com>; qemu-devel@nongnu.org > >>>> > > *Subject:* Re: The issues about architecture of the COLO checkpoint > >>>> > > > >>>> > > > >>>> > > > >>>> > > Hi Hailiang, > >>>> > > > >>>> > > > >>>> > > > >>>> > > Thanks for your replaying and explain in detail. > >>>> > > > >>>> > > We will try to use the attachments to enhance memory copy. > >>>> > > > >>>> > > > >>>> > > > >>>> > > However, we have some questions for your replying. > >>>> > > > >>>> > > > >>>> > > > >>>> > > 1. As you said, "for each checkpoint, we have to send the whole > >>>> PVM's > >>>> > > pages To SVM", why the only first checkpoint will takes more pause > >>>> time? > >>>> > > > >>>> > > In our observing, the first checkpoint will take more time for > >>>> pausing, > >>>> > > then other checkpoints will takes a few time for pausing. Does it > >>>> means > >>>> > > only the first checkpoint will send the whole pages to SVM, and the > >>>> other > >>>> > > checkpoints send the dirty pages to SVM for reloading? > >>>> > > > >>>> > > > >>>> > > > >>>> > > 2. We notice the COLO-COMPARE component will stuck the packet until > >>>> > > receive packets from PVM and SVM, as this rule, when we add the > >>>> > > COLO-COMPARE to PVM, its network will stuck until SVM start. So it > >>>> is an > >>>> > > other issue to make PVM stuck while setting COLO feature. With this > >>>> issue, > >>>> > > could we let colo-compare to pass the PVM's packet when the SVM's > >>>> packet > >>>> > > queue is empty? Then, the PVM's network won't stock, and "if PVM > >>>> runs > >>>> > > firstly, it still need to wait for The network packets from SVM to > >>>> > > compare before send it to client side" won't happened either. > >>>> > > > >>>> > > > >>>> > > > >>>> > > Best regard, > >>>> > > > >>>> > > Daniel Cho > >>>> > > > >>>> > > > >>>> > > > >>>> > > Zhang, Chen <chen.zh...@intel.com> 於 2020年2月12日 週三 下午1:45寫道: > >>>> > > > >>>> > > > >>>> > > > >>>> > > > -----Original Message----- > >>>> > > > From: Zhanghailiang <zhang.zhanghaili...@huawei.com> > >>>> > > > Sent: Wednesday, February 12, 2020 11:18 AM > >>>> > > > To: Dr. David Alan Gilbert <dgilb...@redhat.com>; Daniel Cho > >>>> > > > <daniel...@qnap.com>; Zhang, Chen <chen.zh...@intel.com> > >>>> > > > Cc: qemu-devel@nongnu.org > >>>> > > > Subject: RE: The issues about architecture of the COLO checkpoint > >>>> > > > > >>>> > > > Hi, > >>>> > > > > >>>> > > > Thank you Dave, > >>>> > > > > >>>> > > > I'll reply here directly. > >>>> > > > > >>>> > > > -----Original Message----- > >>>> > > > From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com] > >>>> > > > Sent: Wednesday, February 12, 2020 1:48 AM > >>>> > > > To: Daniel Cho <daniel...@qnap.com>; chen.zh...@intel.com; > >>>> > > > Zhanghailiang <zhang.zhanghaili...@huawei.com> > >>>> > > > Cc: qemu-devel@nongnu.org > >>>> > > > Subject: Re: The issues about architecture of the COLO checkpoint > >>>> > > > > >>>> > > > > >>>> > > > cc'ing in COLO people: > >>>> > > > > >>>> > > > > >>>> > > > * Daniel Cho (daniel...@qnap.com) wrote: > >>>> > > > > Hi everyone, > >>>> > > > > We have some issues about setting COLO feature. Hope > >>>> somebody > >>>> > > > > could give us some advice. > >>>> > > > > > >>>> > > > > Issue 1: > >>>> > > > > We dynamic to set COLO feature for PVM(2 core, 16G > >>>> memory), but > >>>> > > > > the Primary VM will pause a long time(based on memory size) for > >>>> > > > > waiting SVM start. Does it have any idea to reduce the pause > >>>> time? > >>>> > > > > > >>>> > > > > >>>> > > > Yes, we do have some ideas to optimize this downtime. > >>>> > > > > >>>> > > > The main problem for current version is, for each checkpoint, we > >>>> have to > >>>> > > > send the whole PVM's pages > >>>> > > > To SVM, and then copy the whole VM's state into SVM from ram > >>>> cache, in > >>>> > > > this process, we need both of them be paused. > >>>> > > > Just as you said, the downtime is based on memory size. > >>>> > > > > >>>> > > > So firstly, we need to reduce the sending data while do > >>>> checkpoint, > >>>> > > actually, > >>>> > > > we can migrate parts of PVM's dirty pages in background > >>>> > > > While both of VMs are running. And then we load these pages into > >>>> ram > >>>> > > > cache (backup memory) in SVM temporarily. While do checkpoint, > >>>> > > > We just send the last dirty pages of PVM to slave side and then > >>>> copy the > >>>> > > ram > >>>> > > > cache into SVM. Further on, we don't have > >>>> > > > To send the whole PVM's dirty pages, we can only send the pages > >>>> that > >>>> > > > dirtied by PVM or SVM during two checkpoints. (Because > >>>> > > > If one page is not dirtied by both PVM and SVM, the data of this > >>>> pages > >>>> > > will > >>>> > > > keep same in SVM, PVM, backup memory). This method can reduce > >>>> > > > the time that consumed in sending data. > >>>> > > > > >>>> > > > For the second problem, we can reduce the memory copy by two > >>>> methods, > >>>> > > > first one, we don't have to copy the whole pages in ram cache, > >>>> > > > We can only copy the pages that dirtied by PVM and SVM in last > >>>> > > checkpoint. > >>>> > > > Second, we can use userfault missing function to reduce the > >>>> > > > Time consumed in memory copy. (For the second time, in theory, we > >>>> can > >>>> > > > reduce time consumed in memory into ms level). > >>>> > > > > >>>> > > > You can find the first optimization in attachment, it is based on > >>>> an old > >>>> > > qemu > >>>> > > > version (qemu-2.6), it should not be difficult to rebase it > >>>> > > > Into master or your version. And please feel free to send the new > >>>> > > version if > >>>> > > > you want into community ;) > >>>> > > > > >>>> > > > > >>>> > > > >>>> > > Thanks Hailiang! > >>>> > > By the way, Do you have time to push the patches to upstream? > >>>> > > I think this is a better and faster option. > >>>> > > > >>>> > > Thanks > >>>> > > Zhang Chen > >>>> > > > >>>> > > > > > >>>> > > > > Issue 2: > >>>> > > > > In > >>>> > > > > https://github.com/qemu/qemu/blob/master/migration/colo.c#L503, > >>>> > > > > could we move start_vm() before Line 488? Because at first > >>>> checkpoint > >>>> > > > > PVM will wait for SVM's reply, it cause PVM stop for a while. > >>>> > > > > > >>>> > > > > >>>> > > > No, that makes no sense, because if PVM runs firstly, it still > >>>> need to > >>>> > > wait for > >>>> > > > The network packets from SVM to compare before send it to client > >>>> side. > >>>> > > > > >>>> > > > > >>>> > > > Thanks, > >>>> > > > Hailiang > >>>> > > > > >>>> > > > > We set the COLO feature on running VM, so we hope the > >>>> running VM > >>>> > > > > could continuous service for users. > >>>> > > > > Do you have any suggestions for those issues? > >>>> > > > > > >>>> > > > > Best regards, > >>>> > > > > Daniel Cho > >>>> > > > -- > >>>> > > > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK > >>>> > > > >>>> > > > >>>> -- > >>>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK > >>>> > >>>> -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK