Hi Hailiang, I have already patched the file to my branch, but there is a problem while doing migration. Here is the error message from SVM "qemu-system-x86_64: /root/download/qemu-4.1.0/memory.c:1079: memory_region_transaction_commit: Assertion `qemu_mutex_iothread_locked()' failed."
Do you have this problem? Best regards, Daniel Cho Daniel Cho <daniel...@qnap.com> 於 2020年2月20日 週四 上午11:49寫道: > Hi Zhang, > > Thanks, I will configure on code for testing first. > However, if you have free time, could you please send the patch file to > us, Thanks. > > Best Regard, > Daniel Cho > > > Zhang, Chen <chen.zh...@intel.com> 於 2020年2月20日 週四 上午11:07寫道: > >> >> On 2/18/2020 5:22 PM, Daniel Cho wrote: >> >> Hi Hailiang, >> Thanks for your help. If we have any problems we will contact you for >> your favor. >> >> >> Hi Zhang, >> >> " If colo-compare got a primary packet without related secondary packet >> in a certain time , it will automatically trigger checkpoint. " >> As you said, the colo-compare will trigger checkpoint, but does it need >> to limit checkpoint times? >> There is a problem about doing many checkpoints while we use fio to >> random write files. Then it will cause low throughput on PVM. >> Is this situation is normal on COLO? >> >> >> Hi Daniel, >> >> The checkpoint time is designed to be user adjustable based on user >> environment(workload/network status/business conditions...). >> >> In net/colo-compare.c >> >> /* TODO: Should be configurable */ >> #define REGULAR_PACKET_CHECK_MS 3000 >> >> If you need, I can send a patch for this issue. Make users can change the >> value by QMP and qemu monitor commands. >> >> Thanks >> >> Zhang Chen >> >> >> >> Best regards, >> Daniel Cho >> >> Zhang, Chen <chen.zh...@intel.com> 於 2020年2月17日 週一 下午1:36寫道: >> >>> >>> On 2/15/2020 11:35 AM, Daniel Cho wrote: >>> >>> Hi Dave, >>> >>> Yes, I agree with you, it does need a timeout. >>> >>> >>> Hi Daniel and Dave, >>> >>> Current colo-compare already have the timeout mechanism. >>> >>> Named packet_check_timer, It will scan primary packet queue to make >>> sure all the primary packet not stay too long time. >>> >>> If colo-compare got a primary packet without related secondary packet in >>> a certain time , it will automatic trigger checkpoint. >>> >>> https://github.com/qemu/qemu/blob/master/net/colo-compare.c#L847 >>> >>> >>> Thanks >>> >>> Zhang Chen >>> >>> >>> >>> Hi Hailiang, >>> >>> We base on qemu-4.1.0 for using COLO feature, in your patch, we found a >>> lot of difference between your version and ours. >>> Could you give us a latest release version which is close your >>> developing code? >>> >>> Thanks. >>> >>> Regards >>> Daniel Cho >>> >>> Dr. David Alan Gilbert <dgilb...@redhat.com> 於 2020年2月13日 週四 下午6:38寫道: >>> >>>> * Daniel Cho (daniel...@qnap.com) wrote: >>>> > Hi Hailiang, >>>> > >>>> > 1. >>>> > OK, we will try the patch >>>> > “0001-COLO-Optimize-memory-back-up-process.patch”, >>>> > and thanks for your help. >>>> > >>>> > 2. >>>> > We understand the reason to compare PVM and SVM's packet. >>>> However, the >>>> > empty of SVM's packet queue might happened on setting COLO feature >>>> and SVM >>>> > broken. >>>> > >>>> > On situation 1 ( setting COLO feature ): >>>> > We could force do checkpoint after setting COLO feature finish, >>>> then it >>>> > will protect the state of PVM and SVM . As the Zhang Chen said. >>>> > >>>> > On situation 2 ( SVM broken ): >>>> > COLO will do failover for PVM, so it might not cause any wrong on >>>> PVM. >>>> > >>>> > However, those situations are our views, so there might be a big >>>> difference >>>> > between reality and our views. >>>> > If we have any wrong views and opinions, please let us know, and >>>> correct >>>> > us. >>>> >>>> It does need a timeout; the SVM being broken or being in a state where >>>> it never sends the corresponding packet (because of a state difference) >>>> can happen and COLO needs to timeout when the packet hasn't arrived >>>> after a while and trigger the checkpoint. >>>> >>>> Dave >>>> >>>> > Thanks. >>>> > >>>> > Best regards, >>>> > Daniel Cho >>>> > >>>> > Zhang, Chen <chen.zh...@intel.com> 於 2020年2月13日 週四 上午10:17寫道: >>>> > >>>> > > Add cc Jason Wang, he is a network expert. >>>> > > >>>> > > In case some network things goes wrong. >>>> > > >>>> > > >>>> > > >>>> > > Thanks >>>> > > >>>> > > Zhang Chen >>>> > > >>>> > > >>>> > > >>>> > > *From:* Zhang, Chen >>>> > > *Sent:* Thursday, February 13, 2020 10:10 AM >>>> > > *To:* 'Zhanghailiang' <zhang.zhanghaili...@huawei.com>; Daniel Cho >>>> < >>>> > > daniel...@qnap.com> >>>> > > *Cc:* Dr. David Alan Gilbert <dgilb...@redhat.com>; >>>> qemu-devel@nongnu.org >>>> > > *Subject:* RE: The issues about architecture of the COLO checkpoint >>>> > > >>>> > > >>>> > > >>>> > > For the issue 2: >>>> > > >>>> > > >>>> > > >>>> > > COLO need use the network packets to confirm PVM and SVM in the >>>> same state, >>>> > > >>>> > > Generally speaking, we can’t send PVM packets without compared with >>>> SVM >>>> > > packets. >>>> > > >>>> > > But to prevent jamming, I think COLO can do force checkpoint and >>>> send the >>>> > > PVM packets in this case. >>>> > > >>>> > > >>>> > > >>>> > > Thanks >>>> > > >>>> > > Zhang Chen >>>> > > >>>> > > >>>> > > >>>> > > *From:* Zhanghailiang <zhang.zhanghaili...@huawei.com> >>>> > > *Sent:* Thursday, February 13, 2020 9:45 AM >>>> > > *To:* Daniel Cho <daniel...@qnap.com> >>>> > > *Cc:* Dr. David Alan Gilbert <dgilb...@redhat.com>; >>>> qemu-devel@nongnu.org; >>>> > > Zhang, Chen <chen.zh...@intel.com> >>>> > > *Subject:* RE: The issues about architecture of the COLO checkpoint >>>> > > >>>> > > >>>> > > >>>> > > Hi, >>>> > > >>>> > > >>>> > > >>>> > > 1. After re-walked through the codes, yes, you are right, >>>> actually, >>>> > > after the first migration, we will keep dirty log on in primary >>>> side, >>>> > > >>>> > > And only send the dirty pages in PVM to SVM. The ram cache in >>>> secondary >>>> > > side is always a backup of PVM, so we don’t have to >>>> > > >>>> > > Re-send the none-dirtied pages. >>>> > > >>>> > > The reason why the first checkpoint takes longer time is we have to >>>> backup >>>> > > the whole VM’s ram into ram cache, that is colo_init_ram_cache(). >>>> > > >>>> > > It is time consuming, but I have optimized in the second patch >>>> > > “0001-COLO-Optimize-memory-back-up-process.patch” which you can >>>> find in my >>>> > > previous reply. >>>> > > >>>> > > >>>> > > >>>> > > Besides, I found that, In my previous reply “We can only copy the >>>> pages >>>> > > that dirtied by PVM and SVM in last checkpoint.”, >>>> > > >>>> > > We have done this optimization in current upstream codes. >>>> > > >>>> > > >>>> > > >>>> > > 2.I don’t quite understand this question. For COLO, we always need >>>> both >>>> > > network packets of PVM’s and SVM’s to compare before send this >>>> packets to >>>> > > client. >>>> > > >>>> > > It depends on this to decide whether or not PVM and SVM are in same >>>> state. >>>> > > >>>> > > >>>> > > >>>> > > Thanks, >>>> > > >>>> > > hailiang >>>> > > >>>> > > >>>> > > >>>> > > *From:* Daniel Cho [mailto:daniel...@qnap.com <daniel...@qnap.com>] >>>> > > *Sent:* Wednesday, February 12, 2020 4:37 PM >>>> > > *To:* Zhang, Chen <chen.zh...@intel.com> >>>> > > *Cc:* Zhanghailiang <zhang.zhanghaili...@huawei.com>; Dr. David >>>> Alan >>>> > > Gilbert <dgilb...@redhat.com>; qemu-devel@nongnu.org >>>> > > *Subject:* Re: The issues about architecture of the COLO checkpoint >>>> > > >>>> > > >>>> > > >>>> > > Hi Hailiang, >>>> > > >>>> > > >>>> > > >>>> > > Thanks for your replaying and explain in detail. >>>> > > >>>> > > We will try to use the attachments to enhance memory copy. >>>> > > >>>> > > >>>> > > >>>> > > However, we have some questions for your replying. >>>> > > >>>> > > >>>> > > >>>> > > 1. As you said, "for each checkpoint, we have to send the whole >>>> PVM's >>>> > > pages To SVM", why the only first checkpoint will takes more pause >>>> time? >>>> > > >>>> > > In our observing, the first checkpoint will take more time for >>>> pausing, >>>> > > then other checkpoints will takes a few time for pausing. Does it >>>> means >>>> > > only the first checkpoint will send the whole pages to SVM, and the >>>> other >>>> > > checkpoints send the dirty pages to SVM for reloading? >>>> > > >>>> > > >>>> > > >>>> > > 2. We notice the COLO-COMPARE component will stuck the packet until >>>> > > receive packets from PVM and SVM, as this rule, when we add the >>>> > > COLO-COMPARE to PVM, its network will stuck until SVM start. So it >>>> is an >>>> > > other issue to make PVM stuck while setting COLO feature. With this >>>> issue, >>>> > > could we let colo-compare to pass the PVM's packet when the SVM's >>>> packet >>>> > > queue is empty? Then, the PVM's network won't stock, and "if PVM >>>> runs >>>> > > firstly, it still need to wait for The network packets from SVM to >>>> > > compare before send it to client side" won't happened either. >>>> > > >>>> > > >>>> > > >>>> > > Best regard, >>>> > > >>>> > > Daniel Cho >>>> > > >>>> > > >>>> > > >>>> > > Zhang, Chen <chen.zh...@intel.com> 於 2020年2月12日 週三 下午1:45寫道: >>>> > > >>>> > > >>>> > > >>>> > > > -----Original Message----- >>>> > > > From: Zhanghailiang <zhang.zhanghaili...@huawei.com> >>>> > > > Sent: Wednesday, February 12, 2020 11:18 AM >>>> > > > To: Dr. David Alan Gilbert <dgilb...@redhat.com>; Daniel Cho >>>> > > > <daniel...@qnap.com>; Zhang, Chen <chen.zh...@intel.com> >>>> > > > Cc: qemu-devel@nongnu.org >>>> > > > Subject: RE: The issues about architecture of the COLO checkpoint >>>> > > > >>>> > > > Hi, >>>> > > > >>>> > > > Thank you Dave, >>>> > > > >>>> > > > I'll reply here directly. >>>> > > > >>>> > > > -----Original Message----- >>>> > > > From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com] >>>> > > > Sent: Wednesday, February 12, 2020 1:48 AM >>>> > > > To: Daniel Cho <daniel...@qnap.com>; chen.zh...@intel.com; >>>> > > > Zhanghailiang <zhang.zhanghaili...@huawei.com> >>>> > > > Cc: qemu-devel@nongnu.org >>>> > > > Subject: Re: The issues about architecture of the COLO checkpoint >>>> > > > >>>> > > > >>>> > > > cc'ing in COLO people: >>>> > > > >>>> > > > >>>> > > > * Daniel Cho (daniel...@qnap.com) wrote: >>>> > > > > Hi everyone, >>>> > > > > We have some issues about setting COLO feature. Hope >>>> somebody >>>> > > > > could give us some advice. >>>> > > > > >>>> > > > > Issue 1: >>>> > > > > We dynamic to set COLO feature for PVM(2 core, 16G >>>> memory), but >>>> > > > > the Primary VM will pause a long time(based on memory size) for >>>> > > > > waiting SVM start. Does it have any idea to reduce the pause >>>> time? >>>> > > > > >>>> > > > >>>> > > > Yes, we do have some ideas to optimize this downtime. >>>> > > > >>>> > > > The main problem for current version is, for each checkpoint, we >>>> have to >>>> > > > send the whole PVM's pages >>>> > > > To SVM, and then copy the whole VM's state into SVM from ram >>>> cache, in >>>> > > > this process, we need both of them be paused. >>>> > > > Just as you said, the downtime is based on memory size. >>>> > > > >>>> > > > So firstly, we need to reduce the sending data while do >>>> checkpoint, >>>> > > actually, >>>> > > > we can migrate parts of PVM's dirty pages in background >>>> > > > While both of VMs are running. And then we load these pages into >>>> ram >>>> > > > cache (backup memory) in SVM temporarily. While do checkpoint, >>>> > > > We just send the last dirty pages of PVM to slave side and then >>>> copy the >>>> > > ram >>>> > > > cache into SVM. Further on, we don't have >>>> > > > To send the whole PVM's dirty pages, we can only send the pages >>>> that >>>> > > > dirtied by PVM or SVM during two checkpoints. (Because >>>> > > > If one page is not dirtied by both PVM and SVM, the data of this >>>> pages >>>> > > will >>>> > > > keep same in SVM, PVM, backup memory). This method can reduce >>>> > > > the time that consumed in sending data. >>>> > > > >>>> > > > For the second problem, we can reduce the memory copy by two >>>> methods, >>>> > > > first one, we don't have to copy the whole pages in ram cache, >>>> > > > We can only copy the pages that dirtied by PVM and SVM in last >>>> > > checkpoint. >>>> > > > Second, we can use userfault missing function to reduce the >>>> > > > Time consumed in memory copy. (For the second time, in theory, we >>>> can >>>> > > > reduce time consumed in memory into ms level). >>>> > > > >>>> > > > You can find the first optimization in attachment, it is based on >>>> an old >>>> > > qemu >>>> > > > version (qemu-2.6), it should not be difficult to rebase it >>>> > > > Into master or your version. And please feel free to send the new >>>> > > version if >>>> > > > you want into community ;) >>>> > > > >>>> > > > >>>> > > >>>> > > Thanks Hailiang! >>>> > > By the way, Do you have time to push the patches to upstream? >>>> > > I think this is a better and faster option. >>>> > > >>>> > > Thanks >>>> > > Zhang Chen >>>> > > >>>> > > > > >>>> > > > > Issue 2: >>>> > > > > In >>>> > > > > https://github.com/qemu/qemu/blob/master/migration/colo.c#L503, >>>> > > > > could we move start_vm() before Line 488? Because at first >>>> checkpoint >>>> > > > > PVM will wait for SVM's reply, it cause PVM stop for a while. >>>> > > > > >>>> > > > >>>> > > > No, that makes no sense, because if PVM runs firstly, it still >>>> need to >>>> > > wait for >>>> > > > The network packets from SVM to compare before send it to client >>>> side. >>>> > > > >>>> > > > >>>> > > > Thanks, >>>> > > > Hailiang >>>> > > > >>>> > > > > We set the COLO feature on running VM, so we hope the >>>> running VM >>>> > > > > could continuous service for users. >>>> > > > > Do you have any suggestions for those issues? >>>> > > > > >>>> > > > > Best regards, >>>> > > > > Daniel Cho >>>> > > > -- >>>> > > > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK >>>> > > >>>> > > >>>> -- >>>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK >>>> >>>>