Re: [PATCH V2 4/8] COLO: Optimize memory back-up process

2020-02-24 Thread Daniel Cho
Hi Hailiang,

With version 2, the code in migration/ram.c

+if (migration_incoming_colo_enabled()) {
+if (migration_incoming_in_colo_state()) {
+/* In COLO stage, put all pages into cache temporarily */
+host = colo_cache_from_block_offset(block, addr);
+} else {
+   /*
+* In migration stage but before COLO stage,
+* Put all pages into both cache and SVM's memory.
+*/
+host_bak = colo_cache_from_block_offset(block, addr);
+}
 }
 if (!host) {
 error_report("Illegal RAM offset " RAM_ADDR_FMT, addr);
 ret = -EINVAL;
 break;
 }

host = colo_cache_from_block_offset(block, addr);
host_bak = colo_cache_from_block_offset(block, addr);
Does it cause the "if(!host)" will go break if the condition goes
"host_bak = colo_cache_from_block_offset(block, addr);" ?

Best regards,
Daniel Cho

zhanghailiang  於 2020年2月24日 週一 下午2:55寫道:
>
> This patch will reduce the downtime of VM for the initial process,
> Privously, we copied all these memory in preparing stage of COLO
> while we need to stop VM, which is a time-consuming process.
> Here we optimize it by a trick, back-up every page while in migration
> process while COLO is enabled, though it affects the speed of the
> migration, but it obviously reduce the downtime of back-up all SVM'S
> memory in COLO preparing stage.
>
> Signed-off-by: zhanghailiang 
> ---
>  migration/colo.c |  3 +++
>  migration/ram.c  | 68 +++-
>  migration/ram.h  |  1 +
>  3 files changed, 54 insertions(+), 18 deletions(-)
>
> diff --git a/migration/colo.c b/migration/colo.c
> index 93c5a452fb..44942c4e23 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -26,6 +26,7 @@
>  #include "qemu/main-loop.h"
>  #include "qemu/rcu.h"
>  #include "migration/failover.h"
> +#include "migration/ram.h"
>  #ifdef CONFIG_REPLICATION
>  #include "replication.h"
>  #endif
> @@ -845,6 +846,8 @@ void *colo_process_incoming_thread(void *opaque)
>   */
>  qemu_file_set_blocking(mis->from_src_file, true);
>
> +colo_incoming_start_dirty_log();
> +
>  bioc = qio_channel_buffer_new(COLO_BUFFER_BASE_SIZE);
>  fb = qemu_fopen_channel_input(QIO_CHANNEL(bioc));
>  object_unref(OBJECT(bioc));
> diff --git a/migration/ram.c b/migration/ram.c
> index ed23ed1c7c..ebf9e6ba51 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2277,6 +2277,7 @@ static void ram_list_init_bitmaps(void)
>   * dirty_memory[DIRTY_MEMORY_MIGRATION] don't include the whole
>   * guest memory.
>   */
> +
>  block->bmap = bitmap_new(pages);
>  bitmap_set(block->bmap, 0, pages);
>  block->clear_bmap_shift = shift;
> @@ -2986,7 +2987,6 @@ int colo_init_ram_cache(void)
>  }
>  return -errno;
>  }
> -memcpy(block->colo_cache, block->host, block->used_length);
>  }
>  }
>
> @@ -3000,19 +3000,36 @@ int colo_init_ram_cache(void)
>
>  RAMBLOCK_FOREACH_NOT_IGNORED(block) {
>  unsigned long pages = block->max_length >> TARGET_PAGE_BITS;
> -
>  block->bmap = bitmap_new(pages);
> -bitmap_set(block->bmap, 0, pages);
>  }
>  }
> -ram_state = g_new0(RAMState, 1);
> -ram_state->migration_dirty_pages = 0;
> -qemu_mutex_init(_state->bitmap_mutex);
> -memory_global_dirty_log_start();
>
> +ram_state_init(_state);
>  return 0;
>  }
>
> +/* TODO: duplicated with ram_init_bitmaps */
> +void colo_incoming_start_dirty_log(void)
> +{
> +RAMBlock *block = NULL;
> +/* For memory_global_dirty_log_start below. */
> +qemu_mutex_lock_iothread();
> +qemu_mutex_lock_ramlist();
> +
> +memory_global_dirty_log_sync();
> +WITH_RCU_READ_LOCK_GUARD() {
> +RAMBLOCK_FOREACH_NOT_IGNORED(block) {
> +ramblock_sync_dirty_bitmap(ram_state, block);
> +/* Discard this dirty bitmap record */
> +bitmap_zero(block->bmap, block->max_length >> TARGET_PAGE_BITS);
> +}
> +memory_global_dirty_log_start();
> +}
> +ram_state->migration_dirty_pages = 0;
> +qemu_mutex_unlock_ramlist();
> +qemu_mutex_unlock_iothread();
> +}
> +
>  /* It is need to hold the global lock 

Re: The issues about architecture of the COLO checkpoint

2020-02-23 Thread Daniel Cho
Hi Zhang,

Thanks for your help.
However, did you occur the error which the function qemu_hexdump in
colo-compare.c will crash the qemu process while doing operation with
network?

We are working on VM fault tolerance study and COLO function
evalutation first. Currently we did not have a confirmed plan on it.

Best regard,
Daniel Cho

Zhang, Chen  於 2020年2月24日 週一 上午2:43寫道:

>
>
>
>
>
> From: Daniel Cho 
> Sent: Thursday, February 20, 2020 11:49 AM
> To: Zhang, Chen 
> Cc: Dr. David Alan Gilbert ; Zhanghailiang 
> ; qemu-devel@nongnu.org; Jason Wang 
> 
> Subject: Re: The issues about architecture of the COLO checkpoint
>
>
>
> Hi Zhang,
>
>
>
> Thanks, I will configure on code for testing first.
>
> However, if you have free time, could you please send the patch file to us, 
> Thanks.
>
>
>
> OK, I will send this patch recently.
>
> By the way, can you share QNAP’s plan and status for COLO?
>
>
>
> Best Regard,
>
> Daniel Cho
>
>
>
>
>
> Zhang, Chen  於 2020年2月20日 週四 上午11:07寫道:
>
>
>
> On 2/18/2020 5:22 PM, Daniel Cho wrote:
>
> Hi Hailiang,
>
> Thanks for your help. If we have any problems we will contact you for your 
> favor.
>
>
>
>
>
> Hi Zhang,
>
>
>
> " If colo-compare got a primary packet without related secondary packet in a 
> certain time , it will automatically trigger checkpoint.  "
>
> As you said, the colo-compare will trigger checkpoint, but does it need to 
> limit checkpoint times?
>
> There is a problem about doing many checkpoints while we use fio to random 
> write files. Then it will cause low throughput on PVM.
>
> Is this situation is normal on COLO?
>
>
>
> Hi Daniel,
>
> The checkpoint time is designed to be user adjustable based on user 
> environment(workload/network status/business conditions...).
>
> In net/colo-compare.c
>
> /* TODO: Should be configurable */
> #define REGULAR_PACKET_CHECK_MS 3000
>
> If you need, I can send a patch for this issue. Make users can change the 
> value by QMP and qemu monitor commands.
>
> Thanks
>
> Zhang Chen
>
>
>
>
>
> Best regards,
>
> Daniel Cho
>
>
>
> Zhang, Chen  於 2020年2月17日 週一 下午1:36寫道:
>
>
>
> On 2/15/2020 11:35 AM, Daniel Cho wrote:
>
> Hi Dave,
>
>
>
> Yes, I agree with you, it does need a timeout.
>
>
>
> Hi Daniel and Dave,
>
> Current colo-compare already have the timeout mechanism.
>
> Named packet_check_timer,  It will scan primary packet queue to make sure all 
> the primary packet not stay too long time.
>
> If colo-compare got a primary packet without related secondary packet in a 
> certain time , it will automatic trigger checkpoint.
>
> https://github.com/qemu/qemu/blob/master/net/colo-compare.c#L847
>
>
>
> Thanks
>
> Zhang Chen
>
>
>
>
>
> Hi Hailiang,
>
>
>
> We base on qemu-4.1.0 for using COLO feature, in your patch, we found a lot 
> of difference  between your version and ours.
>
> Could you give us a latest release version which is close your developing 
> code?
>
>
>
> Thanks.
>
>
>
> Regards
>
> Daniel Cho
>
>
>
> Dr. David Alan Gilbert  於 2020年2月13日 週四 下午6:38寫道:
>
> * Daniel Cho (daniel...@qnap.com) wrote:
> > Hi Hailiang,
> >
> > 1.
> > OK, we will try the patch
> > “0001-COLO-Optimize-memory-back-up-process.patch”,
> > and thanks for your help.
> >
> > 2.
> > We understand the reason to compare PVM and SVM's packet. However, the
> > empty of SVM's packet queue might happened on setting COLO feature and SVM
> > broken.
> >
> > On situation 1 ( setting COLO feature ):
> > We could force do checkpoint after setting COLO feature finish, then it
> > will protect the state of PVM and SVM . As the Zhang Chen said.
> >
> > On situation 2 ( SVM broken ):
> >     COLO will do failover for PVM, so it might not cause any wrong on PVM.
> >
> > However, those situations are our views, so there might be a big difference
> > between reality and our views.
> > If we have any wrong views and opinions, please let us know, and correct
> > us.
>
> It does need a timeout; the SVM being broken or being in a state where
> it never sends the corresponding packet (because of a state difference)
> can happen and COLO needs to timeout when the packet hasn't arrived
> after a while and trigger the checkpoint.
>
> Dave
>
> > Thanks.
> >
> > Best regards,
> > Daniel Cho
> >
> > Zhang, Chen  於 2020年2月13日 週四 上午10:17寫道:
> >
>

Re: The issues about architecture of the COLO checkpoint

2020-02-19 Thread Daniel Cho
Hi Hailiang,

I have already patched the file to my branch, but there is a problem while
doing migration.
Here is the error message from SVM
"qemu-system-x86_64: /root/download/qemu-4.1.0/memory.c:1079:
memory_region_transaction_commit: Assertion `qemu_mutex_iothread_locked()'
failed."

Do you have this problem?

Best regards,
Daniel Cho

Daniel Cho  於 2020年2月20日 週四 上午11:49寫道:

> Hi Zhang,
>
> Thanks, I will configure on code for testing first.
> However, if you have free time, could you please send the patch file to
> us, Thanks.
>
> Best Regard,
> Daniel Cho
>
>
> Zhang, Chen  於 2020年2月20日 週四 上午11:07寫道:
>
>>
>> On 2/18/2020 5:22 PM, Daniel Cho wrote:
>>
>> Hi Hailiang,
>> Thanks for your help. If we have any problems we will contact you for
>> your favor.
>>
>>
>> Hi Zhang,
>>
>> " If colo-compare got a primary packet without related secondary packet
>> in a certain time , it will automatically trigger checkpoint.  "
>> As you said, the colo-compare will trigger checkpoint, but does it need
>> to limit checkpoint times?
>> There is a problem about doing many checkpoints while we use fio to
>> random write files. Then it will cause low throughput on PVM.
>> Is this situation is normal on COLO?
>>
>>
>> Hi Daniel,
>>
>> The checkpoint time is designed to be user adjustable based on user
>> environment(workload/network status/business conditions...).
>>
>> In net/colo-compare.c
>>
>> /* TODO: Should be configurable */
>> #define REGULAR_PACKET_CHECK_MS 3000
>>
>> If you need, I can send a patch for this issue. Make users can change the
>> value by QMP and qemu monitor commands.
>>
>> Thanks
>>
>> Zhang Chen
>>
>>
>>
>> Best regards,
>> Daniel Cho
>>
>> Zhang, Chen  於 2020年2月17日 週一 下午1:36寫道:
>>
>>>
>>> On 2/15/2020 11:35 AM, Daniel Cho wrote:
>>>
>>> Hi Dave,
>>>
>>> Yes, I agree with you, it does need a timeout.
>>>
>>>
>>> Hi Daniel and Dave,
>>>
>>> Current colo-compare already have the timeout mechanism.
>>>
>>> Named packet_check_timer,  It will scan primary packet queue to make
>>> sure all the primary packet not stay too long time.
>>>
>>> If colo-compare got a primary packet without related secondary packet in
>>> a certain time , it will automatic trigger checkpoint.
>>>
>>> https://github.com/qemu/qemu/blob/master/net/colo-compare.c#L847
>>>
>>>
>>> Thanks
>>>
>>> Zhang Chen
>>>
>>>
>>>
>>> Hi Hailiang,
>>>
>>> We base on qemu-4.1.0 for using COLO feature, in your patch, we found a
>>> lot of difference  between your version and ours.
>>> Could you give us a latest release version which is close your
>>> developing code?
>>>
>>> Thanks.
>>>
>>> Regards
>>> Daniel Cho
>>>
>>> Dr. David Alan Gilbert  於 2020年2月13日 週四 下午6:38寫道:
>>>
>>>> * Daniel Cho (daniel...@qnap.com) wrote:
>>>> > Hi Hailiang,
>>>> >
>>>> > 1.
>>>> > OK, we will try the patch
>>>> > “0001-COLO-Optimize-memory-back-up-process.patch”,
>>>> > and thanks for your help.
>>>> >
>>>> > 2.
>>>> > We understand the reason to compare PVM and SVM's packet.
>>>> However, the
>>>> > empty of SVM's packet queue might happened on setting COLO feature
>>>> and SVM
>>>> > broken.
>>>> >
>>>> > On situation 1 ( setting COLO feature ):
>>>> > We could force do checkpoint after setting COLO feature finish,
>>>> then it
>>>> > will protect the state of PVM and SVM . As the Zhang Chen said.
>>>> >
>>>> > On situation 2 ( SVM broken ):
>>>> > COLO will do failover for PVM, so it might not cause any wrong on
>>>> PVM.
>>>> >
>>>> > However, those situations are our views, so there might be a big
>>>> difference
>>>> > between reality and our views.
>>>> > If we have any wrong views and opinions, please let us know, and
>>>> correct
>>>> > us.
>>>>
>>>> It does need a timeout; the SVM being broken or being in a state where
>>>> it never sends the corresponding packet (because of a state difference)
>&

Re: The issues about architecture of the COLO checkpoint

2020-02-19 Thread Daniel Cho
Hi Zhang,

Thanks, I will configure on code for testing first.
However, if you have free time, could you please send the patch file to us,
Thanks.

Best Regard,
Daniel Cho


Zhang, Chen  於 2020年2月20日 週四 上午11:07寫道:

>
> On 2/18/2020 5:22 PM, Daniel Cho wrote:
>
> Hi Hailiang,
> Thanks for your help. If we have any problems we will contact you for your
> favor.
>
>
> Hi Zhang,
>
> " If colo-compare got a primary packet without related secondary packet in
> a certain time , it will automatically trigger checkpoint.  "
> As you said, the colo-compare will trigger checkpoint, but does it need to
> limit checkpoint times?
> There is a problem about doing many checkpoints while we use fio to random
> write files. Then it will cause low throughput on PVM.
> Is this situation is normal on COLO?
>
>
> Hi Daniel,
>
> The checkpoint time is designed to be user adjustable based on user
> environment(workload/network status/business conditions...).
>
> In net/colo-compare.c
>
> /* TODO: Should be configurable */
> #define REGULAR_PACKET_CHECK_MS 3000
>
> If you need, I can send a patch for this issue. Make users can change the
> value by QMP and qemu monitor commands.
>
> Thanks
>
> Zhang Chen
>
>
>
> Best regards,
> Daniel Cho
>
> Zhang, Chen  於 2020年2月17日 週一 下午1:36寫道:
>
>>
>> On 2/15/2020 11:35 AM, Daniel Cho wrote:
>>
>> Hi Dave,
>>
>> Yes, I agree with you, it does need a timeout.
>>
>>
>> Hi Daniel and Dave,
>>
>> Current colo-compare already have the timeout mechanism.
>>
>> Named packet_check_timer,  It will scan primary packet queue to make sure
>> all the primary packet not stay too long time.
>>
>> If colo-compare got a primary packet without related secondary packet in
>> a certain time , it will automatic trigger checkpoint.
>>
>> https://github.com/qemu/qemu/blob/master/net/colo-compare.c#L847
>>
>>
>> Thanks
>>
>> Zhang Chen
>>
>>
>>
>> Hi Hailiang,
>>
>> We base on qemu-4.1.0 for using COLO feature, in your patch, we found a
>> lot of difference  between your version and ours.
>> Could you give us a latest release version which is close your developing
>> code?
>>
>> Thanks.
>>
>> Regards
>> Daniel Cho
>>
>> Dr. David Alan Gilbert  於 2020年2月13日 週四 下午6:38寫道:
>>
>>> * Daniel Cho (daniel...@qnap.com) wrote:
>>> > Hi Hailiang,
>>> >
>>> > 1.
>>> > OK, we will try the patch
>>> > “0001-COLO-Optimize-memory-back-up-process.patch”,
>>> > and thanks for your help.
>>> >
>>> > 2.
>>> > We understand the reason to compare PVM and SVM's packet. However,
>>> the
>>> > empty of SVM's packet queue might happened on setting COLO feature and
>>> SVM
>>> > broken.
>>> >
>>> > On situation 1 ( setting COLO feature ):
>>> > We could force do checkpoint after setting COLO feature finish,
>>> then it
>>> > will protect the state of PVM and SVM . As the Zhang Chen said.
>>> >
>>> > On situation 2 ( SVM broken ):
>>> > COLO will do failover for PVM, so it might not cause any wrong on
>>> PVM.
>>> >
>>> > However, those situations are our views, so there might be a big
>>> difference
>>> > between reality and our views.
>>> > If we have any wrong views and opinions, please let us know, and
>>> correct
>>> > us.
>>>
>>> It does need a timeout; the SVM being broken or being in a state where
>>> it never sends the corresponding packet (because of a state difference)
>>> can happen and COLO needs to timeout when the packet hasn't arrived
>>> after a while and trigger the checkpoint.
>>>
>>> Dave
>>>
>>> > Thanks.
>>> >
>>> > Best regards,
>>> > Daniel Cho
>>> >
>>> > Zhang, Chen  於 2020年2月13日 週四 上午10:17寫道:
>>> >
>>> > > Add cc Jason Wang, he is a network expert.
>>> > >
>>> > > In case some network things goes wrong.
>>> > >
>>> > >
>>> > >
>>> > > Thanks
>>> > >
>>> > > Zhang Chen
>>> > >
>>> > >
>>> > >
>>> > > *From:* Zhang, Chen
>>> > > *Sent:* Thursday, February 13, 2020 10:10 AM
>>> > > *To:* 'Zhanghailiang' ; Daniel Cho <

Re: The issues about architecture of the COLO checkpoint

2020-02-18 Thread Daniel Cho
Hi Hailiang,
Thanks for your help. If we have any problems we will contact you for your
favor.


Hi Zhang,

" If colo-compare got a primary packet without related secondary packet in
a certain time , it will automatically trigger checkpoint.  "
As you said, the colo-compare will trigger checkpoint, but does it need to
limit checkpoint times?
There is a problem about doing many checkpoints while we use fio to random
write files. Then it will cause low throughput on PVM.
Is this situation is normal on COLO?

Best regards,
Daniel Cho

Zhang, Chen  於 2020年2月17日 週一 下午1:36寫道:

>
> On 2/15/2020 11:35 AM, Daniel Cho wrote:
>
> Hi Dave,
>
> Yes, I agree with you, it does need a timeout.
>
>
> Hi Daniel and Dave,
>
> Current colo-compare already have the timeout mechanism.
>
> Named packet_check_timer,  It will scan primary packet queue to make sure
> all the primary packet not stay too long time.
>
> If colo-compare got a primary packet without related secondary packet in a
> certain time , it will automatic trigger checkpoint.
>
> https://github.com/qemu/qemu/blob/master/net/colo-compare.c#L847
>
>
> Thanks
>
> Zhang Chen
>
>
>
> Hi Hailiang,
>
> We base on qemu-4.1.0 for using COLO feature, in your patch, we found a
> lot of difference  between your version and ours.
> Could you give us a latest release version which is close your developing
> code?
>
> Thanks.
>
> Regards
> Daniel Cho
>
> Dr. David Alan Gilbert  於 2020年2月13日 週四 下午6:38寫道:
>
>> * Daniel Cho (daniel...@qnap.com) wrote:
>> > Hi Hailiang,
>> >
>> > 1.
>> > OK, we will try the patch
>> > “0001-COLO-Optimize-memory-back-up-process.patch”,
>> > and thanks for your help.
>> >
>> > 2.
>> > We understand the reason to compare PVM and SVM's packet. However,
>> the
>> > empty of SVM's packet queue might happened on setting COLO feature and
>> SVM
>> > broken.
>> >
>> > On situation 1 ( setting COLO feature ):
>> > We could force do checkpoint after setting COLO feature finish,
>> then it
>> > will protect the state of PVM and SVM . As the Zhang Chen said.
>> >
>> > On situation 2 ( SVM broken ):
>> > COLO will do failover for PVM, so it might not cause any wrong on
>> PVM.
>> >
>> > However, those situations are our views, so there might be a big
>> difference
>> > between reality and our views.
>> > If we have any wrong views and opinions, please let us know, and correct
>> > us.
>>
>> It does need a timeout; the SVM being broken or being in a state where
>> it never sends the corresponding packet (because of a state difference)
>> can happen and COLO needs to timeout when the packet hasn't arrived
>> after a while and trigger the checkpoint.
>>
>> Dave
>>
>> > Thanks.
>> >
>> > Best regards,
>> > Daniel Cho
>> >
>> > Zhang, Chen  於 2020年2月13日 週四 上午10:17寫道:
>> >
>> > > Add cc Jason Wang, he is a network expert.
>> > >
>> > > In case some network things goes wrong.
>> > >
>> > >
>> > >
>> > > Thanks
>> > >
>> > > Zhang Chen
>> > >
>> > >
>> > >
>> > > *From:* Zhang, Chen
>> > > *Sent:* Thursday, February 13, 2020 10:10 AM
>> > > *To:* 'Zhanghailiang' ; Daniel Cho <
>> > > daniel...@qnap.com>
>> > > *Cc:* Dr. David Alan Gilbert ;
>> qemu-devel@nongnu.org
>> > > *Subject:* RE: The issues about architecture of the COLO checkpoint
>> > >
>> > >
>> > >
>> > > For the issue 2:
>> > >
>> > >
>> > >
>> > > COLO need use the network packets to confirm PVM and SVM in the same
>> state,
>> > >
>> > > Generally speaking, we can’t send PVM packets without compared with
>> SVM
>> > > packets.
>> > >
>> > > But to prevent jamming, I think COLO can do force checkpoint and send
>> the
>> > > PVM packets in this case.
>> > >
>> > >
>> > >
>> > > Thanks
>> > >
>> > > Zhang Chen
>> > >
>> > >
>> > >
>> > > *From:* Zhanghailiang 
>> > > *Sent:* Thursday, February 13, 2020 9:45 AM
>> > > *To:* Daniel Cho 
>> > > *Cc:* Dr. David Alan Gilbert ;
>> qemu-devel@nongnu.org;
>> > > Zhang, Chen 
>> > > *S

Re: The issues about architecture of the COLO checkpoint

2020-02-14 Thread Daniel Cho
Hi Dave,

Yes, I agree with you, it does need a timeout.

Hi Hailiang,

We base on qemu-4.1.0 for using COLO feature, in your patch, we found a lot
of difference  between your version and ours.
Could you give us a latest release version which is close your developing
code?

Thanks.

Regards
Daniel Cho

Dr. David Alan Gilbert  於 2020年2月13日 週四 下午6:38寫道:

> * Daniel Cho (daniel...@qnap.com) wrote:
> > Hi Hailiang,
> >
> > 1.
> > OK, we will try the patch
> > “0001-COLO-Optimize-memory-back-up-process.patch”,
> > and thanks for your help.
> >
> > 2.
> > We understand the reason to compare PVM and SVM's packet. However,
> the
> > empty of SVM's packet queue might happened on setting COLO feature and
> SVM
> > broken.
> >
> > On situation 1 ( setting COLO feature ):
> > We could force do checkpoint after setting COLO feature finish, then
> it
> > will protect the state of PVM and SVM . As the Zhang Chen said.
> >
> > On situation 2 ( SVM broken ):
> > COLO will do failover for PVM, so it might not cause any wrong on
> PVM.
> >
> > However, those situations are our views, so there might be a big
> difference
> > between reality and our views.
> > If we have any wrong views and opinions, please let us know, and correct
> > us.
>
> It does need a timeout; the SVM being broken or being in a state where
> it never sends the corresponding packet (because of a state difference)
> can happen and COLO needs to timeout when the packet hasn't arrived
> after a while and trigger the checkpoint.
>
> Dave
>
> > Thanks.
> >
> > Best regards,
> > Daniel Cho
> >
> > Zhang, Chen  於 2020年2月13日 週四 上午10:17寫道:
> >
> > > Add cc Jason Wang, he is a network expert.
> > >
> > > In case some network things goes wrong.
> > >
> > >
> > >
> > > Thanks
> > >
> > > Zhang Chen
> > >
> > >
> > >
> > > *From:* Zhang, Chen
> > > *Sent:* Thursday, February 13, 2020 10:10 AM
> > > *To:* 'Zhanghailiang' ; Daniel Cho <
> > > daniel...@qnap.com>
> > > *Cc:* Dr. David Alan Gilbert ;
> qemu-devel@nongnu.org
> > > *Subject:* RE: The issues about architecture of the COLO checkpoint
> > >
> > >
> > >
> > > For the issue 2:
> > >
> > >
> > >
> > > COLO need use the network packets to confirm PVM and SVM in the same
> state,
> > >
> > > Generally speaking, we can’t send PVM packets without compared with SVM
> > > packets.
> > >
> > > But to prevent jamming, I think COLO can do force checkpoint and send
> the
> > > PVM packets in this case.
> > >
> > >
> > >
> > > Thanks
> > >
> > > Zhang Chen
> > >
> > >
> > >
> > > *From:* Zhanghailiang 
> > > *Sent:* Thursday, February 13, 2020 9:45 AM
> > > *To:* Daniel Cho 
> > > *Cc:* Dr. David Alan Gilbert ;
> qemu-devel@nongnu.org;
> > > Zhang, Chen 
> > > *Subject:* RE: The issues about architecture of the COLO checkpoint
> > >
> > >
> > >
> > > Hi,
> > >
> > >
> > >
> > > 1.   After re-walked through the codes, yes, you are right,
> actually,
> > > after the first migration, we will keep dirty log on in primary side,
> > >
> > > And only send the dirty pages in PVM to SVM. The ram cache in secondary
> > > side is always a backup of PVM, so we don’t have to
> > >
> > > Re-send the none-dirtied pages.
> > >
> > > The reason why the first checkpoint takes longer time is we have to
> backup
> > > the whole VM’s ram into ram cache, that is colo_init_ram_cache().
> > >
> > > It is time consuming, but I have optimized in the second patch
> > > “0001-COLO-Optimize-memory-back-up-process.patch” which you can find
> in my
> > > previous reply.
> > >
> > >
> > >
> > > Besides, I found that, In my previous reply “We can only copy the pages
> > > that dirtied by PVM and SVM in last checkpoint.”,
> > >
> > > We have done this optimization in current upstream codes.
> > >
> > >
> > >
> > > 2.I don’t quite understand this question. For COLO, we always need both
> > > network packets of PVM’s and SVM’s to compare before send this packets
> to
> > > client.
> > >
> > > It depends on this to decide whether or not PVM and SVM are

Re: The issues about architecture of the COLO checkpoint

2020-02-12 Thread Daniel Cho
Hi Hailiang,

1.
OK, we will try the patch
“0001-COLO-Optimize-memory-back-up-process.patch”,
and thanks for your help.

2.
We understand the reason to compare PVM and SVM's packet. However, the
empty of SVM's packet queue might happened on setting COLO feature and SVM
broken.

On situation 1 ( setting COLO feature ):
We could force do checkpoint after setting COLO feature finish, then it
will protect the state of PVM and SVM . As the Zhang Chen said.

On situation 2 ( SVM broken ):
COLO will do failover for PVM, so it might not cause any wrong on PVM.

However, those situations are our views, so there might be a big difference
between reality and our views.
If we have any wrong views and opinions, please let us know, and correct
us.
Thanks.

Best regards,
Daniel Cho

Zhang, Chen  於 2020年2月13日 週四 上午10:17寫道:

> Add cc Jason Wang, he is a network expert.
>
> In case some network things goes wrong.
>
>
>
> Thanks
>
> Zhang Chen
>
>
>
> *From:* Zhang, Chen
> *Sent:* Thursday, February 13, 2020 10:10 AM
> *To:* 'Zhanghailiang' ; Daniel Cho <
> daniel...@qnap.com>
> *Cc:* Dr. David Alan Gilbert ; qemu-devel@nongnu.org
> *Subject:* RE: The issues about architecture of the COLO checkpoint
>
>
>
> For the issue 2:
>
>
>
> COLO need use the network packets to confirm PVM and SVM in the same state,
>
> Generally speaking, we can’t send PVM packets without compared with SVM
> packets.
>
> But to prevent jamming, I think COLO can do force checkpoint and send the
> PVM packets in this case.
>
>
>
> Thanks
>
> Zhang Chen
>
>
>
> *From:* Zhanghailiang 
> *Sent:* Thursday, February 13, 2020 9:45 AM
> *To:* Daniel Cho 
> *Cc:* Dr. David Alan Gilbert ; qemu-devel@nongnu.org;
> Zhang, Chen 
> *Subject:* RE: The issues about architecture of the COLO checkpoint
>
>
>
> Hi,
>
>
>
> 1.   After re-walked through the codes, yes, you are right, actually,
> after the first migration, we will keep dirty log on in primary side,
>
> And only send the dirty pages in PVM to SVM. The ram cache in secondary
> side is always a backup of PVM, so we don’t have to
>
> Re-send the none-dirtied pages.
>
> The reason why the first checkpoint takes longer time is we have to backup
> the whole VM’s ram into ram cache, that is colo_init_ram_cache().
>
> It is time consuming, but I have optimized in the second patch
> “0001-COLO-Optimize-memory-back-up-process.patch” which you can find in my
> previous reply.
>
>
>
> Besides, I found that, In my previous reply “We can only copy the pages
> that dirtied by PVM and SVM in last checkpoint.”,
>
> We have done this optimization in current upstream codes.
>
>
>
> 2.I don’t quite understand this question. For COLO, we always need both
> network packets of PVM’s and SVM’s to compare before send this packets to
> client.
>
> It depends on this to decide whether or not PVM and SVM are in same state.
>
>
>
> Thanks,
>
> hailiang
>
>
>
> *From:* Daniel Cho [mailto:daniel...@qnap.com ]
> *Sent:* Wednesday, February 12, 2020 4:37 PM
> *To:* Zhang, Chen 
> *Cc:* Zhanghailiang ; Dr. David Alan
> Gilbert ; qemu-devel@nongnu.org
> *Subject:* Re: The issues about architecture of the COLO checkpoint
>
>
>
> Hi Hailiang,
>
>
>
> Thanks for your replaying and explain in detail.
>
> We will try to use the attachments to enhance memory copy.
>
>
>
> However, we have some questions for your replying.
>
>
>
> 1.  As you said, "for each checkpoint, we have to send the whole PVM's
> pages To SVM", why the only first checkpoint will takes more pause time?
>
> In our observing, the first checkpoint will take more time for pausing,
> then other checkpoints will takes a few time for pausing. Does it means
> only the first checkpoint will send the whole pages to SVM, and the other
> checkpoints send the dirty pages to SVM for reloading?
>
>
>
> 2. We notice the COLO-COMPARE component will stuck the packet until
> receive packets from PVM and SVM, as this rule, when we add the
> COLO-COMPARE to PVM, its network will stuck until SVM start. So it is an
> other issue to make PVM stuck while setting COLO feature. With this issue,
> could we let colo-compare to pass the PVM's packet when the SVM's packet
> queue is empty? Then, the PVM's network won't stock, and "if PVM runs
> firstly, it still need to wait for The network packets from SVM to
> compare before send it to client side" won't happened either.
>
>
>
> Best regard,
>
> Daniel Cho
>
>
>
> Zhang, Chen  於 2020年2月12日 週三 下午1:45寫道:
>
>
>
> > -Original Message-----
> > From: Zhang

Re: The issues about architecture of the COLO checkpoint

2020-02-12 Thread Daniel Cho
Hi Hailiang,

Thanks for your replaying and explain in detail.
We will try to use the attachments to enhance memory copy.

However, we have some questions for your replying.

1.  As you said, "for each checkpoint, we have to send the whole PVM's
pages To SVM", why the only first checkpoint will takes more pause time?
In our observing, the first checkpoint will take more time for pausing,
then other checkpoints will takes a few time for pausing. Does it means
only the first checkpoint will send the whole pages to SVM, and the other
checkpoints send the dirty pages to SVM for reloading?

2. We notice the COLO-COMPARE component will stuck the packet until
receive packets from PVM and SVM, as this rule, when we add the
COLO-COMPARE to PVM, its network will stuck until SVM start. So it is an
other issue to make PVM stuck while setting COLO feature. With this issue,
could we let colo-compare to pass the PVM's packet when the SVM's packet
queue is empty? Then, the PVM's network won't stock, and "if PVM runs
firstly, it still need to wait for The network packets from SVM to compare
before send it to client side" won't happened either.

Best regard,
Daniel Cho

Zhang, Chen  於 2020年2月12日 週三 下午1:45寫道:

>
>
> > -Original Message-
> > From: Zhanghailiang 
> > Sent: Wednesday, February 12, 2020 11:18 AM
> > To: Dr. David Alan Gilbert ; Daniel Cho
> > ; Zhang, Chen 
> > Cc: qemu-devel@nongnu.org
> > Subject: RE: The issues about architecture of the COLO checkpoint
> >
> > Hi,
> >
> > Thank you Dave,
> >
> > I'll reply here directly.
> >
> > -Original Message-
> > From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com]
> > Sent: Wednesday, February 12, 2020 1:48 AM
> > To: Daniel Cho ; chen.zh...@intel.com;
> > Zhanghailiang 
> > Cc: qemu-devel@nongnu.org
> > Subject: Re: The issues about architecture of the COLO checkpoint
> >
> >
> > cc'ing in COLO people:
> >
> >
> > * Daniel Cho (daniel...@qnap.com) wrote:
> > > Hi everyone,
> > >  We have some issues about setting COLO feature. Hope somebody
> > > could give us some advice.
> > >
> > > Issue 1:
> > >  We dynamic to set COLO feature for PVM(2 core, 16G memory),  but
> > > the Primary VM will pause a long time(based on memory size) for
> > > waiting SVM start. Does it have any idea to reduce the pause time?
> > >
> >
> > Yes, we do have some ideas to optimize this downtime.
> >
> > The main problem for current version is, for each checkpoint, we have to
> > send the whole PVM's pages
> > To SVM, and then copy the whole VM's state into SVM from ram cache, in
> > this process, we need both of them be paused.
> > Just as you said, the downtime is based on memory size.
> >
> > So firstly, we need to reduce the sending data while do checkpoint,
> actually,
> > we can migrate parts of PVM's dirty pages in background
> > While both of VMs are running. And then we load these pages into ram
> > cache (backup memory) in SVM temporarily. While do checkpoint,
> > We just send the last dirty pages of PVM to slave side and then copy the
> ram
> > cache into SVM. Further on, we don't have
> > To send the whole PVM's dirty pages, we can only send the pages that
> > dirtied by PVM or SVM during two checkpoints. (Because
> > If one page is not dirtied by both PVM and SVM, the data of this pages
> will
> > keep same in SVM, PVM, backup memory). This method can reduce
> > the time that consumed in sending data.
> >
> > For the second problem, we can reduce the memory copy by two methods,
> > first one, we don't have to copy the whole pages in ram cache,
> > We can only copy the pages that dirtied by PVM and SVM in last
> checkpoint.
> > Second, we can use userfault missing function to reduce the
> > Time consumed in memory copy. (For the second time, in theory, we can
> > reduce time consumed in memory into ms level).
> >
> > You can find the first optimization in attachment, it is based on an old
> qemu
> > version (qemu-2.6), it should not be difficult to rebase it
> > Into master or your version. And please feel free to send the new
> version if
> > you want into community ;)
> >
> >
>
> Thanks Hailiang!
> By the way, Do you have time to push the patches to upstream?
> I think this is a better and faster option.
>
> Thanks
> Zhang Chen
>
> > >
> > > Issue 2:
> > >  In
> > > https://github.com/qemu/qemu/blob/master/migration/colo.c#L503,
> > > could we move start_vm() before Line 488? Because at f

The issues about architecture of the COLO checkpoint

2020-02-10 Thread Daniel Cho
Hi everyone,
 We have some issues about setting COLO feature. Hope somebody could
give us some advice.

Issue 1:
 We dynamic to set COLO feature for PVM(2 core, 16G memory),  but the
Primary VM will pause a long time(based on memory size) for waiting SVM
start. Does it have any idea to reduce the pause time?


Issue 2:
 In https://github.com/qemu/qemu/blob/master/migration/colo.c#L503,
could we move start_vm() before Line 488? Because at first checkpoint PVM
will wait for SVM's reply, it cause PVM stop for a while.

 We set the COLO feature on running VM, so we hope the running VM could
continuous service for users.
Do you have any suggestions for those issues?

Best regards,
Daniel Cho


Re: Network connection with COLO VM

2019-12-06 Thread Daniel Cho
Hi Dave,  Zhang,

Thanks for your help. I will try your recommendations.

Best Regard,
Daniel Cho

Zhang, Chen  於 2019年12月4日 週三 下午4:32寫道:

>
> On 12/3/2019 9:25 PM, Dr. David Alan Gilbert wrote:
> > * Daniel Cho (daniel...@qnap.com) wrote:
> >> Hi Dave,
> >>
> >> We could use the exist interface to add netfilter and chardev, it might
> not
> >> have the problem you said.
> >>
> >> However, the netfilter and chardev on the primary at the start, that
> means
> >> we could not dynamic set COLO
> >> feature to VM?
> > I wasn't expecting that to be possible - I'd expect you to be able
> > to start in a state that looks the same as a primary+failed secondary;
> > but I'm not sure.
>
> Current COLO (with Lukas's patch) can support dynamic set COLO system.
>
> This status is same like the secondary has triggered failover, the
> primary node need to find new secondary
>
> node to combine new COLO system.
>
>
> >> We try to change this chardev to prevent primary VM will stuck to wait
> >> secondary VM.
> >>
> >> -chardev socket,id=compare1,host=127.0.0.1,port=9004,server,wait \
> >>
> >> to
> >>
> >> -chardev socket,id=compare1,host=127.0.0.1,port=9004,server,nowait \
> >>
> >> But it will make primary VM's network not works. (Can't get ip), until
> >> starting connect with secondary VM.
>
> I think you need to check the port 9004 if already connect to the pair
> node.
>
> > I'm not sure of the answer to this; I've not tried doing it - I'm not
> > sure anyone has!
> > But, the colo components do track the state of tcp connections, so I'm
> > expecting that they have to already exist to have the state of those
> > connections available for when you start the secondary.
>
> Yes, you are right.
>
> For this status, we don't need to sync the state of tcp connections,
> because after failover
>
> (or just solo COLO primary node), we have empty all the tcp connections
> state in COLO module.
>
> In the processing of build new COLO pair, we will sync all the VM state
> to secondary node and re-build
>
> new track things in COLO module.
>
>
> >
> >
> >> Otherwise, the primary VM with netfileter / chardev and without
> netfilter /
> >> chardev , they takes very different
> >> booting time.
> >> Without  netfilter / chardev : about 1 mins
> >> With   netfilter / chardev : about 5 mins
> >> Is this an issue?
> > that sounds like it needs investigating.
> >
> > Dave
>
> Yes, In previous COLO use cases, we need make primary node and secondary
> node boot in the same time.
>
> I did't expect such a big difference on netfilter/chardev.
>
> I think you can try without netfilter/chardev, after VM boot, re-build
> the netfilter/chardev related work with chardev server nowait.
>
>
> Thanks
>
> Zhang Chen
>
> >
> >> Best regards,
> >> Daniel Cho
> >>
> >>
> >> Dr. David Alan Gilbert  於 2019年12月2日 週一 下午5:58寫道:
> >>
> >>> * Daniel Cho (daniel...@qnap.com) wrote:
> >>>> Hi Zhang,
> >>>>
> >>>> We use qemu-4.1.0 release on this case.
> >>>>
> >>>> I think we need use block mirror to sync the disk to secondary node
> >>> first,
> >>>> then stop the primary VM and build COLO system.
> >>>>
> >>>> In the stop moment, you need add some netfilter and chardev socket
> node
> >>> for
> >>>> COLO, maybe you need re-check this part.
> >>>>
> >>>>
> >>>> Our test was already follow those step. Maybe I could describe the
> detail
> >>>> of the test flow and issues.
> >>>>
> >>>>
> >>>> Step 1:
> >>>>
> >>>> Create primary VM without any netfilter and chardev for COLO, and
> using
> >>>> other host ping primary VM continually.
> >>>>
> >>>>
> >>>> Step 2:
> >>>>
> >>>> Create secondary VM (the same device/drive with primary VM), and do
> block
> >>>> mirror sync ( ping to primary VM normally )
> >>>>
> >>>>
> >>>> Step 3:
> >>>>
> >>>> After block mirror sync finish, add those netfilter and chardev to
> >>> primary
> >>>> VM and secondary VM for COLO ( *Can't* ping to primary VM but those
> >>> pac

Re: Network connection with COLO VM

2019-12-03 Thread Daniel Cho
Hi Dave,

We could use the exist interface to add netfilter and chardev, it might not
have the problem you said.

However, the netfilter and chardev on the primary at the start, that means
we could not dynamic set COLO
feature to VM?

We try to change this chardev to prevent primary VM will stuck to wait
secondary VM.

-chardev socket,id=compare1,host=127.0.0.1,port=9004,server,wait \

to

-chardev socket,id=compare1,host=127.0.0.1,port=9004,server,nowait \

But it will make primary VM's network not works. (Can't get ip), until
starting connect with secondary VM.


Otherwise, the primary VM with netfileter / chardev and without netfilter /
chardev , they takes very different
booting time.
Without  netfilter / chardev : about 1 mins
With   netfilter / chardev : about 5 mins
Is this an issue?

Best regards,
Daniel Cho


Dr. David Alan Gilbert  於 2019年12月2日 週一 下午5:58寫道:

> * Daniel Cho (daniel...@qnap.com) wrote:
> > Hi Zhang,
> >
> > We use qemu-4.1.0 release on this case.
> >
> > I think we need use block mirror to sync the disk to secondary node
> first,
> > then stop the primary VM and build COLO system.
> >
> > In the stop moment, you need add some netfilter and chardev socket node
> for
> > COLO, maybe you need re-check this part.
> >
> >
> > Our test was already follow those step. Maybe I could describe the detail
> > of the test flow and issues.
> >
> >
> > Step 1:
> >
> > Create primary VM without any netfilter and chardev for COLO, and using
> > other host ping primary VM continually.
> >
> >
> > Step 2:
> >
> > Create secondary VM (the same device/drive with primary VM), and do block
> > mirror sync ( ping to primary VM normally )
> >
> >
> > Step 3:
> >
> > After block mirror sync finish, add those netfilter and chardev to
> primary
> > VM and secondary VM for COLO ( *Can't* ping to primary VM but those
> packets
> > will be received later )
> >
> >
> > Step 4:
> >
> > Start migrate primary VM to secondary VM, and primary VM & secondary VM
> are
> > running ( ping to primary VM works and receive those packets on step 3
> > status )
> >
> >
> >
> >
> > Between Step 3 to Step 4, it will take 10~20 seconds in our environment.
> >
> > I could image this issue (delay reply packets) is because of setting COLO
> > proxy for temporary status,
> >
> > but we thought 10~20 seconds might a little long. (If primary VM is
> already
> > doing some jobs, it might lose the data.)
> >
> >
> > Could we reduce those time? or those delay is depends on different VM?
>
> I think you need to set up the netfilter and chardev on the primary at
> the start;  the filter contains the state of the TCP connections working
> with the VM, so adding it later can't gain that state for existing
> connections.
>
> Dave
>
> >
> > Best Regard,
> >
> > Daniel Cho.
> >
> >
> >
> > Zhang, Chen  於 2019年11月30日 週六 上午2:04寫道:
> >
> > >
> > >
> > >
> > >
> > > *From:* Daniel Cho 
> > > *Sent:* Friday, November 29, 2019 10:43 AM
> > > *To:* Zhang, Chen 
> > > *Cc:* Dr. David Alan Gilbert ;
> lukasstra...@web.de;
> > > qemu-devel@nongnu.org
> > > *Subject:* Re: Network connection with COLO VM
> > >
> > >
> > >
> > > Hi David,  Zhang,
> > >
> > >
> > >
> > > Thanks for replying my question.
> > >
> > > We know why will occur this issue.
> > >
> > > As you said, the COLO VM's network needs
> > >
> > > colo-proxy to control packets, so the guest's
> > >
> > > interface should set the filter to solve the problem.
> > >
> > >
> > >
> > > But we found another question, when we set the
> > >
> > > fault-tolerance feature to guest (primary VM is running,
> > >
> > > secondary VM is pausing), the guest's network would not
> > >
> > > responds any request for a while (in our environment
> > >
> > > about 20~30 secs) after secondary VM runs.
> > >
> > >
> > >
> > > Does it be a normal situation, or a known issue?
> > >
> > >
> > >
> > > Our test is creating primary VM for a while, then creating
> > >
> > > secondary VM to make it with COLO feature.
> > >
> > >
> > >
> > > Hi Daniel,
> > >
> > >
> > >
> > > Happy to hear you have solved s

Re: Network connection with COLO VM

2019-12-01 Thread Daniel Cho
Hi Zhang,

We use qemu-4.1.0 release on this case.

I think we need use block mirror to sync the disk to secondary node first,
then stop the primary VM and build COLO system.

In the stop moment, you need add some netfilter and chardev socket node for
COLO, maybe you need re-check this part.


Our test was already follow those step. Maybe I could describe the detail
of the test flow and issues.


Step 1:

Create primary VM without any netfilter and chardev for COLO, and using
other host ping primary VM continually.


Step 2:

Create secondary VM (the same device/drive with primary VM), and do block
mirror sync ( ping to primary VM normally )


Step 3:

After block mirror sync finish, add those netfilter and chardev to primary
VM and secondary VM for COLO ( *Can't* ping to primary VM but those packets
will be received later )


Step 4:

Start migrate primary VM to secondary VM, and primary VM & secondary VM are
running ( ping to primary VM works and receive those packets on step 3
status )




Between Step 3 to Step 4, it will take 10~20 seconds in our environment.

I could image this issue (delay reply packets) is because of setting COLO
proxy for temporary status,

but we thought 10~20 seconds might a little long. (If primary VM is already
doing some jobs, it might lose the data.)


Could we reduce those time? or those delay is depends on different VM?


Best Regard,

Daniel Cho.



Zhang, Chen  於 2019年11月30日 週六 上午2:04寫道:

>
>
>
>
> *From:* Daniel Cho 
> *Sent:* Friday, November 29, 2019 10:43 AM
> *To:* Zhang, Chen 
> *Cc:* Dr. David Alan Gilbert ; lukasstra...@web.de;
> qemu-devel@nongnu.org
> *Subject:* Re: Network connection with COLO VM
>
>
>
> Hi David,  Zhang,
>
>
>
> Thanks for replying my question.
>
> We know why will occur this issue.
>
> As you said, the COLO VM's network needs
>
> colo-proxy to control packets, so the guest's
>
> interface should set the filter to solve the problem.
>
>
>
> But we found another question, when we set the
>
> fault-tolerance feature to guest (primary VM is running,
>
> secondary VM is pausing), the guest's network would not
>
> responds any request for a while (in our environment
>
> about 20~30 secs) after secondary VM runs.
>
>
>
> Does it be a normal situation, or a known issue?
>
>
>
> Our test is creating primary VM for a while, then creating
>
> secondary VM to make it with COLO feature.
>
>
>
> Hi Daniel,
>
>
>
> Happy to hear you have solved ssh disconnection issue.
>
>
>
> Do you use Lukas’s patch on this case?
>
> I think we need use block mirror to sync the disk to secondary node first,
> then stop the primary VM and build COLO system.
>
> In the stop moment, you need add some netfilter and chardev socket node
> for COLO, maybe you need re-check this part.
>
>
>
> Best Regard,
>
> Daniel Cho
>
>
>
> Zhang, Chen  於 2019年11月28日 週四 上午9:26寫道:
>
>
>
> > -Original Message-
> > From: Dr. David Alan Gilbert 
> > Sent: Wednesday, November 27, 2019 6:51 PM
> > To: Daniel Cho ; Zhang, Chen
> > ; lukasstra...@web.de
> > Cc: qemu-devel@nongnu.org
> > Subject: Re: Network connection with COLO VM
> >
> > * Daniel Cho (daniel...@qnap.com) wrote:
> > > Hello everyone,
> > >
> > > Could we ssh to colo VM (means PVM & SVM are starting)?
> > >
> >
> > Lets cc in Zhang Chen and Lukas Straub.
>
> Thanks Dave.
>
> >
> > > SSH will connect to colo VM for a while, but it will disconnect with
> > > error
> > > *client_loop: send disconnect: Broken pipe*
> > >
> > > It seems to colo VM could not keep network session.
> > >
> > > Does it be a known issue?
> >
> > That sounds like the COLO proxy is getting upset; it's supposed to
> compare
> > packets sent by the primary and secondary and only send one to the
> outside
> > - you shouldn't be talking directly to the guest, but always via the
> proxy.  See
> > docs/colo-proxy.txt
> >
>
> Hi Daniel,
>
> I have try ssh to COLO guest with 8 hours, not occurred this issue.
> Please check your network/qemu configuration.
> But I found another problem maybe related this issue, if no network
> communication for a period of time(maybe 10min), the first message send to
> guest have a chance with delay(maybe 1-5 sec), I will try to fix it when I
> have time.
>
> Thanks
> Zhang Chen
>
> > Dave
> >
> > > Best Regard,
> > > Daniel Cho
> > --
> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>
>


Re: Network connection with COLO VM

2019-11-28 Thread Daniel Cho
Hi David,  Zhang,

Thanks for replying my question.
We know why will occur this issue.
As you said, the COLO VM's network needs
colo-proxy to control packets, so the guest's
interface should set the filter to solve the problem.

But we found another question, when we set the
fault-tolerance feature to guest (primary VM is running,
secondary VM is pausing), the guest's network would not
responds any request for a while (in our environment
about 20~30 secs) after secondary VM runs.

Does it be a normal situation, or a known issue?

Our test is creating primary VM for a while, then creating
secondary VM to make it with COLO feature.

Best Regard,
Daniel Cho

Zhang, Chen  於 2019年11月28日 週四 上午9:26寫道:

>
>
> > -Original Message-
> > From: Dr. David Alan Gilbert 
> > Sent: Wednesday, November 27, 2019 6:51 PM
> > To: Daniel Cho ; Zhang, Chen
> > ; lukasstra...@web.de
> > Cc: qemu-devel@nongnu.org
> > Subject: Re: Network connection with COLO VM
> >
> > * Daniel Cho (daniel...@qnap.com) wrote:
> > > Hello everyone,
> > >
> > > Could we ssh to colo VM (means PVM & SVM are starting)?
> > >
> >
> > Lets cc in Zhang Chen and Lukas Straub.
>
> Thanks Dave.
>
> >
> > > SSH will connect to colo VM for a while, but it will disconnect with
> > > error
> > > *client_loop: send disconnect: Broken pipe*
> > >
> > > It seems to colo VM could not keep network session.
> > >
> > > Does it be a known issue?
> >
> > That sounds like the COLO proxy is getting upset; it's supposed to
> compare
> > packets sent by the primary and secondary and only send one to the
> outside
> > - you shouldn't be talking directly to the guest, but always via the
> proxy.  See
> > docs/colo-proxy.txt
> >
>
> Hi Daniel,
>
> I have try ssh to COLO guest with 8 hours, not occurred this issue.
> Please check your network/qemu configuration.
> But I found another problem maybe related this issue, if no network
> communication for a period of time(maybe 10min), the first message send to
> guest have a chance with delay(maybe 1-5 sec), I will try to fix it when I
> have time.
>
> Thanks
> Zhang Chen
>
> > Dave
> >
> > > Best Regard,
> > > Daniel Cho
> > --
> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>
>


Network connection with COLO VM

2019-11-26 Thread Daniel Cho
Hello everyone,

Could we ssh to colo VM (means PVM & SVM are starting)?

SSH will connect to colo VM for a while, but it will disconnect with error
*client_loop: send disconnect: Broken pipe*

It seems to colo VM could not keep network session.

Does it be a known issue?

Best Regard,
Daniel Cho


Re: The problems about COLO

2019-11-07 Thread Daniel Cho
Lukas Straub  於 2019年11月7日 週四 下午9:34寫道:

> On Thu, 7 Nov 2019 16:14:43 +0800
> Daniel Cho  wrote:
>
> > Hi  Lukas,
> > Thanks for your reply.
> >
> > However, we test the question 1 with steps below the error message, we
> > notice the secondary VM's image
> > will break  while it reboots.
> > Here is the error message.
> > ---
> > [1.280299] XFS (sda1): Mounting V5 Filesystem
> > [1.428418] input: ImExPS/2 Generic Explorer Mouse as
> > /devices/platform/i8042/serio1/input/input2
> > [1.501320] XFS (sda1): Starting recovery (logdev: internal)
> > [1.504076] tsc: Refined TSC clocksource calibration: 3492.211 MHz
> > [1.505534] Switched to clocksource tsc
> > [2.031027] XFS (sda1): Internal error XFS_WANT_CORRUPTED_GOTO at line
> > 1635 of file fs/xfs/libxfs/xfs_alloc.c.  Caller
> xfs_free_extent+0xfc/0x130
> > [xfs]
> > [2.032743] CPU: 0 PID: 300 Comm: mount Not tainted
> > 3.10.0-693.11.6.el7.x86_64 #1
> > [2.033982] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS
> > rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> > [2.035882] Call Trace:
> > [2.036494]  [] dump_stack+0x19/0x1b
> > [2.037315]  [] xfs_error_report+0x3b/0x40 [xfs]
> > [2.038150]  [] ? xfs_free_extent+0xfc/0x130 [xfs]
> > [2.039046]  [] xfs_free_ag_extent+0x20a/0x780 [xfs]
> > [2.039920]  [] xfs_free_extent+0xfc/0x130 [xfs]
> > [2.040768]  [] xfs_trans_free_extent+0x26/0x60
> [xfs]
> > [2.041642]  [] xlog_recover_process_efi+0x17e/0x1c0
> > [xfs]
> > [2.042558]  []
> > xlog_recover_process_efis.isra.30+0x77/0xe0 [xfs]
> > [2.043771]  [] xlog_recover_finish+0x21/0xb0 [xfs]
> > [2.044650]  [] xfs_log_mount_finish+0x34/0x50 [xfs]
> > [2.045518]  [] xfs_mountfs+0x5d1/0x8b0 [xfs]
> > [2.046341]  [] ?
> xfs_filestream_get_parent+0x80/0x80
> > [xfs]
> > [2.047260]  [] xfs_fs_fill_super+0x3bb/0x4d0 [xfs]
> > [2.048116]  [] mount_bdev+0x1b0/0x1f0
> > [2.048881]  [] ?
> > xfs_test_remount_options.isra.11+0x70/0x70 [xfs]
> > [2.050105]  [] xfs_fs_mount+0x15/0x20 [xfs]
> > [2.050906]  [] mount_fs+0x39/0x1b0
> > [2.051963]  [] ? __alloc_percpu+0x15/0x20
> > [2.059431]  [] vfs_kern_mount+0x67/0x110
> > [2.060283]  [] do_mount+0x233/0xaf0
> > [2.061081]  [] ? strndup_user+0x4b/0xa0
> > [2.061844]  [] SyS_mount+0x96/0xf0
> > [2.062619]  [] system_call_fastpath+0x16/0x1b
> > [2.063512] XFS (sda1): Internal error xfs_trans_cancel at line 984 of
> > file fs/xfs/xfs_trans.c.  Caller xlog_recover_process_efi+0x18e/0x1c0
> [xfs]
> > [2.065260] CPU: 0 PID: 300 Comm: mount Not tainted
> > 3.10.0-693.11.6.el7.x86_64 #1
> > [2.066489] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS
> > rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
> > [2.068023] Call Trace:
> > [2.068590]  [] dump_stack+0x19/0x1b
> > [2.069403]  [] xfs_error_report+0x3b/0x40 [xfs]
> > [2.070318]  [] ?
> xlog_recover_process_efi+0x18e/0x1c0
> > [xfs]
> > [2.071538]  [] xfs_trans_cancel+0xbd/0xe0 [xfs]
> > [2.072429]  [] xlog_recover_process_efi+0x18e/0x1c0
> > [xfs]
> > [2.073339]  []
> > xlog_recover_process_efis.isra.30+0x77/0xe0 [xfs]
> > [2.074561]  [] xlog_recover_finish+0x21/0xb0 [xfs]
> > [2.075421]  [] xfs_log_mount_finish+0x34/0x50 [xfs]
> > [2.076301]  [] xfs_mountfs+0x5d1/0x8b0 [xfs]
> > [2.077128]  [] ?
> xfs_filestream_get_parent+0x80/0x80
> > [xfs]
> > [2.078049]  [] xfs_fs_fill_super+0x3bb/0x4d0 [xfs]
> > [2.078900]  [] mount_bdev+0x1b0/0x1f0
> > [2.079667]  [] ?
> > xfs_test_remount_options.isra.11+0x70/0x70 [xfs]
> > [2.080883]  [] xfs_fs_mount+0x15/0x20 [xfs]
> > [2.081687]  [] mount_fs+0x39/0x1b0
> > [2.082457]  [] ? __alloc_percpu+0x15/0x20
> > [2.083258]  [] vfs_kern_mount+0x67/0x110
> > [2.084057]  [] do_mount+0x233/0xaf0
> > [2.084797]  [] ? strndup_user+0x4b/0xa0
> > [2.085568]  [] SyS_mount+0x96/0xf0
> > [2.086324]  [] system_call_fastpath+0x16/0x1b
> > [2.087161] XFS (sda1): xfs_do_force_shutdown(0x8) called from line
> 985
> > of file fs/xfs/xfs_trans.c.  Return address = 0xc0195966
> > [2.088795] XFS (sda1): Corruption of in-memory data detected.
> Shutting
> > down filesystem
> > [2.090273] XFS (sda1): Please umount the filesystem and rectify the
> > problem(s)

The problems about COLO

2019-10-31 Thread Daniel Cho
Hello all,
I have some questions about the COLO.
1)  Could we dynamic set fault tolerance feature on running VM?
In your document, the primary VM could not  start first (if you start
primary VM, the secondary VM will need to start), it means to if I want
this VM with fault-tolerance feature, it needs to be set while we boot it.

2)  If primary VM or secondary VM broke, could we start the third VM to
keep fault tolerance feature?


Best regard,
Daniel Cho.