Re: [PATCH 3/7] Fixed SVM hang when do failover before PVM crash

2021-06-16 Thread lizhij...@fujitsu.com


On 17/06/2021 10:47, Lei Rao wrote:
> From: "Rao, Lei" 
>
> This patch fixed as follows:
>  Thread 1 (Thread 0x7f34ee738d80 (LWP 11212)):
>  #0 __pthread_clockjoin_ex (threadid=139847152957184, 
> thread_return=0x7f30b1febf30, clockid=, abstime= out>, block=) at pthread_join_common.c:145
>  #1 0x563401998e36 in qemu_thread_join (thread=0x563402d66610) at 
> util/qemu-thread-posix.c:587
>  #2 0x5634017a79fa in process_incoming_migration_co (opaque=0x0) at 
> migration/migration.c:502
>  #3 0x5634019b59c9 in coroutine_trampoline (i0=63395504, i1=22068) at 
> util/coroutine-ucontext.c:115
>  #4 0x7f34ef860660 in ?? () at 
> ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91 from 
> /lib/x86_64-linux-gnu/libc.so.6
>  #5 0x7f30b21ee730 in ?? ()
>  #6 0x in ?? ()
>
>  Thread 13 (Thread 0x7f30b3dff700 (LWP 11747)):
>  #0  __lll_lock_wait (futex=futex@entry=0x56340218ffa0 
> , private=0) at lowlevellock.c:52
>  #1  0x7f34efa000a3 in _GI__pthread_mutex_lock (mutex=0x56340218ffa0 
> ) at ../nptl/pthread_mutex_lock.c:80
>  #2  0x563401997f99 in qemu_mutex_lock_impl (mutex=0x56340218ffa0 
> , file=0x563401b7a80e "migration/colo.c", line=806) at 
> util/qemu-thread-posix.c:78
>  #3  0x563401407144 in qemu_mutex_lock_iothread_impl 
> (file=0x563401b7a80e "migration/colo.c", line=806) at 
> /home/workspace/colo-qemu/cpus.c:1899
>  #4  0x5634017ba8e8 in colo_process_incoming_thread 
> (opaque=0x563402d664c0) at migration/colo.c:806
>  #5  0x563401998b72 in qemu_thread_start (args=0x5634039f8370) at 
> util/qemu-thread-posix.c:519
>  #6  0x7f34ef9fd609 in start_thread (arg=) at 
> pthread_create.c:477
>  #7  0x7f34ef924293 in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>
>  The QEMU main thread is holding the lock:
>  (gdb) p qemu_global_mutex
>  $1 = {lock = {_data = {lock = 2, __count = 0, __owner = 11212, __nusers 
> = 9, __kind = 0, __spins = 0, __elision = 0, __list = {_prev = 0x0, __next = 
> 0x0}},
>   __size = "\002\000\000\000\000\000\000\000\314+\000\000\t", '\000' 
> , __align = 2}, file = 0x563401c07e4b "util/main-loop.c", 
> line = 240,
>  initialized = true}
>
>  From the call trace, we can see it is a deadlock bug. and the QEMU main 
> thread holds the global mutex to wait until the COLO thread ends. and the 
> colo thread
> wants to acquire the global mutex, which will cause a deadlock. So, we should 
> release the qemu_global_mutex before waiting colo thread ends.
>
> Signed-off-by: Lei Rao 
Reviewed-by: Li Zhijian 


> ---
>   migration/migration.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index c2c84c7..6debb8b 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -593,8 +593,10 @@ static void process_incoming_migration_co(void *opaque)
>   mis->have_colo_incoming_thread = true;
>   qemu_coroutine_yield();
>   
> +qemu_mutex_unlock_iothread();
>   /* Wait checkpoint incoming thread exit before free resource */
>   qemu_thread_join(&mis->colo_incoming_thread);
> +qemu_mutex_lock_iothread();
>   /* We hold the global iothread lock, so it is safe here */
>   colo_release_ram_cache();
>   }


Re: [PATCH v3] migration/rdma: Fix out of order wrid

2021-10-26 Thread lizhij...@fujitsu.com
ping again


On 18/10/2021 18:18, Li, Zhijian/李 智坚 wrote:
> ping
>
>
> On 27/09/2021 15:07, Li Zhijian wrote:
>> destination:
>> ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev 
>> tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device 
>> e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive 
>> if=none,file=./Fedora-rdma-server-migration.qcow2,id=drive-virtio-disk0 
>> -device 
>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 
>> -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga 
>> qxl -spice streaming-video=filter,port=5902,disable-ticketing -incoming 
>> rdma:192.168.22.23:
>> qemu-system-x86_64: -spice 
>> streaming-video=filter,port=5902,disable-ticketing: warning: short-form 
>> boolean option 'disable-ticketing' deprecated
>> Please use disable-ticketing=on instead
>> QEMU 6.0.50 monitor - type 'help' for more information
>> (qemu) trace-event qemu_rdma_block_for_wrid_miss on
>> (qemu) dest_init RDMA Device opened: kernel name rxe_eth0 uverbs device name 
>> uverbs2, infiniband_verbs class device path 
>> /sys/class/infiniband_verbs/uverbs2, infiniband class device path 
>> /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
>> qemu_rdma_block_for_wrid_miss A Wanted wrid CONTROL SEND (2000) but got 
>> CONTROL RECV (4000)
>>
>> source:
>> ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev 
>> tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device 
>> e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive 
>> if=none,file=./Fedora-rdma-server.qcow2,id=drive-virtio-disk0 -device 
>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 
>> -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga 
>> qxl -spice streaming-video=filter,port=5901,disable-ticketing -S
>> qemu-system-x86_64: -spice 
>> streaming-video=filter,port=5901,disable-ticketing: warning: short-form 
>> boolean option 'disable-ticketing' deprecated
>> Please use disable-ticketing=on instead
>> QEMU 6.0.50 monitor - type 'help' for more information
>> (qemu)
>> (qemu) trace-event qemu_rdma_block_for_wrid_miss on
>> (qemu) migrate -d rdma:192.168.22.23:
>> source_resolve_host RDMA Device opened: kernel name rxe_eth0 uverbs device 
>> name uverbs2, infiniband_verbs class device path 
>> /sys/class/infiniband_verbs/uverbs2, infiniband class device path 
>> /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
>> (qemu) qemu_rdma_block_for_wrid_miss A Wanted wrid WRITE RDMA (1) but got 
>> CONTROL RECV (4000)
>>
>> NOTE: we use soft RoCE as the rdma device.
>> [root@iaas-rpma images]# rdma link show rxe_eth0/1
>> link rxe_eth0/1 state ACTIVE physical_state LINK_UP netdev eth0
>>
>> This migration could not be completed when out of order(OOO) CQ event occurs.
>> The send queue and receive queue shared a same completion queue, and
>> qemu_rdma_block_for_wrid() will drop the CQs it's not interested in. But
>> the dropped CQs by qemu_rdma_block_for_wrid() could be later CQs it wants.
>> So in this case, qemu_rdma_block_for_wrid() will block forever.
>>
>> OOO cases will occur in both source side and destination side. And a
>> forever blocking happens on only SEND and RECV are out of order. OOO between
>> 'WRITE RDMA' and 'RECV' doesn't matter.
>>
>> below the OOO sequence:
>>  source destination
>> rdma_write_one()   qemu_rdma_registration_handle()
>> 1.S1: post_recv XD1: post_recv Y
>> 2.wait for recv CQ event X
>> 3.   D2: post_send X ---+
>> 4.   wait for send CQ send event X (D2) |
>> 5.recv CQ event X reaches (D2)  |
>> 6.  +-S2: post_send Y   |
>> 7.  | wait for send CQ event Y  |
>> 8.  |recv CQ event Y (S2) (drop it) |
>> 9.  +-send CQ event Y reaches (S2)  |
>> 10.  send CQ event X reaches (D2)  -+
>> 11.  wait recv CQ event Y (dropped by 
>> (8))
>>
>> Although a hardware IB works fine in my a hundred of runs, the IB 
>> specification
>> doesn't guaratee the CQ order in such case.
>>
>> Here we introduce a independent send completion queue to distinguish
>> ibv_post_send completion queue from the original mixed completion queue.
>> It helps us to poll the specific CQE we are really interested in.
>>
>> Signed-off-by: Li Zhijian 
>> ---
>> V3: rebase code, and combine 2/2 to 1/2
>> V2: Introduce send completion queue
>> ---
>>migration/rdma.c | 132 +++
>>1 file changed, 98 insertions(+), 34 deletions(-)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index 5c2d113aa

Re: [PATCH v3] migration/rdma: Fix out of order wrid

2021-10-28 Thread lizhij...@fujitsu.com


On 28/10/2021 23:17, Dr. David Alan Gilbert wrote:
> * Li Zhijian (lizhij...@cn.fujitsu.com) wrote:
>
> Apologies for taking so long.
It's okay :), thanks for your review.

>
>>   /*
>> - * Completion queue can be filled by both read and write work requests,
>> - * so must reflect the sum of both possible queue sizes.
>> + * Completion queue can be filled by read work requests.
>>*/
>> -rdma->cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
>> -NULL, rdma->comp_channel, 0);
>> -if (!rdma->cq) {
>> +rdma->recv_cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
>> +  NULL, rdma->recv_comp_channel, 0);
>> +if (!rdma->recv_cq) {
>> +error_report("failed to allocate completion queue");
> Minor: It would be good to make this different from the error below;
> e.g. 'failed to allocate receive completion queue'

Good catch, i will amend them soon.


>
>> +goto err_alloc_pd_cq;
>> +}
>> +
>> +/* create send completion channel */
>> +rdma->send_comp_channel = ibv_create_comp_channel(rdma->verbs);
>> +if (!rdma->send_comp_channel) {
>> +error_report("failed to allocate completion channel");
>> +goto err_alloc_pd_cq;
>> +}
>> +
>> +rdma->send_cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
>> +  NULL, rdma->send_comp_channel, 0);
>> +if (!rdma->send_cq) {
>>   error_report("failed to allocate completion queue");
>>   goto err_alloc_pd_cq;
>>   }
>> @@ -1083,11 +1098,19 @@ err_alloc_pd_cq:
>>   if (rdma->pd) {
>>   ibv_dealloc_pd(rdma->pd);
>>   }
>> -if (rdma->comp_channel) {
>> -ibv_destroy_comp_channel(rdma->comp_channel);
>> +if (rdma->recv_comp_channel) {
>> +ibv_destroy_comp_channel(rdma->recv_comp_channel);
>> +}
>> +if (rdma->send_comp_channel) {
>> +ibv_destroy_comp_channel(rdma->send_comp_channel);
>> +}
>> +if (rdma->recv_cq) {
>> +ibv_destroy_cq(rdma->recv_cq);
>> +rdma->recv_cq = NULL;
>>   }
> Don't you need to destroy the send_cq as well?

we don't need to do that since send_cq is that last element we allot, that means
send_cq will always be NULL once the code reaches here.

Thanks
Zhijian

>
> (Other than that I think it's fine)
>
> Dave
>
>


回复: [PATCH 2/2] migration: allow enabling mutilfd for specific protocol only

2021-07-18 Thread lizhij...@fujitsu.com
there was a typo:  s/protocal/protocol


发件人: Li Zhijian 
发送时间: 2021年7月16日 15:59
收件人: quint...@redhat.com; dgilb...@redhat.com
抄送: qemu-devel@nongnu.org; Li, Zhijian/李 智坚
主题: [PATCH 2/2] migration: allow enabling mutilfd for specific protocol only

And change the default to true so that '-incoming defer' can enable
multifd first.

Signed-off-by: Li Zhijian 
---
 migration/migration.c | 8 
 migration/multifd.c   | 2 +-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index d6d48f6999b..bcc8b3bcb92 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1249,6 +1249,14 @@ static bool migrate_caps_check(bool *cap_list,
 }
 }

+/* incoming side only */
+if (runstate_check(RUN_STATE_INMIGRATE) &&
+!migrate_multifd_is_allowed() &&
+cap_list[MIGRATION_CAPABILITY_MULTIFD]) {
+error_setg(errp, "multifd is not supported by current protocol");
+return false;
+}
+
 return true;
 }

diff --git a/migration/multifd.c b/migration/multifd.c
index b3d99c79d83..372f3633eda 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -864,7 +864,7 @@ cleanup:
 multifd_new_send_channel_cleanup(p, sioc, local_err);
 }

-static bool migrate_allow_multifd;
+static bool migrate_allow_multifd = true;
 void migrate_protocal_allow_multifd(bool allow)
 {
 migrate_allow_multifd = allow;
--
2.31.1



Re: [PATCH 0/2] enable fsdax rdma migration

2021-08-15 Thread lizhij...@fujitsu.com

ping...

Hey Dave, could you help to take a look :)

Thanks
Zhijian


On 31/07/2021 22:03, Li Zhijian wrote:
> Previous qemu face 2 problems when migrating a fsdax memory backend with
> RDMA protocol.
> (1) ibv_reg_mr failed with Operation not supported
> (2) requester(source) side could receive RNR NAK.
>
> For the (1), we can try to register memory region with ODP feature which
> has already been implemented in some modern HCA hardware/drivers.
> For the (2), IB provides advise API to prefetch pages in specific memory
> region. It can help driver reduce the page fault on responder(destination)
> side during RDMA_WRITE.
>
> Li Zhijian (2):
>migration/rdma: Try to register On-Demand Paging memory region
>migration/rdma: advise prefetch write for ODP region
>
>   migration/rdma.c   | 67 --
>   migration/trace-events |  2 ++
>   2 files changed, 60 insertions(+), 9 deletions(-)
>


Re: [PATCH 2/2] migration/rdma: advise prefetch write for ODP region

2021-08-22 Thread lizhij...@fujitsu.com
Hi Marcel


On 22/08/2021 16:39, Marcel Apfelbaum wrote:
> Hi,
>
> On Sat, Jul 31, 2021 at 5:03 PM Li Zhijian  wrote:
>> The responder mr registering with ODP will sent RNR NAK back to
>> the requester in the face of the page fault.
>> -
>> ibv_poll_cq wc.status=13 RNR retry counter exceeded!
>> ibv_poll_cq wrid=WRITE RDMA!
>> -
>> ibv_advise_mr(3) helps to make pages present before the actual IO is
>> conducted so that the responder does page fault as little as possible.
>>
>> Signed-off-by: Li Zhijian 
>> ---
>>   migration/rdma.c   | 40 
>>   migration/trace-events |  1 +
>>   2 files changed, 41 insertions(+)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index 8784b5f22a6..a2ad00d665f 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -1117,6 +1117,30 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma)
>>   return 0;
>>   }
>>
>> +/*
>> + * ibv_advise_mr to avoid RNR NAK error as far as possible.
>> + * The responder mr registering with ODP will sent RNR NAK back to
>> + * the requester in the face of the page fault.
>> + */
>> +static void qemu_rdma_advise_prefetch_write_mr(struct ibv_pd *pd, uint64_t 
>> addr,
>> +   uint32_t len,  uint32_t lkey,
>> +   const char *name, bool wr)
>> +{
>> +int ret;
>> +int advice = wr ? IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE :
>> + IBV_ADVISE_MR_ADVICE_PREFETCH;
>> +struct ibv_sge sg_list = {.lkey = lkey, .addr = addr, .length = len};
>> +
>> +ret = ibv_advise_mr(pd, advice,
>> +IB_UVERBS_ADVISE_MR_FLAG_FLUSH, &sg_list, 1);
>> +/* ignore the error */
> Following 
> https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/man/ibv_advise_mr.3.md
> it looks like it is a best-effort optimization,
> I don't see any down-sides to it.
> However it seems like it is recommended to use
> IBV_ADVISE_MR_FLAG_FLUSH in order to
> increase the optimization chances.
Good catch,  i will update it soon.


Thanks

>
> Anyway
>
> Reviewed-by: Marcel Apfelbaum 
>
> Thanks,
> Marcel
>
>


Re: [PATCH 1/2] migration/rdma: Try to register On-Demand Paging memory region

2021-08-22 Thread lizhij...@fujitsu.com


On 22/08/2021 16:53, Marcel Apfelbaum wrote:
> Hi
>
> On Sat, Jul 31, 2021 at 5:00 PM Li Zhijian  wrote:
>> Previously, for the fsdax mem-backend-file, it will register failed with
>> Operation not supported. In this case, we can try to register it with
>> On-Demand Paging[1] like what rpma_mr_reg() does on rpma[2].
>>
>> [1]: 
>> https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x
>> [2]: http://pmem.io/rpma/manpages/v0.9.0/rpma_mr_reg.3
>> Signed-off-by: Li Zhijian 
>> ---
>>   migration/rdma.c   | 27 ++-
>>   migration/trace-events |  1 +
>>   2 files changed, 19 insertions(+), 9 deletions(-)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index 5c2d113aa94..8784b5f22a6 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -1123,15 +1123,21 @@ static int 
>> qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
>>   RDMALocalBlocks *local = &rdma->local_ram_blocks;
>>
>>   for (i = 0; i < local->nb_blocks; i++) {
>> +int access = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE;
>> +
>> +on_demand:
>>   local->block[i].mr =
>>   ibv_reg_mr(rdma->pd,
>>   local->block[i].local_host_addr,
>> -local->block[i].length,
>> -IBV_ACCESS_LOCAL_WRITE |
>> -IBV_ACCESS_REMOTE_WRITE
>> +local->block[i].length, access
>>   );
>>   if (!local->block[i].mr) {
>> -perror("Failed to register local dest ram block!");
>> +if (!(access & IBV_ACCESS_ON_DEMAND) && errno == ENOTSUP) {
>> +access |= IBV_ACCESS_ON_DEMAND;
>> +trace_qemu_rdma_register_odp_mr(local->block[i].block_name);
>> +goto on_demand;
> Wouldn't it be better to check first if the device supports ODP ?
> Something like:
>  ret = ibv_exp_query_device(context, &dattr);
>  if (dattr.exp_device_cap_flags & IBV_EXP_DEVICE_ODP)...

Good idea !



>
> Also, I  am not (personally) too fond of the "on_demand" label usage here,
> however I will let the maintainer/others decide.
Indeed, how just repeating the ibv_reg_mr() instead of a 'go to'

Thanks
Zhijian



>
> Thanks,
> Marcel
>
>> +}
>> +perror("Failed to register local dest ram block!\n");
>>   break;
>>   }
>>   rdma->total_registrations++;
>> @@ -1215,15 +1221,18 @@ static int 
>> qemu_rdma_register_and_get_keys(RDMAContext *rdma,
>>*/
>>   if (!block->pmr[chunk]) {
>>   uint64_t len = chunk_end - chunk_start;
>> +int access = rkey ? IBV_ACCESS_LOCAL_WRITE | 
>> IBV_ACCESS_REMOTE_WRITE : 0;
>>
>>   trace_qemu_rdma_register_and_get_keys(len, chunk_start);
>>
>> -block->pmr[chunk] = ibv_reg_mr(rdma->pd,
>> -chunk_start, len,
>> -(rkey ? (IBV_ACCESS_LOCAL_WRITE |
>> -IBV_ACCESS_REMOTE_WRITE) : 0));
>> -
>> +on_demand:
>> +block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
>>   if (!block->pmr[chunk]) {
>> +if (!(access & IBV_ACCESS_ON_DEMAND) && errno == ENOTSUP) {
>> +access |= IBV_ACCESS_ON_DEMAND;
>> +trace_qemu_rdma_register_odp_mr(block->block_name);
>> +goto on_demand;
>> +}
>>   perror("Failed to register chunk!");
>>   fprintf(stderr, "Chunk details: block: %d chunk index %d"
>>   " start %" PRIuPTR " end %" PRIuPTR
>> diff --git a/migration/trace-events b/migration/trace-events
>> index a1c0f034ab8..5f6aa580def 100644
>> --- a/migration/trace-events
>> +++ b/migration/trace-events
>> @@ -212,6 +212,7 @@ qemu_rdma_poll_write(const char *compstr, int64_t comp, 
>> int left, uint64_t block
>>   qemu_rdma_poll_other(const char *compstr, int64_t comp, int left) "other 
>> completion %s (%" PRId64 ") received left %d"
>>   qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
>>   qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" 
>> PRIu64 " bytes @ %p"
>> +qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand 
>> Paging memory region: %s"
>>   qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t 
>> offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64
>>   qemu_rdma_registration_handle_finished(void) ""
>>   qemu_rdma_registration_handle_ram_blocks(void) ""
>> --
>> 2.31.1
>>
>>
>>
>>
>


Re: [PATCH v2 1/2] migration: allow multifd for socket protocol only

2021-08-22 Thread lizhij...@fujitsu.com
kindly ping


On 31/07/2021 22:05, Li Zhijian wrote:
> multifd with unsupported protocol will cause a segment fault.
> (gdb) bt
>   #0  0x563b4a93faf8 in socket_connect (addr=0x0, errp=0x7f7f02675410) at 
> ../util/qemu-sockets.c:1190
>   #1  0x563b4a797a03 in qio_channel_socket_connect_sync 
> (ioc=0x563b4d16e8c0, addr=0x0, errp=0x7f7f02675410) at 
> ../io/channel-socket.c:145
>   #2  0x563b4a797abf in qio_channel_socket_connect_worker 
> (task=0x563b4cd86c30, opaque=0x0) at ../io/channel-socket.c:168
>   #3  0x563b4a792631 in qio_task_thread_worker (opaque=0x563b4cd86c30) at 
> ../io/task.c:124
>   #4  0x563b4a91da69 in qemu_thread_start (args=0x563b4c44bb80) at 
> ../util/qemu-thread-posix.c:541
>   #5  0x7f7fe9b5b3f9 in ?? ()
>   #6  0x in ?? ()
>
> It's enough to check migrate_multifd_is_allowed() in multifd cleanup() and
> multifd setup() though there are so many other places using 
> migrate_use_multifd().
>
> Signed-off-by: Li Zhijian 
> ---
>   migration/migration.c |  4 
>   migration/multifd.c   | 24 ++--
>   migration/multifd.h   |  2 ++
>   3 files changed, 28 insertions(+), 2 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 2d306582ebf..212314541f1 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -456,10 +456,12 @@ static void qemu_start_incoming_migration(const char 
> *uri, Error **errp)
>   {
>   const char *p = NULL;
>   
> +migrate_protocol_allow_multifd(false); /* reset it anyway */
>   qapi_event_send_migration(MIGRATION_STATUS_SETUP);
>   if (strstart(uri, "tcp:", &p) ||
>   strstart(uri, "unix:", NULL) ||
>   strstart(uri, "vsock:", NULL)) {
> +migrate_protocol_allow_multifd(true);
>   socket_start_incoming_migration(p ? p : uri, errp);
>   #ifdef CONFIG_RDMA
>   } else if (strstart(uri, "rdma:", &p)) {
> @@ -2289,9 +2291,11 @@ void qmp_migrate(const char *uri, bool has_blk, bool 
> blk,
>   }
>   }
>   
> +migrate_protocol_allow_multifd(false);
>   if (strstart(uri, "tcp:", &p) ||
>   strstart(uri, "unix:", NULL) ||
>   strstart(uri, "vsock:", NULL)) {
> +migrate_protocol_allow_multifd(true);
>   socket_start_outgoing_migration(s, p ? p : uri, &local_err);
>   #ifdef CONFIG_RDMA
>   } else if (strstart(uri, "rdma:", &p)) {
> diff --git a/migration/multifd.c b/migration/multifd.c
> index ab41590e714..4a4d16d3888 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -531,7 +531,7 @@ void multifd_save_cleanup(void)
>   {
>   int i;
>   
> -if (!migrate_use_multifd()) {
> +if (!migrate_use_multifd() || !migrate_multifd_is_allowed()) {
>   return;
>   }
>   multifd_send_terminate_threads(NULL);
> @@ -864,6 +864,17 @@ cleanup:
>   multifd_new_send_channel_cleanup(p, sioc, local_err);
>   }
>   
> +static bool migrate_allow_multifd;
> +void migrate_protocol_allow_multifd(bool allow)
> +{
> +migrate_allow_multifd = allow;
> +}
> +
> +bool migrate_multifd_is_allowed(void)
> +{
> +return migrate_allow_multifd;
> +}
> +
>   int multifd_save_setup(Error **errp)
>   {
>   int thread_count;
> @@ -874,6 +885,11 @@ int multifd_save_setup(Error **errp)
>   if (!migrate_use_multifd()) {
>   return 0;
>   }
> +if (!migrate_multifd_is_allowed()) {
> +error_setg(errp, "multifd is not supported by current protocol");
> +return -1;
> +}
> +
>   s = migrate_get_current();
>   thread_count = migrate_multifd_channels();
>   multifd_send_state = g_malloc0(sizeof(*multifd_send_state));
> @@ -967,7 +983,7 @@ int multifd_load_cleanup(Error **errp)
>   {
>   int i;
>   
> -if (!migrate_use_multifd()) {
> +if (!migrate_use_multifd() || !migrate_multifd_is_allowed()) {
>   return 0;
>   }
>   multifd_recv_terminate_threads(NULL);
> @@ -1123,6 +1139,10 @@ int multifd_load_setup(Error **errp)
>   if (!migrate_use_multifd()) {
>   return 0;
>   }
> +if (!migrate_multifd_is_allowed()) {
> +error_setg(errp, "multifd is not supported by current protocol");
> +return -1;
> +}
>   thread_count = migrate_multifd_channels();
>   multifd_recv_state = g_malloc0(sizeof(*multifd_recv_state));
>   multifd_recv_state->params = g_new0(MultiFDRecvParams, thread_count);
> diff --git a/migration/multifd.h b/migration/multifd.h
> index 8d6751f5ed8..f62a1becd0b 100644
> --- a/migration/multifd.h
> +++ b/migration/multifd.h
> @@ -13,6 +13,8 @@
>   #ifndef QEMU_MIGRATION_MULTIFD_H
>   #define QEMU_MIGRATION_MULTIFD_H
>   
> +bool migrate_multifd_is_allowed(void);
> +void migrate_protocol_allow_multifd(bool allow);
>   int multifd_save_setup(Error **errp);
>   void multifd_save_cleanup(void);
>   int multifd_load_setup(Error **errp);


Re: [PATCH v2 0/2] enable fsdax rdma migration

2021-08-23 Thread lizhij...@fujitsu.com
CCing  Marcel


On 23/08/2021 11:33, Li Zhijian wrote:
> Previous qemu are facing 2 problems when migrating a fsdax memory backend with
> RDMA protocol.
> (1) ibv_reg_mr failed with Operation not supported
> (2) requester(source) side could receive RNR NAK.
>
> For the (1), we can try to register memory region with ODP feature which
> has already been implemented in some modern HCA hardware/drivers.
> For the (2), IB provides advise API to prefetch pages in specific memory
> region. It can help driver reduce the page fault on responder(destination)
> side during RDMA_WRITE.
>
> CC: marcel.apfelb...@gmail.com
>
> Li Zhijian (2):
>migration/rdma: Try to register On-Demand Paging memory region
>migration/rdma: advise prefetch write for ODP region
>
>   migration/rdma.c   | 117 +
>   migration/trace-events |   2 +
>   2 files changed, 98 insertions(+), 21 deletions(-)
>


Re: [PATCH v2 1/2] migration/rdma: Try to register On-Demand Paging memory region

2021-08-23 Thread lizhij...@fujitsu.com
CCing  Marcel


On 23/08/2021 11:33, Li Zhijian wrote:
> Previously, for the fsdax mem-backend-file, it will register failed with
> Operation not supported. In this case, we can try to register it with
> On-Demand Paging[1] like what rpma_mr_reg() does on rpma[2].
>
> [1]: 
> https://community.mellanox.com/s/article/understanding-on-demand-paging--odp-x
> [2]: http://pmem.io/rpma/manpages/v0.9.0/rpma_mr_reg.3
>
> CC: Marcel Apfelbaum 
> Signed-off-by: Li Zhijian 
>
> ---
> V2: add ODP sanity check and remove goto
> ---
>   migration/rdma.c   | 73 ++
>   migration/trace-events |  1 +
>   2 files changed, 54 insertions(+), 20 deletions(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 5c2d113aa94..eb80431aae2 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -1117,19 +1117,47 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma)
>   return 0;
>   }
>   
> +/* Check whether On-Demand Paging is supported by RDAM device */
> +static bool rdma_support_odp(struct ibv_context *dev)
> +{
> +struct ibv_device_attr_ex attr = {0};
> +int ret = ibv_query_device_ex(dev, NULL, &attr);
> +if (ret) {
> +return false;
> +}
> +
> +if (attr.odp_caps.general_caps & IBV_ODP_SUPPORT) {
> +return true;
> +}
> +
> +return false;
> +}
> +
>   static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
>   {
>   int i;
>   RDMALocalBlocks *local = &rdma->local_ram_blocks;
>   
>   for (i = 0; i < local->nb_blocks; i++) {
> +int access = IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE;
> +
>   local->block[i].mr =
>   ibv_reg_mr(rdma->pd,
>   local->block[i].local_host_addr,
> -local->block[i].length,
> -IBV_ACCESS_LOCAL_WRITE |
> -IBV_ACCESS_REMOTE_WRITE
> +local->block[i].length, access
>   );
> +
> +if (!local->block[i].mr &&
> +errno == ENOTSUP && rdma_support_odp(rdma->verbs)) {
> +access |= IBV_ACCESS_ON_DEMAND;
> +/* register ODP mr */
> +local->block[i].mr =
> +ibv_reg_mr(rdma->pd,
> +   local->block[i].local_host_addr,
> +   local->block[i].length, access);
> +trace_qemu_rdma_register_odp_mr(local->block[i].block_name);
> +}
> +
>   if (!local->block[i].mr) {
>   perror("Failed to register local dest ram block!");
>   break;
> @@ -1215,28 +1243,33 @@ static int 
> qemu_rdma_register_and_get_keys(RDMAContext *rdma,
>*/
>   if (!block->pmr[chunk]) {
>   uint64_t len = chunk_end - chunk_start;
> +int access = rkey ? IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE 
> :
> + 0;
>   
>   trace_qemu_rdma_register_and_get_keys(len, chunk_start);
>   
> -block->pmr[chunk] = ibv_reg_mr(rdma->pd,
> -chunk_start, len,
> -(rkey ? (IBV_ACCESS_LOCAL_WRITE |
> -IBV_ACCESS_REMOTE_WRITE) : 0));
> -
> -if (!block->pmr[chunk]) {
> -perror("Failed to register chunk!");
> -fprintf(stderr, "Chunk details: block: %d chunk index %d"
> -" start %" PRIuPTR " end %" PRIuPTR
> -" host %" PRIuPTR
> -" local %" PRIuPTR " registrations: %d\n",
> -block->index, chunk, (uintptr_t)chunk_start,
> -(uintptr_t)chunk_end, host_addr,
> -(uintptr_t)block->local_host_addr,
> -rdma->total_registrations);
> -return -1;
> +block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, access);
> +if (!block->pmr[chunk] &&
> +errno == ENOTSUP && rdma_support_odp(rdma->verbs)) {
> +access |= IBV_ACCESS_ON_DEMAND;
> +/* register ODP mr */
> +block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, 
> access);
> +trace_qemu_rdma_register_odp_mr(block->block_name);
>   }
> -rdma->total_registrations++;
>   }
> +if (!block->pmr[chunk]) {
> +perror("Failed to register chunk!");
> +fprintf(stderr, "Chunk details: block: %d chunk index %d"
> +" start %" PRIuPTR " end %" PRIuPTR
> +" host %" PRIuPTR
> +" local %" PRIuPTR " registrations: %d\n",
> +block->index, chunk, (uintptr_t)chunk_start,
> +(uintptr_t)chunk_end, host_addr,
> +(uintptr_t)block->local_host_addr,
> +rdma->total_registrations);
> +return -1;
> +}
> +rdma->total_registrations++;
>   
>   if (lke

Re: [PATCH v2 2/2] migration/rdma: advise prefetch write for ODP region

2021-08-23 Thread lizhij...@fujitsu.com
CCing Marcel


On 23/08/2021 11:33, Li Zhijian wrote:
> The responder mr registering with ODP will sent RNR NAK back to
> the requester in the face of the page fault.
> -
> ibv_poll_cq wc.status=13 RNR retry counter exceeded!
> ibv_poll_cq wrid=WRITE RDMA!
> -
> ibv_advise_mr(3) helps to make pages present before the actual IO is
> conducted so that the responder does page fault as little as possible.
>
> Signed-off-by: Li Zhijian 
> Reviewed-by: Marcel Apfelbaum 
>
> ---
> V2: use IBV_ADVISE_MR_FLAG_FLUSH instead of IB_UVERBS_ADVISE_MR_FLAG_FLUSH
>  and add Reviewed-by tag. # Marcel
> ---
>   migration/rdma.c   | 40 
>   migration/trace-events |  1 +
>   2 files changed, 41 insertions(+)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index eb80431aae2..6c2cc3f617c 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -1133,6 +1133,30 @@ static bool rdma_support_odp(struct ibv_context *dev)
>   return false;
>   }
>   
> +/*
> + * ibv_advise_mr to avoid RNR NAK error as far as possible.
> + * The responder mr registering with ODP will sent RNR NAK back to
> + * the requester in the face of the page fault.
> + */
> +static void qemu_rdma_advise_prefetch_mr(struct ibv_pd *pd, uint64_t addr,
> + uint32_t len,  uint32_t lkey,
> + const char *name, bool wr)
> +{
> +int ret;
> +int advice = wr ? IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE :
> + IBV_ADVISE_MR_ADVICE_PREFETCH;
> +struct ibv_sge sg_list = {.lkey = lkey, .addr = addr, .length = len};
> +
> +ret = ibv_advise_mr(pd, advice,
> +IBV_ADVISE_MR_FLAG_FLUSH, &sg_list, 1);
> +/* ignore the error */
> +if (ret) {
> +trace_qemu_rdma_advise_mr(name, len, addr, strerror(errno));
> +} else {
> +trace_qemu_rdma_advise_mr(name, len, addr, "successed");
> +}
> +}
> +
>   static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
>   {
>   int i;
> @@ -1156,6 +1180,15 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext 
> *rdma)
>  local->block[i].local_host_addr,
>  local->block[i].length, access);
>   trace_qemu_rdma_register_odp_mr(local->block[i].block_name);
> +
> +if (local->block[i].mr) {
> +qemu_rdma_advise_prefetch_mr(rdma->pd,
> +
> (uintptr_t)local->block[i].local_host_addr,
> +local->block[i].length,
> +local->block[i].mr->lkey,
> +local->block[i].block_name,
> +true);
> +}
>   }
>   
>   if (!local->block[i].mr) {
> @@ -1255,6 +1288,13 @@ static int qemu_rdma_register_and_get_keys(RDMAContext 
> *rdma,
>   /* register ODP mr */
>   block->pmr[chunk] = ibv_reg_mr(rdma->pd, chunk_start, len, 
> access);
>   trace_qemu_rdma_register_odp_mr(block->block_name);
> +
> +if (block->pmr[chunk]) {
> +qemu_rdma_advise_prefetch_mr(rdma->pd, 
> (uintptr_t)chunk_start,
> +len, block->pmr[chunk]->lkey,
> +block->block_name, rkey);
> +
> +}
>   }
>   }
>   if (!block->pmr[chunk]) {
> diff --git a/migration/trace-events b/migration/trace-events
> index 5f6aa580def..a8ae163707c 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -213,6 +213,7 @@ qemu_rdma_poll_other(const char *compstr, int64_t comp, 
> int left) "other complet
>   qemu_rdma_post_send_control(const char *desc) "CONTROL: sending %s.."
>   qemu_rdma_register_and_get_keys(uint64_t len, void *start) "Registering %" 
> PRIu64 " bytes @ %p"
>   qemu_rdma_register_odp_mr(const char *name) "Try to register On-Demand 
> Paging memory region: %s"
> +qemu_rdma_advise_mr(const char *name, uint32_t len, uint64_t addr, const 
> char *res) "Try to advise block %s prefetch at %" PRIu32 "@0x%" PRIx64 ": %s"
>   qemu_rdma_registration_handle_compress(int64_t length, int index, int64_t 
> offset) "Zapping zero chunk: %" PRId64 " bytes, index %d, offset %" PRId64
>   qemu_rdma_registration_handle_finished(void) ""
>   qemu_rdma_registration_handle_ram_blocks(void) ""


Re: [PATCH] nvdimm: release the correct device list

2021-08-29 Thread lizhij...@fujitsu.com
ping


On 03/08/2021 12:00, Li, Zhijian wrote:
> ping
>
> Any body could help to review/queue this patch ?
>
>
>
> On 2021/6/29 22:05, Igor Mammedov wrote:
>> On Thu, 24 Jun 2021 19:04:15 +0800
>> Li Zhijian  wrote:
>>
>>> Signed-off-by: Li Zhijian 
>> Reviewed-by: Igor Mammedov 
>>
>>> ---
>>>   hw/acpi/nvdimm.c | 12 ++--
>>>   1 file changed, 6 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
>>> index e3d5fe19392..ff317263e85 100644
>>> --- a/hw/acpi/nvdimm.c
>>> +++ b/hw/acpi/nvdimm.c
>>> @@ -355,10 +355,10 @@ nvdimm_build_structure_caps(GArray *structures, 
>>> uint32_t capabilities)
>>>     static GArray *nvdimm_build_device_structure(NVDIMMState *state)
>>>   {
>>> -    GSList *device_list = nvdimm_get_device_list();
>>> +    GSList *device_list, *list = nvdimm_get_device_list();
>>>   GArray *structures = g_array_new(false, true /* clear */, 1);
>>>   -    for (; device_list; device_list = device_list->next) {
>>> +    for (device_list = list; device_list; device_list = device_list->next) 
>>> {
>>>   DeviceState *dev = device_list->data;
>>>     /* build System Physical Address Range Structure. */
>>> @@ -373,7 +373,7 @@ static GArray 
>>> *nvdimm_build_device_structure(NVDIMMState *state)
>>>   /* build NVDIMM Control Region Structure. */
>>>   nvdimm_build_structure_dcr(structures, dev);
>>>   }
>>> -    g_slist_free(device_list);
>>> +    g_slist_free(list);
>>>     if (state->persistence) {
>>>   nvdimm_build_structure_caps(structures, state->persistence);
>>> @@ -1339,9 +1339,9 @@ static void nvdimm_build_ssdt(GArray *table_offsets, 
>>> GArray *table_data,
>>>     void nvdimm_build_srat(GArray *table_data)
>>>   {
>>> -    GSList *device_list = nvdimm_get_device_list();
>>> +    GSList *device_list, *list = nvdimm_get_device_list();
>>>   -    for (; device_list; device_list = device_list->next) {
>>> +    for (device_list = list; device_list; device_list = device_list->next) 
>>> {
>>>   AcpiSratMemoryAffinity *numamem = NULL;
>>>   DeviceState *dev = device_list->data;
>>>   Object *obj = OBJECT(dev);
>>> @@ -1356,7 +1356,7 @@ void nvdimm_build_srat(GArray *table_data)
>>>   build_srat_memory(numamem, addr, size, node,
>>>     MEM_AFFINITY_ENABLED | 
>>> MEM_AFFINITY_NON_VOLATILE);
>>>   }
>>> -    g_slist_free(device_list);
>>> +    g_slist_free(list);
>>>   }
>>>     void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
>>
>>
>
>
>


Re: [PATCH 2/4] net/colo: Fix a "double free" crash to clear the conn_list

2022-03-30 Thread lizhij...@fujitsu.com

connection_track_table
-+--
key1 | conn|---+
-+--   |
key2 | conn|--+|
-+--  ||
key3 | conn|-+||
-+-- |||
  |||
  |||
 + CompareState ++|||
 |   |VVV
 +---+   +---+ +---+
 |conn_list  +--->conn   +->conn   | connx
 +---+   +---+ +---+
 |   | |   | |  |
 +---+ +---v+  +---v++---v+ +---v+
   |primary |  |secondary|primary | |secondary
   |packet  |  |packet  +|packet  | |packet  +
   ++  ++++ ++
   |   | |  |
   +---v+  +---v++---v+ +---v+
   |primary |  |secondary|primary | |secondary
   |packet  |  |packet  +|packet  | |packet  +
   ++  ++++ ++
   |   | |  |
   +---v+  +---v++---v+ +---v+
   |primary |  |secondary|primary | |secondary
   |packet  |  |packet  +|packet  | |packet  +
   ++  ++++ ++
  
I recalled that we should above relationships between connection_track_table 
conn_list and conn.
That means both connection_track_table and conn_list reference to the same conn 
instance.

So before this patch, connection_get() is possible to use-after-free/double 
free conn. where 1st was in
connection_hashtable_reset() and 2nd was
221 while (!g_queue_is_empty(conn_list)) {
222 connection_destroy(g_queue_pop_head(conn_list));
223 }

I also doubt that your current abort was just due to above 
use-after-free/double free.
If so, looks it's enough we just update to g_queue_clear(conn_list) in the 2nd 
place.

Thanks
Zhijian


On 28/03/2022 17:13, Zhang, Chen wrote:
>
>> -Original Message-
>> From: lizhij...@fujitsu.com 
>> Sent: Monday, March 21, 2022 11:06 AM
>> To: Zhang, Chen ; Jason Wang
>> ; lizhij...@fujitsu.com
>> Cc: qemu-dev ; Like Xu 
>> Subject: Re: [PATCH 2/4] net/colo: Fix a "double free" crash to clear the
>> conn_list
>>
>>
>>
>> On 09/03/2022 16:38, Zhang Chen wrote:
>>> We notice the QEMU may crash when the guest has too many incoming
>>> network connections with the following log:
>>>
>>> 15197@1593578622.668573:colo_proxy_main : colo proxy connection
>>> hashtable full, clear it
>>> free(): invalid pointer
>>> [1]15195 abort (core dumped)  qemu-system-x86_64 
>>>
>>> This is because we create the s->connection_track_table with
>>> g_hash_table_new_full() which is defined as:
>>>
>>> GHashTable * g_hash_table_new_full (GHashFunc hash_func,
>>>  GEqualFunc key_equal_func,
>>>  GDestroyNotify key_destroy_func,
>>>  GDestroyNotify value_destroy_func);
>>>
>>> The fourth parameter connection_destroy() will be called to free the
>>> memory allocated for all 'Connection' values in the hashtable when we
>>> call g_hash_table_remove_all() in the connection_hashtable_reset().
>>>
>>> It's unnecessary because we clear the conn_list explicitly later, and
>>> it's buggy when other agents try to call connection_get() with the
>>> same connection_track_table.
>>>
>>> Signed-off-by: Like Xu 
>>> Signed-off-by: Zhang Chen 
>>> ---
>>>net/colo-compare.c| 2 +-
>>>net/filter-rewriter.c | 2 +-
>>>2 files changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/net/colo-compare.c b/net/colo-compare.c index
>>> 62554b5b3c..ab054cfd21 100644
>>> --- a/net/colo-compare.c
>>> +++ b/

Re: [PATCH 2/4] net/colo: Fix a "double free" crash to clear the conn_list

2022-03-31 Thread lizhij...@fujitsu.com


On 31/03/2022 10:25, Zhang, Chen wrote:
>
>> -Original Message-
>> From: lizhij...@fujitsu.com 
>> Sent: Thursday, March 31, 2022 9:15 AM
>> To: Zhang, Chen ; Jason Wang
>> 
>> Cc: qemu-dev ; Like Xu 
>> Subject: Re: [PATCH 2/4] net/colo: Fix a "double free" crash to clear the
>> conn_list
>>
>>
>> connection_track_table
>> -+--
>> key1 | conn|---+
>> -+--   |
>> key2 | conn|--+|
>> -+--  ||
>> key3 | conn|-+||
>> -+-- |||
>>||
>> |
>>||
>> |
>>   + CompareState ++||
>> |
>>   |   |VV
>> V
>>   +---+   +---+ +---+
>>   |conn_list  +--->conn   +->conn   | 
>> connx
>>   +---+   +---+ +---+
>>   |   | |   | |  |
>>   +---+ +---v+  +---v++---v+ +---v+
>> |primary |  |secondary|primary | |secondary
>> |packet  |  |packet  +|packet  | |packet  +
>> ++  ++++ ++
>> |   | |  |
>> +---v+  +---v++---v+ +---v+
>> |primary |  |secondary|primary | |secondary
>> |packet  |  |packet  +|packet  | |packet  +
>> ++  ++++ ++
>> |   | |  |
>> +---v+  +---v++---v+ +---v+
>> |primary |  |secondary|primary | |secondary
>> |packet  |  |packet  +|packet  | |packet  +
>> ++  ++++ ++
>>
>> I recalled that we should above relationships between
>> connection_track_table conn_list and conn.
>> That means both connection_track_table and conn_list reference to the
>> same conn instance.
>>
>> So before this patch, connection_get() is possible to use-after-free/double
>> free conn. where 1st was in
>> connection_hashtable_reset() and 2nd was
>> 221 while (!g_queue_is_empty(conn_list)) {
>> 222 connection_destroy(g_queue_pop_head(conn_list));
>> 223 }
>>
>> I also doubt that your current abort was just due to above use-after-
>> free/double free.
>> If so, looks it's enough we just update to g_queue_clear(conn_list) in the 
>> 2nd
>> place.
> Make sense, but It also means the original patch works here, skip free conn 
> in connection_hashtable_reset() and do it in:
> 221 while (!g_queue_is_empty(conn_list)) {
>   222 connection_destroy(g_queue_pop_head(conn_list));
>   223 }.
> It also avoid use-after-free/double free conn.
Although you will not use-after-free here, you have to consider other 
situations carefully that
g_hash_table_remove_all() g_hash_table_destroy() were called where the 
conn_list should also be freed
with you approach.




> Maybe we can keep the original version to fix it?
And your commit log should be more clear.

Thanks
Zhijian

>
> Thanks
> Chen
>
>> Thanks
>> Zhijian
>>
>>
>> On 28/03/2022 17:13, Zhang, Chen wrote:
>>>> -Original Message-
>>>> From: lizhij...@fujitsu.com 
>>>> Sent: Monday, March 21, 2022 11:06 AM
>>>> To: Zhang, Chen ; Jason Wang
>>>> ; lizhij...@fujitsu.com
>>>> Cc: qemu-dev ; Like Xu
>>>> 
>>>> Subject: Re: [PATCH 2/4] net/colo: Fix a "double free" crash to clear
>>>> the conn_list
>>>>
>>>>
>>>>
>>>> On 09/03/2022 16:38, Zhang Chen wrote:
>>

Re: [PATCH 2/4] net/colo: Fix a "double free" crash to clear the conn_list

2022-03-20 Thread lizhij...@fujitsu.com


On 09/03/2022 16:38, Zhang Chen wrote:
> We notice the QEMU may crash when the guest has too many
> incoming network connections with the following log:
>
> 15197@1593578622.668573:colo_proxy_main : colo proxy connection hashtable 
> full, clear it
> free(): invalid pointer
> [1]15195 abort (core dumped)  qemu-system-x86_64 
>
> This is because we create the s->connection_track_table with
> g_hash_table_new_full() which is defined as:
>
> GHashTable * g_hash_table_new_full (GHashFunc hash_func,
> GEqualFunc key_equal_func,
> GDestroyNotify key_destroy_func,
> GDestroyNotify value_destroy_func);
>
> The fourth parameter connection_destroy() will be called to free the
> memory allocated for all 'Connection' values in the hashtable when
> we call g_hash_table_remove_all() in the connection_hashtable_reset().
>
> It's unnecessary because we clear the conn_list explicitly later,
> and it's buggy when other agents try to call connection_get()
> with the same connection_track_table.
>
> Signed-off-by: Like Xu 
> Signed-off-by: Zhang Chen 
> ---
>   net/colo-compare.c| 2 +-
>   net/filter-rewriter.c | 2 +-
>   2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/colo-compare.c b/net/colo-compare.c
> index 62554b5b3c..ab054cfd21 100644
> --- a/net/colo-compare.c
> +++ b/net/colo-compare.c
> @@ -1324,7 +1324,7 @@ static void colo_compare_complete(UserCreatable *uc, 
> Error **errp)
>   s->connection_track_table = g_hash_table_new_full(connection_key_hash,
> connection_key_equal,
> g_free,
> -  connection_destroy);
> +  NULL);


202 /* if not found, create a new connection and add to hash table */
203 Connection *connection_get(GHashTable *connection_track_table,
204    ConnectionKey *key,
205    GQueue *conn_list)
206 {
207 Connection *conn = g_hash_table_lookup(connection_track_table, key);
208
209 if (conn == NULL) {
210 ConnectionKey *new_key = g_memdup(key, sizeof(*key));
211
212 conn = connection_new(key);
213
214 if (g_hash_table_size(connection_track_table) > HASHTABLE_MAX_SIZE) 
{
215 trace_colo_proxy_main("colo proxy connection hashtable full,"
216   " clear it");
217 connection_hashtable_reset(connection_track_table);

197 void connection_hashtable_reset(GHashTable *connection_track_table)
198 {
199 g_hash_table_remove_all(connection_track_table);
200 }

IIUC,  above subroutine will do some cleanup explicitly. And before your patch, 
connection_hashtable_reset()
will release all keys and their values in this hashtable. But now, you remove 
all keys and just
one value(conn_list) instead. Does it means other values will be leaked ?


218 /*
219  * clear the conn_list
220 */
221 while (!g_queue_is_empty(conn_list)) {
222 connection_destroy(g_queue_pop_head(conn_list));
223 }
224 }
225
226 g_hash_table_insert(connection_track_table, new_key, conn);
227 }
228
229 return conn;
230 }


Thanks
Zhijian

>   
>   colo_compare_iothread(s);
>   
> diff --git a/net/filter-rewriter.c b/net/filter-rewriter.c
> index bf05023dc3..c18c4c2019 100644
> --- a/net/filter-rewriter.c
> +++ b/net/filter-rewriter.c
> @@ -383,7 +383,7 @@ static void colo_rewriter_setup(NetFilterState *nf, Error 
> **errp)
>   s->connection_track_table = g_hash_table_new_full(connection_key_hash,
> connection_key_equal,
> g_free,
> -  connection_destroy);
> +  NULL);
>   s->incoming_queue = qemu_new_net_queue(qemu_netfilter_pass_to_next, nf);
>   }
>   


Re: [PATCH 3/4] net/colo.c: No need to track conn_list for filter-rewriter

2022-03-20 Thread lizhij...@fujitsu.com


On 09/03/2022 16:38, Zhang Chen wrote:
> Filter-rewriter no need to track connection in conn_list.
> This patch fix the glib g_queue_is_empty assertion when COLO guest
> keep a lot of network connection.
>
> Signed-off-by: Zhang Chen 
LGTM.

Reviewed-by: Li Zhijian 


> ---
>   net/colo.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/colo.c b/net/colo.c
> index 1f8162f59f..694f3c93ef 100644
> --- a/net/colo.c
> +++ b/net/colo.c
> @@ -218,7 +218,7 @@ Connection *connection_get(GHashTable 
> *connection_track_table,
>   /*
>* clear the conn_list
>*/
> -while (!g_queue_is_empty(conn_list)) {
> +while (conn_list && !g_queue_is_empty(conn_list)) {
>   connection_destroy(g_queue_pop_head(conn_list));
>   }
>   }


Re: [PATCH 4/4] net/colo.c: fix segmentation fault when packet is not parsed correctly

2022-03-20 Thread lizhij...@fujitsu.com


On 09/03/2022 16:38, Zhang Chen wrote:
> When COLO use only one vnet_hdr_support parameter between
> filter-redirector and filter-mirror(or colo-compare), COLO will crash
> with segmentation fault. Back track as follow:
>
> Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
> 0x55cb200b in eth_get_l2_hdr_length (p=0x0)
>  at /home/tao/project/COLO/colo-qemu/include/net/eth.h:296
> 296 uint16_t proto = be16_to_cpu(PKT_GET_ETH_HDR(p)->h_proto);
> (gdb) bt
> 0  0x55cb200b in eth_get_l2_hdr_length (p=0x0)
>  at /home/tao/project/COLO/colo-qemu/include/net/eth.h:296
> 1  0x55cb22b4 in parse_packet_early (pkt=0x56a44840) at
> net/colo.c:49
> 2  0x55cb2b91 in is_tcp_packet (pkt=0x56a44840) at
> net/filter-rewriter.c:63
>
> So wrong vnet_hdr_len will cause pkt->data become NULL.
Not sure if we can check this earlier, well

Reviewed-by: Li Zhijian 


> Add check to
> raise error and add trace-events to track vnet_hdr_len.
>
> Signed-off-by: Tao Xu 
> Signed-off-by: Zhang Chen 


> ---
>   net/colo.c   | 9 -
>   net/trace-events | 1 +
>   2 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/net/colo.c b/net/colo.c
> index 694f3c93ef..6b0ff562ad 100644
> --- a/net/colo.c
> +++ b/net/colo.c
> @@ -46,7 +46,14 @@ int parse_packet_early(Packet *pkt)
>   static const uint8_t vlan[] = {0x81, 0x00};
>   uint8_t *data = pkt->data + pkt->vnet_hdr_len;
>   uint16_t l3_proto;
> -ssize_t l2hdr_len = eth_get_l2_hdr_length(data);
> +ssize_t l2hdr_len;
> +
> +if (data == NULL) {
> +trace_colo_proxy_main_vnet_info("This packet is not parsed 
> correctly, "
> +"pkt->vnet_hdr_len", 
> pkt->vnet_hdr_len);
> +return 1;
> +}
> +l2hdr_len = eth_get_l2_hdr_length(data);
>   
>   if (pkt->size < ETH_HLEN + pkt->vnet_hdr_len) {
>   trace_colo_proxy_main("pkt->size < ETH_HLEN");
> diff --git a/net/trace-events b/net/trace-events
> index d7a17256cc..6af927b4b9 100644
> --- a/net/trace-events
> +++ b/net/trace-events
> @@ -9,6 +9,7 @@ vhost_user_event(const char *chr, int event) "chr: %s got 
> event: %d"
>   
>   # colo.c
>   colo_proxy_main(const char *chr) ": %s"
> +colo_proxy_main_vnet_info(const char *sta, int size) ": %s = %d"
>   
>   # colo-compare.c
>   colo_compare_main(const char *chr) ": %s"


Re: [PATCH v3] migration/rdma: Fix out of order wrid

2021-10-18 Thread lizhij...@fujitsu.com
ping


On 27/09/2021 15:07, Li Zhijian wrote:
> destination:
> ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev 
> tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device 
> e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive 
> if=none,file=./Fedora-rdma-server-migration.qcow2,id=drive-virtio-disk0 
> -device 
> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 
> 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl 
> -spice streaming-video=filter,port=5902,disable-ticketing -incoming 
> rdma:192.168.22.23:
> qemu-system-x86_64: -spice 
> streaming-video=filter,port=5902,disable-ticketing: warning: short-form 
> boolean option 'disable-ticketing' deprecated
> Please use disable-ticketing=on instead
> QEMU 6.0.50 monitor - type 'help' for more information
> (qemu) trace-event qemu_rdma_block_for_wrid_miss on
> (qemu) dest_init RDMA Device opened: kernel name rxe_eth0 uverbs device name 
> uverbs2, infiniband_verbs class device path 
> /sys/class/infiniband_verbs/uverbs2, infiniband class device path 
> /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
> qemu_rdma_block_for_wrid_miss A Wanted wrid CONTROL SEND (2000) but got 
> CONTROL RECV (4000)
>
> source:
> ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev 
> tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device 
> e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive 
> if=none,file=./Fedora-rdma-server.qcow2,id=drive-virtio-disk0 -device 
> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 
> 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl 
> -spice streaming-video=filter,port=5901,disable-ticketing -S
> qemu-system-x86_64: -spice 
> streaming-video=filter,port=5901,disable-ticketing: warning: short-form 
> boolean option 'disable-ticketing' deprecated
> Please use disable-ticketing=on instead
> QEMU 6.0.50 monitor - type 'help' for more information
> (qemu)
> (qemu) trace-event qemu_rdma_block_for_wrid_miss on
> (qemu) migrate -d rdma:192.168.22.23:
> source_resolve_host RDMA Device opened: kernel name rxe_eth0 uverbs device 
> name uverbs2, infiniband_verbs class device path 
> /sys/class/infiniband_verbs/uverbs2, infiniband class device path 
> /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
> (qemu) qemu_rdma_block_for_wrid_miss A Wanted wrid WRITE RDMA (1) but got 
> CONTROL RECV (4000)
>
> NOTE: we use soft RoCE as the rdma device.
> [root@iaas-rpma images]# rdma link show rxe_eth0/1
> link rxe_eth0/1 state ACTIVE physical_state LINK_UP netdev eth0
>
> This migration could not be completed when out of order(OOO) CQ event occurs.
> The send queue and receive queue shared a same completion queue, and
> qemu_rdma_block_for_wrid() will drop the CQs it's not interested in. But
> the dropped CQs by qemu_rdma_block_for_wrid() could be later CQs it wants.
> So in this case, qemu_rdma_block_for_wrid() will block forever.
>
> OOO cases will occur in both source side and destination side. And a
> forever blocking happens on only SEND and RECV are out of order. OOO between
> 'WRITE RDMA' and 'RECV' doesn't matter.
>
> below the OOO sequence:
> source destination
>rdma_write_one()   qemu_rdma_registration_handle()
> 1.S1: post_recv XD1: post_recv Y
> 2.wait for recv CQ event X
> 3.   D2: post_send X ---+
> 4.   wait for send CQ send event X (D2) |
> 5.recv CQ event X reaches (D2)  |
> 6.  +-S2: post_send Y   |
> 7.  | wait for send CQ event Y  |
> 8.  |recv CQ event Y (S2) (drop it) |
> 9.  +-send CQ event Y reaches (S2)  |
> 10.  send CQ event X reaches (D2)  -+
> 11.  wait recv CQ event Y (dropped by (8))
>
> Although a hardware IB works fine in my a hundred of runs, the IB 
> specification
> doesn't guaratee the CQ order in such case.
>
> Here we introduce a independent send completion queue to distinguish
> ibv_post_send completion queue from the original mixed completion queue.
> It helps us to poll the specific CQE we are really interested in.
>
> Signed-off-by: Li Zhijian 
> ---
> V3: rebase code, and combine 2/2 to 1/2
> V2: Introduce send completion queue
> ---
>   migration/rdma.c | 132 +++
>   1 file changed, 98 insertions(+), 34 deletions(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 5c2d113aa94..bb19a5afe73 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -358,9 +358,11 @@ typedef struct RDMAContext {
>   struct ibv_context  *ver

Re: [PATCH v2 01/10] Remove some duplicate trace code.

2021-03-12 Thread lizhij...@fujitsu.com


On 3/12/21 1:02 PM, leirao wrote:
> From: "Rao, Lei" 
>
> There is the same trace code in the colo_compare_packet_payload.
>
> Signed-off-by: Lei Rao 
Reviewed-by: Li Zhijian 

> ---
>   net/colo-compare.c | 13 -
>   1 file changed, 13 deletions(-)
>
> diff --git a/net/colo-compare.c b/net/colo-compare.c
> index 84db497..9e18baa 100644
> --- a/net/colo-compare.c
> +++ b/net/colo-compare.c
> @@ -590,19 +590,6 @@ static int colo_packet_compare_other(Packet *spkt, 
> Packet *ppkt)
>   uint16_t offset = ppkt->vnet_hdr_len;
>   
>   trace_colo_compare_main("compare other");
> -if (trace_event_get_state_backends(TRACE_COLO_COMPARE_IP_INFO)) {
> -char pri_ip_src[20], pri_ip_dst[20], sec_ip_src[20], sec_ip_dst[20];
> -
> -strcpy(pri_ip_src, inet_ntoa(ppkt->ip->ip_src));
> -strcpy(pri_ip_dst, inet_ntoa(ppkt->ip->ip_dst));
> -strcpy(sec_ip_src, inet_ntoa(spkt->ip->ip_src));
> -strcpy(sec_ip_dst, inet_ntoa(spkt->ip->ip_dst));
> -
> -trace_colo_compare_ip_info(ppkt->size, pri_ip_src,
> -   pri_ip_dst, spkt->size,
> -   sec_ip_src, sec_ip_dst);
> -}
> -
>   if (ppkt->size != spkt->size) {
>   trace_colo_compare_main("Other: payload size of packets are 
> different");
>   return -1;


Re: [PATCH v2 05/10] Optimize the function of packet_new

2021-03-12 Thread lizhij...@fujitsu.com


On 3/12/21 1:02 PM, leirao wrote:
> From: "Rao, Lei" 
>
> if we put the data copy outside the packet_new(), then for the
> filter-rewrite module, there will be one less memory copy in the
> processing of each network packet.
>
> Signed-off-by: Lei Rao 
> ---
>   net/colo-compare.c| 7 +--
>   net/colo.c| 4 ++--
>   net/colo.h| 2 +-
>   net/filter-rewriter.c | 1 -
>   4 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/net/colo-compare.c b/net/colo-compare.c
> index 9e18baa..8bdf5a8 100644
> --- a/net/colo-compare.c
> +++ b/net/colo-compare.c
> @@ -247,14 +247,17 @@ static int packet_enqueue(CompareState *s, int mode, 
> Connection **con)
>   ConnectionKey key;
>   Packet *pkt = NULL;
>   Connection *conn;
> +char *data = NULL;
>   int ret;
>   
>   if (mode == PRIMARY_IN) {
> -pkt = packet_new(s->pri_rs.buf,
> +data = g_memdup(s->pri_rs.buf, s->pri_rs.packet_len);
> +pkt = packet_new(data,
>s->pri_rs.packet_len,
>s->pri_rs.vnet_hdr_len);
>   } else {
> -pkt = packet_new(s->sec_rs.buf,
> +data = g_memdup(s->sec_rs.buf, s->sec_rs.packet_len);
> +pkt = packet_new(data,
>s->sec_rs.packet_len,
>s->sec_rs.vnet_hdr_len);
>   }
> diff --git a/net/colo.c b/net/colo.c
> index ef00609..08fb37e 100644
> --- a/net/colo.c
> +++ b/net/colo.c
> @@ -155,11 +155,11 @@ void connection_destroy(void *opaque)
>   g_slice_free(Connection, conn);
>   }
>   
> -Packet *packet_new(const void *data, int size, int vnet_hdr_len)
> +Packet *packet_new(void *data, int size, int vnet_hdr_len)
>   {
>   Packet *pkt = g_slice_new(Packet);
>   
> -pkt->data = g_memdup(data, size);
> +pkt->data = data;

if so,  should packet_destroy()  free() data which may be not alloc by itself

Thanks
Zhijian

Re: [PATCH v2 02/10] Fix the qemu crash when guest shutdown during checkpoint

2021-03-12 Thread lizhij...@fujitsu.com


On 3/12/21 1:02 PM, leirao wrote:
> From: "Rao, Lei" 
>
> This patch fixes the following:
>  qemu-system-x86_64: invalid runstate transition: 'colo' ->'shutdown'
>  Aborted (core dumped)
>
> Signed-off-by: Lei Rao 
Reviewed-by: Li Zhijian 

> ---
>   softmmu/runstate.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/softmmu/runstate.c b/softmmu/runstate.c
> index 2874417..884f8fa 100644
> --- a/softmmu/runstate.c
> +++ b/softmmu/runstate.c
> @@ -126,6 +126,7 @@ static const RunStateTransition 
> runstate_transitions_def[] = {
>   { RUN_STATE_RESTORE_VM, RUN_STATE_PRELAUNCH },
>   
>   { RUN_STATE_COLO, RUN_STATE_RUNNING },
> +{ RUN_STATE_COLO, RUN_STATE_SHUTDOWN},
>   
>   { RUN_STATE_RUNNING, RUN_STATE_DEBUG },
>   { RUN_STATE_RUNNING, RUN_STATE_INTERNAL_ERROR },


Re: [PATCH v2 10/10] Fixed calculation error of pkt->header_size in fill_pkt_tcp_info()

2021-03-12 Thread lizhij...@fujitsu.com


On 3/12/21 1:03 PM, leirao wrote:
> From: "Rao, Lei" 
>
> The data pointer has skipped vnet_hdr_len in the function of
> parse_packet_early().So, we can not subtract vnet_hdr_len again
> when calculating pkt->header_size in fill_pkt_tcp_info(). Otherwise,
> it will cause network packet comparsion errors and greatly increase
> the frequency of checkpoints.
>
> Signed-off-by: Lei Rao 
> Signed-off-by: Zhang Chen 
Reviewed-by: Li Zhijian 


> ---
>   net/colo-compare.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/colo-compare.c b/net/colo-compare.c
> index 06f2c28..af30490 100644
> --- a/net/colo-compare.c
> +++ b/net/colo-compare.c
> @@ -211,7 +211,7 @@ static void fill_pkt_tcp_info(void *data, uint32_t 
> *max_ack)
>   pkt->tcp_ack = ntohl(tcphd->th_ack);
>   *max_ack = *max_ack > pkt->tcp_ack ? *max_ack : pkt->tcp_ack;
>   pkt->header_size = pkt->transport_header - (uint8_t *)pkt->data
> -   + (tcphd->th_off << 2) - pkt->vnet_hdr_len;
> +   + (tcphd->th_off << 2);
>   pkt->payload_size = pkt->size - pkt->header_size;
>   pkt->seq_end = pkt->tcp_seq + pkt->payload_size;
>   pkt->flags = tcphd->th_flags;


Re: [PATCH v2 03/10] Optimize the function of filter_send

2021-03-12 Thread lizhij...@fujitsu.com


On 3/12/21 1:02 PM, leirao wrote:
> From: "Rao, Lei" 
>
> The iov_size has been calculated in filter_send(). we can directly
> return the size.In this way, this is no need to repeat calculations
> in filter_redirector_receive_iov();
>
> Signed-off-by: Lei Rao 
Reviewed-by: Li Zhijian 

> ---
>   net/filter-mirror.c | 8 
>   1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/net/filter-mirror.c b/net/filter-mirror.c
> index f8e6500..f20240c 100644
> --- a/net/filter-mirror.c
> +++ b/net/filter-mirror.c
> @@ -88,7 +88,7 @@ static int filter_send(MirrorState *s,
>   goto err;
>   }
>   
> -return 0;
> +return size;
>   
>   err:
>   return ret < 0 ? ret : -EIO;
> @@ -159,7 +159,7 @@ static ssize_t filter_mirror_receive_iov(NetFilterState 
> *nf,
>   int ret;
>   
>   ret = filter_send(s, iov, iovcnt);
> -if (ret) {
> +if (ret < 0) {
>   error_report("filter mirror send failed(%s)", strerror(-ret));
>   }
>   
> @@ -182,10 +182,10 @@ static ssize_t 
> filter_redirector_receive_iov(NetFilterState *nf,
>   
>   if (qemu_chr_fe_backend_connected(&s->chr_out)) {
>   ret = filter_send(s, iov, iovcnt);
> -if (ret) {
> +if (ret < 0) {
>   error_report("filter redirector send failed(%s)", 
> strerror(-ret));
>   }
> -return iov_size(iov, iovcnt);
> +return ret;
>   } else {
>   return 0;
>   }


Re: [PATCH v2 08/10] Reduce the PVM stop time during Checkpoint

2021-03-12 Thread lizhij...@fujitsu.com


On 3/12/21 1:03 PM, leirao wrote:
> From: "Rao, Lei" 
>
> When flushing memory from ram cache to ram during every checkpoint
> on secondary VM, we can copy continuous chunks of memory instead of
> 4096 bytes per time to reduce the time of VM stop during checkpoint.
>
> Signed-off-by: Lei Rao 
> ---
>   migration/ram.c | 44 +---
>   1 file changed, 41 insertions(+), 3 deletions(-)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index e795a8d..b269637 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -823,6 +823,39 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, 
> RAMBlock *rb,
>   return next;
>   }
>   
> +/*
> + * colo_bitmap_find_diry:find contiguous dirty pages from start
> + *
> + * Returns the page offset within memory region of the start of the 
> contiguout
> + * dirty page
> + *
> + * @rs: current RAM state
> + * @rb: RAMBlock where to search for dirty pages
> + * @start: page where we start the search
> + * @num: the number of contiguous dirty pages
> + */
> +static inline
> +unsigned long colo_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
> + unsigned long start, unsigned long *num)
> +{
> +unsigned long size = rb->used_length >> TARGET_PAGE_BITS;
> +unsigned long *bitmap = rb->bmap;
> +unsigned long first, next;
> +
> +if (ramblock_is_ignored(rb)) {
> +return size;
> +}
> +
> +first = find_next_bit(bitmap, size, start);
> +if (first >= size) {
> +return first;
> +}
> +next = find_next_zero_bit(bitmap, size, first + 1);
> +assert(next >= first);
> +*num = next - first;
> +return first;

The idea is outstanding

i wonder it should return (next - 1) ?

Thanks
Zhijian


> +}
> +
>   static inline bool migration_bitmap_clear_dirty(RAMState *rs,
>   RAMBlock *rb,
>   unsigned long page)
> @@ -3669,6 +3702,8 @@ void colo_flush_ram_cache(void)
>   void *dst_host;
>   void *src_host;
>   unsigned long offset = 0;
> +unsigned long num = 0;
> +unsigned long i = 0;
>   
>   memory_global_dirty_log_sync();
>   WITH_RCU_READ_LOCK_GUARD() {
> @@ -3682,19 +3717,22 @@ void colo_flush_ram_cache(void)
>   block = QLIST_FIRST_RCU(&ram_list.blocks);
>   
>   while (block) {
> -offset = migration_bitmap_find_dirty(ram_state, block, offset);
> +offset = colo_bitmap_find_dirty(ram_state, block, offset, &num);
>   
>   if (((ram_addr_t)offset) << TARGET_PAGE_BITS
>   >= block->used_length) {
>   offset = 0;
> +num = 0;
>   block = QLIST_NEXT_RCU(block, next);
>   } else {
> -migration_bitmap_clear_dirty(ram_state, block, offset);
> +for (i = 0; i < num; i++) {
> +migration_bitmap_clear_dirty(ram_state, block, offset + 
> i);
> +}
>   dst_host = block->host
>+ (((ram_addr_t)offset) << TARGET_PAGE_BITS);
>   src_host = block->colo_cache
>+ (((ram_addr_t)offset) << TARGET_PAGE_BITS);
> -memcpy(dst_host, src_host, TARGET_PAGE_SIZE);
> +memcpy(dst_host, src_host, TARGET_PAGE_SIZE * num);
>   }
>   }
>   }


Re: [PATCH v2 04/10] Remove migrate_set_block_enabled in checkpoint

2021-03-12 Thread lizhij...@fujitsu.com

On 3/12/21 1:02 PM, leirao wrote:

From: "Rao, Lei" 



We can detect disk migration in migrate_prepare, if disk migration

is enabled in COLO mode, we can directly report an error.and there

is no need to disable block migration at every checkpoint.



Signed-off-by: Lei Rao 

Signed-off-by: Zhang Chen 
Reviewed-by: Li Zhijian 





---

 migration/colo.c  | 6 --

 migration/migration.c | 4 

 2 files changed, 4 insertions(+), 6 deletions(-)

DA_

diff --git a/migration/colo.c b/migration/colo.c

index de27662..1aaf316 100644

--- a/migration/colo.c

+++ b/migration/colo.c

@@ -435,12 +435,6 @@ static int colo_do_checkpoint_transaction(MigrationState 
*s,

 if (failover_get_state() != FAILOVER_STATUS_NONE) {

 goto out;

 }

-

-/* Disable block migration */

-migrate_set_block_enabled(false, &local_err);

-if (local_err) {

-goto out;

-}

 qemu_mutex_lock_iothread();



 #ifdef CONFIG_REPLICATION

diff --git a/migration/migration.c b/migration/migration.c

index a5ddf43..785a331 100644

--- a/migration/migration.c

+++ b/migration/migration.c

@@ -2221,6 +2221,10 @@ static bool migrate_prepare(MigrationState *s, bool blk, 
bool blk_inc,

 }



 if (blk || blk_inc) {

+if (migrate_colo_enabled()) {

+error_setg(errp, "No disk migration is required in COLO mode");

+return false;

+}

 if (migrate_use_block() || migrate_use_block_incremental()) {

 error_setg(errp, "Command options are incompatible with "

"current migration capabilities");



Re: [PATCH v2 09/10] Add the function of colo_bitmap_clear_diry

2021-03-12 Thread lizhij...@fujitsu.com


On 3/12/21 1:03 PM, leirao wrote:
> From: "Rao, Lei" 
>
> When we use continuous dirty memory copy for flushing ram cache on
> secondary VM, we can also clean up the bitmap of contiguous dirty
> page memory. This also can reduce the VM stop time during checkpoint.
>
> Signed-off-by: Lei Rao 
> ---
>   migration/ram.c | 29 +
>   1 file changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/migration/ram.c b/migration/ram.c
> index b269637..008a26e 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -856,6 +856,30 @@ unsigned long colo_bitmap_find_dirty(RAMState *rs, 
> RAMBlock *rb,
>   return first;
>   }
>   
> +/**
> + * colo_bitmap_clear_dirty:when we flush ram cache to ram, we will use
> + * continuous memory copy, so we can also clean up the bitmap of contiguous
> + * dirty memory.
> + */
> +static inline bool colo_bitmap_clear_dirty(RAMState *rs,
> +   RAMBlock *rb,
> +   unsigned long start,
> +   unsigned long num)
> +{
> +bool ret;
> +unsigned long i = 0;
> +
> +qemu_mutex_lock(&rs->bitmap_mutex);
> +for (i = 0; i < num; i++) {
> +ret = test_and_clear_bit(start + i, rb->bmap);
> +if (ret) {
> +rs->migration_dirty_pages--;
> +}
> +}
> +qemu_mutex_unlock(&rs->bitmap_mutex);
> +return ret;
> +}

i'm not sure whether we should clear the dirty_log in kvm side like 
memory_region_clear_dirty_bitmap() does ?
sorry about that, i have missed qemu for a long time...

Thanks
Zhijian


Re: [PATCH v2 05/10] Optimize the function of packet_new

2021-03-12 Thread lizhij...@fujitsu.com


> +offset = colo_bitmap_find_dirty(ram_state, block, offset,
> + &num);
IIUC, this return value would pass to the next round as start index,  so you 
should skip the already checked one.


Thanks


On 3/12/21 5:56 PM, Rao, Lei wrote:
> How about redefine a function named packet_new_nocopy?
> In comments, we can tell the caller don't release the buffer and the 
> packet_destroy will release it.
>
> Thanks,
> Lei.
> -Original Message-
> From:lizhij...@fujitsu.com
> Sent: Friday, March 12, 2021 2:53 PM
> To: Rao, Lei; Zhang, 
> Chen;jasow...@redhat.com;quint...@redhat.com;dgilb...@redhat.com;pbonz...@redhat.com;lukasstra...@web.de
> Cc:qemu-devel@nongnu.org
> Subject: Re: [PATCH v2 05/10] Optimize the function of packet_new
>
>


Re: [PULL 0/7] Migration.next patches

2021-09-09 Thread lizhij...@fujitsu.com


On 10/09/2021 00:10, Juan Quintela wrote:
> "Li, Zhijian"  wrote:
>> on 2021/9/9 21:42, Peter Maydell wrote:
>>> On Thu, 9 Sept 2021 at 11:36, Juan Quintela  wrote:
>>> Fails to build, FreeBSD:
>>>
>>> ../src/migration/rdma.c:1146:23: error: use of undeclared identifier
>>> 'IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE'
>>>   int advice = wr ? IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE :
>>> ^
>>> ../src/migration/rdma.c:1147:18: error: use of undeclared identifier
>>> 'IBV_ADVISE_MR_ADVICE_PREFETCH'
>>>IBV_ADVISE_MR_ADVICE_PREFETCH;
>>>^
>>> ../src/migration/rdma.c:1150:11: warning: implicit declaration of
>>> function 'ibv_advise_mr' is invalid in C99
>>> [-Wimplicit-function-declaration]
>>>   ret = ibv_advise_mr(pd, advice,
>>> ^
>>> ../src/migration/rdma.c:1151:25: error: use of undeclared identifier
>>> 'IBV_ADVISE_MR_FLAG_FLUSH'
>>>   IBV_ADVISE_MR_FLAG_FLUSH, &sg_list, 1);
>>>   ^
>>>
>> it's introduced by [PULL 4/7] migration/rdma: advise prefetch write for ODP 
>> region
>> where it calls a ibv_advise_mr(). i have checked the latest FreeBSD, it 
>> didn't ship with this API
>> May i know if just FressBSD reports this failure? if so, i just need 
>> filtering out FreeBSD only
> Second try.  I can't see an example where they search for:
> a symbol on the header file
>and
> a function in a library
>
> so I assume that if you have the symbols, you have the function.
>
> How do you see it?
>
> Trying to compile it on vm-build-freebsd, but not being very sucessfull
> so far.

Your patch does work! But i still followed PMM's suggestion, converted it to 
has_function
as another option.
I have verified it on FreeBSD and Linux.

 From 67f386acc2092ecf6e71b8951b6af5d5b8366f80 Mon Sep 17 00:00:00 2001
From: Juan Quintela 
Date: Thu, 9 Sep 2021 17:07:17 +0200
Subject: [PATCH] rdma: test for ibv_advise_mr API

Signed-off-by: Juan Quintela 
Signed-off-by: Li Zhijian 
---
  meson.build  | 6 ++
  migration/rdma.c | 2 ++
  2 files changed, 8 insertions(+)

diff --git a/meson.build b/meson.build
index 6e4d2d80343..97406d1b79b 100644
--- a/meson.build
+++ b/meson.build
@@ -1328,6 +1328,12 @@ config_host_data.set('HAVE_COPY_FILE_RANGE', 
cc.has_function('copy_file_range'))
  config_host_data.set('HAVE_OPENPTY', cc.has_function('openpty', dependencies: 
util))
  config_host_data.set('HAVE_STRCHRNUL', cc.has_function('strchrnul'))
  config_host_data.set('HAVE_SYSTEM_FUNCTION', cc.has_function('system', 
prefix: '#include '))
+if rdma.found()
+  config_host_data.set('HAVE_IBV_ADVISE_MR',
+   cc.has_function('ibv_advise_mr',
+   args: config_host['RDMA_LIBS'].split(),
+   prefix: '#include 
'))
+endif
  
  # has_header_symbol
  config_host_data.set('CONFIG_BYTESWAP_H',
diff --git a/migration/rdma.c b/migration/rdma.c
index 6c2cc3f617c..2a3c7889b9f 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -1142,6 +1142,7 @@ static void qemu_rdma_advise_prefetch_mr(struct ibv_pd 
*pd, uint64_t addr,
   uint32_t len,  uint32_t lkey,
   const char *name, bool wr)
  {
+#ifdef HAVE_IBV_ADVISE_MR
  int ret;
  int advice = wr ? IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE :
   IBV_ADVISE_MR_ADVICE_PREFETCH;
@@ -1155,6 +1156,7 @@ static void qemu_rdma_advise_prefetch_mr(struct ibv_pd 
*pd, uint64_t addr,
  } else {
  trace_qemu_rdma_advise_mr(name, len, addr, "successed");
  }
+#endif
  }
  
  static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
-- 
2.31.1




> Later, Juan.
>
>  From e954c1e0afc785a98d472201dafe75a7e7126b1d Mon Sep 17 00:00:00 2001
> From: Juan Quintela 
> Date: Thu, 9 Sep 2021 17:07:17 +0200
> Subject: [PATCH] rdma: test for ibv_advise_mr API
>
> Signed-off-by: Juan Quintela 
> ---
>   meson.build  | 3 +++
>   migration/rdma.c | 2 ++
>   2 files changed, 5 insertions(+)
>
> diff --git a/meson.build b/meson.build
> index 7e58e6279b..c2eb437df4 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1375,6 +1375,9 @@ config_host_data.set('HAVE_SIGEV_NOTIFY_THREAD_ID',
>   config_host_data.set('HAVE_STRUCT_STAT_ST_ATIM',
>cc.has_member('struct stat', 'st_atim',
>  prefix: '#include '))
> +config_host_data.set('CONFIG_RDMA_IBV_ADVISE_MR',
> + cc.has_header_symbol('infiniband/verbs.h', 
> 'IBV_ADVISE_MR_ADVICE_PREFETCH') and
> + cc.has_header_symbol('infiniband/verbs.h', 
> 'IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE'))
>   
>   config_host_data.set('CONFIG_EVENTFD', cc.links('''
> #include 
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 6c2cc3f617..f0d78597fb 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -1142,6 +1142,7 @@ static void qemu_rdma_advise_prefetc

Re: [PULL 0/7] Migration.next patches

2021-09-09 Thread lizhij...@fujitsu.com


On 10/09/2021 13:20, Li Zhijian wrote:
>
>
> On 10/09/2021 00:10, Juan Quintela wrote:
>> "Li, Zhijian"  wrote:
>>> on 2021/9/9 21:42, Peter Maydell wrote:
 On Thu, 9 Sept 2021 at 11:36, Juan Quintela  wrote:
 Fails to build, FreeBSD:

 ../src/migration/rdma.c:1146:23: error: use of undeclared identifier
 'IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE'
   int advice = wr ? IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE :
     ^
 ../src/migration/rdma.c:1147:18: error: use of undeclared identifier
 'IBV_ADVISE_MR_ADVICE_PREFETCH'
    IBV_ADVISE_MR_ADVICE_PREFETCH;
    ^
 ../src/migration/rdma.c:1150:11: warning: implicit declaration of
 function 'ibv_advise_mr' is invalid in C99
 [-Wimplicit-function-declaration]
   ret = ibv_advise_mr(pd, advice,
     ^
 ../src/migration/rdma.c:1151:25: error: use of undeclared identifier
 'IBV_ADVISE_MR_FLAG_FLUSH'
   IBV_ADVISE_MR_FLAG_FLUSH, &sg_list, 1);
   ^

>>> it's introduced by [PULL 4/7] migration/rdma: advise prefetch write for ODP 
>>> region
>>> where it calls a ibv_advise_mr(). i have checked the latest FreeBSD, it 
>>> didn't ship with this API
>>> May i know if just FressBSD reports this failure? if so, i just need 
>>> filtering out FreeBSD only
>> Second try.  I can't see an example where they search for:
>> a symbol on the header file
>>    and
>> a function in a library
>>
>> so I assume that if you have the symbols, you have the function.
>>
>> How do you see it?
>>
>> Trying to compile it on vm-build-freebsd, but not being very sucessfull
>> so far.

BTW: Does QEMU provide any mean to set http(s)_proxy to building vm ? 
Currently, i have to
hack the code like:

-self.ssh_root_check("pkg install -y %s\n" % " ".join(self.pkgs))
+self.ssh_root_check("setenv HTTP_PROXY http://myproxy; setenv 
HTTPS_PROXY http://myproxy; pkg install -y %s\n" % " ".join(self.pkgs))


Thanks
Zhijian


>
> Your patch does work! But i still followed PMM's suggestion, converted it to 
> has_function
> as another option.
> I have verified it on FreeBSD and Linux.
>
> From 67f386acc2092ecf6e71b8951b6af5d5b8366f80 Mon Sep 17 00:00:00 2001
> From: Juan Quintela 
> Date: Thu, 9 Sep 2021 17:07:17 +0200
> Subject: [PATCH] rdma: test for ibv_advise_mr API
>
> Signed-off-by: Juan Quintela 
> Signed-off-by: Li Zhijian 
> ---
>  meson.build  | 6 ++
>  migration/rdma.c | 2 ++
>  2 files changed, 8 insertions(+)
>
> diff --git a/meson.build b/meson.build
> index 6e4d2d80343..97406d1b79b 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1328,6 +1328,12 @@ config_host_data.set('HAVE_COPY_FILE_RANGE', 
> cc.has_function('copy_file_range'))
>  config_host_data.set('HAVE_OPENPTY', cc.has_function('openpty', 
> dependencies: util))
>  config_host_data.set('HAVE_STRCHRNUL', cc.has_function('strchrnul'))
>  config_host_data.set('HAVE_SYSTEM_FUNCTION', cc.has_function('system', 
> prefix: '#include '))
> +if rdma.found()
> +  config_host_data.set('HAVE_IBV_ADVISE_MR',
> +   cc.has_function('ibv_advise_mr',
> +   args: 
> config_host['RDMA_LIBS'].split(),
> +   prefix: '#include 
> '))
> +endif
>
>  # has_header_symbol
>  config_host_data.set('CONFIG_BYTESWAP_H',
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 6c2cc3f617c..2a3c7889b9f 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -1142,6 +1142,7 @@ static void qemu_rdma_advise_prefetch_mr(struct ibv_pd 
> *pd, uint64_t addr,
>   uint32_t len,  uint32_t lkey,
>   const char *name, bool wr)
>  {
> +#ifdef HAVE_IBV_ADVISE_MR
>  int ret;
>  int advice = wr ? IBV_ADVISE_MR_ADVICE_PREFETCH_WRITE :
>   IBV_ADVISE_MR_ADVICE_PREFETCH;
> @@ -1155,6 +1156,7 @@ static void qemu_rdma_advise_prefetch_mr(struct ibv_pd 
> *pd, uint64_t addr,
>  } else {
>  trace_qemu_rdma_advise_mr(name, len, addr, "successed");
>  }
> +#endif
>  }
>
>  static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)


Re: [PULL 0/7] Migration.next patches

2021-09-10 Thread lizhij...@fujitsu.com


On 10/09/2021 15:00, Juan Quintela wrote:
> ++ git diff-index --quiet --ignore-submodules=all HEAD --
> ++ echo HEAD
> + git archive --format tar --prefix slirp/ HEAD
> + test 0 -ne 0
> + tar --concatenate --file /tmp/kk.tar /tmp/kk.sub.WKj1o6oP/submodule.tar
> tar: Skipping to next header
> tar: Exiting with failure status due to previous errors
> + test 2 -ne 0
> + error 'failed append submodule slirp to /tmp/kk.tar'
> + printf '%s\n' 'failed append submodule slirp to /tmp/kk.tar'
> failed append submodule slirp to /tmp/kk.tar
> + exit 1
> + cleanup
> + local status=1
> + rm -rf /tmp/kk.sub.WKj1o6oP
> + test '' '!=' ''
> + exit 1
> (master)$
>
> Doing the things on the command line, the
>
>git archive --format tar --prefix slirp/ HEAD

It's so weird, i have no idea about it.
It works fine for me. :)



> Creates a tar archive, so I get completely lost.
>
> I showed here fedora, but it fails exactly the same for freebsd,
> openbsd, ... and everything that I decided to build.  It fails in the
> smae stage.


Re: [PATCH] nvdimm: release the correct device list

2021-09-12 Thread lizhij...@fujitsu.com

ping again




On 30/08/2021 09:04, Li Zhijian wrote:
> ping
>
>
> On 03/08/2021 12:00, Li, Zhijian wrote:
>> ping
>>
>> Any body could help to review/queue this patch ?
>>
>>
>>
>> On 2021/6/29 22:05, Igor Mammedov wrote:
>>> On Thu, 24 Jun 2021 19:04:15 +0800
>>> Li Zhijian  wrote:
>>>
 Signed-off-by: Li Zhijian 
>>> Reviewed-by: Igor Mammedov 
>>>
 ---
   hw/acpi/nvdimm.c | 12 ++--
   1 file changed, 6 insertions(+), 6 deletions(-)

 diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
 index e3d5fe19392..ff317263e85 100644
 --- a/hw/acpi/nvdimm.c
 +++ b/hw/acpi/nvdimm.c
 @@ -355,10 +355,10 @@ nvdimm_build_structure_caps(GArray *structures, 
 uint32_t capabilities)
     static GArray *nvdimm_build_device_structure(NVDIMMState *state)
   {
 -    GSList *device_list = nvdimm_get_device_list();
 +    GSList *device_list, *list = nvdimm_get_device_list();
   GArray *structures = g_array_new(false, true /* clear */, 1);
   -    for (; device_list; device_list = device_list->next) {
 +    for (device_list = list; device_list; device_list = 
 device_list->next) {
   DeviceState *dev = device_list->data;
     /* build System Physical Address Range Structure. */
 @@ -373,7 +373,7 @@ static GArray 
 *nvdimm_build_device_structure(NVDIMMState *state)
   /* build NVDIMM Control Region Structure. */
   nvdimm_build_structure_dcr(structures, dev);
   }
 -    g_slist_free(device_list);
 +    g_slist_free(list);
     if (state->persistence) {
   nvdimm_build_structure_caps(structures, state->persistence);
 @@ -1339,9 +1339,9 @@ static void nvdimm_build_ssdt(GArray *table_offsets, 
 GArray *table_data,
     void nvdimm_build_srat(GArray *table_data)
   {
 -    GSList *device_list = nvdimm_get_device_list();
 +    GSList *device_list, *list = nvdimm_get_device_list();
   -    for (; device_list; device_list = device_list->next) {
 +    for (device_list = list; device_list; device_list = 
 device_list->next) {
   AcpiSratMemoryAffinity *numamem = NULL;
   DeviceState *dev = device_list->data;
   Object *obj = OBJECT(dev);
 @@ -1356,7 +1356,7 @@ void nvdimm_build_srat(GArray *table_data)
   build_srat_memory(numamem, addr, size, node,
     MEM_AFFINITY_ENABLED | 
 MEM_AFFINITY_NON_VOLATILE);
   }
 -    g_slist_free(device_list);
 +    g_slist_free(list);
   }
     void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
>>>
>>>
>>
>>
>>
>


Re: [PATCH] migration/rdma: Fix return-path case

2023-03-14 Thread lizhij...@fujitsu.com


On 15/03/2023 01:15, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" 
> 
> The RDMA code has return-path handling code, but it's only enabled
> if postcopy is enabled; if the 'return-path' migration capability
> is enabled, the return path is NOT setup but the core migration
> code still tries to use it and breaks.
> 
> Enable the RDMA return path if either postcopy or the return-path
> capability is enabled.
> 
> bz: https://bugzilla.redhat.com/show_bug.cgi?id=2063615
> 
> Signed-off-by: Dr. David Alan Gilbert 

LGTM.

Reviewed-by: Li Zhijian 



> ---
>   migration/rdma.c | 8 +---
>   1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 288eadc2d2..9d70e9885b 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -3373,7 +3373,8 @@ static int qemu_rdma_accept(RDMAContext *rdma)
>* initialize the RDMAContext for return path for postcopy after first
>* connection request reached.
>*/
> -if (migrate_postcopy() && !rdma->is_return_path) {
> +if ((migrate_postcopy() || migrate_use_return_path())
> +&& !rdma->is_return_path) {
>   rdma_return_path = qemu_rdma_data_init(rdma->host_port, NULL);
>   if (rdma_return_path == NULL) {
>   rdma_ack_cm_event(cm_event);
> @@ -3455,7 +3456,8 @@ static int qemu_rdma_accept(RDMAContext *rdma)
>   }
>   
>   /* Accept the second connection request for return path */
> -if (migrate_postcopy() && !rdma->is_return_path) {
> +if ((migrate_postcopy() || migrate_use_return_path())
> +&& !rdma->is_return_path) {
>   qemu_set_fd_handler(rdma->channel->fd, 
> rdma_accept_incoming_migration,
>   NULL,
>   (void *)(intptr_t)rdma->return_path);
> @@ -4192,7 +4194,7 @@ void rdma_start_outgoing_migration(void *opaque,
>   }
>   
>   /* RDMA postcopy need a separate queue pair for return path */
> -if (migrate_postcopy()) {
> +if (migrate_postcopy() || migrate_use_return_path()) {
>   rdma_return_path = qemu_rdma_data_init(host_port, errp);
>   
>   if (rdma_return_path == NULL) {

Re: [PATCH] migration/rdma: Remove deprecated variable rdma_return_path

2023-03-16 Thread lizhij...@fujitsu.com

Not clear why it doesn't appear in the 
archive(https://lists.gnu.org/archive/html/qemu-devel/2023-03/threads.html)

nop...


On 15/03/2023 09:22, Li Zhijian wrote:
> It's no longer needed since commit
> 44bcfd45e98 ("migration/rdma: destination: create the return patch after the 
> first accept")
> 
> Signed-off-by: Li Zhijian 
> ---
>   migration/rdma.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/migration/rdma.c b/migration/rdma.c
> index f5d3bbe7e9c..2bc2fcf727b 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -4214,7 +4214,7 @@ static void rdma_accept_incoming_migration(void *opaque)
>   void rdma_start_incoming_migration(const char *host_port, Error **errp)
>   {
>   int ret;
> -RDMAContext *rdma, *rdma_return_path = NULL;
> +RDMAContext *rdma;
>   Error *local_err = NULL;
>   
>   trace_rdma_start_incoming_migration();
> @@ -4260,7 +4260,6 @@ err:
>   g_free(rdma->host_port);
>   }
>   g_free(rdma);
> -g_free(rdma_return_path);
>   }
>   
>   void rdma_start_outgoing_migration(void *opaque,

Re: [PATCH] net/filter: Optimize filter_send to coroutine

2021-12-24 Thread lizhij...@fujitsu.com


On 24/12/2021 10:37, Rao, Lei wrote:
> This patch is to improve the logic of QEMU main thread sleep code in
> qemu_chr_write_buffer() where it can be blocked and can't run other
> coroutines during COLO IO stress test.
>
> Our approach is to put filter_send() in a coroutine. In this way,
> filter_send() will call qemu_coroutine_yield() in qemu_co_sleep_ns(),
> so that it can be scheduled out and QEMU main thread has opportunity to
> run other tasks.
>
> Signed-off-by: Lei Rao 
> Signed-off-by: Zhang Chen 
> ---
>   net/filter-mirror.c | 67 -
>   1 file changed, 54 insertions(+), 13 deletions(-)
>
> diff --git a/net/filter-mirror.c b/net/filter-mirror.c
> index f20240cc9f..1e9f8b6216 100644
> --- a/net/filter-mirror.c
> +++ b/net/filter-mirror.c
> @@ -20,6 +20,7 @@
>   #include "chardev/char-fe.h"
>   #include "qemu/iov.h"
>   #include "qemu/sockets.h"
> +#include "block/aio-wait.h"
>   
>   #define TYPE_FILTER_MIRROR "filter-mirror"
>   typedef struct MirrorState MirrorState;
> @@ -42,20 +43,21 @@ struct MirrorState {
>   bool vnet_hdr;
>   };
>   
> -static int filter_send(MirrorState *s,
> -   const struct iovec *iov,
> -   int iovcnt)
> +typedef struct FilterSendCo {
> +MirrorState *s;
> +char *buf;
> +ssize_t size;
> +bool done;
> +int ret;
> +} FilterSendCo;
> +
> +static int _filter_send(MirrorState *s,
> +   char *buf,
> +   ssize_t size)
>   {
>   NetFilterState *nf = NETFILTER(s);
>   int ret = 0;
> -ssize_t size = 0;
>   uint32_t len = 0;
> -char *buf;
> -
> -size = iov_size(iov, iovcnt);
> -if (!size) {
> -return 0;
> -}
>   
>   len = htonl(size);
>   ret = qemu_chr_fe_write_all(&s->chr_out, (uint8_t *)&len, sizeof(len));
> @@ -80,10 +82,7 @@ static int filter_send(MirrorState *s,
>   }
>   }
>   
> -buf = g_malloc(size);
> -iov_to_buf(iov, iovcnt, 0, buf, size);
>   ret = qemu_chr_fe_write_all(&s->chr_out, (uint8_t *)buf, size);
> -g_free(buf);
>   if (ret != size) {
>   goto err;
>   }
> @@ -94,6 +93,48 @@ err:
>   return ret < 0 ? ret : -EIO;
>   }
>   
> +static void coroutine_fn filter_send_co(void *opaque)
> +{
> +FilterSendCo *data = opaque;
> +
> +data->ret = _filter_send(data->s, data->buf, data->size);
> +data->done = true;
> +g_free(data->buf);
> +aio_wait_kick();
> +}
> +
> +static int filter_send(MirrorState *s,
> +   const struct iovec *iov,
> +   int iovcnt)
> +{
> +ssize_t size = iov_size(iov, iovcnt);
> +char *buf = NULL;
> +
> +if (!size) {
> +return 0;
> +}
> +
> +buf = g_malloc(size);
> +iov_to_buf(iov, iovcnt, 0, buf, size);
> +
> +FilterSendCo data = {
> +.s = s,
> +.size = size,
> +.buf = buf,
> +.ret = 0,
> +};
> +
> +Coroutine *co = qemu_coroutine_create(filter_send_co, &data);

BTW, does qemu/old gcc complaint such coding style ?

int a;
a = foo()
int b = a;



> +qemu_coroutine_enter(co);
> +
> +while (!data.done) {
> +aio_poll(qemu_get_aio_context(), true);
> +}
> +
> +return data.ret;
> +
redundant  newline

Otherwise,
Reviewed-by: Li Zhijian 



> +}
> +
>   static void redirector_to_filter(NetFilterState *nf,
>const uint8_t *buf,
>int len)


Re: [RFC PATCH 2/2] migration/rdma: Enable use of g_autoptr with struct rdma_cm_event

2021-06-02 Thread lizhij...@fujitsu.com


On 03/06/2021 01.51, Philippe Mathieu-Daudé wrote:
> Since 00f2cfbbec6 ("glib: bump min required glib library version to
> 2.48") we can use g_auto/g_autoptr to have the compiler automatically
> free an allocated variable when it goes out of scope,
Glad to know this feature.

However per its code, a  'ack' does much more than just free the memory.
not sure g_autoptr have the ability to do the same.

2212 static void ucma_complete_event(struct cma_id_private *id_priv)
2213 {
2214 pthread_mutex_lock(&id_priv->mut);
2215 id_priv->events_completed++;
2216 pthread_cond_signal(&id_priv->cond);
2217 pthread_mutex_unlock(&id_priv->mut);
2218 }
2219
2220 static void ucma_complete_mc_event(struct cma_multicast *mc)
2221 {
 pthread_mutex_lock(&mc->id_priv->mut);
2223 mc->events_completed++;
2224 pthread_cond_signal(&mc->cond);
2225 mc->id_priv->events_completed++;
2226 pthread_cond_signal(&mc->id_priv->cond);
2227 pthread_mutex_unlock(&mc->id_priv->mut);
2228 }
2229
2230 int rdma_ack_cm_event(struct rdma_cm_event *event)
2231 {
2232 struct cma_event *evt;
2233
2234 if (!event)
2235 return ERR(EINVAL);
2236
2237 evt = container_of(event, struct cma_event, event);
2238
2239 if (evt->mc)
2240 ucma_complete_mc_event(evt->mc);
2241 else
2242 ucma_complete_event(evt->id_priv);
2243 free(evt);
2244 return 0;
2245 }

Thanks
Zhijian

> removing this
> burden on the developers.
>
> Per rdma_cm(7) and rdma_ack_cm_event(3) man pages:
>
>"rdma_ack_cm_event() - Free a communication event.
>
> All events which are allocated by rdma_get_cm_event() must be
> released, there should be a one-to-one correspondence between
> successful gets and acks. This call frees the event structure
> and any memory that it references."
>
> Since the 'ack' description doesn't explicit the event is also
> released (free'd), it is safer to use the GLib g_autoptr feature.
> The G_DEFINE_AUTOPTR_CLEANUP_FUNC() macro expects a single word
> for the type name, so add a type definition to achieve this.
> Convert to use g_autoptr and remove the rdma_ack_cm_event() calls.
>
> Inspired-by: Li Zhijian 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
> RFC: build-tested only
> ---
>   migration/rdma.c | 27 ++-
>   1 file changed, 10 insertions(+), 17 deletions(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index b50ebb9183a..b703bf1b918 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -38,6 +38,9 @@
>   #include "qom/object.h"
>   #include 
>   
> +typedef struct rdma_cm_event rdma_cm_event;
> +G_DEFINE_AUTOPTR_CLEANUP_FUNC(rdma_cm_event, rdma_ack_cm_event)
> +
>   /*
>* Print and error on both the Monitor and the Log file.
>*/
> @@ -939,7 +942,7 @@ static int qemu_rdma_resolve_host(RDMAContext *rdma, 
> Error **errp)
>   int ret;
>   struct rdma_addrinfo *res;
>   char port_str[16];
> -struct rdma_cm_event *cm_event;
> +g_autoptr(rdma_cm_event) cm_event = NULL;
>   char ip[40] = "unknown";
>   struct rdma_addrinfo *e;
>   
> @@ -1007,11 +1010,11 @@ route:
>   ERROR(errp, "result not equal to event_addr_resolved %s",
>   rdma_event_str(cm_event->event));
>   perror("rdma_resolve_addr");
> -rdma_ack_cm_event(cm_event);
>   ret = -EINVAL;
>   goto err_resolve_get_addr;
>   }
>   rdma_ack_cm_event(cm_event);
> +cm_event = NULL;
>   
>   /* resolve route */
>   ret = rdma_resolve_route(rdma->cm_id, RDMA_RESOLVE_TIMEOUT_MS);
> @@ -1028,11 +1031,9 @@ route:
>   if (cm_event->event != RDMA_CM_EVENT_ROUTE_RESOLVED) {
>   ERROR(errp, "result not equal to event_route_resolved: %s",
>   rdma_event_str(cm_event->event));
> -rdma_ack_cm_event(cm_event);
>   ret = -EINVAL;
>   goto err_resolve_get_addr;
>   }
> -rdma_ack_cm_event(cm_event);
>   rdma->verbs = rdma->cm_id->verbs;
>   qemu_rdma_dump_id("source_resolve_host", rdma->cm_id->verbs);
>   qemu_rdma_dump_gid("source_resolve_host", rdma->cm_id);
> @@ -1501,7 +1502,7 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, 
> uint64_t *wr_id_out,
>*/
>   static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
>   {
> -struct rdma_cm_event *cm_event;
> +g_autoptr(rdma_cm_event) cm_event = NULL;
>   int ret = -1;
>   
>   /*
> @@ -2503,7 +2504,7 @@ static int qemu_rdma_connect(RDMAContext *rdma, Error 
> **errp, bool return_path)
> .private_data = &cap,
> .private_data_len = sizeof(cap),
>   };
> -struct rdma_cm_event *cm_event;
> +g_autoptr(rdma_cm_event) cm_event = NULL;
>   int ret;
>   
>   /*
> @@ -2544,7 +2545,6 @@ static int qemu_rdma_connect(RDMAContext *rdma, Error 
> **errp, bool return_path)
>   if (cm_event->event != RDMA_CM_EVENT_ESTABLISHED) {
>   perr

Re: [RFC PATCH 2/2] migration/rdma: Enable use of g_autoptr with struct rdma_cm_event

2021-06-03 Thread lizhij...@fujitsu.com


On 03/06/2021 17.30, Philippe Mathieu-Daudé wrote:
> On 6/3/21 3:34 AM, lizhij...@fujitsu.com wrote:
>>
>> On 03/06/2021 01.51, Philippe Mathieu-Daudé wrote:
>>> Since 00f2cfbbec6 ("glib: bump min required glib library version to
>>> 2.48") we can use g_auto/g_autoptr to have the compiler automatically
>>> free an allocated variable when it goes out of scope,
>> Glad to know this feature.
>>
>> However per its code, a  'ack' does much more than just free the memory.
>> not sure g_autoptr have the ability to do the same.
> See
> https://developer.gnome.org/glib/stable/glib-Miscellaneous-Macros.html#G-DEFINE-AUTOPTR-CLEANUP-FUNC:CAPS
>
>Defines the appropriate cleanup function for a pointer type.
>
>The function will not be called if the variable to be cleaned
>up contains NULL.
>
>This will typically be the _free() or _unref() function for
>the given type.
>
> This does not change the code to call free(ptr), but to call the
> registered cleanup function, which is rdma_ack_cm_event().
*

Thanks for your explanation.

Tested-by: Li Zhijian 

*


>
>> 2212 static void ucma_complete_event(struct cma_id_private *id_priv)
>> 2213 {
>> 2214 pthread_mutex_lock(&id_priv->mut);
>> 2215 id_priv->events_completed++;
>> 2216 pthread_cond_signal(&id_priv->cond);
>> 2217 pthread_mutex_unlock(&id_priv->mut);
>> 2218 }
>> 2219
>> 2220 static void ucma_complete_mc_event(struct cma_multicast *mc)
>> 2221 {
>>  pthread_mutex_lock(&mc->id_priv->mut);
>> 2223 mc->events_completed++;
>> 2224 pthread_cond_signal(&mc->cond);
>> 2225 mc->id_priv->events_completed++;
>> 2226 pthread_cond_signal(&mc->id_priv->cond);
>> 2227 pthread_mutex_unlock(&mc->id_priv->mut);
>> 2228 }
>> 2229
>> 2230 int rdma_ack_cm_event(struct rdma_cm_event *event)
>> 2231 {
>> 2232 struct cma_event *evt;
>> 2233
>> 2234 if (!event)
>> 2235 return ERR(EINVAL);
>> 2236
>> 2237 evt = container_of(event, struct cma_event, event);
>> 2238
>> 2239 if (evt->mc)
>> 2240 ucma_complete_mc_event(evt->mc);
>> 2241 else
>> 2242 ucma_complete_event(evt->id_priv);
>> 2243 free(evt);
>> 2244 return 0;
>> 2245 }
>>
>> Thanks
>> Zhijian
>>
>>> removing this
>>> burden on the developers.
>>>
>>> Per rdma_cm(7) and rdma_ack_cm_event(3) man pages:
>>>
>>> "rdma_ack_cm_event() - Free a communication event.
>>>
>>>  All events which are allocated by rdma_get_cm_event() must be
>>>  released, there should be a one-to-one correspondence between
>>>  successful gets and acks. This call frees the event structure
>>>  and any memory that it references."
>>>
>>> Since the 'ack' description doesn't explicit the event is also
>>> released (free'd), it is safer to use the GLib g_autoptr feature.
>>> The G_DEFINE_AUTOPTR_CLEANUP_FUNC() macro expects a single word
>>> for the type name, so add a type definition to achieve this.
>>> Convert to use g_autoptr and remove the rdma_ack_cm_event() calls.
>>>
>>> Inspired-by: Li Zhijian 
>>> Signed-off-by: Philippe Mathieu-Daudé 
>>> ---
>>> RFC: build-tested only
>>> ---
>>>migration/rdma.c | 27 ++-
>>>1 file changed, 10 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/migration/rdma.c b/migration/rdma.c
>>> index b50ebb9183a..b703bf1b918 100644
>>> --- a/migration/rdma.c
>>> +++ b/migration/rdma.c
>>> @@ -38,6 +38,9 @@
>>>#include "qom/object.h"
>>>#include 
>>>
>>> +typedef struct rdma_cm_event rdma_cm_event;
>>> +G_DEFINE_AUTOPTR_CLEANUP_FUNC(rdma_cm_event, rdma_ack_cm_event)
>>> +
>>>/*
>>> * Print and error on both the Monitor and the Log file.
>>> */
>>> @@ -939,7 +942,7 @@ static int qemu_rdma_resolve_host(RDMAContext *rdma, 
>>> Error **errp)
>>>int ret;
>>>struct rdma_addrinfo *res;
>>>char port_str[16];
>>> -struct rdma_cm_event *cm_event;
>>> +g_autoptr(rdma_cm_event) cm_event = NULL;
>>>char ip[40] = "unknown";
>>>struct rdma_addrinfo *e;
>>>
>>> @@ -1007,11 +1010,11 @@ route:
>>>ERROR(errp, "result not equal to event_addr_resolved %s",
>>>rdma_event_str(cm_event->event));
>>>perror("rdma_resolve_addr");
>>> -rdma_ack_cm_event(cm_event);
>>>ret = -EINVAL;
>>>goto err_resolve_get_addr;
>>>}


Re: [PATCH] migration/rdma: Fix cm_event used before being initialized

2021-05-13 Thread lizhij...@fujitsu.com


On 14/05/2021 01.15, Dr. David Alan Gilbert wrote:
> * Li Zhijian (lizhij...@cn.fujitsu.com) wrote:
>> A segmentation fault was triggered when i try to abort a postcopy + rdma
>> migration.
>>
>> since rdma_ack_cm_event releases a uninitialized cm_event in thise case.
>>
>> like below:
>> 2496 ret = rdma_get_cm_event(rdma->channel, &cm_event);
>> 2497 if (ret) {
>> 2498 perror("rdma_get_cm_event after rdma_connect");
>> 2499 ERROR(errp, "connecting to destination!");
>> 2500 rdma_ack_cm_event(cm_event);  cause segmentation fault
>> 2501 goto err_rdma_source_connect;
>> 2502 }
>>
>> Signed-off-by: Li Zhijian 
> OK, that's an easy fix then; but I wonder if we should perhaps remove
> that rdma_ack_cm_event, if it's the get_cm_event that's failed?

I also wondered, i checked the man page get_cm_event(3) which has not documented

and checked some rdma examples, some of them try to ack it[1],  but some not[2].

[1]: 
https://github.com/linux-rdma/rdma-core/blob/e381334c2915a5290565694947790d4aebaf/librdmacm/examples/mckey.c#L451
[2]: 
https://github.com/linux-rdma/rdma-core/blob/e381334c2915a5290565694947790d4aebaf/librdmacm/examples/mckey.c#L342

Thanks

>
> Still,
>
>
> Reviewed-by: Dr. David Alan Gilbert 
>
>> ---
>>   migration/rdma.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index 00eac34232..2dadb62aed 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -2466,7 +2466,7 @@ static int qemu_rdma_connect(RDMAContext *rdma, Error 
>> **errp)
>> .private_data = &cap,
>> .private_data_len = sizeof(cap),
>>   };
>> -struct rdma_cm_event *cm_event;
>> +struct rdma_cm_event *cm_event = NULL;
>>   int ret;
>>   
>>   /*
>> -- 
>> 2.30.2
>>
>>
>>


Re: [PATCH v2] block: Improve backing file validation

2021-05-17 Thread lizhij...@fujitsu.com


On 12/05/2021 23.10, Kevin Wolf wrote:
> Am 11.05.2021 um 10:35 hat Daniel P. Berrangé geschrieben:
>> On Tue, May 11, 2021 at 01:55:18PM +0800, Li Zhijian wrote:
>>>   void bdrv_img_create(const char *filename, const char *fmt,
>>>const char *base_filename, const char *base_fmt,
>>>char *options, uint64_t img_size, int flags, bool 
>>> quiet,
>>> @@ -6507,13 +6538,7 @@ void bdrv_img_create(const char *filename, const 
>>> char *fmt,
>>>   
>>>   backing_file = qemu_opt_get(opts, BLOCK_OPT_BACKING_FILE);
>>>   if (backing_file) {
>>> -if (!strcmp(filename, backing_file)) {
>>> -error_setg(errp, "Error: Trying to create an image with the "
>>> - "same filename as the backing file");
>>> -goto out;
>>> -}
>>> -if (backing_file[0] == '\0') {
>>> -error_setg(errp, "Expected backing file name, got empty 
>>> string");
>>> +if (!validate_backing_file(filename, backing_file, errp)) {
>>>   goto out;
>>>   }
>>>   }
>> Thinking about this again, this seems to be quite high in the generic block
>> layer code. As such I don't think we can assume that the backing file here
>> is actually a plain file on disk. IIUC the backing file could still be any
>> of the block drivers. Only once we get down into the protocol specific
>> drivers can be validate the type of backend.
> Yes, you definitely can't assume that filename is really a local file
> name here. It could be any other protocol supported by QEMU, or even use
> the json: pseudo-protocol.
>
>> I'm not sure what the right way to deal with that is, so perhaps Kevin or
>> Max can make a suggestion.
> Can we just keep the backing file open with write permissions unshared
> so that locking will fail for the new image?

*

Not sure if i have understood.  In my understanding, open(2) cannot support 
'open with write permissions unshared',

it has to cooperate with flock(2)/fcntl(2) to accomplish writing exclusively.


Currently, qemu block also doesn't support 'open with write permissions 
unshared', but i found something:

#define BDRV_O_NO_SHARE0x0001 /* don't share permissions */


And I have tried below changes and expect the block fails to write the image.

@@ -6563,7 +6563,7 @@ void bdrv_img_create(const char *filename, const char 
*fmt,

assert(full_backing);

/* backing files always opened read-only */

- back_flags = flags;

+ back_flags = flags | BDRV_O_NO_SHARE;

back_flags &= ~(BDRV_O_RDWR | BDRV_O_SNAPSHOT | BDRV_O_NO_BACKING);

backing_options = qdict_new();


But in practice, the image is created successfully.

So do you mean we should implement a new flag like 'BDRV_O_NO_SHARE_WR' to 
handle this

*

Thanks
Zhijian

>   Or would that error
> condition be detected too late so that the image would already be
> truncated?

>
> Kevin
>
>
>


Re: [PATCH] migration/rdma: Fix cm_event used before being initialized

2021-05-18 Thread lizhij...@fujitsu.com


On 17/05/2021 18.00, Dr. David Alan Gilbert wrote:
> * lizhij...@fujitsu.com (lizhij...@fujitsu.com) wrote:
>>
>> On 14/05/2021 01.15, Dr. David Alan Gilbert wrote:
>>> * Li Zhijian (lizhij...@cn.fujitsu.com) wrote:
>>>> A segmentation fault was triggered when i try to abort a postcopy + rdma
>>>> migration.
>>>>
>>>> since rdma_ack_cm_event releases a uninitialized cm_event in thise case.
>>>>
>>>> like below:
>>>> 2496 ret = rdma_get_cm_event(rdma->channel, &cm_event);
>>>> 2497 if (ret) {
>>>> 2498 perror("rdma_get_cm_event after rdma_connect");
>>>> 2499 ERROR(errp, "connecting to destination!");
>>>> 2500 rdma_ack_cm_event(cm_event); <<<< cause segmentation fault
>>>> 2501 goto err_rdma_source_connect;
>>>> 2502 }
>>>>
>>>> Signed-off-by: Li Zhijian 
>>> OK, that's an easy fix then; but I wonder if we should perhaps remove
>>> that rdma_ack_cm_event, if it's the get_cm_event that's failed?
>> I also wondered, i checked the man page get_cm_event(3) which has not 
>> documented
>>
>> and checked some rdma examples, some of them try to ack it[1],  but some 
>> not[2].
> I think they're actually consistent:
You are right.
I also checked rdma_get_cm_even() code, indeed, event will be changed only if 
rdma_get_cm_even() returns 0.
So i agree to remove rdma_ack_cm_event(event) in error path. i will update the 
patch soon.

Thanks
Zhijian




Re: [PATCH RESEND 3/4] migration/rdma: destination: create the return patch after the first accept

2021-05-20 Thread lizhij...@fujitsu.com
should make some changes for this patch like below:

# git diff
diff --git a/migration/rdma.c b/migration/rdma.c
index 3b228c46ebf..067ea272276 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -316,7 +316,7 @@ typedef struct RDMALocalBlocks {
  typedef struct RDMAContext {
  char *host;
  int port;
-    const char *host_port;
+    char *host_port;

  RDMAWorkRequestData wr_data[RDMA_WRID_MAX];

@@ -2393,7 +2393,9 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
  rdma->channel = NULL;
  }
  g_free(rdma->host);
+    g_free(rdma->host_port);
  rdma->host = NULL;
+    rdma->host_port = NULL;
  }


@@ -2649,7 +2651,7 @@ static void *qemu_rdma_data_init(const char *host_port, 
Error **errp)
  if (!inet_parse(addr, host_port, NULL)) {
  rdma->port = atoi(addr->port);
  rdma->host = g_strdup(addr->host);
-    rdma->host_port = host_port;
+    rdma->host_port = g_strdup(host_port);
  } else {
  ERROR(errp, "bad RDMA migration address '%s'", host_port);
  g_free(rdma);
@@ -4076,6 +4078,7 @@ err:
  error_propagate(errp, local_err);
  if (rdma) {
  g_free(rdma->host);
+    g_free(rdma->host_port);
  }
  g_free(rdma);
  g_free(rdma_return_path);


On 20/05/2021 16.11, Li Zhijian wrote:
> destination side:
> $ build/qemu-system-x86_64 -enable-kvm -netdev 
> tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device 
> e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive 
> if=none,file=./Fedora-rdma-server-migration.qcow2,id=drive-virtio-disk0 
> -device 
> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 
> 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl 
> -spice streaming-video=filter,port=5902,disable-ticketing -incoming 
> rdma:192.168.1.10:
> (qemu) migrate_set_capability postcopy-ram on
> (qemu)
> dest_init RDMA Device opened: kernel name rocep1s0f0 uverbs device name 
> uverbs0, infiniband_verbs class device path 
> /sys/class/infiniband_verbs/uverbs0, infiniband class device path 
> /sys/class/infiniband/rocep1s0f0, transport: (2) Ethernet
> Segmentation fault (core dumped)
>
>   (gdb) bt
>   #0  qemu_rdma_accept (rdma=0x0) at ../migration/rdma.c:3272
>   #1  rdma_accept_incoming_migration (opaque=0x0) at 
> ../migration/rdma.c:3986
>   #2  0x563c9e51f02a in aio_dispatch_handler
>   (ctx=ctx@entry=0x563ca0606010, node=0x563ca12b2150) at 
> ../util/aio-posix.c:329
>   #3  0x563c9e51f752 in aio_dispatch_handlers (ctx=0x563ca0606010) at 
>  ../util/aio-posix.c:372
>   #4  aio_dispatch (ctx=0x563ca0606010) at ../util/aio-posix.c:382
>   #5  0x563c9e4f4d9e in aio_ctx_dispatch (source=,  
> callback=, user_data=)at ../util/async.c:306
>   #6  0x7fe96ef3fa9f in g_main_context_dispatch () at  
> /lib64/libglib-2.0.so.0
>   #7  0x563c9e4ffeb8 in glib_pollfds_poll () at 
> ../util/main-loop.c:231
>   #8  os_host_main_loop_wait (timeout=12188789) at ../util/main-loop.c:254
>   #9  main_loop_wait (nonblocking=nonblocking@entry=0) at 
> ../util/main-loop.c:530
>   #10 0x563c9e3c7211 in qemu_main_loop () at ../softmmu/runstate.c:725
>   #11 0x563c9dfd46fe in main (argc=, argv= out>, envp=) at ../softmmu/main.c:50
>
> The rdma return path will not be created when qemu incoming is starting
> since migrate_copy() is false at that moment, then a  NULL return path
> rdma was referenced if the user enabled postcopy later.
>
> Signed-off-by: Li Zhijian 
> ---
>   migration/rdma.c | 29 ++---
>   1 file changed, 18 insertions(+), 11 deletions(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 651534e825..3b228c46eb 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -316,6 +316,7 @@ typedef struct RDMALocalBlocks {
>   typedef struct RDMAContext {
>   char *host;
>   int port;
> +const char *host_port;
>   
>   RDMAWorkRequestData wr_data[RDMA_WRID_MAX];
>   
> @@ -2648,6 +2649,7 @@ static void *qemu_rdma_data_init(const char *host_port, 
> Error **errp)
>   if (!inet_parse(addr, host_port, NULL)) {
>   rdma->port = atoi(addr->port);
>   rdma->host = g_strdup(addr->host);
> +rdma->host_port = host_port;
>   } else {
>   ERROR(errp, "bad RDMA migration address '%s'", host_port);
>   g_free(rdma);
> @@ -3276,6 +3278,7 @@ static int qemu_rdma_accept(RDMAContext *rdma)
>   .private_data = &cap,
>   .private_data_len = sizeof(cap),
>};
> +RDMAContext *rdma_return_path = NULL;
>   struct rdma_cm_event *cm_event;
>   struct ibv_context *verbs;
>   int ret = -EINVAL;
> @@ -3291,6 +3294,20 @@ static int qemu_rdma_accept(RDMAContext *rdma)
>   goto err_rdma_dest_

Re: [PATCH] docs/nvdimm: update doc

2021-07-06 Thread lizhij...@fujitsu.com

ping...


On 11/06/2021 11:41, Li Zhijian wrote:
> The prompt was updated since def835f0da ('hostmem: Don't report pmem 
> attribute if unsupported')
>
> Signed-off-by: Li Zhijian 
> ---
>   docs/nvdimm.txt | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/docs/nvdimm.txt b/docs/nvdimm.txt
> index 0aae682be3e..71cdbdf554b 100644
> --- a/docs/nvdimm.txt
> +++ b/docs/nvdimm.txt
> @@ -247,7 +247,8 @@ is built with libpmem [2] support (configured with 
> --enable-libpmem), QEMU
>   will take necessary operations to guarantee the persistence of its own 
> writes
>   to the vNVDIMM backend(e.g., in vNVDIMM label emulation and live migration).
>   If 'pmem' is 'on' while there is no libpmem support, qemu will exit and 
> report
> -a "lack of libpmem support" message to ensure the persistence is available.
> +a "lack of libpmem support" (or "Invalid parameter 'pmem'" since v6.0.0)
> +message to ensure the persistence is available.
>   For example, if we want to ensure the persistence for some backend file,
>   use the QEMU command line:
>   


Re: [PATCH] migration/rdma: prevent from double free the same mr

2021-07-08 Thread lizhij...@fujitsu.com


On 09/07/2021 03:11, Dr. David Alan Gilbert wrote:
> * Li Zhijian (lizhij...@cn.fujitsu.com) wrote:
>> backtrace:
>> '0x75f44ec2 in __ibv_dereg_mr_1_1 (mr=0x7fff1007d390) at 
>> /home/lizhijian/rdma-core/libibverbs/verbs.c:478
>> 478 void *addr  = mr->addr;
> ANy idea why it deletes the same mr twice?

It's easy to reproduce it if we specify a nvdimm backing to a fsdax 
memory-backend-file which cannot support registering mr like:

[root@iaas-rpma ~]# mount | grep pmem0
/dev/pmem0 on /mnt/pmem0 type ext4 (rw,relatime,seclabel,dax=always)

[root@iaas-rpma ~]# ndctl list -n namespace0.0
[
   {
     "dev":"namespace0.0",
     "mode":"fsdax",
     "map":"mem",
     "size":536870912,
     "sector_size":512,
     "blockdev":"pmem0"
   }
]


`-object 
memory-backend-file,id=mem1,share=on,mem-path=/mnt/pmem0/nv-128m.img,size=128m,pmem=on,align=2m
 -device nvdimm,memdev=mem1,id=nv1`

and then enable rdma-pin-all.

(qemu) migrate_set_capability rdma-pin-all on
(qemu)

Now qemu has at least 2 ram block, pc.ram and mem1. the latter will be failed 
to register mr:
`Failed to register local dest ram block! : Invalid argument   `

in this case, the mr of pc.ram will be deleted twice.

Thanks
Li
>
> Dave
>
>> (gdb) bt
>>   #0  0x75f44ec2 in __ibv_dereg_mr_1_1 (mr=0x7fff1007d390) at 
>> /home/lizhijian/rdma-core/libibverbs/verbs.c:478
>>   #1  0x55891fcc in rdma_delete_block (block=, 
>> rdma=0x7fff38176010) at ../migration/rdma.c:691
>>   #2  qemu_rdma_cleanup (rdma=0x7fff38176010) at ../migration/rdma.c:2365
>>   #3  0x558925b0 in qio_channel_rdma_close_rcu (rcu=0x56b8b6c0) 
>> at ../migration/rdma.c:3073
>>   #4  0x55d652a3 in call_rcu_thread (opaque=opaque@entry=0x0) at 
>> ../util/rcu.c:281
>>   #5  0x55d5edf9 in qemu_thread_start (args=0x7fffe88bb4d0) at 
>> ../util/qemu-thread-posix.c:541
>>   #6  0x754c73f9 in start_thread () at /lib64/libpthread.so.0
>>   #7  0x753f3b03 in clone () at /lib64/libc.so.6 '
>>
>> Signed-off-by: Li Zhijian 
>> ---
>>   migration/rdma.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/migration/rdma.c b/migration/rdma.c
>> index b6cc4bef4a8..0f22b8227c0 100644
>> --- a/migration/rdma.c
>> +++ b/migration/rdma.c
>> @@ -1143,6 +1143,7 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext 
>> *rdma)
>>   
>>   for (i--; i >= 0; i--) {
>>   ibv_dereg_mr(local->block[i].mr);
>> +local->block[i].mr = NULL;
>>   rdma->total_registrations--;
>>   }
>>   
>> -- 
>> 2.30.2
>>
>>
>>


Re: [PATCH v2 1/2] migration/rdma: Fix out of order wrid

2021-06-28 Thread lizhij...@fujitsu.com


On 25/06/2021 00:42, Dr. David Alan Gilbert wrote:
> * Li Zhijian (lizhij...@cn.fujitsu.com) wrote:
>> destination:
>> ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev 
>> tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device 
>> e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive 
>> if=none,file=./Fedora-rdma-server-migration.qcow2,id=drive-virtio-disk0 
>> -device 
>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 
>> -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga 
>> qxl -spice streaming-video=filter,port=5902,disable-ticketing -incoming 
>> rdma:192.168.22.23:
>> qemu-system-x86_64: -spice 
>> streaming-video=filter,port=5902,disable-ticketing: warning: short-form 
>> boolean option 'disable-ticketing' deprecated
>> Please use disable-ticketing=on instead
>> QEMU 6.0.50 monitor - type 'help' for more information
>> (qemu) trace-event qemu_rdma_block_for_wrid_miss on
>> (qemu) dest_init RDMA Device opened: kernel name rxe_eth0 uverbs device name 
>> uverbs2, infiniband_verbs class device path 
>> /sys/class/infiniband_verbs/uverbs2, infiniband class device path 
>> /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
>> qemu_rdma_block_for_wrid_miss A Wanted wrid CONTROL SEND (2000) but got 
>> CONTROL RECV (4000)
>>
>> source:
>> ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev 
>> tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device 
>> e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive 
>> if=none,file=./Fedora-rdma-server.qcow2,id=drive-virtio-disk0 -device 
>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 
>> -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga 
>> qxl -spice streaming-video=filter,port=5901,disable-ticketing -S
>> qemu-system-x86_64: -spice 
>> streaming-video=filter,port=5901,disable-ticketing: warning: short-form 
>> boolean option 'disable-ticketing' deprecated
>> Please use disable-ticketing=on instead
>> QEMU 6.0.50 monitor - type 'help' for more information
>> (qemu)
>> (qemu) trace-event qemu_rdma_block_for_wrid_miss on
>> (qemu) migrate -d rdma:192.168.22.23:
>> source_resolve_host RDMA Device opened: kernel name rxe_eth0 uverbs device 
>> name uverbs2, infiniband_verbs class device path 
>> /sys/class/infiniband_verbs/uverbs2, infiniband class device path 
>> /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
>> (qemu) qemu_rdma_block_for_wrid_miss A Wanted wrid WRITE RDMA (1) but got 
>> CONTROL RECV (4000)
>>
>> NOTE: soft RoCE as the rdma device.
>> [root@iaas-rpma images]# rdma link show rxe_eth0/1
>> link rxe_eth0/1 state ACTIVE physical_state LINK_UP netdev eth0
>>
>> This migration cannot be completed since out of order(OOO) CQ event occurs.
>> OOO cases will occur in both source side and destination side. And it
>> happens on only SEND and RECV are out of order. OOO between 'WRITE RDMA' and
>> 'RECV' doesn't matter.
>>
>> below the OOO sequence:
>>source destination
>>qemu_rdma_write_one()  qemu_rdma_registration_handle()
>> 1.   post_recv X post_recv Y
>> 2.   post_send X
>> 3.   wait X CQ event
>> 4.   X CQ event
>> 5.   post_send Y
>> 6.   wait Y CQ event
>> 7.   Y CQ event (dropped)
>> 8.   Y CQ event(send Y done)
>> 9.   X CQ event(send X done)
>> 10. wait Y CQ event(dropped at (7), blocks 
>> forever)
>>
>> Looks it only happens on soft RoCE rdma device in my a hundred of runs,
>> a hardware IB device works fine.
>>
>> Here we introduce a independent send completion queue to distinguish
>> ibv_post_send completion queue from the original mixed completion queue.
>> It helps us to poll the specific CQE we are really interesting in.
> Hi Li,
>OK, it's a while since I've thought this much about completion, but I
> think that's OK, however, what stops the other messages, RDMA_WRITE and
> SEND_CONTROL being out of order?

Once either source or destination got below OOO wrid, both sides will wait for 
their FDs becoming
readable so that the migration will have no chance to be completed.
qemu_rdma_block_for_wrid_miss A Wanted wrid CONTROL SEND (2000) but got CONTROL 
RECV (4000)



>
>Could this be fixed another way; make block_for_wrid record a flag for
> WRID's it's received, and then check (and clear) that flag right at the
> start?

I intent to do so like [1], but i think it's too tricky and hard to understand.

And I have consideration about:
- should we record a OOO in 'WRITE RDMA' and CONTROL RECV even if it doesn't 
matter in practice
- how many ooo_wrid we should record, I have observed  2 later WRs' CQ arrived 
earlier than
the wanted one.



[1]: 
https://lore.kernel.org/qemu-devel/162371118578.2358.12447251487494492434@7c66fb7bc3ab/T/#t

Thanks
Li

>
> Dave
>
>> Signed-off-by: Li

Re: [PATCH] block: Improve backing file validation

2021-05-10 Thread lizhij...@fujitsu.com

On 2021/5/10 16:41, Daniel P. Berrangé wrote:
> On Mon, May 10, 2021 at 12:30:45PM +0800, Li Zhijian wrote:
>> Image below user cases:
>> case 1:
>> ```
>> $ qemu-img create -f raw source.raw 1G
>> $ qemu-img create -f qcow2 -F raw -b source.raw ./source.raw
>> qemu-img info source.raw
>> image: source.raw
>> file format: qcow2
>> virtual size: 193K (197120 bytes)
>> disk size: 196K
>> cluster_size: 65536
>> backing file: source.raw <<
>> backing file format: raw
>> Format specific information:
>>  compat: 1.1
>>  lazy refcounts: false
>>  refcount bits: 16
>>  corrupt: false
>> ```
>>
>> case 2:
>> ```
>> $ qemu-img create -f raw source.raw 1G
>> $ ln -sf source.raw destination.qcow2
>> $ qemu-img create -f qcow2 -F raw -b source.raw ./destination.qcow2
>> qemu-img info source.raw
>> image: source.raw
>> file format: qcow2 <<
>> virtual size: 2.0G (2147483648 bytes)
>> disk size: 196K
>> cluster_size: 65536
>> backing file: source.raw
>> backing file format: raw
>> Format specific information:
>>  compat: 1.1
>>  lazy refcounts: false
>>  refcount bits: 16
>>  corrupt: false
>> ```
>> Generally, we don't expect to corrupte the source.raw anyway, while
>> actually it does.
>>
>> Here we validate the realpath of file instead the input string.
> That still won't handle the case where you use hard links
>
>$ ln source.raw destination.qcow2
>
> To properly validate the scenarios I think it is neccessary
> to ignore the filename sentirely.
>
> Instead attempt to open both files, and if successful, fstat()
> them both, and then compare the st_dev + st_ino  fields.


Sounds great, i will update it.

Thanks

Zhijian