Re: [PATCH v2 1/2] migration/rdma: Fix out of order wrid
On 25/06/2021 00:42, Dr. David Alan Gilbert wrote: > * Li Zhijian (lizhij...@cn.fujitsu.com) wrote: >> destination: >> ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev >> tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device >> e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive >> if=none,file=./Fedora-rdma-server-migration.qcow2,id=drive-virtio-disk0 >> -device >> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 >> -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga >> qxl -spice streaming-video=filter,port=5902,disable-ticketing -incoming >> rdma:192.168.22.23: >> qemu-system-x86_64: -spice >> streaming-video=filter,port=5902,disable-ticketing: warning: short-form >> boolean option 'disable-ticketing' deprecated >> Please use disable-ticketing=on instead >> QEMU 6.0.50 monitor - type 'help' for more information >> (qemu) trace-event qemu_rdma_block_for_wrid_miss on >> (qemu) dest_init RDMA Device opened: kernel name rxe_eth0 uverbs device name >> uverbs2, infiniband_verbs class device path >> /sys/class/infiniband_verbs/uverbs2, infiniband class device path >> /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet >> qemu_rdma_block_for_wrid_miss A Wanted wrid CONTROL SEND (2000) but got >> CONTROL RECV (4000) >> >> source: >> ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev >> tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device >> e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive >> if=none,file=./Fedora-rdma-server.qcow2,id=drive-virtio-disk0 -device >> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 >> -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga >> qxl -spice streaming-video=filter,port=5901,disable-ticketing -S >> qemu-system-x86_64: -spice >> streaming-video=filter,port=5901,disable-ticketing: warning: short-form >> boolean option 'disable-ticketing' deprecated >> Please use disable-ticketing=on instead >> QEMU 6.0.50 monitor - type 'help' for more information >> (qemu) >> (qemu) trace-event qemu_rdma_block_for_wrid_miss on >> (qemu) migrate -d rdma:192.168.22.23: >> source_resolve_host RDMA Device opened: kernel name rxe_eth0 uverbs device >> name uverbs2, infiniband_verbs class device path >> /sys/class/infiniband_verbs/uverbs2, infiniband class device path >> /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet >> (qemu) qemu_rdma_block_for_wrid_miss A Wanted wrid WRITE RDMA (1) but got >> CONTROL RECV (4000) >> >> NOTE: soft RoCE as the rdma device. >> [root@iaas-rpma images]# rdma link show rxe_eth0/1 >> link rxe_eth0/1 state ACTIVE physical_state LINK_UP netdev eth0 >> >> This migration cannot be completed since out of order(OOO) CQ event occurs. >> OOO cases will occur in both source side and destination side. And it >> happens on only SEND and RECV are out of order. OOO between 'WRITE RDMA' and >> 'RECV' doesn't matter. >> >> below the OOO sequence: >>source destination >>qemu_rdma_write_one() qemu_rdma_registration_handle() >> 1. post_recv X post_recv Y >> 2. post_send X >> 3. wait X CQ event >> 4. X CQ event >> 5. post_send Y >> 6. wait Y CQ event >> 7. Y CQ event (dropped) >> 8. Y CQ event(send Y done) >> 9. X CQ event(send X done) >> 10. wait Y CQ event(dropped at (7), blocks >> forever) >> >> Looks it only happens on soft RoCE rdma device in my a hundred of runs, >> a hardware IB device works fine. >> >> Here we introduce a independent send completion queue to distinguish >> ibv_post_send completion queue from the original mixed completion queue. >> It helps us to poll the specific CQE we are really interesting in. > Hi Li, >OK, it's a while since I've thought this much about completion, but I > think that's OK, however, what stops the other messages, RDMA_WRITE and > SEND_CONTROL being out of order? Once either source or destination got below OOO wrid, both sides will wait for their FDs becoming readable so that the migration will have no chance to be completed. qemu_rdma_block_for_wrid_miss A Wanted wrid CONTROL SEND (2000) but got CONTROL RECV (4000) > >Could this be fixed another way; make block_for_wrid record a flag for > WRID's it's received, and then check (and clear) that flag right at the > start? I intent to do so like [1], but i think it's too tricky and hard to understand. And I have consideration about: - should we record a OOO in 'WRITE RDMA' and CONTROL RECV even if it doesn't matter in practice - how many ooo_wrid we should record, I have observed 2 later WRs' CQ arrived earlier than the wanted one. [1]: https://lore.kernel.org/qemu-devel/162371118578.2358.12447251487494492434@7c66fb7bc3ab/T/#t Thanks Li > > Dave > >> Signed-off-by: Li
Re: [PATCH v2 1/2] migration/rdma: Fix out of order wrid
* Li Zhijian (lizhij...@cn.fujitsu.com) wrote: > destination: > ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev > tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device > e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive > if=none,file=./Fedora-rdma-server-migration.qcow2,id=drive-virtio-disk0 > -device > virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m > 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl > -spice streaming-video=filter,port=5902,disable-ticketing -incoming > rdma:192.168.22.23: > qemu-system-x86_64: -spice > streaming-video=filter,port=5902,disable-ticketing: warning: short-form > boolean option 'disable-ticketing' deprecated > Please use disable-ticketing=on instead > QEMU 6.0.50 monitor - type 'help' for more information > (qemu) trace-event qemu_rdma_block_for_wrid_miss on > (qemu) dest_init RDMA Device opened: kernel name rxe_eth0 uverbs device name > uverbs2, infiniband_verbs class device path > /sys/class/infiniband_verbs/uverbs2, infiniband class device path > /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet > qemu_rdma_block_for_wrid_miss A Wanted wrid CONTROL SEND (2000) but got > CONTROL RECV (4000) > > source: > ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev > tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device > e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive > if=none,file=./Fedora-rdma-server.qcow2,id=drive-virtio-disk0 -device > virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m > 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl > -spice streaming-video=filter,port=5901,disable-ticketing -S > qemu-system-x86_64: -spice > streaming-video=filter,port=5901,disable-ticketing: warning: short-form > boolean option 'disable-ticketing' deprecated > Please use disable-ticketing=on instead > QEMU 6.0.50 monitor - type 'help' for more information > (qemu) > (qemu) trace-event qemu_rdma_block_for_wrid_miss on > (qemu) migrate -d rdma:192.168.22.23: > source_resolve_host RDMA Device opened: kernel name rxe_eth0 uverbs device > name uverbs2, infiniband_verbs class device path > /sys/class/infiniband_verbs/uverbs2, infiniband class device path > /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet > (qemu) qemu_rdma_block_for_wrid_miss A Wanted wrid WRITE RDMA (1) but got > CONTROL RECV (4000) > > NOTE: soft RoCE as the rdma device. > [root@iaas-rpma images]# rdma link show rxe_eth0/1 > link rxe_eth0/1 state ACTIVE physical_state LINK_UP netdev eth0 > > This migration cannot be completed since out of order(OOO) CQ event occurs. > OOO cases will occur in both source side and destination side. And it > happens on only SEND and RECV are out of order. OOO between 'WRITE RDMA' and > 'RECV' doesn't matter. > > below the OOO sequence: > source destination > qemu_rdma_write_one() qemu_rdma_registration_handle() > 1.post_recv X post_recv Y > 2.post_send X > 3.wait X CQ event > 4.X CQ event > 5.post_send Y > 6.wait Y CQ event > 7.Y CQ event (dropped) > 8.Y CQ event(send Y done) > 9.X CQ event(send X done) > 10. wait Y CQ event(dropped at (7), blocks > forever) > > Looks it only happens on soft RoCE rdma device in my a hundred of runs, > a hardware IB device works fine. > > Here we introduce a independent send completion queue to distinguish > ibv_post_send completion queue from the original mixed completion queue. > It helps us to poll the specific CQE we are really interesting in. Hi Li, OK, it's a while since I've thought this much about completion, but I think that's OK, however, what stops the other messages, RDMA_WRITE and SEND_CONTROL being out of order? Could this be fixed another way; make block_for_wrid record a flag for WRID's it's received, and then check (and clear) that flag right at the start? Dave > Signed-off-by: Li Zhijian > --- > V2 Introduce send completion queue > --- > migration/rdma.c | 94 > 1 file changed, 79 insertions(+), 15 deletions(-) > > diff --git a/migration/rdma.c b/migration/rdma.c > index d90b29a4b51..16fe0688858 100644 > --- a/migration/rdma.c > +++ b/migration/rdma.c > @@ -359,8 +359,10 @@ typedef struct RDMAContext { > struct rdma_event_channel *channel; > struct ibv_qp *qp; /* queue pair */ > struct ibv_comp_channel *comp_channel; /* completion channel */ > +struct ibv_comp_channel *send_comp_channel; /* send completion channel > */ > struct ibv_pd *pd; /* protection domain */ > struct ibv_cq *cq; /* completion queue */ > +struct ibv_cq *send_cq;
[PATCH v2 1/2] migration/rdma: Fix out of order wrid
destination: ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive if=none,file=./Fedora-rdma-server-migration.qcow2,id=drive-virtio-disk0 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl -spice streaming-video=filter,port=5902,disable-ticketing -incoming rdma:192.168.22.23: qemu-system-x86_64: -spice streaming-video=filter,port=5902,disable-ticketing: warning: short-form boolean option 'disable-ticketing' deprecated Please use disable-ticketing=on instead QEMU 6.0.50 monitor - type 'help' for more information (qemu) trace-event qemu_rdma_block_for_wrid_miss on (qemu) dest_init RDMA Device opened: kernel name rxe_eth0 uverbs device name uverbs2, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs2, infiniband class device path /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet qemu_rdma_block_for_wrid_miss A Wanted wrid CONTROL SEND (2000) but got CONTROL RECV (4000) source: ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive if=none,file=./Fedora-rdma-server.qcow2,id=drive-virtio-disk0 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl -spice streaming-video=filter,port=5901,disable-ticketing -S qemu-system-x86_64: -spice streaming-video=filter,port=5901,disable-ticketing: warning: short-form boolean option 'disable-ticketing' deprecated Please use disable-ticketing=on instead QEMU 6.0.50 monitor - type 'help' for more information (qemu) (qemu) trace-event qemu_rdma_block_for_wrid_miss on (qemu) migrate -d rdma:192.168.22.23: source_resolve_host RDMA Device opened: kernel name rxe_eth0 uverbs device name uverbs2, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs2, infiniband class device path /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet (qemu) qemu_rdma_block_for_wrid_miss A Wanted wrid WRITE RDMA (1) but got CONTROL RECV (4000) NOTE: soft RoCE as the rdma device. [root@iaas-rpma images]# rdma link show rxe_eth0/1 link rxe_eth0/1 state ACTIVE physical_state LINK_UP netdev eth0 This migration cannot be completed since out of order(OOO) CQ event occurs. OOO cases will occur in both source side and destination side. And it happens on only SEND and RECV are out of order. OOO between 'WRITE RDMA' and 'RECV' doesn't matter. below the OOO sequence: source destination qemu_rdma_write_one() qemu_rdma_registration_handle() 1. post_recv X post_recv Y 2. post_send X 3. wait X CQ event 4. X CQ event 5. post_send Y 6. wait Y CQ event 7. Y CQ event (dropped) 8. Y CQ event(send Y done) 9. X CQ event(send X done) 10. wait Y CQ event(dropped at (7), blocks forever) Looks it only happens on soft RoCE rdma device in my a hundred of runs, a hardware IB device works fine. Here we introduce a independent send completion queue to distinguish ibv_post_send completion queue from the original mixed completion queue. It helps us to poll the specific CQE we are really interesting in. Signed-off-by: Li Zhijian --- V2 Introduce send completion queue --- migration/rdma.c | 94 1 file changed, 79 insertions(+), 15 deletions(-) diff --git a/migration/rdma.c b/migration/rdma.c index d90b29a4b51..16fe0688858 100644 --- a/migration/rdma.c +++ b/migration/rdma.c @@ -359,8 +359,10 @@ typedef struct RDMAContext { struct rdma_event_channel *channel; struct ibv_qp *qp; /* queue pair */ struct ibv_comp_channel *comp_channel; /* completion channel */ +struct ibv_comp_channel *send_comp_channel; /* send completion channel */ struct ibv_pd *pd; /* protection domain */ struct ibv_cq *cq; /* completion queue */ +struct ibv_cq *send_cq; /* send completion queue */ /* * If a previous write failed (perhaps because of a failed @@ -1067,8 +1069,7 @@ static int qemu_rdma_alloc_pd_cq(RDMAContext *rdma) } /* - * Completion queue can be filled by both read and write work requests, - * so must reflect the sum of both possible queue sizes. + * Completion queue can be filled by read work requests. */ rdma->cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3), NULL, rdma->comp_channel, 0); @@ -1077,6 +1078,20 @@ static int qemu_rdma_al