+Gal
Gal, please comment with our findings.
Thanks!
On 5/31/2018 10:36 AM, 858585 jemmy wrote:
On Thu, May 31, 2018 at 1:33 AM, Dr. David Alan Gilbert
<dgilb...@redhat.com> wrote:
* Lidong Chen (jemmy858...@gmail.com) wrote:
If the peer qemu is crashed, the qemu_rdma_wait_comp_channel function
maybe loop forever. so we should also poll the cm event fd, and when
receive any cm event, we consider some error happened.
Signed-off-by: Lidong Chen <lidongc...@tencent.com>
I don't understand enough about the way the infiniband fd's work to
fully review this; so I'd appreciate if some one who does could
comment/add their review.
Hi Avaid:
we need your help. I also not find any document about the cq
channel event fd and
cm channel event f.
Should we set the events to G_IO_IN | G_IO_HUP | G_IO_ERR? or
G_IO_IN is enough?
pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
Thanks.
---
migration/rdma.c | 35 ++++++++++++++++++++++++-----------
1 file changed, 24 insertions(+), 11 deletions(-)
diff --git a/migration/rdma.c b/migration/rdma.c
index 1b9e261..d611a06 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -1489,6 +1489,9 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma,
uint64_t *wr_id_out,
*/
static int qemu_rdma_wait_comp_channel(RDMAContext *rdma)
{
+ struct rdma_cm_event *cm_event;
+ int ret = -1;
+
/*
* Coroutine doesn't start until migration_fd_process_incoming()
* so don't yield unless we know we're running inside of a coroutine.
@@ -1504,25 +1507,35 @@ static int qemu_rdma_wait_comp_channel(RDMAContext
*rdma)
* But we need to be able to handle 'cancel' or an error
* without hanging forever.
*/
- while (!rdma->error_state && !rdma->received_error) {
- GPollFD pfds[1];
+ while (!rdma->error_state && !rdma->received_error) {
+ GPollFD pfds[2];
pfds[0].fd = rdma->comp_channel->fd;
pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
+ pfds[0].revents = 0;
+
+ pfds[1].fd = rdma->channel->fd;
+ pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR;
+ pfds[1].revents = 0;
+
/* 0.1s timeout, should be fine for a 'cancel' */
- switch (qemu_poll_ns(pfds, 1, 100 * 1000 * 1000)) {
- case 1: /* fd active */
- return 0;
+ qemu_poll_ns(pfds, 2, 100 * 1000 * 1000);
Shouldn't we still check the return value of this; if it's negative
something has gone wrong.
I will fix this.
Thanks.
Dave
- case 0: /* Timeout, go around again */
- break;
+ if (pfds[1].revents) {
+ ret = rdma_get_cm_event(rdma->channel, &cm_event);
+ if (!ret) {
+ rdma_ack_cm_event(cm_event);
+ }
+ error_report("receive cm event while wait comp channel,"
+ "cm event is %d", cm_event->event);
- default: /* Error of some type -
- * I don't trust errno from qemu_poll_ns
- */
- error_report("%s: poll failed", __func__);
+ /* consider any rdma communication event as an error */
return -EPIPE;
}
+ if (pfds[0].revents) {
+ return 0;
+ }
+
if (migrate_get_current()->state == MIGRATION_STATUS_CANCELLING) {
/* Bail out and let the cancellation happen */
return -EPIPE;
--
1.8.3.1
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK