> When I hit a RDMA error (which happens quite frequently now at rds-stress > exit, thanks to the fixed mr pool flushing :) I often see the RDS > shutdown_worker > getting stuck (and rmmod hangs). It's waiting for allocated WRs to disappear. > This usually works, as all WQ entries are flushed out. This doesn't happen > when a RDMA transfer generates a remote access error, and that seems to be > intended according to the spec.
I don't follow this. All work requests should generate a completion eventually, unless you do something like destroy the work queue or overrun a CQ. So what part of the spec are you talking about here? > I tried destroying the QP first, then we know we can pick off > any remaining WRs still allocated. That didn't work, as the card > seems to generate interrupts even after the QP is gone. This results > in lots of errors on the console complaining about "Completion to > bogus CQ". Destroying a QP should immediately stop work processing, so no completions should be generated once the destroy QP operation returns. I don't see how you get the bogus CQ message in this case -- it certainly seems like a driver bug. Unless you mean you are destroying the CQ with a QP still attached? But that shouldn't be possible because the CQ's usecnt should be non-zero until all attached QPs are freed. Not sure what could be going on but it sounds bad... > I then tried to move the QP to error state instead - this didn't > elicit a storm of kernel messages anymore, but still I seem to get > incoming completions. The cleanest way to destroy a QP is to move the QP to the error state, wait until you have seen a completion for every posted work request (the completions generated after the transition to the error state should have a "flush" status), and then destroy the QP. - R. _______________________________________________ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg