Hi Sean,

We have a user space application which is made of M (clients) x N (servers) RC connectivity pattern using librdmacm. Basically, there are N nodes, each running M client process and each client connects to all N servers.

So under some unknown conditions, many of the clients connection attempts fail with RDMA_CM_EVENT_UNREACHABLE event and the status is -ETIMEDOUT. Looking on the rdma-cm kernel code, I see that the only location which generates this event is in cma_ib_handler when getting IB_CM_REQ_ERROR (or IB_CM_REP_ERROR).

Digging down into the CM, I see that the only place where IB_CM_REQ_ERROR is delivered is on cm_process_send_error which is called when the status of mad send completion is not success or flush.

Digging down into the MAD code and the CM usage of it, I see that that the mad code will issue a mac send completion handler with the IB_WC_RESP_TIMEOUT_ERR status, and that the CM code programs the number of retries set by its consumer (rdma-cm in this case) into the mad send buffer.

Running this over an M=8 and N=4setup, e.g four nodes, each running one server process and eight client processes and sampling the IB CM counters before and after the job and adding the numbers from the four nodes, we see the following

cm_tx_msgs.req = 395
cm_tx_retries.req= 270
cm_rx_msgs.req= 390

cm_tx_msgs.rep= 375
cm_tx_retries.rep= 255
cm_rx_msgs.rep= 380

cm_tx_msgs.rtu= 108
cm_rx_msgs.rtu= 103

cm_tx_msgs.mra= 540
cm_rx_msgs.mra= 270
cm_tx_retries.mra= 270

In cm_send_handler we see that the CM TX retry counter is incremented with the number of retries reported by the MAD layer, I also see that the RDMA-CM programs the CM to do 15 retries and the CM further programs this into the MAD send buffers.

From the RTU counters its clear that at most ~100 connections got established out of 128.

One thing seen in the nodes dmesg is a message from an old patch of yours which exists in ofed1.5.3 but didn't hit (or wasn't accepted?) upstream saying "ib_cm: calculated mra timeout 67584 > 8192, decreasing used timeout_ms" does this provides any insight into the problem?

One more piece of info, is that this apps doesn't call rdma_disconnect at all, when they are done or if something goes wrong (e.g that unreachable event) they simply issue rdma_destroy_id which when I look on the rdma-cm/cm code gets to a CM function whic sends a dreq (if the ID is in the established state) and puts the ID in the timewait zone.

So it seems we're not loosing mads, also on the stack they use (that 1.5.3) the ucma backlog size is 128 but each server process gets only 32 request (8x4) so we don't think ucma dropping REQs as of no more backlog budget takes place.

Or.







--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to