better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme

Or Gerlitz Tue, 21 May 2013 08:07:40 -0700

Hi Sean,

We have a user space application which is made of M (clients) x N(servers) RC connectivity pattern using librdmacm. Basically, there areN nodes, each running M client process and each client connects to all Nservers.

So under some unknown conditions, many of the clients connectionattempts fail with RDMA_CM_EVENT_UNREACHABLE event and the status is-ETIMEDOUT. Looking on the rdma-cm kernel code, I see that the onlylocation which generates this event is in cma_ib_handler when gettingIB_CM_REQ_ERROR (or IB_CM_REP_ERROR).

Digging down into the CM, I see that the only place whereIB_CM_REQ_ERROR is delivered is on cm_process_send_error which is calledwhen the status of mad send completion is not success or flush.

Digging down into the MAD code and the CM usage of it, I see that thatthe mad code will issue a mac send completion handler with theIB_WC_RESP_TIMEOUT_ERR status, and that the CM code programs the numberof retries set by its consumer (rdma-cm in this case) into the mad sendbuffer.

Running this over an M=8 and N=4setup, e.g four nodes, each running oneserver process and eight client processes and sampling the IB CMcounters before and after the job and adding the numbers from the fournodes, we see the following


cm_tx_msgs.req = 395
cm_tx_retries.req= 270
cm_rx_msgs.req= 390

cm_tx_msgs.rep= 375
cm_tx_retries.rep= 255
cm_rx_msgs.rep= 380

cm_tx_msgs.rtu= 108
cm_rx_msgs.rtu= 103

cm_tx_msgs.mra= 540
cm_rx_msgs.mra= 270
cm_tx_retries.mra= 270

In cm_send_handler we see that the CM TX retry counter is incrementedwith the number of retries reportedby the MAD layer, I also see that the RDMA-CM programs the CM to do 15retries and the CM further programs this into the MAD send buffers.

From the RTU counters its clear that at most ~100 connections gotestablished out of 128.

One thing seen in the nodes dmesg is a message from an old patch ofyours which exists in ofed1.5.3 but didn't hit (or wasn't accepted?)upstream saying "ib_cm: calculated mra timeout 67584 > 8192, decreasingused timeout_ms" does this provides any insight into the problem?

One more piece of info, is that this apps doesn't call rdma_disconnectat all, when they are done or if something goes wrong (e.g thatunreachable event) they simply issue rdma_destroy_id which when I lookon the rdma-cm/cm code gets to a CM function whic sends a dreq (if theID is in the established state) and puts the ID in the timewait zone.

So it seems we're not loosing mads, also on the stack they use (that1.5.3) the ucma backlog size is 128but each server process gets only 32 request (8x4) so we don't thinkucma dropping REQs as of no more backlog budget takes place.


Or.







--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme

Reply via email to