Re: better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme

2013-05-26 Thread Alex Rosenbaum
On 5/23/2013 1:31 PM, Alex Rosenbaum wrote: On 5/21/2013 6:24 PM, Hefty, Sean wrote: My first guess is that the server isn't responding to new requests. - Sean This is where we're looking now. Now testing on 17 server with 8 clients per server. When disabling all RDMA traffic in the test we

Re: better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme

2013-05-23 Thread Alex Rosenbaum
On 5/21/2013 6:24 PM, Hefty, Sean wrote: My first guess is that the server isn't responding to new requests. - Sean This is where we're looking now. Now testing on 17 server with 8 clients per server. When disabling all RDMA traffic in the test we get 100% RDMA connection established. So at

better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme

2013-05-21 Thread Or Gerlitz
Hi Sean, We have a user space application which is made of M (clients) x N (servers) RC connectivity pattern using librdmacm. Basically, there are N nodes, each running M client process and each client connects to all N servers. So under some unknown conditions, many of the clients

RE: better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme

2013-05-21 Thread Hefty, Sean
So under some unknown conditions, many of the clients connection attempts fail with RDMA_CM_EVENT_UNREACHABLE event and the status is -ETIMEDOUT. Looking on the rdma-cm kernel code, I see that the only location which generates this event is in cma_ib_handler when getting IB_CM_REQ_ERROR (or

Re: better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme

2013-05-21 Thread Or Gerlitz
On 21/05/2013 18:24, Hefty, Sean wrote: I don't remember this patch at all. Alex, can you please send Sean this patch -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at

Re: better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme

2013-05-21 Thread Or Gerlitz
On Tue, May 21, 2013 at 6:24 PM, Hefty, Sean sean.he...@intel.com wrote: One thing seen in the nodes dmesg is a message from an old patch of yours which exists in ofed1.5.3 but didn't hit (or wasn't accepted?) upstream saying ib_cm: calculated mra timeout 67584 8192, decreasing used

RE: better understanding rdma-cm UNREACHABLE/ETIMEDOUT scheme

2013-05-21 Thread Hefty, Sean
One thing seen in the nodes dmesg is a message from an old patch of yours which exists in ofed1.5.3 but didn't hit (or wasn't accepted?) upstream saying ib_cm: calculated mra timeout 67584 8192, decreasing used timeout_ms does this provides any insight into the problem? I don't