Hi,

When we try to establish a connection with NFS RDMA server, we get the following messages with debug enabled -

[ 2937.577657] RPC: rpcrdma_conn_upcall: established: 192.168.1.13:20049 (ep 0xffff88012f980628 event 0x9)
[ 2937.597566] RPC:       rpcrdma_conn_upcall: connected
[ 2937.597569] RPC:  6385 __rpc_wake_up_task (now 4295627490)
[ 2937.597572] RPC:  6385 disabling timer
[ 2937.597576] RPC:  6385 removed from queue ffff88012f9802f0 "xprt_pending"
[ 2937.597580] RPC:       __rpc_wake_up_task done
[ 2937.597586] RPC:  6385 sync task resuming
[ 2937.597592] rpcrdma: connection to 192.168.1.13:20049 on mlx4_0, memreg 5 slots 32 ird 4
[ 2937.597597] RPC:  6385 marshaling NULL cred ffffffffa0437c60
[ 2937.597603] RPC: 6385 using AUTH_NULL cred ffffffffa0437c60 to wrap rpc data
[ 2937.597607] RPC:       rpcrdma_ep_connect: connected
[ 2937.597611] RPC:  6385 sleep_on(queue "xprt_pending" time 4295627490)
[ 2937.597615] RPC:       xprt_rdma_connect_worker: exit
[ 2937.597620] RPC:  6385 added to queue ffff88012f9802f0 "xprt_pending"
[ 2937.597625] RPC:  6385 setting alarm for 60000 ms
[ 2937.597631] RPC:  6385 sync task going to sleep
[ 2937.597812] RPC: rpcrdma_qp_async_error_upcall: QP error 3 on device mlx4_0 ep ffff88012f980628
[ 2937.597817] RPC:  6385 __rpc_wake_up_task (now 4295627490)
[ 2937.597818] RPC:  6385 disabling timer
[ 2937.597821] RPC:  6385 removed from queue ffff88012f9802f0 "xprt_pending"
[ 2937.597824] RPC:       __rpc_wake_up_task done
[ 2937.597830] RPC: rpcrdma_event_process: event rep ffff880139eb7000 status 5 opcode FFFFFFFF length 4294936578 [ 2937.597833] RPC: rpcrdma_event_process: recv WC status 5, connection lost
[ 2937.597841] RPC:  6385 sync task resuming
[ 2937.597844] RPC:  6385 sleep_on(queue "xprt_pending" time 4295627490)
[ 2937.597846] RPC:  6385 added to queue ffff88012f9802f0 "xprt_pending"
[ 2937.597848] RPC:  6385 setting alarm for 60000 ms
[ 2937.597850] RPC:  6385 sync task going to sleep
[ 2937.598207] RPC: rpcrdma_conn_upcall: disconnected: 192.168.1.13:20049 (ep 0xffff88012f980628 event 0xa)
[ 2937.598210] RPC:       rpcrdma_conn_upcall: disconnected
[ 2937.598213] rpcrdma: connection to 192.168.1.13:20049 closed (-103)
[ 2967.547845] RPC:       xprt_rdma_connect_worker: reconnect
[ 2967.558976] RPC:       rpcrdma_ep_disconnect: after wait, disconnected
[ 2967.561651] RPC: rpcrdma_conn_upcall: 4 responder resources (1 initiator)

This keeps looping until mount is cancelled.
Looking at the code, rpcrdma_qp_async_error_upcall is called with event=3 (IB_EVENT_QP_ACCESS_ERROR) and the device name is mlx4_0 This is initated from mlx4_ib_qp_event and it is receiving MLX4_EVENT_TYPE_WQ_ACCESS_ERROR.

What could cause this mlx4 driver unable to access the WQ or raise such an interrupt? I checked setup of qp in mlx4_ib_create_qp and it returns success.

This is SLES11SP1 - kernel 2.6.32.59-0.3

--
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to