Re: [ewg] nfsrdma fails to write big file,
Tom Tucker wrote: Mahesh Siddheshwar wrote: Hi Tom, Vu, Tom Tucker wrote: Roland Dreier wrote: + /* +* Add room for frmr register and invalidate WRs +* Requests sometimes have two chunks, each chunk +* requires to have different frmr. The safest +* WRs required are max_send_wr * 6; however, we +* get send completions and poll fast enough, it +* is pretty safe to have max_send_wr * 4. +*/ + ep-rep_attr.cap.max_send_wr *= 4; Seems like a bad design if there is a possibility of work queue overflow; if you're counting on events occurring in a particular order or completions being handled fast enough, then your design is going to fail in some high load situations, which I don't think you want. Vu, Would you please try the following: - Set the multiplier to 5 While trying to test this between a Linux client and Solaris server, I made the following changes in : /usr/src/ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c diff verbs.c.org verbs.c 653c653 ep-rep_attr.cap.max_send_wr *= 3; --- ep-rep_attr.cap.max_send_wr *= 8; 685c685 ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/2 /* - 1*/; --- ep-rep_cqinit = ep-rep_attr.cap.max (I bumped it to 8) did make install. On reboot I see the errors on NFS READs as opposed to WRITEs as seen before, when I try to read a 10G file from the server. The client is running: RHEL 5.3 (2.6.18-128.el5PAE) with OFED-1.5.1-20100223-0740 bits. The client has an Sun IB HCA: SUN0070130001, MT25418, 2.7.0 firmware, hw_rev = a0. The server is running Solaris based on snv_128. rpcdebug output from the client: == RPC:85 call_bind (status 0) RPC:85 call_connect xprt ec78d800 is connected RPC:85 call_transmit (status 0) RPC:85 xprt_prepare_transmit RPC:85 xprt_cwnd_limited cong = 0 cwnd = 8192 RPC:85 rpc_xdr_encode (status 0) RPC:85 marshaling UNIX cred eddb4dc0 RPC:85 using AUTH_UNIX cred eddb4dc0 to wrap rpc data RPC:85 xprt_transmit(164) RPC: rpcrdma_inline_pullup: pad 0 destp 0xf1dd1410 len 164 hdrlen 164 RPC: rpcrdma_register_frmr_external: Using frmr ec7da920 to map 4 segments RPC: rpcrdma_create_chunks: write chunk elem 16...@0x38536d000:0xa601 (more) RPC: rpcrdma_register_frmr_external: Using frmr ec7da960 to map 1 segments RPC: rpcrdma_create_chunks: write chunk elem 1...@0x31dd153c:0xaa01 (last) RPC: rpcrdma_marshal_req: write chunk: hdrlen 68 rpclen 164 padlen 0 headerp 0xf1dd124c base 0xf1dd136c lkey 0x500 RPC:85 xmit complete RPC:85 sleep_on(queue xprt_pending time 4683109) RPC:85 added to queue ec78d994 xprt_pending RPC:85 setting alarm for 6 ms RPC: wake_up_next(ec78d944 xprt_resend) RPC: wake_up_next(ec78d8f4 xprt_sending) RPC: rpcrdma_qp_async_error_upcall: QP error 3 on device mlx4_0 ep ec78db40 RPC:85 __rpc_wake_up_task (now 4683110) RPC:85 disabling timer RPC:85 removed from queue ec78d994 xprt_pending RPC: __rpc_wake_up_task done RPC:85 __rpc_execute flags=0x1 RPC:85 call_status (status -107) RPC:85 call_bind (status 0) RPC:85 call_connect xprt ec78d800 is not connected RPC:85 xprt_connect xprt ec78d800 is not connected RPC:85 sleep_on(queue xprt_pending time 4683110) RPC:85 added to queue ec78d994 xprt_pending RPC:85 setting alarm for 6 ms RPC: rpcrdma_event_process: event rep ec116800 status 5 opcode 80 length 2493606 RPC: rpcrdma_event_process: recv WC status 5, connection lost RPC: rpcrdma_conn_upcall: disconnected: ec78dbccI4:20049 (ep 0xec78db40 event 0xa) RPC: rpcrdma_conn_upcall: disconnected rpcrdma: connection to ec78dbccI4:20049 closed (-103) RPC: xprt_rdma_connect_worker: reconnect == On the server I see: Mar 3 17:45:16 elena-ar hermon: [ID 271130 kern.notice] NOTICE: hermon0: Device Error: CQE remote access error Mar 3 17:45:16 elena-ar nfssrv: [ID 819430 kern.notice] NOTICE: NFS: bad sendreply Mar 3 17:45:21 elena-ar hermon: [ID 271130 kern.notice] NOTICE: hermon0: Device Error: CQE remote access error Mar 3 17:45:21 elena-ar nfssrv: [ID 819430 kern.notice] NOTICE: NFS: bad sendreply The remote access error is actually seen on RDMA_WRITE. Doing some more debug on the server with DTrace, I see that the destination address and length matches the write chunk element in the Linux debug output above. 0 9385 rib_write:entry daddr 38536d000, len 4000, hdl a601 0 9358 rib_init_sendwait:return ff44a715d308 1 9296 rib_svc_scq_handler:return 1f7 1 9356 rib_sendwait:return 14 1 9386 rib_write:return 14 ^^^ that is RDMA_FAILED in 1 63295
Re: [ewg] nfsrdma fails to write big file,
Mahesh Siddheshwar wrote: Hi Tom, Vu, Tom Tucker wrote: Roland Dreier wrote: + /* +* Add room for frmr register and invalidate WRs +* Requests sometimes have two chunks, each chunk +* requires to have different frmr. The safest +* WRs required are max_send_wr * 6; however, we +* get send completions and poll fast enough, it +* is pretty safe to have max_send_wr * 4. +*/ + ep-rep_attr.cap.max_send_wr *= 4; Seems like a bad design if there is a possibility of work queue overflow; if you're counting on events occurring in a particular order or completions being handled fast enough, then your design is going to fail in some high load situations, which I don't think you want. Vu, Would you please try the following: - Set the multiplier to 5 While trying to test this between a Linux client and Solaris server, I made the following changes in : /usr/src/ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c diff verbs.c.org verbs.c 653c653 ep-rep_attr.cap.max_send_wr *= 3; --- ep-rep_attr.cap.max_send_wr *= 8; 685c685 ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/2 /* - 1*/; --- ep-rep_cqinit = ep-rep_attr.cap.max (I bumped it to 8) did make install. On reboot I see the errors on NFS READs as opposed to WRITEs as seen before, when I try to read a 10G file from the server. The client is running: RHEL 5.3 (2.6.18-128.el5PAE) with OFED-1.5.1-20100223-0740 bits. The client has an Sun IB HCA: SUN0070130001, MT25418, 2.7.0 firmware, hw_rev = a0. The server is running Solaris based on snv_128. rpcdebug output from the client: == RPC:85 call_bind (status 0) RPC:85 call_connect xprt ec78d800 is connected RPC:85 call_transmit (status 0) RPC:85 xprt_prepare_transmit RPC:85 xprt_cwnd_limited cong = 0 cwnd = 8192 RPC:85 rpc_xdr_encode (status 0) RPC:85 marshaling UNIX cred eddb4dc0 RPC:85 using AUTH_UNIX cred eddb4dc0 to wrap rpc data RPC:85 xprt_transmit(164) RPC: rpcrdma_inline_pullup: pad 0 destp 0xf1dd1410 len 164 hdrlen 164 RPC: rpcrdma_register_frmr_external: Using frmr ec7da920 to map 4 segments RPC: rpcrdma_create_chunks: write chunk elem 16...@0x38536d000:0xa601 (more) RPC: rpcrdma_register_frmr_external: Using frmr ec7da960 to map 1 segments RPC: rpcrdma_create_chunks: write chunk elem 1...@0x31dd153c:0xaa01 (last) RPC: rpcrdma_marshal_req: write chunk: hdrlen 68 rpclen 164 padlen 0 headerp 0xf1dd124c base 0xf1dd136c lkey 0x500 RPC:85 xmit complete RPC:85 sleep_on(queue xprt_pending time 4683109) RPC:85 added to queue ec78d994 xprt_pending RPC:85 setting alarm for 6 ms RPC: wake_up_next(ec78d944 xprt_resend) RPC: wake_up_next(ec78d8f4 xprt_sending) RPC: rpcrdma_qp_async_error_upcall: QP error 3 on device mlx4_0 ep ec78db40 RPC:85 __rpc_wake_up_task (now 4683110) RPC:85 disabling timer RPC:85 removed from queue ec78d994 xprt_pending RPC: __rpc_wake_up_task done RPC:85 __rpc_execute flags=0x1 RPC:85 call_status (status -107) RPC:85 call_bind (status 0) RPC:85 call_connect xprt ec78d800 is not connected RPC:85 xprt_connect xprt ec78d800 is not connected RPC:85 sleep_on(queue xprt_pending time 4683110) RPC:85 added to queue ec78d994 xprt_pending RPC:85 setting alarm for 6 ms RPC: rpcrdma_event_process: event rep ec116800 status 5 opcode 80 length 2493606 RPC: rpcrdma_event_process: recv WC status 5, connection lost RPC: rpcrdma_conn_upcall: disconnected: ec78dbccI4:20049 (ep 0xec78db40 event 0xa) RPC: rpcrdma_conn_upcall: disconnected rpcrdma: connection to ec78dbccI4:20049 closed (-103) RPC: xprt_rdma_connect_worker: reconnect == On the server I see: Mar 3 17:45:16 elena-ar hermon: [ID 271130 kern.notice] NOTICE: hermon0: Device Error: CQE remote access error Mar 3 17:45:16 elena-ar nfssrv: [ID 819430 kern.notice] NOTICE: NFS: bad sendreply Mar 3 17:45:21 elena-ar hermon: [ID 271130 kern.notice] NOTICE: hermon0: Device Error: CQE remote access error Mar 3 17:45:21 elena-ar nfssrv: [ID 819430 kern.notice] NOTICE: NFS: bad sendreply The remote access error is actually seen on RDMA_WRITE. Doing some more debug on the server with DTrace, I see that the destination address and length matches the write chunk element in the Linux debug output above. 0 9385 rib_write:entry daddr 38536d000, len 4000, hdl a601 0 9358 rib_init_sendwait:return ff44a715d308 1 9296 rib_svc_scq_handler:return 1f7 1 9356 rib_sendwait:return 14 1 9386 rib_write:return 14 ^^^ that is RDMA_FAILED in 1 63295xdrrdma_send_read_data:return 0 1
Re: [ewg] nfsrdma fails to write big file,
Vu Pham wrote: Setup: 1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600, ConnectX2 QDR HCAs fw 2.7.8-6, RHEL 5.2. 2. Solaris nfsrdma server svn 130, ConnectX QDR HCA. Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M count=1*, operation fail, connection get drop, client cannot re-establish connection to server. After rebooting only the client, I can mount again. It happens with both solaris and linux nfsrdma servers. For linux client/server, I run memreg=5 (FRMR), I don't see problem with memreg=6 (global dma key) Awesome. This is the key I think. Thanks for the info Vu, Tom On Solaris server snv 130, we see problem decoding write request of 32K. The client send two read chunks (32K 16-byte), the server fail to do rdma read on the 16-byte chunk (cqe.status = 10 ie. IB_WC_REM_ACCCESS_ERROR); therefore, server terminate the connection. We don't see this problem on nfs version 3 on Solaris. Solaris server run normal memory registration mode. On linux client, I see cqe.status = 12 ie. IB_WC_RETRY_EXC_ERR I added these notes in bug #1919 (bugs.openfabrics.org) to track the issue. thanks, -vu ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] nfsrdma fails to write big file,
Tom, Some more info on the problem: 1. Running with memreg=4 (FMR) I can not reproduce the problem 2. I also see different error on client Feb 22 12:16:55 mellanox-2 rpc.idmapd[5786]: nss_getpwnam: name 'nobody' does not map into domain 'localdomain' Feb 22 12:16:55 mellanox-2 kernel: QP 0x70004b: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: RPC: rpcrdma_ep_post: ib_post_send returned -12 cq_init 48 cq_count 32 Feb 22 12:17:00 mellanox-2 kernel: RPC: rpcrdma_event_process: send WC status 5, vend_err F5 Feb 22 12:17:00 mellanox-2 kernel: rpcrdma: connection to 13.20.1.9:20049 closed (-103) -vu -Original Message- From: Tom Tucker [mailto:t...@opengridcomputing.com] Sent: Monday, February 22, 2010 10:49 AM To: Vu Pham Cc: linux-r...@vger.kernel.org; Mahesh Siddheshwar; ewg@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Vu Pham wrote: Setup: 1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600, ConnectX2 QDR HCAs fw 2.7.8-6, RHEL 5.2. 2. Solaris nfsrdma server svn 130, ConnectX QDR HCA. Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M count=1*, operation fail, connection get drop, client cannot re-establish connection to server. After rebooting only the client, I can mount again. It happens with both solaris and linux nfsrdma servers. For linux client/server, I run memreg=5 (FRMR), I don't see problem with memreg=6 (global dma key) Awesome. This is the key I think. Thanks for the info Vu, Tom On Solaris server snv 130, we see problem decoding write request of 32K. The client send two read chunks (32K 16-byte), the server fail to do rdma read on the 16-byte chunk (cqe.status = 10 ie. IB_WC_REM_ACCCESS_ERROR); therefore, server terminate the connection. We don't see this problem on nfs version 3 on Solaris. Solaris server run normal memory registration mode. On linux client, I see cqe.status = 12 ie. IB_WC_RETRY_EXC_ERR I added these notes in bug #1919 (bugs.openfabrics.org) to track the issue. thanks, -vu ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg