Re: [ewg] nfsrdma fails to write big file,
Tom Tucker wrote: Mahesh Siddheshwar wrote: Hi Tom, Vu, Tom Tucker wrote: Roland Dreier wrote: + /* +* Add room for frmr register and invalidate WRs +* Requests sometimes have two chunks, each chunk +* requires to have different frmr. The safest +* WRs required are max_send_wr * 6; however, we +* get send completions and poll fast enough, it +* is pretty safe to have max_send_wr * 4. +*/ + ep-rep_attr.cap.max_send_wr *= 4; Seems like a bad design if there is a possibility of work queue overflow; if you're counting on events occurring in a particular order or completions being handled fast enough, then your design is going to fail in some high load situations, which I don't think you want. Vu, Would you please try the following: - Set the multiplier to 5 While trying to test this between a Linux client and Solaris server, I made the following changes in : /usr/src/ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c diff verbs.c.org verbs.c 653c653 ep-rep_attr.cap.max_send_wr *= 3; --- ep-rep_attr.cap.max_send_wr *= 8; 685c685 ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/2 /* - 1*/; --- ep-rep_cqinit = ep-rep_attr.cap.max (I bumped it to 8) did make install. On reboot I see the errors on NFS READs as opposed to WRITEs as seen before, when I try to read a 10G file from the server. The client is running: RHEL 5.3 (2.6.18-128.el5PAE) with OFED-1.5.1-20100223-0740 bits. The client has an Sun IB HCA: SUN0070130001, MT25418, 2.7.0 firmware, hw_rev = a0. The server is running Solaris based on snv_128. rpcdebug output from the client: == RPC:85 call_bind (status 0) RPC:85 call_connect xprt ec78d800 is connected RPC:85 call_transmit (status 0) RPC:85 xprt_prepare_transmit RPC:85 xprt_cwnd_limited cong = 0 cwnd = 8192 RPC:85 rpc_xdr_encode (status 0) RPC:85 marshaling UNIX cred eddb4dc0 RPC:85 using AUTH_UNIX cred eddb4dc0 to wrap rpc data RPC:85 xprt_transmit(164) RPC: rpcrdma_inline_pullup: pad 0 destp 0xf1dd1410 len 164 hdrlen 164 RPC: rpcrdma_register_frmr_external: Using frmr ec7da920 to map 4 segments RPC: rpcrdma_create_chunks: write chunk elem 16...@0x38536d000:0xa601 (more) RPC: rpcrdma_register_frmr_external: Using frmr ec7da960 to map 1 segments RPC: rpcrdma_create_chunks: write chunk elem 1...@0x31dd153c:0xaa01 (last) RPC: rpcrdma_marshal_req: write chunk: hdrlen 68 rpclen 164 padlen 0 headerp 0xf1dd124c base 0xf1dd136c lkey 0x500 RPC:85 xmit complete RPC:85 sleep_on(queue xprt_pending time 4683109) RPC:85 added to queue ec78d994 xprt_pending RPC:85 setting alarm for 6 ms RPC: wake_up_next(ec78d944 xprt_resend) RPC: wake_up_next(ec78d8f4 xprt_sending) RPC: rpcrdma_qp_async_error_upcall: QP error 3 on device mlx4_0 ep ec78db40 RPC:85 __rpc_wake_up_task (now 4683110) RPC:85 disabling timer RPC:85 removed from queue ec78d994 xprt_pending RPC: __rpc_wake_up_task done RPC:85 __rpc_execute flags=0x1 RPC:85 call_status (status -107) RPC:85 call_bind (status 0) RPC:85 call_connect xprt ec78d800 is not connected RPC:85 xprt_connect xprt ec78d800 is not connected RPC:85 sleep_on(queue xprt_pending time 4683110) RPC:85 added to queue ec78d994 xprt_pending RPC:85 setting alarm for 6 ms RPC: rpcrdma_event_process: event rep ec116800 status 5 opcode 80 length 2493606 RPC: rpcrdma_event_process: recv WC status 5, connection lost RPC: rpcrdma_conn_upcall: disconnected: ec78dbccI4:20049 (ep 0xec78db40 event 0xa) RPC: rpcrdma_conn_upcall: disconnected rpcrdma: connection to ec78dbccI4:20049 closed (-103) RPC: xprt_rdma_connect_worker: reconnect == On the server I see: Mar 3 17:45:16 elena-ar hermon: [ID 271130 kern.notice] NOTICE: hermon0: Device Error: CQE remote access error Mar 3 17:45:16 elena-ar nfssrv: [ID 819430 kern.notice] NOTICE: NFS: bad sendreply Mar 3 17:45:21 elena-ar hermon: [ID 271130 kern.notice] NOTICE: hermon0: Device Error: CQE remote access error Mar 3 17:45:21 elena-ar nfssrv: [ID 819430 kern.notice] NOTICE: NFS: bad sendreply The remote access error is actually seen on RDMA_WRITE. Doing some more debug on the server with DTrace, I see that the destination address and length matches the write chunk element in the Linux debug output above. 0 9385 rib_write:entry daddr 38536d000, len 4000, hdl a601 0 9358 rib_init_sendwait:return ff44a715d308 1 9296 rib_svc_scq_handler:return 1f7 1 9356 rib_sendwait:return 14 1 9386 rib_write:return 14 ^^^ that is RDMA_FAILED in 1 63295xdrrdma_send_read_data:return 0 1 5969 xdr_READ3res:return 1 5969 xdr_READ3res:return
Re: [ewg] nfsrdma fails to write big file,
Hi Tom, Vu, Tom Tucker wrote: Roland Dreier wrote: + /* +* Add room for frmr register and invalidate WRs +* Requests sometimes have two chunks, each chunk +* requires to have different frmr. The safest +* WRs required are max_send_wr * 6; however, we +* get send completions and poll fast enough, it +* is pretty safe to have max_send_wr * 4. +*/ + ep-rep_attr.cap.max_send_wr *= 4; Seems like a bad design if there is a possibility of work queue overflow; if you're counting on events occurring in a particular order or completions being handled fast enough, then your design is going to fail in some high load situations, which I don't think you want. Vu, Would you please try the following: - Set the multiplier to 5 While trying to test this between a Linux client and Solaris server, I made the following changes in : /usr/src/ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c diff verbs.c.org verbs.c 653c653 ep-rep_attr.cap.max_send_wr *= 3; --- ep-rep_attr.cap.max_send_wr *= 8; 685c685 ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/2 /* - 1*/; --- ep-rep_cqinit = ep-rep_attr.cap.max (I bumped it to 8) did make install. On reboot I see the errors on NFS READs as opposed to WRITEs as seen before, when I try to read a 10G file from the server. The client is running: RHEL 5.3 (2.6.18-128.el5PAE) with OFED-1.5.1-20100223-0740 bits. The client has an Sun IB HCA: SUN0070130001, MT25418, 2.7.0 firmware, hw_rev = a0. The server is running Solaris based on snv_128. rpcdebug output from the client: == RPC:85 call_bind (status 0) RPC:85 call_connect xprt ec78d800 is connected RPC:85 call_transmit (status 0) RPC:85 xprt_prepare_transmit RPC:85 xprt_cwnd_limited cong = 0 cwnd = 8192 RPC:85 rpc_xdr_encode (status 0) RPC:85 marshaling UNIX cred eddb4dc0 RPC:85 using AUTH_UNIX cred eddb4dc0 to wrap rpc data RPC:85 xprt_transmit(164) RPC: rpcrdma_inline_pullup: pad 0 destp 0xf1dd1410 len 164 hdrlen 164 RPC: rpcrdma_register_frmr_external: Using frmr ec7da920 to map 4 segments RPC: rpcrdma_create_chunks: write chunk elem 16...@0x38536d000:0xa601 (more) RPC: rpcrdma_register_frmr_external: Using frmr ec7da960 to map 1 segments RPC: rpcrdma_create_chunks: write chunk elem 1...@0x31dd153c:0xaa01 (last) RPC: rpcrdma_marshal_req: write chunk: hdrlen 68 rpclen 164 padlen 0 headerp 0xf1dd124c base 0xf1dd136c lkey 0x500 RPC:85 xmit complete RPC:85 sleep_on(queue xprt_pending time 4683109) RPC:85 added to queue ec78d994 xprt_pending RPC:85 setting alarm for 6 ms RPC: wake_up_next(ec78d944 xprt_resend) RPC: wake_up_next(ec78d8f4 xprt_sending) RPC: rpcrdma_qp_async_error_upcall: QP error 3 on device mlx4_0 ep ec78db40 RPC:85 __rpc_wake_up_task (now 4683110) RPC:85 disabling timer RPC:85 removed from queue ec78d994 xprt_pending RPC: __rpc_wake_up_task done RPC:85 __rpc_execute flags=0x1 RPC:85 call_status (status -107) RPC:85 call_bind (status 0) RPC:85 call_connect xprt ec78d800 is not connected RPC:85 xprt_connect xprt ec78d800 is not connected RPC:85 sleep_on(queue xprt_pending time 4683110) RPC:85 added to queue ec78d994 xprt_pending RPC:85 setting alarm for 6 ms RPC: rpcrdma_event_process: event rep ec116800 status 5 opcode 80 length 2493606 RPC: rpcrdma_event_process: recv WC status 5, connection lost RPC: rpcrdma_conn_upcall: disconnected: ec78dbccI4:20049 (ep 0xec78db40 event 0xa) RPC: rpcrdma_conn_upcall: disconnected rpcrdma: connection to ec78dbccI4:20049 closed (-103) RPC: xprt_rdma_connect_worker: reconnect == On the server I see: Mar 3 17:45:16 elena-ar hermon: [ID 271130 kern.notice] NOTICE: hermon0: Device Error: CQE remote access error Mar 3 17:45:16 elena-ar nfssrv: [ID 819430 kern.notice] NOTICE: NFS: bad sendreply Mar 3 17:45:21 elena-ar hermon: [ID 271130 kern.notice] NOTICE: hermon0: Device Error: CQE remote access error Mar 3 17:45:21 elena-ar nfssrv: [ID 819430 kern.notice] NOTICE: NFS: bad sendreply The remote access error is actually seen on RDMA_WRITE. Doing some more debug on the server with DTrace, I see that the destination address and length matches the write chunk element in the Linux debug output above. 0 9385 rib_write:entry daddr 38536d000, len 4000, hdl a601 0 9358 rib_init_sendwait:return ff44a715d308 1 9296 rib_svc_scq_handler:return 1f7 1 9356 rib_sendwait:return 14 1 9386 rib_write:return 14 ^^^ that is RDMA_FAILED in 1 63295xdrrdma_send_read_data:return 0 1 5969 xdr_READ3res:return 1 5969 xdr_READ3res:return 0 Is this a variation of the previously
Re: [ewg] nfsrdma fails to write big file,
Roland: I'll put together a patch based on 5 with a comment that indicates why I think 5 is the number. Since Vu has verified this behaviorally as well, I'm comfortable that our understanding of the code is sound. I'm on the road right now, so it won't be until tomorrow though. Thanks, Tom Vu Pham wrote: -Original Message- From: Tom Tucker [mailto:t...@opengridcomputing.com] Sent: Saturday, February 27, 2010 8:23 PM To: Vu Pham Cc: Roland Dreier; linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Roland Dreier wrote: + /* +* Add room for frmr register and invalidate WRs +* Requests sometimes have two chunks, each chunk +* requires to have different frmr. The safest +* WRs required are max_send_wr * 6; however, we +* get send completions and poll fast enough, it +* is pretty safe to have max_send_wr * 4. +*/ + ep-rep_attr.cap.max_send_wr *= 4; Seems like a bad design if there is a possibility of work queue overflow; if you're counting on events occurring in a particular order or completions being handled fast enough, then your design is going to fail in some high load situations, which I don't think you want. Vu, Would you please try the following: - Set the multiplier to 5 - Set the number of buffer credits small as follows echo 4 /proc/sys/sunrpc/rdma_slot_table_entries - Rerun your test and see if you can reproduce the problem? I did the above and was unable to reproduce, but I would like to see if you can to convince ourselves that 5 is the right number. Tom, I did the above and can not reproduce either. I think 5 is the right number; however, we should optimize it later. -vu -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] nfsrdma fails to write big file,
Roland Dreier wrote: + /* +* Add room for frmr register and invalidate WRs +* Requests sometimes have two chunks, each chunk +* requires to have different frmr. The safest +* WRs required are max_send_wr * 6; however, we +* get send completions and poll fast enough, it +* is pretty safe to have max_send_wr * 4. +*/ + ep-rep_attr.cap.max_send_wr *= 4; Seems like a bad design if there is a possibility of work queue overflow; if you're counting on events occurring in a particular order or completions being handled fast enough, then your design is going to fail in some high load situations, which I don't think you want. Vu, Would you please try the following: - Set the multiplier to 5 - Set the number of buffer credits small as follows echo 4 /proc/sys/sunrpc/rdma_slot_table_entries - Rerun your test and see if you can reproduce the problem? I did the above and was unable to reproduce, but I would like to see if you can to convince ourselves that 5 is the right number. Thanks, Tom - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] nfsrdma fails to write big file,
Vu Pham wrote: Tom, Did you make any change to have bonnie++, dd of a 10G file and vdbench concurrently run finish? No I did not but my disk subsystem is pretty slow, so it might be that I just don't have fast enough storage. I keep hitting the WQE overflow error below. I saw that most of the requests have two chunks (32K chunk and some-bytes chunk), each chunk requires an frmr + invalidate wrs; However, you set ep-rep_attr.cap.max_send_wr = cdata-max_requests and then for frmr case you do ep-rep_atrr.cap.max_send_wr *=3; which is not enough. Moreover, you also set ep-rep_cqinit = max_send_wr/2 for send completion signal which causes the wqe overflow happened faster. After applying the following patch, I have thing vdbench, dd, and copy 10g_file running overnight -vu --- ofa_kernel-1.5.1.orig/net/sunrpc/xprtrdma/verbs.c 2010-02-24 10:41:22.0 -0800 +++ ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c2010-02-24 10:03:18.0 -0800 @@ -649,8 +654,15 @@ ep-rep_attr.cap.max_send_wr = cdata-max_requests; switch (ia-ri_memreg_strategy) { case RPCRDMA_FRMR: - /* Add room for frmr register and invalidate WRs */ - ep-rep_attr.cap.max_send_wr *= 3; + /* +* Add room for frmr register and invalidate WRs +* Requests sometimes have two chunks, each chunk +* requires to have different frmr. The safest +* WRs required are max_send_wr * 6; however, we +* get send completions and poll fast enough, it +* is pretty safe to have max_send_wr * 4. +*/ + ep-rep_attr.cap.max_send_wr *= 4; if (ep-rep_attr.cap.max_send_wr devattr.max_qp_wr) return -EINVAL; break; @@ -682,7 +694,8 @@ ep-rep_attr.cap.max_recv_sge); /* set trigger for requesting send completion */ - ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/2 /* - 1*/; + ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/4; + switch (ia-ri_memreg_strategy) { case RPCRDMA_MEMWINDOWS_ASYNC: case RPCRDMA_MEMWINDOWS: Erf. This is client code. I'll take a look at this and see if I can understand what Talpey was up to. Tom -Original Message- From: ewg-boun...@lists.openfabrics.org [mailto:ewg- boun...@lists.openfabrics.org] On Behalf Of Vu Pham Sent: Monday, February 22, 2010 12:23 PM To: Tom Tucker Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Tom, Some more info on the problem: 1. Running with memreg=4 (FMR) I can not reproduce the problem 2. I also see different error on client Feb 22 12:16:55 mellanox-2 rpc.idmapd[5786]: nss_getpwnam: name 'nobody' does not map into domain 'localdomain' Feb 22 12:16:55 mellanox-2 kernel: QP 0x70004b: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: RPC: rpcrdma_ep_post: ib_post_send returned -12 cq_init 48 cq_count 32 Feb 22 12:17:00 mellanox-2 kernel: RPC: rpcrdma_event_process: send WC status 5, vend_err F5 Feb 22 12:17:00 mellanox-2 kernel: rpcrdma: connection to 13.20.1.9:20049 closed (-103) -vu -Original Message- From: Tom Tucker [mailto:t...@opengridcomputing.com] Sent: Monday, February 22, 2010 10:49 AM To: Vu Pham Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Vu Pham wrote: Setup: 1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600, ConnectX2 QDR HCAs fw 2.7.8-6, RHEL 5.2. 2. Solaris nfsrdma server svn 130, ConnectX QDR HCA. Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M count=1*, operation fail, connection get drop, client cannot re-establish connection to server. After rebooting only the client, I can mount again. It happens with both solaris and linux nfsrdma servers. For linux client/server, I run memreg=5 (FRMR), I don't see problem with memreg=6 (global dma key) Awesome. This is the key I think. Thanks for the info Vu, Tom On Solaris server snv 130, we see problem decoding write request of 32K. The client send two read chunks (32K 16-byte), the server fail to do rdma read on the 16-byte chunk (cqe.status = 10 ie. IB_WC_REM_ACCCESS_ERROR); therefore, server terminate the connection. We don't see this problem on nfs version 3 on Solaris. Solaris server run normal memory registration mode. On linux client, I see cqe.status = 12 ie. IB_WC_RETRY_EXC_ERR I added these notes in bug #1919 (bugs.openfabrics.org) to track the issue. thanks, -vu
Re: [ewg] nfsrdma fails to write big file,
Vu, Are you changing any of the default settings? For example rsize/wsize, etc... I'd like to reproduce this problem if I can. Thanks, Tom Vu Pham wrote: Tom, Did you make any change to have bonnie++, dd of a 10G file and vdbench concurrently run finish? I keep hitting the WQE overflow error below. I saw that most of the requests have two chunks (32K chunk and some-bytes chunk), each chunk requires an frmr + invalidate wrs; However, you set ep-rep_attr.cap.max_send_wr = cdata-max_requests and then for frmr case you do ep-rep_atrr.cap.max_send_wr *=3; which is not enough. Moreover, you also set ep-rep_cqinit = max_send_wr/2 for send completion signal which causes the wqe overflow happened faster. After applying the following patch, I have thing vdbench, dd, and copy 10g_file running overnight -vu --- ofa_kernel-1.5.1.orig/net/sunrpc/xprtrdma/verbs.c 2010-02-24 10:41:22.0 -0800 +++ ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c2010-02-24 10:03:18.0 -0800 @@ -649,8 +654,15 @@ ep-rep_attr.cap.max_send_wr = cdata-max_requests; switch (ia-ri_memreg_strategy) { case RPCRDMA_FRMR: - /* Add room for frmr register and invalidate WRs */ - ep-rep_attr.cap.max_send_wr *= 3; + /* +* Add room for frmr register and invalidate WRs +* Requests sometimes have two chunks, each chunk +* requires to have different frmr. The safest +* WRs required are max_send_wr * 6; however, we +* get send completions and poll fast enough, it +* is pretty safe to have max_send_wr * 4. +*/ + ep-rep_attr.cap.max_send_wr *= 4; if (ep-rep_attr.cap.max_send_wr devattr.max_qp_wr) return -EINVAL; break; @@ -682,7 +694,8 @@ ep-rep_attr.cap.max_recv_sge); /* set trigger for requesting send completion */ - ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/2 /* - 1*/; + ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/4; + switch (ia-ri_memreg_strategy) { case RPCRDMA_MEMWINDOWS_ASYNC: case RPCRDMA_MEMWINDOWS: -Original Message- From: ewg-boun...@lists.openfabrics.org [mailto:ewg- boun...@lists.openfabrics.org] On Behalf Of Vu Pham Sent: Monday, February 22, 2010 12:23 PM To: Tom Tucker Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Tom, Some more info on the problem: 1. Running with memreg=4 (FMR) I can not reproduce the problem 2. I also see different error on client Feb 22 12:16:55 mellanox-2 rpc.idmapd[5786]: nss_getpwnam: name 'nobody' does not map into domain 'localdomain' Feb 22 12:16:55 mellanox-2 kernel: QP 0x70004b: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: RPC: rpcrdma_ep_post: ib_post_send returned -12 cq_init 48 cq_count 32 Feb 22 12:17:00 mellanox-2 kernel: RPC: rpcrdma_event_process: send WC status 5, vend_err F5 Feb 22 12:17:00 mellanox-2 kernel: rpcrdma: connection to 13.20.1.9:20049 closed (-103) -vu -Original Message- From: Tom Tucker [mailto:t...@opengridcomputing.com] Sent: Monday, February 22, 2010 10:49 AM To: Vu Pham Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Vu Pham wrote: Setup: 1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600, ConnectX2 QDR HCAs fw 2.7.8-6, RHEL 5.2. 2. Solaris nfsrdma server svn 130, ConnectX QDR HCA. Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M count=1*, operation fail, connection get drop, client cannot re-establish connection to server. After rebooting only the client, I can mount again. It happens with both solaris and linux nfsrdma servers. For linux client/server, I run memreg=5 (FRMR), I don't see problem with memreg=6 (global dma key) Awesome. This is the key I think. Thanks for the info Vu, Tom On Solaris server snv 130, we see problem decoding write request of 32K. The client send two read chunks (32K 16-byte), the server fail to do rdma read on the 16-byte chunk (cqe.status = 10 ie. IB_WC_REM_ACCCESS_ERROR); therefore, server terminate the connection. We don't see this problem on nfs version 3 on Solaris. Solaris server run normal memory registration mode. On linux client, I see cqe.status = 12 ie. IB_WC_RETRY_EXC_ERR I added these notes in bug #1919 (bugs.openfabrics.org) to track the issue. thanks, -vu ___ ewg mailing list e...@lists.openfabrics.org http
Re: [ewg] nfsrdma fails to write big file,
Vu, Based on the mapping code, it looks to me like the worst case is RPCRDMA_MAX_SEGS * 2 + 1 as the multiplier. However, I think in practice, due to the way that iov are built, the actual max is 5 (frmr for head + pagelist plus invalidates for same plus one for the send itself). Why did you think the max was 6? Thanks, Tom Tom Tucker wrote: Vu, Are you changing any of the default settings? For example rsize/wsize, etc... I'd like to reproduce this problem if I can. Thanks, Tom Vu Pham wrote: Tom, Did you make any change to have bonnie++, dd of a 10G file and vdbench concurrently run finish? I keep hitting the WQE overflow error below. I saw that most of the requests have two chunks (32K chunk and some-bytes chunk), each chunk requires an frmr + invalidate wrs; However, you set ep-rep_attr.cap.max_send_wr = cdata-max_requests and then for frmr case you do ep-rep_atrr.cap.max_send_wr *=3; which is not enough. Moreover, you also set ep-rep_cqinit = max_send_wr/2 for send completion signal which causes the wqe overflow happened faster. After applying the following patch, I have thing vdbench, dd, and copy 10g_file running overnight -vu --- ofa_kernel-1.5.1.orig/net/sunrpc/xprtrdma/verbs.c 2010-02-24 10:41:22.0 -0800 +++ ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c2010-02-24 10:03:18.0 -0800 @@ -649,8 +654,15 @@ ep-rep_attr.cap.max_send_wr = cdata-max_requests; switch (ia-ri_memreg_strategy) { case RPCRDMA_FRMR: - /* Add room for frmr register and invalidate WRs */ - ep-rep_attr.cap.max_send_wr *= 3; + /* +* Add room for frmr register and invalidate WRs +* Requests sometimes have two chunks, each chunk +* requires to have different frmr. The safest +* WRs required are max_send_wr * 6; however, we +* get send completions and poll fast enough, it +* is pretty safe to have max_send_wr * 4. +*/ + ep-rep_attr.cap.max_send_wr *= 4; if (ep-rep_attr.cap.max_send_wr devattr.max_qp_wr) return -EINVAL; break; @@ -682,7 +694,8 @@ ep-rep_attr.cap.max_recv_sge); /* set trigger for requesting send completion */ - ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/2 /* - 1*/; + ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/4; + switch (ia-ri_memreg_strategy) { case RPCRDMA_MEMWINDOWS_ASYNC: case RPCRDMA_MEMWINDOWS: -Original Message- From: ewg-boun...@lists.openfabrics.org [mailto:ewg- boun...@lists.openfabrics.org] On Behalf Of Vu Pham Sent: Monday, February 22, 2010 12:23 PM To: Tom Tucker Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Tom, Some more info on the problem: 1. Running with memreg=4 (FMR) I can not reproduce the problem 2. I also see different error on client Feb 22 12:16:55 mellanox-2 rpc.idmapd[5786]: nss_getpwnam: name 'nobody' does not map into domain 'localdomain' Feb 22 12:16:55 mellanox-2 kernel: QP 0x70004b: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: RPC: rpcrdma_ep_post: ib_post_send returned -12 cq_init 48 cq_count 32 Feb 22 12:17:00 mellanox-2 kernel: RPC: rpcrdma_event_process: send WC status 5, vend_err F5 Feb 22 12:17:00 mellanox-2 kernel: rpcrdma: connection to 13.20.1.9:20049 closed (-103) -vu -Original Message- From: Tom Tucker [mailto:t...@opengridcomputing.com] Sent: Monday, February 22, 2010 10:49 AM To: Vu Pham Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Vu Pham wrote: Setup: 1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600, ConnectX2 QDR HCAs fw 2.7.8-6, RHEL 5.2. 2. Solaris nfsrdma server svn 130, ConnectX QDR HCA. Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M count=1*, operation fail, connection get drop, client cannot re-establish connection to server. After rebooting only the client, I can mount again. It happens with both solaris and linux nfsrdma servers. For linux client/server, I run memreg=5 (FRMR), I don't see problem with memreg=6 (global dma key) Awesome. This is the key I think. Thanks for the info Vu, Tom On Solaris server snv 130, we see problem decoding write request of 32K. The client send two read chunks (32K 16-byte), the server fail to do rdma read on the 16-byte chunk (cqe.status = 10 ie
Re: [ewg] nfsrdma fails to write big file,
Vu, I ran the number of slots down to 8 (echo 8 rdma_slot_table_entries) and I can reproduce the issue now. I'm going to try setting the allocation multiple to 5 and see if I can't prove to myself and Roland that we've accurately computed the correct factor. I think overall a better solution might be a different credit system, however, I think that's a much more substantial change than we can tackle at this point. Tom Tom Tucker wrote: Vu, Based on the mapping code, it looks to me like the worst case is RPCRDMA_MAX_SEGS * 2 + 1 as the multiplier. However, I think in practice, due to the way that iov are built, the actual max is 5 (frmr for head + pagelist plus invalidates for same plus one for the send itself). Why did you think the max was 6? Thanks, Tom Tom Tucker wrote: Vu, Are you changing any of the default settings? For example rsize/wsize, etc... I'd like to reproduce this problem if I can. Thanks, Tom Vu Pham wrote: Tom, Did you make any change to have bonnie++, dd of a 10G file and vdbench concurrently run finish? I keep hitting the WQE overflow error below. I saw that most of the requests have two chunks (32K chunk and some-bytes chunk), each chunk requires an frmr + invalidate wrs; However, you set ep-rep_attr.cap.max_send_wr = cdata-max_requests and then for frmr case you do ep-rep_atrr.cap.max_send_wr *=3; which is not enough. Moreover, you also set ep-rep_cqinit = max_send_wr/2 for send completion signal which causes the wqe overflow happened faster. After applying the following patch, I have thing vdbench, dd, and copy 10g_file running overnight -vu --- ofa_kernel-1.5.1.orig/net/sunrpc/xprtrdma/verbs.c 2010-02-24 10:41:22.0 -0800 +++ ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c2010-02-24 10:03:18.0 -0800 @@ -649,8 +654,15 @@ ep-rep_attr.cap.max_send_wr = cdata-max_requests; switch (ia-ri_memreg_strategy) { case RPCRDMA_FRMR: - /* Add room for frmr register and invalidate WRs */ - ep-rep_attr.cap.max_send_wr *= 3; + /* +* Add room for frmr register and invalidate WRs +* Requests sometimes have two chunks, each chunk +* requires to have different frmr. The safest +* WRs required are max_send_wr * 6; however, we +* get send completions and poll fast enough, it +* is pretty safe to have max_send_wr * 4. +*/ + ep-rep_attr.cap.max_send_wr *= 4; if (ep-rep_attr.cap.max_send_wr devattr.max_qp_wr) return -EINVAL; break; @@ -682,7 +694,8 @@ ep-rep_attr.cap.max_recv_sge); /* set trigger for requesting send completion */ - ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/2 /* - 1*/; + ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/4; + switch (ia-ri_memreg_strategy) { case RPCRDMA_MEMWINDOWS_ASYNC: case RPCRDMA_MEMWINDOWS: -Original Message- From: ewg-boun...@lists.openfabrics.org [mailto:ewg- boun...@lists.openfabrics.org] On Behalf Of Vu Pham Sent: Monday, February 22, 2010 12:23 PM To: Tom Tucker Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Tom, Some more info on the problem: 1. Running with memreg=4 (FMR) I can not reproduce the problem 2. I also see different error on client Feb 22 12:16:55 mellanox-2 rpc.idmapd[5786]: nss_getpwnam: name 'nobody' does not map into domain 'localdomain' Feb 22 12:16:55 mellanox-2 kernel: QP 0x70004b: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: RPC: rpcrdma_ep_post: ib_post_send returned -12 cq_init 48 cq_count 32 Feb 22 12:17:00 mellanox-2 kernel: RPC: rpcrdma_event_process: send WC status 5, vend_err F5 Feb 22 12:17:00 mellanox-2 kernel: rpcrdma: connection to 13.20.1.9:20049 closed (-103) -vu -Original Message- From: Tom Tucker [mailto:t...@opengridcomputing.com] Sent: Monday, February 22, 2010 10:49 AM To: Vu Pham Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Vu Pham wrote: Setup: 1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600, ConnectX2 QDR HCAs fw 2.7.8-6, RHEL 5.2. 2. Solaris nfsrdma server svn 130, ConnectX QDR HCA. Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M count=1*, operation fail, connection get drop, client cannot re-establish connection to server. After rebooting only the client, I can mount again. It happens with both solaris and linux nfsrdma servers. For linux client/server
Re: [ewg] nfsrdma fails to write big file,
Vu Pham wrote: Setup: 1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600, ConnectX2 QDR HCAs fw 2.7.8-6, RHEL 5.2. 2. Solaris nfsrdma server svn 130, ConnectX QDR HCA. Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M count=1*, operation fail, connection get drop, client cannot re-establish connection to server. After rebooting only the client, I can mount again. It happens with both solaris and linux nfsrdma servers. For linux client/server, I run memreg=5 (FRMR), I don't see problem with memreg=6 (global dma key) Awesome. This is the key I think. Thanks for the info Vu, Tom On Solaris server snv 130, we see problem decoding write request of 32K. The client send two read chunks (32K 16-byte), the server fail to do rdma read on the 16-byte chunk (cqe.status = 10 ie. IB_WC_REM_ACCCESS_ERROR); therefore, server terminate the connection. We don't see this problem on nfs version 3 on Solaris. Solaris server run normal memory registration mode. On linux client, I see cqe.status = 12 ie. IB_WC_RETRY_EXC_ERR I added these notes in bug #1919 (bugs.openfabrics.org) to track the issue. thanks, -vu ___ ewg mailing list e...@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ewg] nfsrdma fails to write big file,
Tom, Some more info on the problem: 1. Running with memreg=4 (FMR) I can not reproduce the problem 2. I also see different error on client Feb 22 12:16:55 mellanox-2 rpc.idmapd[5786]: nss_getpwnam: name 'nobody' does not map into domain 'localdomain' Feb 22 12:16:55 mellanox-2 kernel: QP 0x70004b: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: RPC: rpcrdma_ep_post: ib_post_send returned -12 cq_init 48 cq_count 32 Feb 22 12:17:00 mellanox-2 kernel: RPC: rpcrdma_event_process: send WC status 5, vend_err F5 Feb 22 12:17:00 mellanox-2 kernel: rpcrdma: connection to 13.20.1.9:20049 closed (-103) -vu -Original Message- From: Tom Tucker [mailto:t...@opengridcomputing.com] Sent: Monday, February 22, 2010 10:49 AM To: Vu Pham Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Vu Pham wrote: Setup: 1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600, ConnectX2 QDR HCAs fw 2.7.8-6, RHEL 5.2. 2. Solaris nfsrdma server svn 130, ConnectX QDR HCA. Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M count=1*, operation fail, connection get drop, client cannot re-establish connection to server. After rebooting only the client, I can mount again. It happens with both solaris and linux nfsrdma servers. For linux client/server, I run memreg=5 (FRMR), I don't see problem with memreg=6 (global dma key) Awesome. This is the key I think. Thanks for the info Vu, Tom On Solaris server snv 130, we see problem decoding write request of 32K. The client send two read chunks (32K 16-byte), the server fail to do rdma read on the 16-byte chunk (cqe.status = 10 ie. IB_WC_REM_ACCCESS_ERROR); therefore, server terminate the connection. We don't see this problem on nfs version 3 on Solaris. Solaris server run normal memory registration mode. On linux client, I see cqe.status = 12 ie. IB_WC_RETRY_EXC_ERR I added these notes in bug #1919 (bugs.openfabrics.org) to track the issue. thanks, -vu ___ ewg mailing list e...@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html