Re: [ewg] Making a new ofed_kernel-1.5.1.tgz for OFED-1.5.1.tgz
Hi Vlad, On 9/12/10 5:34 AM, Vladimir Sokolovsky wrote: On 09/08/2010 01:41 AM, Tom Tucker wrote: Hi Vlad, I'm trying to test an update to the kernel in the context of OFED 1.5.1++. I've got everything 'packaged' and working to the point where I need to create an updated ofa_kernel-1.5.1-OFED.1.5.1..src.rpm I can use makedist to create backport .tgz files for plain (ofa_kernel.tgz, ofa_kernel-$backport.tgz), etc... but can't figure out how to build the actual src RPM. I presume this is an RPM built from ofa_kernel.tgz + a twizzled ofed_scripts/ofa_kernel.spec file with @VERSION@ etc... .replaced with 1.5.1, etc... Can you tell me how to do this? Thanks, Tom Hi Tom, You can use OFED-1.5.1/docs/ofed_patch.sh script to add a patch to existing ofa_kernel source RPM. Great, that's what I did. Alternatively, you may replace @VERSION@ by 1.5.1 and @RELEASE@ by OFED.1.5.1 (or anything else), tar the directory and run rpmbuild -ts tar file to create the source RPM. Ok, that's fine. I was reluctant to do that because I assumed that there was some tool that did that for you and that there would be some other set of things needed that weren't so obvious. Thanks! Tom Regards, Vladimir ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] OFED bugs and 1.5.1 GA release
Tziporet Koren wrote: Hi Vu, Steve, Jeff, Tom These are the last major bugs open for 1.5.1 Please reply me if these are really high and when do you expect to have the fix I wish to have 1.5.1 GA release this week if possible 1964 blo sw...@opengridcomputing.com cxgb3 fails openmpi branding 1961 cri t...@opengridcomputing.com [OFED-1.5.1- NFSoverRDMA] - System hits kernel panic whil... 1976 maj jsquy...@cisco.com errors running IMB over openmpi-1.4.1 1922 maj t...@opengridcomputing.com Errors during stress on NFSoRDMA 1979 maj t...@opengridcomputing.com 2 different mounted directories appear to be the same one 1980 maj t...@opengridcomputing.com failure after nfs stop on NFSoRDMA target 1981 maj t...@opengridcomputing.com openibd hangs upon restart with mounted NFSoRDMA volume 1978 maj v...@mellanox.com Kernel Panic when unloading ib_srp Also - Tom - do you expect NFS-RDMA to be a GA this week, or should we say it's in beta and continue improve it for next 1.5.2 release? I think it's still Beta. It is fundamentally a more invasive and risky install than the other components because it significantly changes the core NFS implementation and affects non-RDMA mounts for TCP and UDP. Tziporet ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] RC4 build failure on FC12
Has anyone seen this? Install rds-tools RPM: Running rpm -iv /root/OFED-1.5.1-rc4/RPMS/fedora-release-11-1.noarch/x86_64/rds-tools-1.5-1.x86_64.rpm Build ibutils RPM Running rpmbuild --rebuild --define '_topdir /var/tmp//OFED_topdir' --define 'dist %{nil}' --target x86_64 --define '_prefix /usr' --define '_exec_prefix /usr' --define '_sysconfdir /etc' --define '_usr /usr' --define 'build_ibmgtsim 1' --define '__arch_install_post %{nil}' --define 'configure_options --with-osm=/usr ' /root/OFED-1.5.1-rc4/SRPMS/ibutils-1.5.4-1.src.rpm Failed to build ibutils RPM See /tmp/OFED.18913.logs/ibutils.rpmbuild.log [r...@shuttle1 OFED-1.5.1-rc4]# tail -50 /tmp/OFED.18913.logs/ibutils.rpmbuild.log ... ibmssh_wrap.cpp:40796: warning: deprecated conversion from string constant to 'char*' ibmssh_wrap.cpp:40796: warning: deprecated conversion from string constant to 'char*' if g++ -DHAVE_CONFIG_H -I. -I. -I.. -I/var/tmp/OFED_topdir/BUILD/ibutils-1.5.4/ibdm/ibdm -I/usr/include -I-I/var/tmp/OFED_topdir/BUILD/ibutils-1.5.4/ibdm/ibdm -I/usr/include -I./../../ibdm/ibdm -I/usr/include/infiniband -I/usr/include -DOSM_VENDOR_INTF_OPENIB -DOSM_BUILD_OPENIB -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -O2 -Wall -I/usr/include/infiniband -I/usr/include -DOSM_VENDOR_INTF_OPENIB -DOSM_BUILD_OPENIB -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1 -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -MT sma.o -MD -MP -MF .deps/sma.Tpo -c -o sma.o sma.cpp; \ then mv -f .deps/sma.Tpo .deps/sma.Po; else rm -f .deps/sma.Tpo; exit 1; fi sma.cpp: In static member function 'static void* SMATimer::timerRun(void*)': sma.cpp:134: warning: no return statement in function returning non-void sma.cpp: In member function 'int IBMSSma::nodeDescMad(ibms_mad_msg_t)': sma.cpp:511: error: invalid conversion from 'const char*' to 'char*' sma.cpp: In member function 'int IBMSSma::setPortInfoSwExtPort(ibms_mad_msg_t, ibms_mad_msg_t, uint8_t, ib_port_info_t, int)': sma.cpp:1278: warning: suggest parentheses around arithmetic in operand of '|' make[3]: *** [sma.o] Error 1 make[3]: Leaving directory `/var/tmp/OFED_topdir/BUILD/ibutils-1.5.4/ibmgtsim/src' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/var/tmp/OFED_topdir/BUILD/ibutils-1.5.4/ibmgtsim' make[1]: *** [all] Error 2 make[1]: Leaving directory `/var/tmp/OFED_topdir/BUILD/ibutils-1.5.4/ibmgtsim' make: *** [all-recursive] Error 1 error: Bad exit status from /var/tmp/rpm-tmp.bRz5D4 (%build) RPM build errors: user vlad does not exist - using root group vlad does not exist - using root user vlad does not exist - using root group vlad does not exist - using root Bad exit status from /var/tmp/rpm-tmp.bRz5D4 (%build ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] [GIT PULL ofed-1.5] bug fix for 1919
Vlad: Please pull from: ssh://boo...@sofa.openfabrics.org/home/boomer/scm/ofed_kernel ofed_1_5 Thanks, Tom ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Build Broken?
I'm having an issue with cma.c when running makedist.sh. It looks like EL5.5 is broken. Does anyone else have this problem? Thanks, Tom ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] nfsrdma fails to write big file,
Mahesh Siddheshwar wrote: Hi Tom, Vu, Tom Tucker wrote: Roland Dreier wrote: + /* +* Add room for frmr register and invalidate WRs +* Requests sometimes have two chunks, each chunk +* requires to have different frmr. The safest +* WRs required are max_send_wr * 6; however, we +* get send completions and poll fast enough, it +* is pretty safe to have max_send_wr * 4. +*/ + ep-rep_attr.cap.max_send_wr *= 4; Seems like a bad design if there is a possibility of work queue overflow; if you're counting on events occurring in a particular order or completions being handled fast enough, then your design is going to fail in some high load situations, which I don't think you want. Vu, Would you please try the following: - Set the multiplier to 5 While trying to test this between a Linux client and Solaris server, I made the following changes in : /usr/src/ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c diff verbs.c.org verbs.c 653c653 ep-rep_attr.cap.max_send_wr *= 3; --- ep-rep_attr.cap.max_send_wr *= 8; 685c685 ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/2 /* - 1*/; --- ep-rep_cqinit = ep-rep_attr.cap.max (I bumped it to 8) did make install. On reboot I see the errors on NFS READs as opposed to WRITEs as seen before, when I try to read a 10G file from the server. The client is running: RHEL 5.3 (2.6.18-128.el5PAE) with OFED-1.5.1-20100223-0740 bits. The client has an Sun IB HCA: SUN0070130001, MT25418, 2.7.0 firmware, hw_rev = a0. The server is running Solaris based on snv_128. rpcdebug output from the client: == RPC:85 call_bind (status 0) RPC:85 call_connect xprt ec78d800 is connected RPC:85 call_transmit (status 0) RPC:85 xprt_prepare_transmit RPC:85 xprt_cwnd_limited cong = 0 cwnd = 8192 RPC:85 rpc_xdr_encode (status 0) RPC:85 marshaling UNIX cred eddb4dc0 RPC:85 using AUTH_UNIX cred eddb4dc0 to wrap rpc data RPC:85 xprt_transmit(164) RPC: rpcrdma_inline_pullup: pad 0 destp 0xf1dd1410 len 164 hdrlen 164 RPC: rpcrdma_register_frmr_external: Using frmr ec7da920 to map 4 segments RPC: rpcrdma_create_chunks: write chunk elem 16...@0x38536d000:0xa601 (more) RPC: rpcrdma_register_frmr_external: Using frmr ec7da960 to map 1 segments RPC: rpcrdma_create_chunks: write chunk elem 1...@0x31dd153c:0xaa01 (last) RPC: rpcrdma_marshal_req: write chunk: hdrlen 68 rpclen 164 padlen 0 headerp 0xf1dd124c base 0xf1dd136c lkey 0x500 RPC:85 xmit complete RPC:85 sleep_on(queue xprt_pending time 4683109) RPC:85 added to queue ec78d994 xprt_pending RPC:85 setting alarm for 6 ms RPC: wake_up_next(ec78d944 xprt_resend) RPC: wake_up_next(ec78d8f4 xprt_sending) RPC: rpcrdma_qp_async_error_upcall: QP error 3 on device mlx4_0 ep ec78db40 RPC:85 __rpc_wake_up_task (now 4683110) RPC:85 disabling timer RPC:85 removed from queue ec78d994 xprt_pending RPC: __rpc_wake_up_task done RPC:85 __rpc_execute flags=0x1 RPC:85 call_status (status -107) RPC:85 call_bind (status 0) RPC:85 call_connect xprt ec78d800 is not connected RPC:85 xprt_connect xprt ec78d800 is not connected RPC:85 sleep_on(queue xprt_pending time 4683110) RPC:85 added to queue ec78d994 xprt_pending RPC:85 setting alarm for 6 ms RPC: rpcrdma_event_process: event rep ec116800 status 5 opcode 80 length 2493606 RPC: rpcrdma_event_process: recv WC status 5, connection lost RPC: rpcrdma_conn_upcall: disconnected: ec78dbccI4:20049 (ep 0xec78db40 event 0xa) RPC: rpcrdma_conn_upcall: disconnected rpcrdma: connection to ec78dbccI4:20049 closed (-103) RPC: xprt_rdma_connect_worker: reconnect == On the server I see: Mar 3 17:45:16 elena-ar hermon: [ID 271130 kern.notice] NOTICE: hermon0: Device Error: CQE remote access error Mar 3 17:45:16 elena-ar nfssrv: [ID 819430 kern.notice] NOTICE: NFS: bad sendreply Mar 3 17:45:21 elena-ar hermon: [ID 271130 kern.notice] NOTICE: hermon0: Device Error: CQE remote access error Mar 3 17:45:21 elena-ar nfssrv: [ID 819430 kern.notice] NOTICE: NFS: bad sendreply The remote access error is actually seen on RDMA_WRITE. Doing some more debug on the server with DTrace, I see that the destination address and length matches the write chunk element in the Linux debug output above. 0 9385 rib_write:entry daddr 38536d000, len 4000, hdl a601 0 9358 rib_init_sendwait:return ff44a715d308 1 9296 rib_svc_scq_handler:return 1f7 1 9356 rib_sendwait:return 14 1 9386 rib_write:return 14 ^^^ that is RDMA_FAILED in 1 63295xdrrdma_send_read_data:return 0 1
Re: [ewg] nfsrdma fails to write big file,
Vu Pham wrote: Setup: 1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600, ConnectX2 QDR HCAs fw 2.7.8-6, RHEL 5.2. 2. Solaris nfsrdma server svn 130, ConnectX QDR HCA. Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M count=1*, operation fail, connection get drop, client cannot re-establish connection to server. After rebooting only the client, I can mount again. It happens with both solaris and linux nfsrdma servers. For linux client/server, I run memreg=5 (FRMR), I don't see problem with memreg=6 (global dma key) Awesome. This is the key I think. Thanks for the info Vu, Tom On Solaris server snv 130, we see problem decoding write request of 32K. The client send two read chunks (32K 16-byte), the server fail to do rdma read on the 16-byte chunk (cqe.status = 10 ie. IB_WC_REM_ACCCESS_ERROR); therefore, server terminate the connection. We don't see this problem on nfs version 3 on Solaris. Solaris server run normal memory registration mode. On linux client, I see cqe.status = 12 ie. IB_WC_RETRY_EXC_ERR I added these notes in bug #1919 (bugs.openfabrics.org) to track the issue. thanks, -vu ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] MLX4 Strangeness
Hi Tziporet: Here is a trace with the data for WR failing with status 12. The vendor error is 129. Feb 17 12:27:33 vic10 kernel: rpcrdma_event_process:154 wr_id status 12 opcode 0 vendor_err 129 byte_len 0 qp 81002a13ec00 ex src_qp wc_flags, 0 pkey_index Feb 17 12:27:33 vic10 kernel: rpcrdma_event_process:154 wr_id 81002878d800 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81002a13ec00 ex src_qp wc_flags, 0 pkey_index Feb 17 12:27:33 vic10 kernel: rpcrdma_event_process:167 wr_id 81002878d800 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81002a13ec00 ex src_qp wc_flags, 0 pkey_index Any thoughts? Tom Tom Tucker wrote: Tom Tucker wrote: Tziporet Koren wrote: On 2/15/2010 10:24 PM, Tom Tucker wrote: Hello, I am seeing some very strange behavior on my MLX4 adapters running 2.7 firmware and the latest OFED 1.5.1. Two systems are involved and each have dual ported MTHCA DDR adapter and MLX4 adapters. The scenario starts with NFSRDMA stress testing between the two systems running bonnie++ and iozone concurrently. The test completes and there is no issue. Then 6 minutes pass and the server times out the connection and shuts down the RC connection to the client. From this point on, using the RDMA CM, a new RC QP can be brought up and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system fails with IB_WC_RETRY_EXC_ERR. I have confirmed: - that arp completed successfully and the neighbor entries are populated on both the client and server - that the QP are in the RTS state on both the client and server - that there are RECV WR posted to the RQ on the server and they did not error out - that no RECV WR completed successfully or in error on the server - that there are SEND WR posted to the QP on the client - the client side SEND_WR fails with error 12 as mentioned above I have also confirmed the following with a different application (i.e. rping): server# rping -s client# rping -c -a 192.168.80.129 fails with the exact same error, i.e. client# rping -c -a 192.168.80.129 cq completion failed status 12 wait for RDMA_WRITE_ADV state 10 client DISCONNECT EVENT... However, if I run rping the other way, it works fine, that is, client# rping -s server# rping -c -a 192.168.80.135 It runs without error until I stop it. Does anyone have any ideas on how I might debug this? Tom What is the vendor syndrome error when you get a completion with error? Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81003c9e3200 ex src_qp wc_flags, 0 pkey_index Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81002f2d8400 ex src_qp wc_flags, 0 pkey_index Repeat forever So the vendor err is 244. Please ignore this. This log skips the failing WR (:-\). I need to do another trace. Does the issue occurs only on the ConnectX cards (mlx4) or also on the InfiniHost cards (mthca) Tziporet ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] MLX4 Strangeness
Tziporet Koren wrote: On 2/15/2010 10:24 PM, Tom Tucker wrote: Hello, I am seeing some very strange behavior on my MLX4 adapters running 2.7 firmware and the latest OFED 1.5.1. Two systems are involved and each have dual ported MTHCA DDR adapter and MLX4 adapters. The scenario starts with NFSRDMA stress testing between the two systems running bonnie++ and iozone concurrently. The test completes and there is no issue. Then 6 minutes pass and the server times out the connection and shuts down the RC connection to the client. From this point on, using the RDMA CM, a new RC QP can be brought up and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system fails with IB_WC_RETRY_EXC_ERR. I have confirmed: - that arp completed successfully and the neighbor entries are populated on both the client and server - that the QP are in the RTS state on both the client and server - that there are RECV WR posted to the RQ on the server and they did not error out - that no RECV WR completed successfully or in error on the server - that there are SEND WR posted to the QP on the client - the client side SEND_WR fails with error 12 as mentioned above I have also confirmed the following with a different application (i.e. rping): server# rping -s client# rping -c -a 192.168.80.129 fails with the exact same error, i.e. client# rping -c -a 192.168.80.129 cq completion failed status 12 wait for RDMA_WRITE_ADV state 10 client DISCONNECT EVENT... However, if I run rping the other way, it works fine, that is, client# rping -s server# rping -c -a 192.168.80.135 It runs without error until I stop it. Does anyone have any ideas on how I might debug this? Tom What is the vendor syndrome error when you get a completion with error? Hang on... compiling Does the issue occurs only on the ConnectX cards (mlx4) or also on the InfiniHost cards (mthca) Only the MLX4 cards. Tziporet ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] MLX4 Strangeness
Tziporet Koren wrote: On 2/15/2010 10:24 PM, Tom Tucker wrote: Hello, I am seeing some very strange behavior on my MLX4 adapters running 2.7 firmware and the latest OFED 1.5.1. Two systems are involved and each have dual ported MTHCA DDR adapter and MLX4 adapters. The scenario starts with NFSRDMA stress testing between the two systems running bonnie++ and iozone concurrently. The test completes and there is no issue. Then 6 minutes pass and the server times out the connection and shuts down the RC connection to the client. From this point on, using the RDMA CM, a new RC QP can be brought up and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system fails with IB_WC_RETRY_EXC_ERR. I have confirmed: - that arp completed successfully and the neighbor entries are populated on both the client and server - that the QP are in the RTS state on both the client and server - that there are RECV WR posted to the RQ on the server and they did not error out - that no RECV WR completed successfully or in error on the server - that there are SEND WR posted to the QP on the client - the client side SEND_WR fails with error 12 as mentioned above I have also confirmed the following with a different application (i.e. rping): server# rping -s client# rping -c -a 192.168.80.129 fails with the exact same error, i.e. client# rping -c -a 192.168.80.129 cq completion failed status 12 wait for RDMA_WRITE_ADV state 10 client DISCONNECT EVENT... However, if I run rping the other way, it works fine, that is, client# rping -s server# rping -c -a 192.168.80.135 It runs without error until I stop it. Does anyone have any ideas on how I might debug this? Tom What is the vendor syndrome error when you get a completion with error? Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81003c9e3200 ex src_qp wc_flags, 0 pkey_index Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81002f2d8400 ex src_qp wc_flags, 0 pkey_index Repeat forever So the vendor err is 244. Does the issue occurs only on the ConnectX cards (mlx4) or also on the InfiniHost cards (mthca) Tziporet ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] MLX4 Strangeness
More info... Reboot the client and try to reconnect to a server that has not been rebooted fails in the same way. It must be an issue with the server. I see no completions on the server or any indication that an RDMA_SEND was incoming. Is there some way to dump adapter state or otherwise see if there was traffic on the wire? Tom Tom Tucker wrote: Tom Tucker wrote: Tziporet Koren wrote: On 2/15/2010 10:24 PM, Tom Tucker wrote: Hello, I am seeing some very strange behavior on my MLX4 adapters running 2.7 firmware and the latest OFED 1.5.1. Two systems are involved and each have dual ported MTHCA DDR adapter and MLX4 adapters. The scenario starts with NFSRDMA stress testing between the two systems running bonnie++ and iozone concurrently. The test completes and there is no issue. Then 6 minutes pass and the server times out the connection and shuts down the RC connection to the client. From this point on, using the RDMA CM, a new RC QP can be brought up and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system fails with IB_WC_RETRY_EXC_ERR. I have confirmed: - that arp completed successfully and the neighbor entries are populated on both the client and server - that the QP are in the RTS state on both the client and server - that there are RECV WR posted to the RQ on the server and they did not error out - that no RECV WR completed successfully or in error on the server - that there are SEND WR posted to the QP on the client - the client side SEND_WR fails with error 12 as mentioned above I have also confirmed the following with a different application (i.e. rping): server# rping -s client# rping -c -a 192.168.80.129 fails with the exact same error, i.e. client# rping -c -a 192.168.80.129 cq completion failed status 12 wait for RDMA_WRITE_ADV state 10 client DISCONNECT EVENT... However, if I run rping the other way, it works fine, that is, client# rping -s server# rping -c -a 192.168.80.135 It runs without error until I stop it. Does anyone have any ideas on how I might debug this? Tom What is the vendor syndrome error when you get a completion with error? Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81003c9e3200 ex src_qp wc_flags, 0 pkey_index Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81002f2d8400 ex src_qp wc_flags, 0 pkey_index Repeat forever So the vendor err is 244. Please ignore this. This log skips the failing WR (:-\). I need to do another trace. Does the issue occurs only on the ConnectX cards (mlx4) or also on the InfiniHost cards (mthca) Tziporet ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [GIT PULL ofed-1.5] nfsrdma fixes
Tziporet Koren wrote: On 2/9/2010 11:22 PM, Tom Tucker wrote: Hi Vlad: I have made updates to the nfsrdma patch files. We put them in Steve's tree just for now, until I get my tree all set up. Please pull from ssh://sw...@sofa.openfabrics.org/home/swise/scm/ofed_kernel.git ofed_1_5 Tom Please also move bugzilla bugs you have fixed to fixed state Ok, once I'm sure I fixed them. Thanks Tziporet ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] RE: [ofa-general] OFED 1.3 Alpha release is available
On Tue, 2007-10-16 at 17:46 -0700, Scott Weitzenkamp (sweitzen) wrote: 3. IPoIB o Stateless offloads o NAPI is enabled default How does one measure these changes using tools like netperf or iperf? Do I need a specific HCA type? 4. SDP - these are not yet in the alpha release o Keep-alive o Asynch IO o Send Zero Copy If it didn't make it into alpha, perhaps it should not go into 1.3, so we can hold the release date better? What ever happened to NFS RDMA? The SVC transport switch and SVC-UDP/TCP/RDMA transport drivers are targeted for 2.6.25. To track this activity, see [EMAIL PROTECTED] Scott ___ general mailing list [EMAIL PROTECTED] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg