from:"Tom Tucker"

Re: [ewg] Making a new ofed_kernel-1.5.1.tgz for OFED-1.5.1.tgz

2010-09-12 Thread Tom Tucker

  Hi Vlad,

On 9/12/10 5:34 AM, Vladimir Sokolovsky wrote:
 On 09/08/2010 01:41 AM, Tom Tucker wrote:
 Hi Vlad,

 I'm trying to test an update to the kernel in the context of OFED
 1.5.1++. I've got
 everything 'packaged' and working to the point where I need to create an
 updated ofa_kernel-1.5.1-OFED.1.5.1..src.rpm

 I can use makedist to create backport .tgz files for plain
 (ofa_kernel.tgz, ofa_kernel-$backport.tgz), etc... but can't figure out
 how to build the actual src RPM. I presume this is an RPM built from
 ofa_kernel.tgz + a twizzled ofed_scripts/ofa_kernel.spec file with
 @VERSION@ etc... .replaced with 1.5.1, etc...

 Can you tell me how to do this?

 Thanks,
 Tom



 Hi Tom,
 You can use OFED-1.5.1/docs/ofed_patch.sh script to add a patch to 
 existing
 ofa_kernel source RPM.

Great, that's what I did.
 Alternatively, you may replace @VERSION@ by 1.5.1 and @RELEASE@ by 
 OFED.1.5.1
 (or anything else), tar the directory and run rpmbuild -ts tar 
 file to
 create the source RPM.

Ok, that's fine. I was reluctant to do that because I assumed that there 
was some tool that did that for you and that there would be some other 
set of things needed that weren't so obvious.

Thanks!
Tom



 Regards,
 Vladimir

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] OFED bugs and 1.5.1 GA release

2010-03-16 Thread Tom Tucker

Tziporet Koren wrote:
 Hi Vu, Steve, Jeff, Tom

 These are the last major bugs open for 1.5.1 
 Please reply me if these are really high and when do you expect to have the 
 fix
 I wish to have 1.5.1 GA release this week if possible

 1964  blo sw...@opengridcomputing.com cxgb3 fails openmpi branding
 1961  cri t...@opengridcomputing.com  [OFED-1.5.1- NFSoverRDMA] - 
 System hits kernel panic whil...
 1976  maj jsquy...@cisco.com  errors running IMB over 
 openmpi-1.4.1
 1922  maj t...@opengridcomputing.com  Errors during stress on NFSoRDMA
 1979  maj t...@opengridcomputing.com  2 different mounted directories 
 appear to be the same one
 1980  maj t...@opengridcomputing.com  failure after nfs stop on 
 NFSoRDMA target
 1981  maj t...@opengridcomputing.com  openibd hangs upon restart with 
 mounted NFSoRDMA volume
 1978  maj v...@mellanox.com   Kernel Panic when 
 unloading ib_srp


 Also -  Tom - do you expect NFS-RDMA to be a GA this week, or should we say 
 it's in beta and continue improve it for next 1.5.2 release?

   
I think it's still Beta. It is fundamentally a more invasive and risky 
install than the other components because it significantly changes the 
core NFS implementation and affects non-RDMA mounts for TCP and UDP.


 Tziporet

 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
   

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] RC4 build failure on FC12

2010-03-11 Thread Tom Tucker

Has anyone seen this?


Install rds-tools RPM:
Running rpm -iv  
/root/OFED-1.5.1-rc4/RPMS/fedora-release-11-1.noarch/x86_64/rds-tools-1.5-1.x86_64.rpm
Build ibutils RPM
Running  rpmbuild --rebuild  --define '_topdir /var/tmp//OFED_topdir' --define 
'dist %{nil}' --target x86_64 --define '_prefix /usr' --define '_exec_prefix 
/usr' --define '_sysconfdir /etc' --define '_usr /usr' --define 'build_ibmgtsim 
1' --define '__arch_install_post %{nil}' --define 'configure_options  
--with-osm=/usr ' /root/OFED-1.5.1-rc4/SRPMS/ibutils-1.5.4-1.src.rpm
Failed to build ibutils RPM 
See /tmp/OFED.18913.logs/ibutils.rpmbuild.log 

[r...@shuttle1 OFED-1.5.1-rc4]# tail -50 
/tmp/OFED.18913.logs/ibutils.rpmbuild.log
...
ibmssh_wrap.cpp:40796: warning: deprecated conversion from string constant to 
'char*'
ibmssh_wrap.cpp:40796: warning: deprecated conversion from string constant to 
'char*'
if g++ -DHAVE_CONFIG_H -I. -I. -I.. 
-I/var/tmp/OFED_topdir/BUILD/ibutils-1.5.4/ibdm/ibdm -I/usr/include 
-I-I/var/tmp/OFED_topdir/BUILD/ibutils-1.5.4/ibdm/ibdm -I/usr/include 
-I./../../ibdm/ibdm -I/usr/include/infiniband -I/usr/include  
-DOSM_VENDOR_INTF_OPENIB  -DOSM_BUILD_OPENIB -D_XOPEN_SOURCE=600 
-D_BSD_SOURCE=1  -O2 -Wall -I/usr/include/infiniband -I/usr/include  
-DOSM_VENDOR_INTF_OPENIB  -DOSM_BUILD_OPENIB -D_XOPEN_SOURCE=600 
-D_BSD_SOURCE=1 -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions 
-fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -MT sma.o -MD 
-MP -MF .deps/sma.Tpo -c -o sma.o sma.cpp; \
then mv -f .deps/sma.Tpo .deps/sma.Po; else rm -f .deps/sma.Tpo; 
exit 1; fi
sma.cpp: In static member function 'static void* SMATimer::timerRun(void*)':
sma.cpp:134: warning: no return statement in function returning non-void
sma.cpp: In member function 'int IBMSSma::nodeDescMad(ibms_mad_msg_t)':
sma.cpp:511: error: invalid conversion from 'const char*' to 'char*'
sma.cpp: In member function 'int IBMSSma::setPortInfoSwExtPort(ibms_mad_msg_t, 
ibms_mad_msg_t, uint8_t, ib_port_info_t, int)':
sma.cpp:1278: warning: suggest parentheses around arithmetic in operand of '|'
make[3]: *** [sma.o] Error 1
make[3]: Leaving directory 
`/var/tmp/OFED_topdir/BUILD/ibutils-1.5.4/ibmgtsim/src'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/var/tmp/OFED_topdir/BUILD/ibutils-1.5.4/ibmgtsim'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/var/tmp/OFED_topdir/BUILD/ibutils-1.5.4/ibmgtsim'
make: *** [all-recursive] Error 1
error: Bad exit status from /var/tmp/rpm-tmp.bRz5D4 (%build)


RPM build errors:
user vlad does not exist - using root
group vlad does not exist - using root
user vlad does not exist - using root
group vlad does not exist - using root
Bad exit status from /var/tmp/rpm-tmp.bRz5D4 (%build

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] [GIT PULL ofed-1.5] bug fix for 1919

2010-03-09 Thread Tom Tucker

Vlad:

Please pull from:

ssh://boo...@sofa.openfabrics.org/home/boomer/scm/ofed_kernel ofed_1_5

Thanks,
Tom
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] Build Broken?

2010-03-04 Thread Tom Tucker


I'm having an issue with cma.c when running makedist.sh. It looks like 
EL5.5 is broken.

Does anyone else have this problem?

Thanks,
Tom

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] nfsrdma fails to write big file,

2010-03-03 Thread Tom Tucker

Mahesh Siddheshwar wrote:
 Hi Tom, Vu,

 Tom Tucker wrote:
 Roland Dreier wrote:
   +   /*   +* Add room for frmr 
 register and invalidate WRs
   +* Requests sometimes have two chunks, each chunk
   +* requires to have different frmr. The safest
   +* WRs required are max_send_wr * 6; however, we
   +* get send completions and poll fast enough, it
   +* is pretty safe to have max_send_wr * 4.   
 +*/
   +   ep-rep_attr.cap.max_send_wr *= 4;

 Seems like a bad design if there is a possibility of work queue
 overflow; if you're counting on events occurring in a particular order
 or completions being handled fast enough, then your design is 
 going to
 fail in some high load situations, which I don't think you want.   

 Vu,

 Would you please try the following:

 - Set the multiplier to 5
 While trying to test this between a Linux client and Solaris server,
 I made the following changes in :
 /usr/src/ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c

 diff verbs.c.org verbs.c
 653c653
ep-rep_attr.cap.max_send_wr *= 3;
 ---
ep-rep_attr.cap.max_send_wr *= 8;
 685c685
ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/2 /*  - 1*/;
 ---
ep-rep_cqinit = ep-rep_attr.cap.max

 (I bumped it to 8)

 did make install.
 On reboot I see the errors on NFS READs as opposed to WRITEs
 as seen before, when I try to read a 10G file from the server.

 The client is running: RHEL 5.3 (2.6.18-128.el5PAE) with
 OFED-1.5.1-20100223-0740 bits. The client has an Sun IB
 HCA: SUN0070130001, MT25418, 2.7.0 firmware, hw_rev = a0.
 The server is running Solaris based on snv_128.

 rpcdebug output from the client:

 ==
 RPC:85 call_bind (status 0)
 RPC:85 call_connect xprt ec78d800 is connected
 RPC:85 call_transmit (status 0)
 RPC:85 xprt_prepare_transmit
 RPC:85 xprt_cwnd_limited cong = 0 cwnd = 8192
 RPC:85 rpc_xdr_encode (status 0)
 RPC:85 marshaling UNIX cred eddb4dc0
 RPC:85 using AUTH_UNIX cred eddb4dc0 to wrap rpc data
 RPC:85 xprt_transmit(164)
 RPC:   rpcrdma_inline_pullup: pad 0 destp 0xf1dd1410 len 164 
 hdrlen 164
 RPC:   rpcrdma_register_frmr_external: Using frmr ec7da920 to map 
 4 segments
 RPC:   rpcrdma_create_chunks: write chunk elem 
 16...@0x38536d000:0xa601 (more)
 RPC:   rpcrdma_register_frmr_external: Using frmr ec7da960 to map 
 1 segments
 RPC:   rpcrdma_create_chunks: write chunk elem 
 1...@0x31dd153c:0xaa01 (last)
 RPC:   rpcrdma_marshal_req: write chunk: hdrlen 68 rpclen 164 
 padlen 0 headerp 0xf1dd124c base 0xf1dd136c lkey 0x500
 RPC:85 xmit complete
 RPC:85 sleep_on(queue xprt_pending time 4683109)
 RPC:85 added to queue ec78d994 xprt_pending
 RPC:85 setting alarm for 6 ms
 RPC:   wake_up_next(ec78d944 xprt_resend)
 RPC:   wake_up_next(ec78d8f4 xprt_sending)
 RPC:   rpcrdma_qp_async_error_upcall: QP error 3 on device mlx4_0 
 ep ec78db40
 RPC:85 __rpc_wake_up_task (now 4683110)
 RPC:85 disabling timer
 RPC:85 removed from queue ec78d994 xprt_pending
 RPC:   __rpc_wake_up_task done
 RPC:85 __rpc_execute flags=0x1
 RPC:85 call_status (status -107)
 RPC:85 call_bind (status 0)
 RPC:85 call_connect xprt ec78d800 is not connected
 RPC:85 xprt_connect xprt ec78d800 is not connected
 RPC:85 sleep_on(queue xprt_pending time 4683110)
 RPC:85 added to queue ec78d994 xprt_pending
 RPC:85 setting alarm for 6 ms
 RPC:   rpcrdma_event_process: event rep ec116800 status 5 opcode 
 80 length 2493606
 RPC:   rpcrdma_event_process: recv WC status 5, connection lost
 RPC:   rpcrdma_conn_upcall: disconnected: ec78dbccI4:20049 (ep 
 0xec78db40 event 0xa)
 RPC:   rpcrdma_conn_upcall: disconnected
 rpcrdma: connection to ec78dbccI4:20049 closed (-103)
 RPC:   xprt_rdma_connect_worker: reconnect
 ==

 On the server I see:

 Mar  3 17:45:16 elena-ar hermon: [ID 271130 kern.notice] NOTICE: 
 hermon0: Device Error: CQE remote access error
 Mar  3 17:45:16 elena-ar nfssrv: [ID 819430 kern.notice] NOTICE: NFS: 
 bad sendreply
 Mar  3 17:45:21 elena-ar hermon: [ID 271130 kern.notice] NOTICE: 
 hermon0: Device Error: CQE remote access error
 Mar  3 17:45:21 elena-ar nfssrv: [ID 819430 kern.notice] NOTICE: NFS: 
 bad sendreply

 The remote access error is actually seen on RDMA_WRITE.
 Doing some more debug on the server with DTrace, I see that
 the destination address and length matches the write chunk
 element in the Linux debug output above.


  0   9385  rib_write:entry daddr 38536d000, len 4000, 
 hdl a601
  0   9358 rib_init_sendwait:return ff44a715d308
  1   9296   rib_svc_scq_handler:return 1f7
  1   9356  rib_sendwait:return 14
  1   9386 rib_write:return 14

 ^^^ that is RDMA_FAILED in
  1  63295xdrrdma_send_read_data:return 0
  1

Re: [ewg] nfsrdma fails to write big file,

2010-02-22 Thread Tom Tucker

Vu Pham wrote:
 Setup: 
 1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600, ConnectX2
 QDR HCAs fw 2.7.8-6, RHEL 5.2.
 2. Solaris nfsrdma server svn 130, ConnectX QDR HCA.


 Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M
 count=1*, operation fail, connection get drop, client cannot
 re-establish connection to server.
 After rebooting only the client, I can mount again.

 It happens with both solaris and linux nfsrdma servers.

 For linux client/server, I run memreg=5 (FRMR), I don't see problem with
 memreg=6 (global dma key)

   

Awesome. This is the key I think.

Thanks for the info Vu,
Tom


 On Solaris server snv 130, we see problem decoding write request of 32K.
 The client send two read chunks (32K  16-byte), the server fail to do
 rdma read on the 16-byte chunk (cqe.status = 10 ie.
 IB_WC_REM_ACCCESS_ERROR); therefore, server terminate the connection. We
 don't see this problem on nfs version 3 on Solaris. Solaris server run
 normal memory registration mode.

 On linux client, I see cqe.status = 12 ie. IB_WC_RETRY_EXC_ERR

 I added these notes in bug #1919 (bugs.openfabrics.org) to track the
 issue.

 thanks,
 -vu
 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
   

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] MLX4 Strangeness

2010-02-17 Thread Tom Tucker

Hi Tziporet:

Here is a trace with the data for WR failing with status 12. The vendor 
error is 129.

Feb 17 12:27:33 vic10 kernel: rpcrdma_event_process:154 wr_id 
 status 12 opcode 0 vendor_err 129 byte_len 0 qp 
81002a13ec00 ex  src_qp  wc_flags, 0 pkey_index
Feb 17 12:27:33 vic10 kernel: rpcrdma_event_process:154 wr_id 
81002878d800 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81002a13ec00 ex  src_qp  wc_flags, 0 pkey_index
Feb 17 12:27:33 vic10 kernel: rpcrdma_event_process:167 wr_id 
81002878d800 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81002a13ec00 ex  src_qp  wc_flags, 0 pkey_index

Any thoughts?
Tom

Tom Tucker wrote:
 Tom Tucker wrote:
 Tziporet Koren wrote:
 On 2/15/2010 10:24 PM, Tom Tucker wrote:
  
 Hello,

 I am seeing some very strange behavior on my MLX4 adapters running 2.7
 firmware and the latest OFED 1.5.1. Two systems are involved and each
 have dual ported MTHCA DDR adapter and MLX4 adapters.

 The scenario starts with NFSRDMA stress testing between the two 
 systems
 running bonnie++ and iozone concurrently. The test completes and there
 is no issue. Then 6 minutes pass and the server times out the
 connection and shuts down the RC connection to the client.

   From this point on, using the RDMA CM, a new RC QP can be brought up
 and moved to RTS, however, the first RDMA_SEND to the NFS SERVER 
 system
 fails with IB_WC_RETRY_EXC_ERR. I have confirmed:

 - that arp completed successfully and the neighbor entries are
 populated on both the client and server
 - that the QP are in the RTS state on both the client and server
 - that there are RECV WR posted to the RQ on the server and they 
 did not
 error out
 - that no RECV WR completed successfully or in error on the server
 - that there are SEND WR posted to the QP on the client
 - the client side SEND_WR fails with error 12 as mentioned above

 I have also confirmed the following with a different application (i.e.
 rping):

 server# rping -s
 client# rping -c -a 192.168.80.129

 fails with the exact same error, i.e.
 client# rping -c -a 192.168.80.129
 cq completion failed status 12
 wait for RDMA_WRITE_ADV state 10
 client DISCONNECT EVENT...

 However, if I run rping the other way, it works fine, that is,

 client# rping -s
 server# rping -c -a 192.168.80.135

 It runs without error until I stop it.

 Does anyone have any ideas on how I might debug this?



 Tom
 What is the vendor syndrome error when you get a completion with error?

   
 Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 
 192.168.80.129:20049 closed (-103)
 Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 
 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
 Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 
 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
 81003c9e3200 ex  src_qp  wc_flags, 0 pkey_index
 Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
 192.168.80.129:20049 closed (-103)
 Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
 Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 
 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
 81002f2d8400 ex  src_qp  wc_flags, 0 pkey_index

 Repeat forever

 So the vendor err is 244.


 Please ignore this. This log skips the failing WR (:-\). I need to do 
 another trace.



 Does the issue occurs only on the ConnectX cards (mlx4) or also on 
 the InfiniHost cards (mthca)

 Tziporet

 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
   





___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] MLX4 Strangeness

2010-02-16 Thread Tom Tucker

Tziporet Koren wrote:
 On 2/15/2010 10:24 PM, Tom Tucker wrote:
   
 Hello,

 I am seeing some very strange behavior on my MLX4 adapters running 2.7
 firmware and the latest OFED 1.5.1. Two systems are involved and each
 have dual ported MTHCA DDR adapter and MLX4 adapters.

 The scenario starts with NFSRDMA stress testing between the two systems
 running bonnie++ and iozone concurrently. The test completes and there
 is no issue. Then 6 minutes pass and the server times out the
 connection and shuts down the RC connection to the client.

   From this point on, using the RDMA CM, a new RC QP can be brought up
 and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system
 fails with IB_WC_RETRY_EXC_ERR. I have confirmed:

 - that arp completed successfully and the neighbor entries are
 populated on both the client and server
 - that the QP are in the RTS state on both the client and server
 - that there are RECV WR posted to the RQ on the server and they did not
 error out
 - that no RECV WR completed successfully or in error on the server
 - that there are SEND WR posted to the QP on the client
 - the client side SEND_WR fails with error 12 as mentioned above

 I have also confirmed the following with a different application (i.e.
 rping):

 server# rping -s
 client# rping -c -a 192.168.80.129

 fails with the exact same error, i.e.
 client# rping -c -a 192.168.80.129
 cq completion failed status 12
 wait for RDMA_WRITE_ADV state 10
 client DISCONNECT EVENT...

 However, if I run rping the other way, it works fine, that is,

 client# rping -s
 server# rping -c -a 192.168.80.135

 It runs without error until I stop it.

 Does anyone have any ideas on how I might debug this?



 
 Tom
 What is the vendor syndrome error when you get a completion with error?

   
Hang on... compiling
 Does the issue occurs only on the ConnectX cards (mlx4) or also on the 
 InfiniHost cards (mthca)

   

Only the MLX4 cards.

 Tziporet

 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
   

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] MLX4 Strangeness

2010-02-16 Thread Tom Tucker

Tziporet Koren wrote:
 On 2/15/2010 10:24 PM, Tom Tucker wrote:
   
 Hello,

 I am seeing some very strange behavior on my MLX4 adapters running 2.7
 firmware and the latest OFED 1.5.1. Two systems are involved and each
 have dual ported MTHCA DDR adapter and MLX4 adapters.

 The scenario starts with NFSRDMA stress testing between the two systems
 running bonnie++ and iozone concurrently. The test completes and there
 is no issue. Then 6 minutes pass and the server times out the
 connection and shuts down the RC connection to the client.

   From this point on, using the RDMA CM, a new RC QP can be brought up
 and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system
 fails with IB_WC_RETRY_EXC_ERR. I have confirmed:

 - that arp completed successfully and the neighbor entries are
 populated on both the client and server
 - that the QP are in the RTS state on both the client and server
 - that there are RECV WR posted to the RQ on the server and they did not
 error out
 - that no RECV WR completed successfully or in error on the server
 - that there are SEND WR posted to the QP on the client
 - the client side SEND_WR fails with error 12 as mentioned above

 I have also confirmed the following with a different application (i.e.
 rping):

 server# rping -s
 client# rping -c -a 192.168.80.129

 fails with the exact same error, i.e.
 client# rping -c -a 192.168.80.129
 cq completion failed status 12
 wait for RDMA_WRITE_ADV state 10
 client DISCONNECT EVENT...

 However, if I run rping the other way, it works fine, that is,

 client# rping -s
 server# rping -c -a 192.168.80.135

 It runs without error until I stop it.

 Does anyone have any ideas on how I might debug this?



 
 Tom
 What is the vendor syndrome error when you get a completion with error?

   
Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 closed (-103)
Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 
81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81003c9e3200 ex  src_qp  wc_flags, 0 pkey_index
Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 closed (-103)
Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 
81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81002f2d8400 ex  src_qp  wc_flags, 0 pkey_index

Repeat forever

So the vendor err is 244.

 Does the issue occurs only on the ConnectX cards (mlx4) or also on the 
 InfiniHost cards (mthca)

 Tziporet

 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
   

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] MLX4 Strangeness

2010-02-16 Thread Tom Tucker


More info...

Reboot the client and try to reconnect to a server that has not been 
rebooted fails in the same way.

It must be an issue with the server. I see no completions on the server 
or any indication that an RDMA_SEND was incoming. Is there some way to 
dump adapter state or otherwise see if there was traffic on the wire?

Tom


Tom Tucker wrote:
 Tom Tucker wrote:
 Tziporet Koren wrote:
 On 2/15/2010 10:24 PM, Tom Tucker wrote:
  
 Hello,

 I am seeing some very strange behavior on my MLX4 adapters running 2.7
 firmware and the latest OFED 1.5.1. Two systems are involved and each
 have dual ported MTHCA DDR adapter and MLX4 adapters.

 The scenario starts with NFSRDMA stress testing between the two 
 systems
 running bonnie++ and iozone concurrently. The test completes and there
 is no issue. Then 6 minutes pass and the server times out the
 connection and shuts down the RC connection to the client.

   From this point on, using the RDMA CM, a new RC QP can be brought up
 and moved to RTS, however, the first RDMA_SEND to the NFS SERVER 
 system
 fails with IB_WC_RETRY_EXC_ERR. I have confirmed:

 - that arp completed successfully and the neighbor entries are
 populated on both the client and server
 - that the QP are in the RTS state on both the client and server
 - that there are RECV WR posted to the RQ on the server and they 
 did not
 error out
 - that no RECV WR completed successfully or in error on the server
 - that there are SEND WR posted to the QP on the client
 - the client side SEND_WR fails with error 12 as mentioned above

 I have also confirmed the following with a different application (i.e.
 rping):

 server# rping -s
 client# rping -c -a 192.168.80.129

 fails with the exact same error, i.e.
 client# rping -c -a 192.168.80.129
 cq completion failed status 12
 wait for RDMA_WRITE_ADV state 10
 client DISCONNECT EVENT...

 However, if I run rping the other way, it works fine, that is,

 client# rping -s
 server# rping -c -a 192.168.80.135

 It runs without error until I stop it.

 Does anyone have any ideas on how I might debug this?



 Tom
 What is the vendor syndrome error when you get a completion with error?

   
 Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 
 192.168.80.129:20049 closed (-103)
 Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 
 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
 Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 
 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
 81003c9e3200 ex  src_qp  wc_flags, 0 pkey_index
 Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
 192.168.80.129:20049 closed (-103)
 Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
 Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 
 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
 81002f2d8400 ex  src_qp  wc_flags, 0 pkey_index

 Repeat forever

 So the vendor err is 244.


 Please ignore this. This log skips the failing WR (:-\). I need to do 
 another trace.



 Does the issue occurs only on the ConnectX cards (mlx4) or also on 
 the InfiniHost cards (mthca)

 Tziporet

 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
   





___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] [GIT PULL ofed-1.5] nfsrdma fixes

2010-02-09 Thread Tom Tucker

Tziporet Koren wrote:
 On 2/9/2010 11:22 PM, Tom Tucker wrote:
 Hi Vlad:

 I have made updates to the nfsrdma patch files. We put them in Steve's
 tree just for now, until I get my tree all set up. Please pull from
 ssh://sw...@sofa.openfabrics.org/home/swise/scm/ofed_kernel.git ofed_1_5


 Tom
 Please also move bugzilla bugs you have fixed to fixed state


Ok, once I'm sure I fixed them.

 Thanks
 Tziporet

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] RE: [ofa-general] OFED 1.3 Alpha release is available

2007-10-16 Thread Tom Tucker

On Tue, 2007-10-16 at 17:46 -0700, Scott Weitzenkamp (sweitzen) wrote:
  3. IPoIB
 o Stateless offloads
 o NAPI is enabled default
 
 How does one measure these changes using tools like netperf or iperf?
 Do I need a specific HCA type?
 
  4. SDP - these are not yet in the alpha release
 o Keep-alive
 o Asynch IO
 o Send Zero Copy
 
 If it didn't make it into alpha, perhaps it should not go into 1.3, so
 we can hold the release date better?
 
 What ever happened to NFS RDMA?

The SVC transport switch and SVC-UDP/TCP/RDMA transport drivers are
targeted for 2.6.25. To track this activity, see
[EMAIL PROTECTED] 

 
 Scott 
 ___
 general mailing list
 [EMAIL PROTECTED]
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] Making a new ofed_kernel-1.5.1.tgz for OFED-1.5.1.tgz

Re: [ewg] OFED bugs and 1.5.1 GA release

[ewg] RC4 build failure on FC12

[ewg] [GIT PULL ofed-1.5] bug fix for 1919

[ewg] Build Broken?

Re: [ewg] nfsrdma fails to write big file,

Re: [ewg] nfsrdma fails to write big file,

Re: [ewg] MLX4 Strangeness

Re: [ewg] MLX4 Strangeness

Re: [ewg] MLX4 Strangeness

Re: [ewg] MLX4 Strangeness

Re: [ewg] [GIT PULL ofed-1.5] nfsrdma fixes

[ewg] RE: [ofa-general] OFED 1.3 Alpha release is available

13 matches

Site Navigation

Mail list logo

Footer information