Re: SRP initiator and iSER initiator performance

2010-03-03 Thread Bart Van Assche
On Wed, Mar 3, 2010 at 9:23 PM, Vladislav Bolkhovitin  wrote:
> Bart Van Assche, on 03/01/2010 11:38 PM wrote:
>>
>> On Mon, Mar 1, 2010 at 9:12 PM, Vladislav Bolkhovitin > > wrote:
>>
>>    [ ... ]
>>    It's good if my impression was wrong. But you've got suspiciously
>>    low IOPS numbers. On your hardware you should have much more. Seems
>>    you experienced a bottleneck on the initiator somewhere above the
>>    drivers level (fio? sg engine? IRQs or context switches count?), so
>>    your results could be not really related to the topic. Oprofile and
>>    lockstat output can shed more light on this.
>>
>>
>> The number of IOPS I obtained is really high considering that I used the
>> sg I/O engine. This means that no buffering has been used and none of the
>> I/O requests were combined into larger requests. I chose the sg I/O engine
>> on purpose in order to bypass the block layer. I was not interested in
>> record IOPS numbers but in a test where most of the time is spent in the SRP
>> / iSER initiator instead of the block layer.
>
> 116K IOPS'es isn't high, it's pretty low for QDR IB. [ ... ]

It looks like it's time that you make yourself familiar with the
difference between CPU-bound, I/O bound and memory bound. It is
essential to understand this terminology before starting to comment on
performance tests.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB issues

2010-03-03 Thread Josh England
I've applied the patch and initial testing has not produced any
transmit timeout errors.  I'll be doing some heavier testing in the
next couple days, but it looks good so far.  Thanks for the quick
turn-around!

-JE

On Wed, Mar 3, 2010 at 4:29 AM, Eli Cohen  wrote:
> I just posted a patch which might fix your problem. Please try it and
> let us know if it fixed anything.
>
> On Tue, Mar 02, 2010 at 01:54:09PM -0800, Josh England wrote:
>> Hello,
>>
>> I've been running into several issues using IPoIB.  The 2 primary uses
>> are for read-only NFS to the clients (over TCP) and access to an
>> ethernet-connected parallel filesystem (Panasas) through router nodes
>> passing IPoIB<-->10GbE.
>>
>> All nodes are running CentOS 5.3 and OFED 1.4.2, although a have played
>> with OFED 1.5 and seen similar results.  Client nodes mount their NFS root
>> from boot servers via IPoIB with a ratio of 80:1.  The boot servers are the
>> ones that seem to have issues.  The fabric itself consists of ~1000 nodes
>> interconnected such that their is 2:1 oversubscription within any single 
>> rack,
>> and 20:1 oversubscription between racks (through the core switch).  I
>> don't know how much the oversubscription comes into play here as I can
>> reproduce the error within a single rack.
>>
>> In datagram mode, I see errors on the boot servers of the form.
>>
>> ib0: post_send failed
>> ib0: post_send failed
>> ib0: post_send failed
>>
>>
>> When using connected mode, I hit a different error:
>>
>> NETDEV WATCHDOG: ib0: transmit timed out
>> ib0: transmit timeout: latency 1999 msecs
>> ib0: queue stopped 1, tx_head 2154042680, tx_tail 2154039464
>> NETDEV WATCHDOG: ib0: transmit timed out
>> ib0: transmit timeout: latency 2999 msecs
>> ib0: queue stopped 1, tx_head 2154042680, tx_tail 2154039464
>> ...
>> ...
>> NETDEV WATCHDOG: ib0: transmit timed out
>> ib0: transmit timeout: latency 61824999 msecs
>> ib0: queue stopped 1, tx_head 2154042680, tx_tail 2154039464
>>
>>
>> The errors seem to hit only after NFS comes into play.  Once it
>> starts, the NETDEV WATCHDOG messages continue until I run
>> 'ifconfig ib0 down up'.  I've tried tuning send_queue_size and
>> recv_queue_size on both sides, the txqueuelen of the ib0 interface, the
>> NFS rsize/wsize.  None of it seems to help greatly.  Does anyone have
>> any ideas about what can I do to try to fix
>> these problems?
>>
>> -JE
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Dimension port order file support

2010-03-03 Thread Dale Purdy


Provide a means to specify on a per switch basis the mapping (order)
between switch ports and dimensions for Dimension Order Routing.  This
allows the DOR routing engine to be used when the cabling is not
properly aligned for DOR, either initially, or for an upgrade.

Signed-off-by: Dale Purdy 
---
 opensm/include/opensm/osm_subnet.h |1 +
 opensm/include/opensm/osm_switch.h |   30 +
 opensm/man/opensm.8.in |   31 +++--
 opensm/opensm/main.c   |   13 -
 opensm/opensm/osm_subnet.c |7 ++
 opensm/opensm/osm_switch.c |2 +-
 opensm/opensm/osm_ucast_mgr.c  |  120 
 7 files changed, 195 insertions(+), 9 deletions(-)

diff --git a/opensm/include/opensm/osm_subnet.h 
b/opensm/include/opensm/osm_subnet.h
index 3970e98..e4e298e 100644
--- a/opensm/include/opensm/osm_subnet.h
+++ b/opensm/include/opensm/osm_subnet.h
@@ -186,6 +186,7 @@ typedef struct osm_subn_opt {
uint16_t console_port;
char *port_prof_ignore_file;
char *hop_weights_file;
+   char *dimn_ports_file;
boolean_t port_profile_switch_nodes;
boolean_t sweep_on_trap;
char *routing_engine_names;
diff --git a/opensm/include/opensm/osm_switch.h 
b/opensm/include/opensm/osm_switch.h
index cb6e5ac..1c6807e 100644
--- a/opensm/include/opensm/osm_switch.h
+++ b/opensm/include/opensm/osm_switch.h
@@ -100,6 +100,7 @@ typedef struct osm_switch {
uint16_t num_hops;
uint8_t **hops;
osm_port_profile_t *p_prof;
+   uint8_t *dimn_ports;
uint8_t *lft;
uint8_t *new_lft;
uint16_t lft_size;
@@ -871,6 +872,35 @@ static inline uint8_t osm_switch_get_mft_max_position(IN 
osm_switch_t * p_sw)
 * RETURN VALUE
 */

+/f* OpenSM: Switch/osm_switch_get_dimn_port
+* NAME
+*  osm_switch_get_dimn_port
+*
+* DESCRIPTION
+*   Get the routing ordered port
+*
+* SYNOPSIS
+*/
+static inline uint8_t osm_switch_get_dimn_port(IN const osm_switch_t * p_sw,
+  IN uint8_t port_num)
+{
+   CL_ASSERT(p_sw);
+   if (p_sw->dimn_ports == NULL)
+   return port_num;
+   return p_sw->dimn_ports[port_num];
+}
+/*
+* PARAMETERS
+*  p_sw
+*  [in] Pointer to the switch object.
+*
+*  port_num
+*  [in] Port number in the switch
+*
+* RETURN VALUES
+*  Returns the port number ordered for routing purposes.
+*/
+
 /f* OpenSM: Switch/osm_switch_recommend_path
 * NAME
 *  osm_switch_recommend_path
diff --git a/opensm/man/opensm.8.in b/opensm/man/opensm.8.in
index 7aca8f9..9053611 100644
--- a/opensm/man/opensm.8.in
+++ b/opensm/man/opensm.8.in
@@ -37,6 +37,7 @@ opensm \- InfiniBand subnet manager and administration (SM/SA)
 [\-console-port ]
 [\-i(gnore-guids) ]
 [\-w | \-\-hop_weights_file ]
+[\-O | \-\-dimn_ports_file ]
 [\-f  | \-\-log_file  ]
 [\-L | \-\-log_limit ] [\-e(rase_log_file)]
 [\-P(config)  ]
@@ -273,6 +274,16 @@ factor of 1.  Lines starting with # are comments.  Weights 
affect only the
 output route from the port, so many useful configurations will require weights
 to be specified in pairs.
 .TP
+\fB\-O\fR, \fB\-\-dimn_ports_file\fR 
+This option provides a mapping between hypercube dimensions and ports
+on a per switch basis for the DOR routing engine.  The file consists
+of lines containing a switch node GUID (specified as a 64 bit hex
+number, with leading 0x) followed by a list of non-zero port numbers,
+separated by spaces, one switch per line.  The order for the port
+numbers is in one to one correspondence to the dimensions.  Ports not
+listed on a line are assigned to the remaining dimensions, in port
+order.  Anything after a # is a comment.
+.TP
 \fB\-x\fR, \fB\-\-honor_guid2lid\fR
 This option forces OpenSM to honor the guid2lid file,
 when it comes out of Standby state, if such file exists
@@ -969,17 +980,20 @@ algorithm and so uses shortest paths.  Instead of 
spreading traffic
 out across different paths with the same shortest distance, it chooses
 among the available shortest paths based on an ordering of dimensions.
 Each port must be consistently cabled to represent a hypercube
-dimension or a mesh dimension.  Paths are grown from a destination
-back to a source using the lowest dimension (port) of available paths
-at each step.  This provides the ordering necessary to avoid deadlock.
+dimension or a mesh dimension.  Alternatively, the -O option can be
+used to assign a custom mapping between the ports on a given switch,
+and the associated dimension.  Paths are grown from a destination back
+to a source using the lowest dimension (port) of available paths at
+each step.  This provides the ordering necessary to avoid deadlock.
 When there are multiple links between any two switches, they still
 represent only one dimension and traffic is balanced across them
 unless port equalization is turned off.  In the case of hypercubes,
 the same port must be used

Re: [ewg] nfsrdma fails to write big file,

2010-03-03 Thread Tom Tucker

Mahesh Siddheshwar wrote:

Hi Tom, Vu,

Tom Tucker wrote:

Roland Dreier wrote:
 > +   /*  > +* Add room for frmr 
register and invalidate WRs

 > +* Requests sometimes have two chunks, each chunk
 > +* requires to have different frmr. The safest
 > +* WRs required are max_send_wr * 6; however, we
 > +* get send completions and poll fast enough, it
 > +* is pretty safe to have max_send_wr * 4.  > 
+*/

 > +   ep->rep_attr.cap.max_send_wr *= 4;

Seems like a bad design if there is a possibility of work queue
overflow; if you're counting on events occurring in a particular order
or completions being handled "fast enough", then your design is 
going to
fail in some high load situations, which I don't think you want.   


Vu,

Would you please try the following:

- Set the multiplier to 5

While trying to test this between a Linux client and Solaris server,
I made the following changes in :
/usr/src/ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c

diff verbs.c.org verbs.c
653c653
<   ep->rep_attr.cap.max_send_wr *= 3;
---
>   ep->rep_attr.cap.max_send_wr *= 8;
685c685
<   ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 /*  - 1*/;
---
>   ep->rep_cqinit = ep->rep_attr.cap.max

(I bumped it to 8)

did make install.
On reboot I see the errors on NFS READs as opposed to WRITEs
as seen before, when I try to read a 10G file from the server.

The client is running: RHEL 5.3 (2.6.18-128.el5PAE) with
OFED-1.5.1-20100223-0740 bits. The client has an Sun IB
HCA: SUN0070130001, MT25418, 2.7.0 firmware, hw_rev = a0.
The server is running Solaris based on snv_128.

rpcdebug output from the client:

==
RPC:85 call_bind (status 0)
RPC:85 call_connect xprt ec78d800 is connected
RPC:85 call_transmit (status 0)
RPC:85 xprt_prepare_transmit
RPC:85 xprt_cwnd_limited cong = 0 cwnd = 8192
RPC:85 rpc_xdr_encode (status 0)
RPC:85 marshaling UNIX cred eddb4dc0
RPC:85 using AUTH_UNIX cred eddb4dc0 to wrap rpc data
RPC:85 xprt_transmit(164)
RPC:   rpcrdma_inline_pullup: pad 0 destp 0xf1dd1410 len 164 
hdrlen 164
RPC:   rpcrdma_register_frmr_external: Using frmr ec7da920 to map 
4 segments
RPC:   rpcrdma_create_chunks: write chunk elem 
16...@0x38536d000:0xa601 (more)
RPC:   rpcrdma_register_frmr_external: Using frmr ec7da960 to map 
1 segments
RPC:   rpcrdma_create_chunks: write chunk elem 
1...@0x31dd153c:0xaa01 (last)
RPC:   rpcrdma_marshal_req: write chunk: hdrlen 68 rpclen 164 
padlen 0 headerp 0xf1dd124c base 0xf1dd136c lkey 0x500

RPC:85 xmit complete
RPC:85 sleep_on(queue "xprt_pending" time 4683109)
RPC:85 added to queue ec78d994 "xprt_pending"
RPC:85 setting alarm for 6 ms
RPC:   wake_up_next(ec78d944 "xprt_resend")
RPC:   wake_up_next(ec78d8f4 "xprt_sending")
RPC:   rpcrdma_qp_async_error_upcall: QP error 3 on device mlx4_0 
ep ec78db40

RPC:85 __rpc_wake_up_task (now 4683110)
RPC:85 disabling timer
RPC:85 removed from queue ec78d994 "xprt_pending"
RPC:   __rpc_wake_up_task done
RPC:85 __rpc_execute flags=0x1
RPC:85 call_status (status -107)
RPC:85 call_bind (status 0)
RPC:85 call_connect xprt ec78d800 is not connected
RPC:85 xprt_connect xprt ec78d800 is not connected
RPC:85 sleep_on(queue "xprt_pending" time 4683110)
RPC:85 added to queue ec78d994 "xprt_pending"
RPC:85 setting alarm for 6 ms
RPC:   rpcrdma_event_process: event rep ec116800 status 5 opcode 
80 length 2493606

RPC:   rpcrdma_event_process: recv WC status 5, connection lost
RPC:   rpcrdma_conn_upcall: disconnected: ec78dbccI4:20049 (ep 
0xec78db40 event 0xa)

RPC:   rpcrdma_conn_upcall: disconnected
rpcrdma: connection to ec78dbccI4:20049 closed (-103)
RPC:   xprt_rdma_connect_worker: reconnect
==

On the server I see:

Mar  3 17:45:16 elena-ar hermon: [ID 271130 kern.notice] NOTICE: 
hermon0: Device Error: CQE remote access error
Mar  3 17:45:16 elena-ar nfssrv: [ID 819430 kern.notice] NOTICE: NFS: 
bad sendreply
Mar  3 17:45:21 elena-ar hermon: [ID 271130 kern.notice] NOTICE: 
hermon0: Device Error: CQE remote access error
Mar  3 17:45:21 elena-ar nfssrv: [ID 819430 kern.notice] NOTICE: NFS: 
bad sendreply


The remote access error is actually seen on RDMA_WRITE.
Doing some more debug on the server with DTrace, I see that
the destination address and length matches the write chunk
element in the Linux debug output above.


 0   9385  rib_write:entry daddr 38536d000, len 4000, 
hdl a601

 0   9358 rib_init_sendwait:return ff44a715d308
 1   9296   rib_svc_scq_handler:return 1f7
 1   9356  rib_sendwait:return 14
 1   9386 rib_write:return 14

^^^ that is RDMA_FAILED in
 1  63295xdrrdma_send_read_data:return 0
 1   5969  xdr_READ3res:return
 1   5969  xdr

Re: [ewg] nfsrdma fails to write big file,

2010-03-03 Thread Mahesh Siddheshwar

Hi Tom, Vu,

Tom Tucker wrote:

Roland Dreier wrote:
 > +   /*  > +* Add room for frmr 
register and invalidate WRs

 > +* Requests sometimes have two chunks, each chunk
 > +* requires to have different frmr. The safest
 > +* WRs required are max_send_wr * 6; however, we
 > +* get send completions and poll fast enough, it
 > +* is pretty safe to have max_send_wr * 4.  > 
+*/

 > +   ep->rep_attr.cap.max_send_wr *= 4;

Seems like a bad design if there is a possibility of work queue
overflow; if you're counting on events occurring in a particular order
or completions being handled "fast enough", then your design is going to
fail in some high load situations, which I don't think you want.   


Vu,

Would you please try the following:

- Set the multiplier to 5

While trying to test this between a Linux client and Solaris server,
I made the following changes in :
/usr/src/ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c

diff verbs.c.org verbs.c
653c653
<   ep->rep_attr.cap.max_send_wr *= 3;
---
>   ep->rep_attr.cap.max_send_wr *= 8;
685c685
<   ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 /*  - 1*/;
---
>   ep->rep_cqinit = ep->rep_attr.cap.max

(I bumped it to 8)

did make install. 


On reboot I see the errors on NFS READs as opposed to WRITEs
as seen before, when I try to read a 10G file from the server.

The client is running: RHEL 5.3 (2.6.18-128.el5PAE) with
OFED-1.5.1-20100223-0740 bits. The client has an Sun IB
HCA: SUN0070130001, MT25418, 2.7.0 firmware, hw_rev = a0.
The server is running Solaris based on snv_128.

rpcdebug output from the client:

==
RPC:85 call_bind (status 0)
RPC:85 call_connect xprt ec78d800 is connected
RPC:85 call_transmit (status 0)
RPC:85 xprt_prepare_transmit
RPC:85 xprt_cwnd_limited cong = 0 cwnd = 8192
RPC:85 rpc_xdr_encode (status 0)
RPC:85 marshaling UNIX cred eddb4dc0
RPC:85 using AUTH_UNIX cred eddb4dc0 to wrap rpc data
RPC:85 xprt_transmit(164)
RPC:   rpcrdma_inline_pullup: pad 0 destp 0xf1dd1410 len 164 hdrlen 164
RPC:   rpcrdma_register_frmr_external: Using frmr ec7da920 to map 4 
segments
RPC:   rpcrdma_create_chunks: write chunk elem 
16...@0x38536d000:0xa601 (more)
RPC:   rpcrdma_register_frmr_external: Using frmr ec7da960 to map 1 
segments
RPC:   rpcrdma_create_chunks: write chunk elem 1...@0x31dd153c:0xaa01 
(last)
RPC:   rpcrdma_marshal_req: write chunk: hdrlen 68 rpclen 164 padlen 
0 headerp 0xf1dd124c base 0xf1dd136c lkey 0x500

RPC:85 xmit complete
RPC:85 sleep_on(queue "xprt_pending" time 4683109)
RPC:85 added to queue ec78d994 "xprt_pending"
RPC:85 setting alarm for 6 ms
RPC:   wake_up_next(ec78d944 "xprt_resend")
RPC:   wake_up_next(ec78d8f4 "xprt_sending")
RPC:   rpcrdma_qp_async_error_upcall: QP error 3 on device mlx4_0 ep 
ec78db40

RPC:85 __rpc_wake_up_task (now 4683110)
RPC:85 disabling timer
RPC:85 removed from queue ec78d994 "xprt_pending"
RPC:   __rpc_wake_up_task done
RPC:85 __rpc_execute flags=0x1
RPC:85 call_status (status -107)
RPC:85 call_bind (status 0)
RPC:85 call_connect xprt ec78d800 is not connected
RPC:85 xprt_connect xprt ec78d800 is not connected
RPC:85 sleep_on(queue "xprt_pending" time 4683110)
RPC:85 added to queue ec78d994 "xprt_pending"
RPC:85 setting alarm for 6 ms
RPC:   rpcrdma_event_process: event rep ec116800 status 5 opcode 80 
length 2493606

RPC:   rpcrdma_event_process: recv WC status 5, connection lost
RPC:   rpcrdma_conn_upcall: disconnected: ec78dbccI4:20049 (ep 
0xec78db40 event 0xa)

RPC:   rpcrdma_conn_upcall: disconnected
rpcrdma: connection to ec78dbccI4:20049 closed (-103)
RPC:   xprt_rdma_connect_worker: reconnect
==

On the server I see:

Mar  3 17:45:16 elena-ar hermon: [ID 271130 kern.notice] NOTICE: 
hermon0: Device Error: CQE remote access error
Mar  3 17:45:16 elena-ar nfssrv: [ID 819430 kern.notice] NOTICE: NFS: 
bad sendreply
Mar  3 17:45:21 elena-ar hermon: [ID 271130 kern.notice] NOTICE: 
hermon0: Device Error: CQE remote access error
Mar  3 17:45:21 elena-ar nfssrv: [ID 819430 kern.notice] NOTICE: NFS: 
bad sendreply


The remote access error is actually seen on RDMA_WRITE.
Doing some more debug on the server with DTrace, I see that
the destination address and length matches the write chunk
element in the Linux debug output above.


 0   9385  rib_write:entry daddr 38536d000, len 4000, 
hdl a601

 0   9358 rib_init_sendwait:return ff44a715d308
 1   9296   rib_svc_scq_handler:return 1f7
 1   9356  rib_sendwait:return 14
 1   9386 rib_write:return 14

^^^ that is RDMA_FAILED in 


 1  63295xdrrdma_send_read_data:return 0
 1   5969  xdr_READ3res:return
 1   5969  xdr_READ3res:return 0

Is 

Re: Bug in OFED1.5 ib_ipoib: IPv6 doesn't interoperate between RHEL4 and other Linux distros

2010-03-03 Thread David J. Wilder
Mike-

A number of fixes related to ipv6 address resolution by rdma_cm went in
to ofed 1.5.1 that may be related to this.  You may want to test 1.5.1
and see if it resolves your issue.  As I recall link-local address are
treated different that assigned ipv6 address so you might want to try
using assigned address.

Dave..

On Wed, 2010-03-03 at 11:00 -0600, Mike Heinz wrote:
> One of my testers reported this, has anyone else seen it? Given two RHEL4 
> hosts, IPV6 works correctly, (according to the testers) but IPV6 does not 
> work between hosts running RHEL4u8 and hosts running SLES10 or RHEL5.
> 
> Given an RHEL5 host:
> 
> [r...@homer ~]# ifconfig ib0
> ib0   Link encap:InfiniBand  HWaddr 
> 80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
>   inet addr:172.21.33.208  Bcast:172.21.33.255  Mask:255.255.255.0
>   inet6 addr: fe80::206:6a00:a000:707f/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
>   RX packets:17 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:9 errors:0 dropped:24 overruns:0 carrier:0
>   collisions:0 txqueuelen:256 
> 
> And an RHEL4 host:
> 
> [r...@apu ~]# ifconfig ib0
> ib0   Link encap:UNSPEC  HWaddr 
> 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00  
>   inet addr:172.21.33.210  Bcast:172.21.33.255  Mask:255.255.255.0
>   inet6 addr: fe80::206:6a00:a000:6ca8/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
>   RX packets:15 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:26 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:256 
>   RX bytes:1092 (1.0 KiB)  TX bytes:2176 (2.1 KiB)
> 
> pinging over ipv4 works:
> 
> [r...@homer ~]# ping 172.21.33.210
> PING 172.21.33.210 (172.21.33.210) 56(84) bytes of data.
> 64 bytes from 172.21.33.210: icmp_seq=1 ttl=64 time=0.064 ms
> 64 bytes from 172.21.33.210: icmp_seq=2 ttl=64 time=0.035 ms
> 64 bytes from 172.21.33.210: icmp_seq=3 ttl=64 time=0.024 ms
> 64 bytes from 172.21.33.210: icmp_seq=4 ttl=64 time=0.026 ms
> 
> [r...@apu ~]# ping 172.21.33.208
> PING 172.21.33.208 (172.21.33.208) 56(84) bytes of data.
> 64 bytes from 172.21.33.208: icmp_seq=0 ttl=64 time=0.053 ms
> 64 bytes from 172.21.33.208: icmp_seq=1 ttl=64 time=0.026 ms
> 64 bytes from 172.21.33.208: icmp_seq=2 ttl=64 time=0.026 ms
> 64 bytes from 172.21.33.208: icmp_seq=3 ttl=64 time=0.027 ms
> 64 bytes from 172.21.33.208: icmp_seq=4 ttl=64 time=0.025 ms
> 
> However, pinging over ipv6 fails:
> 
> [r...@homer ~]# ping6 -I ib0 fe80::206:6a00:a000:6ca8 PING 
> fe80::206:6a00:a000:6ca8(fe80::206:6a00:a000:6ca8) from 
> fe80::206:6a00:a000:707f ib0: 56 data bytes From fe80::206:6a00:a000:707f 
> icmp_seq=1 Destination unreachable: Address unreachable From 
> fe80::206:6a00:a000:707f icmp_seq=2 Destination unreachable: Address 
> unreachable From fe80::206:6a00:a000:707f icmp_seq=3 Destination unreachable: 
> Address unreachable
> 
> But pinging over ipv6 works between rhel5 boxes:
> 
> [r...@homer ~]# ping6 -I ib0 fe80::206:6a00:a000:7d5e PING 
> fe80::206:6a00:a000:7d5e(fe80::206:6a00:a000:7d5e) from 
> fe80::206:6a00:a000:707f ib0: 56 data bytes
> 64 bytes from fe80::206:6a00:a000:7d5e: icmp_seq=0 ttl=64 time=1.72 ms
> 64 bytes from fe80::206:6a00:a000:7d5e: icmp_seq=1 ttl=64 time=0.063 ms
> 64 bytes from fe80::206:6a00:a000:7d5e: icmp_seq=2 ttl=64 time=0.033 ms
> 64 bytes from fe80::206:6a00:a000:7d5e: icmp_seq=3 ttl=64 time=0.044 ms
> 
> Similarly, the RHEL5 host can ping IPV6 to a SLES10 host:
> 
> [r...@homer ~]# ping6 -I ib0 fe80::206:6a00:a000:6cc1 PING 
> fe80::206:6a00:a000:6cc1(fe80::206:6a00:a000:6cc1) from 
> fe80::206:6a00:a000:707f ib0: 56 data bytes
> 64 bytes from fe80::206:6a00:a000:6cc1: icmp_seq=0 ttl=64 time=1.91 ms
> 64 bytes from fe80::206:6a00:a000:6cc1: icmp_seq=1 ttl=64 time=0.048 ms
> 
> Any ideas? Has anyone seen this? 
> 
> Interestingly, the SM shows that the ping packet is leaving the RHEL4 
> machine, but it appears to have a garbled address in it:
> 
> Wed Mar  3 11:58:07 2010: fm0_sm(13922): ERROR[sareader]: SA: sa_PathRecord: 
> gidprefix in Dest Gid 0x0006:6a00a0006ca8 of PATH request 
> from Lid 0x9 does not match SM's(0xfe80)
> 
> What makes this interesting is that the actual GID of the RHEL4 box is 
> fe80:00066a00a0006ca8 - so it looks like the GID in the ping 
> packet is shifted 16 bits.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SRP initiator and iSER initiator performance

2010-03-03 Thread Vladislav Bolkhovitin

Bart Van Assche, on 03/01/2010 11:38 PM wrote:
On Mon, Mar 1, 2010 at 9:12 PM, Vladislav Bolkhovitin > wrote:


[ ... ]
It's good if my impression was wrong. But you've got suspiciously
low IOPS numbers. On your hardware you should have much more. Seems
you experienced a bottleneck on the initiator somewhere above the
drivers level (fio? sg engine? IRQs or context switches count?), so
your results could be not really related to the topic. Oprofile and
lockstat output can shed more light on this.


The number of IOPS I obtained is really high considering that I used the 
sg I/O engine. This means that no buffering has been used and none of 
the I/O requests were combined into larger requests. I chose the sg I/O 
engine on purpose in order to bypass the block layer. I was not 
interested in record IOPS numbers but in a test where most of the time 
is spent in the SRP / iSER initiator instead of the block layer.


116K IOPS'es isn't high, it's pretty low for QDR IB. Even 4Gbps FC can 
overperform it. Remember, Microsoft has managed to get 1 million IOPS'es 
from 10GbE, but your card should be much faster. This is why I have 
strong suspicious that the test is incorrect.


Let's estimate how much your IB card can achieve. It has 1us latency on 
1 byte packets, so it can perform at least 1 millions op/sec. This is 
the upper bound estimation, because (1) if the card has multi-core 
setup, this number can be several times bigger, and (2) it includes data 
transfers. From other side, you can read data via your card on 2.9GB/s. 
If we consider that transferring a 512B packet has 100% overhead (this 
is upper bound estimation too, because I can't believe that such a low 
latency HPC interconnect has so huge data transfer overhead), this will 
give us that it can transfer 2.9 / (512 * 2) = 2.9 millions IOPS'es. So, 
your IB hardware should be capable to make at least 1 million I/O 
transfers per second, which is 10 times bigger than you have.


So, you definitely need to find out the bottleneck. I would start from 
checking:


1. fio implemented not too effectively. It can be checked using null 
ioengine.


2. You have only one outstanding command at time (queue depth 1). You 
can check it during the test either using iostat on the initiator, or 
(better) on the SCST target in /proc/scsi_tgt/sessions and 
/proc/scsi_tgt/sgv files.


3. sg engine used by fio in indirect mode, i.e. it transfers data 
between user and kernel spaces using data copy. Can be checked looking 
at the fio's sources or using oprofile.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SRP initiator and iSER initiator performance

2010-03-03 Thread Vladislav Bolkhovitin

Bart Van Assche, on 03/02/2010 09:59 AM wrote:

On Mon, Mar 1, 2010 at 9:12 PM, Vladislav Bolkhovitin  wrote:

[ ... ]
It's good if my impression was wrong. But you've got suspiciously low IOPS
numbers. On your hardware you should have much more. Seems you experienced a
bottleneck on the initiator somewhere above the drivers level (fio? sg
engine? IRQs or context switches count?), so your results could be not
really related to the topic. Oprofile and lockstat output can shed more
light on this.


You didn't understand the purpose of the test. My goal was not to
achieve record IOPS numbers but to stress the SRP and iSER initiators
as much as possible. I choose the sg I/O engine in order to bypass the
block layer.


No, Bart, I understood your purpose very well. I'll illustrate my point 
on example. Let's consider we want to compare a Ferrari and Toyota 
Corolla cars. The only track we have to use has 60km/h speed limit, so 
we used it strictly following the speed limit. Would we have a correct 
comparison of the cars' capabilities, or would we compare only their 
speedometers' mistakes? If Toyota's speedometer allows to stay more 
closely to the limit, Toyota can win Ferrari. But would it win too on 
180 km/h limit? Or without speed limit at all?


The same is in our topic. We can consider your experiment correct only 
if the bottleneck is driver/hardware, which isn't likely to be.


Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RDMA/nes: clear stall bit before destroying nic qp

2010-03-03 Thread Chien Tung

Clear the stall bit to drop any incoming packets while destroying
nic qp.  This will prevent a chip resource leak.

Signed-off-by: Chien Tung 
---
 drivers/infiniband/hw/nes/nes_hw.c |8 
 drivers/infiniband/hw/nes/nes_hw.h |1 +
 2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/nes/nes_hw.c 
b/drivers/infiniband/hw/nes/nes_hw.c
index ce7f538..9250755 100644
--- a/drivers/infiniband/hw/nes/nes_hw.c
+++ b/drivers/infiniband/hw/nes/nes_hw.c
@@ -1899,9 +1899,14 @@ void nes_destroy_nic_qp(struct nes_vnic *nesvnic)
u16  wqe_fragment_index;
u64 wqe_frag;
u32 cqp_head;
+   u32 wqm_cfg0;
unsigned long flags;
int ret;
 
+   /* clear wqe stall before destroying NIC QP */
+   wqm_cfg0 = nes_read_indexed(nesdev, NES_IDX_WQM_CONFIG0);
+   nes_write_indexed(nesdev, NES_IDX_WQM_CONFIG0, wqm_cfg0 & 0x7FFF);
+
/* Free remaining NIC receive buffers */
while (nesvnic->nic.rq_head != nesvnic->nic.rq_tail) {
nic_rqe   = &nesvnic->nic.rq_vbase[nesvnic->nic.rq_tail];
@@ -2020,6 +2025,9 @@ void nes_destroy_nic_qp(struct nes_vnic *nesvnic)
 
pci_free_consistent(nesdev->pcidev, nesvnic->nic_mem_size, 
nesvnic->nic_vbase,
nesvnic->nic_pbase);
+
+   /* restore old wqm_cfg0 value */
+   nes_write_indexed(nesdev, NES_IDX_WQM_CONFIG0, wqm_cfg0);
 }
 
 /**
diff --git a/drivers/infiniband/hw/nes/nes_hw.h 
b/drivers/infiniband/hw/nes/nes_hw.h
index 9b1e7f8..bbbfe9f 100644
--- a/drivers/infiniband/hw/nes/nes_hw.h
+++ b/drivers/infiniband/hw/nes/nes_hw.h
@@ -160,6 +160,7 @@ enum indexed_regs {
NES_IDX_ENDNODE0_NSTAT_TX_OCTETS_HI = 0x7004,
NES_IDX_ENDNODE0_NSTAT_TX_FRAMES_LO = 0x7008,
NES_IDX_ENDNODE0_NSTAT_TX_FRAMES_HI = 0x700c,
+   NES_IDX_WQM_CONFIG0 = 0x5000,
NES_IDX_WQM_CONFIG1 = 0x5004,
NES_IDX_CM_CONFIG = 0x5100,
NES_IDX_NIC_LOGPORT_TO_PHYPORT = 0x6000,
-- 
1.6.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RDMA/nes: fix CX4 link problem in back-to-back configuration

2010-03-03 Thread Chien Tung

commit 09124e1913cf2140941f60ab4fdf8576e1e8fd8d took out
too much code and broke CX4 link detection in back-to-back
configuration.  This will put back the code that does the link
check.

Signed-off-by: Chien Tung 
---
 drivers/infiniband/hw/nes/nes_nic.c |   30 --
 1 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/nes/nes_nic.c 
b/drivers/infiniband/hw/nes/nes_nic.c
index 7dd6ce6..a60efee 100644
--- a/drivers/infiniband/hw/nes/nes_nic.c
+++ b/drivers/infiniband/hw/nes/nes_nic.c
@@ -1582,7 +1582,6 @@ struct net_device *nes_netdev_init(struct nes_device 
*nesdev,
struct nes_vnic *nesvnic;
struct net_device *netdev;
struct nic_qp_map *curr_qp_map;
-   u32 u32temp;
u8 phy_type = nesdev->nesadapter->phy_type[nesdev->mac_index];
 
netdev = alloc_etherdev(sizeof(struct nes_vnic));
@@ -1694,6 +1693,10 @@ struct net_device *nes_netdev_init(struct nes_device 
*nesdev,
 ((phy_type == NES_PHY_TYPE_PUMA_1G) &&
  (((PCI_FUNC(nesdev->pcidev->devfn) == 1) && (nesdev->mac_index == 
2)) ||
   ((PCI_FUNC(nesdev->pcidev->devfn) == 2) && (nesdev->mac_index == 
1)) {
+   u32 u32temp;
+   u32 link_mask;
+   u32 link_val;
+
u32temp = nes_read_indexed(nesdev, 
NES_IDX_PHY_PCS_CONTROL_STATUS0 +
(0x200 * (nesdev->mac_index & 1)));
if (phy_type != NES_PHY_TYPE_PUMA_1G) {
@@ -1702,13 +1705,36 @@ struct net_device *nes_netdev_init(struct nes_device 
*nesdev,
(0x200 * (nesdev->mac_index & 1)), u32temp);
}
 
+   /* Check and set linkup here.  This is for back to back */
+   /* configuration where second port won't get link interrupt */
+   switch (phy_type) {
+   case NES_PHY_TYPE_PUMA_1G:
+   if (nesdev->mac_index < 2) {
+   link_mask = 0x0101;
+   link_val = 0x0101;
+   } else {
+   link_mask = 0x0202;
+   link_val = 0x0202;
+   }
+   break;
+   default:
+   link_mask = 0x0f1f;
+   link_val = 0x0f0f;
+   break;
+   }
+
+   u32temp = nes_read_indexed(nesdev,
+  NES_IDX_PHY_PCS_CONTROL_STATUS0 +
+  (0x200 * (nesdev->mac_index & 1)));
+   if ((u32temp & link_mask) == link_val)
+   nesvnic->linkup = 1;
+
/* clear the MAC interrupt status, assumes direct logical to 
physical mapping */
u32temp = nes_read_indexed(nesdev, NES_IDX_MAC_INT_STATUS + 
(0x200 * nesdev->mac_index));
nes_debug(NES_DBG_INIT, "Phy interrupt status = 0x%X.\n", 
u32temp);
nes_write_indexed(nesdev, NES_IDX_MAC_INT_STATUS + (0x200 * 
nesdev->mac_index), u32temp);
 
nes_init_phy(nesdev);
-
}
 
return netdev;
-- 
1.6.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Bug in OFED1.5 ib_ipoib: IPv6 doesn't interoperate between RHEL4 and other Linux distros

2010-03-03 Thread Mike Heinz
One of my testers reported this, has anyone else seen it? Given two RHEL4 
hosts, IPV6 works correctly, (according to the testers) but IPV6 does not work 
between hosts running RHEL4u8 and hosts running SLES10 or RHEL5.

Given an RHEL5 host:

[r...@homer ~]# ifconfig ib0
ib0   Link encap:InfiniBand  HWaddr 
80:00:04:04:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
  inet addr:172.21.33.208  Bcast:172.21.33.255  Mask:255.255.255.0
  inet6 addr: fe80::206:6a00:a000:707f/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
  RX packets:17 errors:0 dropped:0 overruns:0 frame:0
  TX packets:9 errors:0 dropped:24 overruns:0 carrier:0
  collisions:0 txqueuelen:256 

And an RHEL4 host:

[r...@apu ~]# ifconfig ib0
ib0   Link encap:UNSPEC  HWaddr 
80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00  
  inet addr:172.21.33.210  Bcast:172.21.33.255  Mask:255.255.255.0
  inet6 addr: fe80::206:6a00:a000:6ca8/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
  RX packets:15 errors:0 dropped:0 overruns:0 frame:0
  TX packets:26 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:256 
  RX bytes:1092 (1.0 KiB)  TX bytes:2176 (2.1 KiB)

pinging over ipv4 works:

[r...@homer ~]# ping 172.21.33.210
PING 172.21.33.210 (172.21.33.210) 56(84) bytes of data.
64 bytes from 172.21.33.210: icmp_seq=1 ttl=64 time=0.064 ms
64 bytes from 172.21.33.210: icmp_seq=2 ttl=64 time=0.035 ms
64 bytes from 172.21.33.210: icmp_seq=3 ttl=64 time=0.024 ms
64 bytes from 172.21.33.210: icmp_seq=4 ttl=64 time=0.026 ms

[r...@apu ~]# ping 172.21.33.208
PING 172.21.33.208 (172.21.33.208) 56(84) bytes of data.
64 bytes from 172.21.33.208: icmp_seq=0 ttl=64 time=0.053 ms
64 bytes from 172.21.33.208: icmp_seq=1 ttl=64 time=0.026 ms
64 bytes from 172.21.33.208: icmp_seq=2 ttl=64 time=0.026 ms
64 bytes from 172.21.33.208: icmp_seq=3 ttl=64 time=0.027 ms
64 bytes from 172.21.33.208: icmp_seq=4 ttl=64 time=0.025 ms

However, pinging over ipv6 fails:

[r...@homer ~]# ping6 -I ib0 fe80::206:6a00:a000:6ca8 PING 
fe80::206:6a00:a000:6ca8(fe80::206:6a00:a000:6ca8) from 
fe80::206:6a00:a000:707f ib0: 56 data bytes From fe80::206:6a00:a000:707f 
icmp_seq=1 Destination unreachable: Address unreachable From 
fe80::206:6a00:a000:707f icmp_seq=2 Destination unreachable: Address 
unreachable From fe80::206:6a00:a000:707f icmp_seq=3 Destination unreachable: 
Address unreachable

But pinging over ipv6 works between rhel5 boxes:

[r...@homer ~]# ping6 -I ib0 fe80::206:6a00:a000:7d5e PING 
fe80::206:6a00:a000:7d5e(fe80::206:6a00:a000:7d5e) from 
fe80::206:6a00:a000:707f ib0: 56 data bytes
64 bytes from fe80::206:6a00:a000:7d5e: icmp_seq=0 ttl=64 time=1.72 ms
64 bytes from fe80::206:6a00:a000:7d5e: icmp_seq=1 ttl=64 time=0.063 ms
64 bytes from fe80::206:6a00:a000:7d5e: icmp_seq=2 ttl=64 time=0.033 ms
64 bytes from fe80::206:6a00:a000:7d5e: icmp_seq=3 ttl=64 time=0.044 ms

Similarly, the RHEL5 host can ping IPV6 to a SLES10 host:

[r...@homer ~]# ping6 -I ib0 fe80::206:6a00:a000:6cc1 PING 
fe80::206:6a00:a000:6cc1(fe80::206:6a00:a000:6cc1) from 
fe80::206:6a00:a000:707f ib0: 56 data bytes
64 bytes from fe80::206:6a00:a000:6cc1: icmp_seq=0 ttl=64 time=1.91 ms
64 bytes from fe80::206:6a00:a000:6cc1: icmp_seq=1 ttl=64 time=0.048 ms

Any ideas? Has anyone seen this? 

Interestingly, the SM shows that the ping packet is leaving the RHEL4 machine, 
but it appears to have a garbled address in it:

Wed Mar  3 11:58:07 2010: fm0_sm(13922): ERROR[sareader]: SA: sa_PathRecord: 
gidprefix in Dest Gid 0x0006:6a00a0006ca8 of PATH request from 
Lid 0x9 does not match SM's(0xfe80)

What makes this interesting is that the actual GID of the RHEL4 box is 
fe80:00066a00a0006ca8 - so it looks like the GID in the ping packet 
is shifted 16 bits.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Setting QP attributes with RDMA CM

2010-03-03 Thread Or Gerlitz
Todd Strader wrote:
> I'm using the RDMA CM to set up a QP and I'm trying to figure out if I
> can suggest QP attributes to it before it transitions through all the states

see rdma_connect(3)

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Setting QP attributes with RDMA CM

2010-03-03 Thread Todd Strader

Hi,

I'm using the RDMA CM to set up a QP and I'm trying to figure out if I 
can suggest QP attributes to it before it transitions through all the 
states.  My QP is coming up with rnr_retry = 0, and I'd like to set it 
higher.  Is there any way to do this besides going back through RTR -> RTS?


Also, if this isn't the right place for this discussion, can someone 
point me in the right direction?  The openfabrics.org lists seem to be dead.


Thanks.

Todd
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rnfs: rq_respages pointer is bad

2010-03-03 Thread David J. Wilder

On Mon, 2010-03-01 at 21:35 -0600, Tom Tucker wrote:
> Hi David:
> 
> That looks like a bug to me and it looks like what you propose is the 
> correct fix. My only reservation is that if you are correct then how did 
> this work at all without data corruption for large writes on x86_64?

The size of the page array is determined by page size (and some other
parameters).  I am using 64k pages, that may be the key.  I don't know
what page size was used on X86-64 but we may have been lucky and not
fallen off the end of the array thus avoided the problem.

> 
> I'm on the road right now, so I can't dig too deep until Wednesday, but 
> at this point your analysis looks correct to me.
> 
> Tom
> 
> 
> David J. Wilder wrote:
> > Tom
> >
> > I have been chasing an rnfs related Oops in svc_process().  I have found
> > the source of the Oops but I am not sure of my fix.  I am seeing the
> > problem on ppc64, kernel 2.6.32, I have not tried other arch yet.
> >
> > The source of the problem is in rdma_read_complete(), I am finding that
> > rqstp->rq_respages is set to point past the end of the rqstp->rq_pages
> > page list.  This results in a NULL reference in svc_process() when
> > passing rq_respages[0] to page_address().
> >
> > In rdma_read_complete() we are using rqstp->rq_arg.pages as the base of
> > the page list then indexing by page_no, however rq_arg.pages is not
> > pointing to the start of the list so rq_respages ends up pointing to:
> >
> > rqstp->rq_pages[(head->count+1) + head->hdr_count]
> >
> > In my case, it ends up pointing one past the end of the list by one.
> >
> > Here is the change I made.
> >
> > static int rdma_read_complete(struct svc_rqst *rqstp,
> >   struct svc_rdma_op_ctxt *head)
> > {
> > int page_no;
> > int ret;
> >
> > BUG_ON(!head);
> >
> > /* Copy RPC pages */
> > for (page_no = 0; page_no < head->count; page_no++) {
> > put_page(rqstp->rq_pages[page_no]);
> > rqstp->rq_pages[page_no] = head->pages[page_no];
> > }
> > /* Point rq_arg.pages past header */
> > rqstp->rq_arg.pages = &rqstp->rq_pages[head->hdr_count];
> > rqstp->rq_arg.page_len = head->arg.page_len;
> > rqstp->rq_arg.page_base = head->arg.page_base;
> >
> > /* rq_respages starts after the last arg page */
> > -   rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];
> > +   rqstp->rq_respages = &rqstp->rq_pages[page_no];
> > .
> > .
> > .
> >
> > The change works for me, but I am not sure it is safe to assume the
> > rqstp->rq_pages[head->count] will always point to the last arg page.
> >
> > Dave.
> >   
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] infiniband-diags perfquery too noisy.

2010-03-03 Thread Mike Heinz
When perfquery is run against fabrics that do not support PortXmitWait, it
emits this warning for every port:

ibwarn: [23225] dump_perfcounters: PortXmitWait not indicated so ignore this
counter

When running ibcheckerrors on a large fabric, this leads to a flood of
warnings.

The proposed patch reduces the warning to a verbose message and, on fabrics 
that do not support PortXmitWait, it suppresses the output of the XmitWait 
attribute.

I've also included this patch in PR 1970 on the openfabrics bugzilla.



perfquery.patch
Description: perfquery.patch


[PATCH] RDMA/cxgb3: wait at least one schedule cycle during device removal.

2010-03-03 Thread Steve Wise
During a hot-plug LLD removal event or an EEH error event, iw_cxgb3
must ensure that any/all threads that might be in a cxgb3 exported function
concurrently must return from the function before iw_cxgb3 returns from
its event processing. Do this by calling synchronize_net().

Signed-off-by: Steve Wise 
---

 drivers/infiniband/hw/cxgb3/iwch.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch.c 
b/drivers/infiniband/hw/cxgb3/iwch.c
index ee1d8b4..63f975f 100644
--- a/drivers/infiniband/hw/cxgb3/iwch.c
+++ b/drivers/infiniband/hw/cxgb3/iwch.c
@@ -189,6 +189,7 @@ static void close_rnic_dev(struct t3cdev *tdev)
list_for_each_entry_safe(dev, tmp, &dev_list, entry) {
if (dev->rdev.t3cdev_p == tdev) {
dev->rdev.flags = CXIO_ERROR_FATAL;
+   synchronize_net();
cancel_delayed_work_sync(&dev->db_drop_task);
list_del(&dev->entry);
iwch_unregister_device(dev);
@@ -217,6 +218,7 @@ static void iwch_event_handler(struct t3cdev *tdev, u32 
evt, u32 port_id)
switch (evt) {
case OFFLOAD_STATUS_DOWN: {
rdev->flags = CXIO_ERROR_FATAL;
+   synchronize_net();
event.event  = IB_EVENT_DEVICE_FATAL;
dispatch = 1;
break;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB issues

2010-03-03 Thread Eli Cohen
I just posted a patch which might fix your problem. Please try it and
let us know if it fixed anything.

On Tue, Mar 02, 2010 at 01:54:09PM -0800, Josh England wrote:
> Hello,
> 
> I've been running into several issues using IPoIB.  The 2 primary uses
> are for read-only NFS to the clients (over TCP) and access to an
> ethernet-connected parallel filesystem (Panasas) through router nodes
> passing IPoIB<-->10GbE.
> 
> All nodes are running CentOS 5.3 and OFED 1.4.2, although a have played
> with OFED 1.5 and seen similar results.  Client nodes mount their NFS root
> from boot servers via IPoIB with a ratio of 80:1.  The boot servers are the
> ones that seem to have issues.  The fabric itself consists of ~1000 nodes
> interconnected such that their is 2:1 oversubscription within any single rack,
> and 20:1 oversubscription between racks (through the core switch).  I
> don't know how much the oversubscription comes into play here as I can
> reproduce the error within a single rack.
> 
> In datagram mode, I see errors on the boot servers of the form.
> 
> ib0: post_send failed
> ib0: post_send failed
> ib0: post_send failed
> 
> 
> When using connected mode, I hit a different error:
> 
> NETDEV WATCHDOG: ib0: transmit timed out
> ib0: transmit timeout: latency 1999 msecs
> ib0: queue stopped 1, tx_head 2154042680, tx_tail 2154039464
> NETDEV WATCHDOG: ib0: transmit timed out
> ib0: transmit timeout: latency 2999 msecs
> ib0: queue stopped 1, tx_head 2154042680, tx_tail 2154039464
> ...
> ...
> NETDEV WATCHDOG: ib0: transmit timed out
> ib0: transmit timeout: latency 61824999 msecs
> ib0: queue stopped 1, tx_head 2154042680, tx_tail 2154039464
> 
> 
> The errors seem to hit only after NFS comes into play.  Once it
> starts, the NETDEV WATCHDOG messages continue until I run
> 'ifconfig ib0 down up'.  I've tried tuning send_queue_size and
> recv_queue_size on both sides, the txqueuelen of the ib0 interface, the
> NFS rsize/wsize.  None of it seems to help greatly.  Does anyone have
> any ideas about what can I do to try to fix
> these problems?
> 
> -JE
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ipoib: Fix lockup of the tx queue

2010-03-03 Thread Eli Cohen
The ipoib UD QP reports send completions to priv->send_cq which is unarmed
generally; it only gets armed when the number of outstanding send requests
(e.g. those for which a completion was not polled yet) reaches the size of the
tx queue. This arming (done using ib_req_notify_cq()) is done only in the send
path for the UD QP. However, when sending CM packets, the net queue may be
stopped for the same reasons but no measures are taken to recover the UD path
from a lockup.
Consider this scenario: a host sends high rate of both CM and UD packets.
Suppose also that the tx queue length is N. If at some time the number of
outstanding UD packets is more than N/2 and the overall outstanding packets is
N-1, and now CM sends a packet making the number of outstanding equal N, the tx
queue will be stopped. When all the CM packets will complete, the number of
outstanding packets will still be higher than N/2 so the tx queue will not be
enabled.
Fix this by calling ib_req_notify_cq() when the queue is stopped in the CM
path.

Signed-off-by: Eli Cohen 
---
 drivers/infiniband/ulp/ipoib/ipoib_cm.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 30bdf42..f8302c2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -752,6 +752,8 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff 
*skb, struct ipoib_cm_
if (++priv->tx_outstanding == ipoib_sendq_size) {
ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net 
queue\n",
  tx->qp->qp_num);
+   if (ib_req_notify_cq(priv->send_cq, IB_CQ_NEXT_COMP))
+   ipoib_warn(priv, "request notify on send CQ 
failed\n");
netif_stop_queue(dev);
}
}
-- 
1.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


opensm/main.c: foce stdout to be line-buffered

2010-03-03 Thread Yevgeny Kliteynik
When stdout is assigned to a terminal, it is line-buffered.
But when opensm's stdout is redirected to a file, stdout
becomes block-buffered, which means that '\n' won't cause
the buffer to be flushed.

Forcing stdout to always be line-buffered.

Signed-off-by: Yevgeny Kliteynik 
---
 opensm/opensm/main.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/opensm/opensm/main.c b/opensm/opensm/main.c
index f9a33af..5ea65dd 100644
--- a/opensm/opensm/main.c
+++ b/opensm/opensm/main.c
@@ -613,6 +613,9 @@ int main(int argc, char *argv[])
{NULL, 0, NULL, 0}  /* Required at the end of the array */
};

+   /* force stdout to be line-buffered */
+   setlinebuf(stdout);
+
/* Make sure that the opensm and complib were compiled using
   same modes (debug/free) */
if (osm_is_debug() != cl_is_debug()) {
-- 
1.5.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html