Re: [ewg] OFED 3.2 - Errors with admin-rdma script and make on x86

2012-03-20 Thread Pradeep Satyanarayana
We have experimented with various configuration options and the build does
complete, depending on what config options are chosen. So, yes we do get
the kernel modules built.
Basic question, how does one get the corresponding user level libraries and
scripts that go with OFED-3.2. Is that still in the works, or is one
expected to pull from a different tree
i.e. where does one get the rest of the stuff?

Thanks
Pradeep
prad...@us.ibm.com

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Errors building OFED-3.2 on ppc64

2012-03-06 Thread Pradeep Satyanarayana

Here are the steps that we followed and the errors encountered:

/home/user/compat-rdma # git clone
git://git.openfabrics.org/compat-rdma/linux-3.2.git
/home/user/compat-rdma # git clone
git://git.openfabrics.org/compat-rdma/compat.git
/home/user/compat-rdma # git clone
git://git.openfabrics.org/compat-rdma/compat-rdma.git

/home/user/compat-rdma # export GIT_TREE=/home/user/linux-3.2/
/home/user/compat-rdma # export GIT_COMPAT_TREE=/home/user/compat
/home/user/compat-rdma # ./scripts/admin_rdma.sh
You said to use git tree at: /home/user/linux-3.2/ for linux-next
You said to use git tree at: /home/user/compat for compat
mkdir -p ./Documentation
cp -a /home/user/linux-3.2//Documentation/infiniband/ ./Documentation
mkdir -p ./drivers/base
cp
-a /home/user/linux-3.2//drivers/base/attribute_container.c ./drivers/base
mkdir -p ./drivers/base
cp -a /home/user/linux-3.2//drivers/base/transport_class.c ./drivers/base
mkdir -p ./drivers
cp -a /home/user/linux-3.2//drivers/infiniband/ ./drivers
mkdir -p ./drivers/net/ethernet/chelsio
cp
-a /home/user/linux-3.2//drivers/net/ethernet/chelsio/cxgb3/ 
./drivers/net/ethernet/chelsio
mkdir -p ./drivers/net/ethernet/chelsio
cp
-a /home/user/linux-3.2//drivers/net/ethernet/chelsio/cxgb4/ 
./drivers/net/ethernet/chelsio
mkdir -p ./drivers/net/ethernet/mellanox
cp
-a /home/user/linux-3.2//drivers/net/ethernet/mellanox/mlx4/ 
./drivers/net/ethernet/mellanox
mkdir -p ./drivers/scsi
cp -a /home/user/linux-3.2//drivers/scsi/iscsi_tcp.c ./drivers/scsi
mkdir -p ./drivers/scsi
cp -a /home/user/linux-3.2//drivers/scsi/iscsi_tcp.h ./drivers/scsi
mkdir -p ./drivers/scsi
cp -a /home/user/linux-3.2//drivers/scsi/libiscsi.c ./drivers/scsi
mkdir -p ./drivers/scsi
cp
-a /home/user/linux-3.2//drivers/scsi/scsi_transport_iscsi.c ./drivers/scsi
mkdir -p ./fs
cp -a /home/user/linux-3.2//fs/exportfs/ ./fs
mkdir -p ./fs
cp -a /home/user/linux-3.2//fs/lockd/ ./fs
mkdir -p ./fs
cp -a /home/user/linux-3.2//fs/nfs/ ./fs
mkdir -p ./fs
cp -a /home/user/linux-3.2//fs/nfs_common/ ./fs
mkdir -p ./fs
cp -a /home/user/linux-3.2//fs/nfsd/ ./fs
mkdir -p ./include/linux
cp -a /home/user/linux-3.2//include/linux/mlx4/ ./include/linux
mkdir -p ./include
cp -a /home/user/linux-3.2//include/rdma/ ./include
mkdir -p ./include/scsi
cp -a /home/user/linux-3.2//include/scsi/srp.h ./include/scsi
mkdir -p ./kernel
cp -a /home/user/linux-3.2//kernel/kfifo.c ./kernel
mkdir -p ./lib
cp -a /home/user/linux-3.2//lib/klist.c ./lib
mkdir -p ./net
cp -a /home/user/linux-3.2//net/rds/ ./net
mkdir -p ./include/linux
cp -a /home/user/linux-3.2//include/linux/rds.h ./include/linux
mkdir -p ./net
cp -a /home/user/linux-3.2//net/sunrpc/ ./net
Copying /home/user/compat/ files...
cp: cannot stat `/home/user/compat/include/scsi/*': No such file or
directory
fatal: cannot describe '6343adaefb8a7a21d9ae1a018159f90f482cac61'
Updated from local tree: /home/user/linux-3.2/
Origin remote URL: git://git.openfabrics.org/compat-rdma/linux-3.2.git
fatal: cannot describe '55379a99264a53afa3cfc82594ee14b5513b0141'

compat-rdma code metrics

423399 - Total upstream lines of code being pulled

Base tree: linux-3.2.git
Base tree version:
compat-rdma release:
/home/user/compat-rdma #

Thanks
Pradeep
prad...@us.ibm.com

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Errors building OFED-3.2 on ppc64

2012-03-06 Thread Pradeep Satyanarayana

Here are the steps that we followed and the errors encountered:

/home/user/compat-rdma # git clone
git://git.openfabrics.org/compat-rdma/linux-3.2.git
/home/user/compat-rdma # git clone
git://git.openfabrics.org/compat-rdma/compat.git
/home/user/compat-rdma # git clone
git://git.openfabrics.org/compat-rdma/compat-rdma.git

/home/user/compat-rdma # export GIT_TREE=/home/user/linux-3.2/
/home/user/compat-rdma # export GIT_COMPAT_TREE=/home/user/compat
/home/user/compat-rdma # ./scripts/admin_rdma.sh
You said to use git tree at: /home/user/linux-3.2/ for linux-next
You said to use git tree at: /home/user/compat for compat
mkdir -p ./Documentation
cp -a /home/user/linux-3.2//Documentation/infiniband/ ./Documentation
mkdir -p ./drivers/base
cp
-a /home/user/linux-3.2//drivers/base/attribute_container.c ./drivers/base
mkdir -p ./drivers/base
cp -a /home/user/linux-3.2//drivers/base/transport_class.c ./drivers/base
mkdir -p ./drivers
cp -a /home/user/linux-3.2//drivers/infiniband/ ./drivers
mkdir -p ./drivers/net/ethernet/chelsio
cp
-a /home/user/linux-3.2//drivers/net/ethernet/chelsio/cxgb3/ 
./drivers/net/ethernet/chelsio
mkdir -p ./drivers/net/ethernet/chelsio
cp
-a /home/user/linux-3.2//drivers/net/ethernet/chelsio/cxgb4/ 
./drivers/net/ethernet/chelsio
mkdir -p ./drivers/net/ethernet/mellanox
cp
-a /home/user/linux-3.2//drivers/net/ethernet/mellanox/mlx4/ 
./drivers/net/ethernet/mellanox
mkdir -p ./drivers/scsi
cp -a /home/user/linux-3.2//drivers/scsi/iscsi_tcp.c ./drivers/scsi
mkdir -p ./drivers/scsi
cp -a /home/user/linux-3.2//drivers/scsi/iscsi_tcp.h ./drivers/scsi
mkdir -p ./drivers/scsi
cp -a /home/user/linux-3.2//drivers/scsi/libiscsi.c ./drivers/scsi
mkdir -p ./drivers/scsi
cp
-a /home/user/linux-3.2//drivers/scsi/scsi_transport_iscsi.c ./drivers/scsi
mkdir -p ./fs
cp -a /home/user/linux-3.2//fs/exportfs/ ./fs
mkdir -p ./fs
cp -a /home/user/linux-3.2//fs/lockd/ ./fs
mkdir -p ./fs
cp -a /home/user/linux-3.2//fs/nfs/ ./fs
mkdir -p ./fs
cp -a /home/user/linux-3.2//fs/nfs_common/ ./fs
mkdir -p ./fs
cp -a /home/user/linux-3.2//fs/nfsd/ ./fs
mkdir -p ./include/linux
cp -a /home/user/linux-3.2//include/linux/mlx4/ ./include/linux
mkdir -p ./include
cp -a /home/user/linux-3.2//include/rdma/ ./include
mkdir -p ./include/scsi
cp -a /home/user/linux-3.2//include/scsi/srp.h ./include/scsi
mkdir -p ./kernel
cp -a /home/user/linux-3.2//kernel/kfifo.c ./kernel
mkdir -p ./lib
cp -a /home/user/linux-3.2//lib/klist.c ./lib
mkdir -p ./net
cp -a /home/user/linux-3.2//net/rds/ ./net
mkdir -p ./include/linux
cp -a /home/user/linux-3.2//include/linux/rds.h ./include/linux
mkdir -p ./net
cp -a /home/user/linux-3.2//net/sunrpc/ ./net
Copying /home/user/compat/ files...
cp: cannot stat `/home/user/compat/include/scsi/*': No such file or
directory
fatal: cannot describe '6343adaefb8a7a21d9ae1a018159f90f482cac61'
Updated from local tree: /home/user/linux-3.2/
Origin remote URL: git://git.openfabrics.org/compat-rdma/linux-3.2.git
fatal: cannot describe '55379a99264a53afa3cfc82594ee14b5513b0141'

compat-rdma code metrics

423399 - Total upstream lines of code being pulled

Base tree: linux-3.2.git
Base tree version:
compat-rdma release:
/home/user/compat-rdma #

Thanks
Pradeep
prad...@us.ibm.com

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] tcp_mem settings in /sbin/ib_ipoib_sysctl

2010-12-02 Thread Pradeep Satyanarayana

Some customers have seen strange behaviors with the tcp_mem settings
in /sbin/ib_ipoib_sysctl.The problems are similar to what has been reported
at:

http://lists.linbit.com/pipermail/drbd-user/2009-September/012711.html

Essentially a setting of :
/sbin/sysctl -q -w net.ipv4.tcp_mem=16777216 16777216 16777216

chews up so much memory that it starves other applications (including file
systems) and deadlocks the system. I am not sure which OFED release this
crept in, but I
do understand that it is a performance tweak to help IPoIB CM. What may not
have been considered is that this vector specifies pages (not bytes) as
shown:

tcp_mem - vector of 3 INTEGERs: min, pressure, max
min: below this number of pages TCP is not bothered about its
memory appetite.

pressure: when amount of memory allocated by TCP exceeds this
number
of pages, TCP moderates its memory consumption and enters memory
pressure mode, which is exited when memory consumption falls
under min.

max: number of pages allowed for queueing by all TCP sockets.

Defaults are calculated at boot time from amount of available
memory.

In effect (with this setting)TCP will not moderate it's memory consumption
below 16M*4K (if page size=4k) i.e. 64GB! This may be more than the RAM
available on smaller
systems, and what what about the case when page size is say 64K (1024GB
before TCP starts to moderate)?

Can we consider removing the tcp_mem settings from
the /sbin/ib_ipoib_sysctl file for OFED-1.5.3? As mentioned above, defaults
are calculated at boot time
based on the memory available and should be good enough for most uses.

Thanks
Pradeep
prad...@us.ibm.com___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] EWG meeting agenda

2010-08-30 Thread Pradeep Satyanarayana

uDAPL users confirmed that the fix is available and bug# 2026 is now
closed.

Thanks
Pradeep
prad...@us.ibm.com

|
| From:  |
|
  
--|
  |Tziporet Koren tzipo...@mellanox.co.il 
 |
  
--|
|
| To:|
|
  
--|
  |ewg@lists.openfabrics.org ewg@lists.openfabrics.org  
 |
  
--|
|
| Date:  |
|
  
--|
  |08/30/2010 09:06 AM  
 |
  
--|
|
| Subject:   |
|
  
--|
  |[ewg] EWG meeting agenda 
 |
  
--|
|
| Sent by:   |
|
  
--|
  |ewg-boun...@lists.openfabrics.org
 |
  
--|





OFED meeting agenda:
OFED 1.5.2 RC5 was released today

Updated schedule:
OFED-1.5.2-rc6 - September 6.
OFED-1.5.2 GA - September 13.

Bugs list:

2075  blo   sas...@voltaire.com  Applications 
built
against OFED 1.5 will not run on OFED ...
767  maj sw...@opengridcomputing.com 
backport
Kernels that don't build in genalloc cause c...
2026 maj arlin.r.da...@intel.com What's 
the
current target release date for OFED 1.5.2 ?
2063 maj yevge...@mellanox.co.il 
mlx4_en: When
disabling both nic ports, and enable one - ...
2077 maj johann.geo...@qlogic.comqperf 
from
OFED-1.5.2-rc2.tgz is not running for RDMA spe...
2084 maj perki...@cse.ohio-state.edu
OFED-1.5.2-20100711-0600- IMB test fails to start on MVAP...
2096 maj i...@dev.mellanox.co.il 
ib-send_* and
ib_write_* on ia64 does not work on OFED1.5...
2097 maj andy.gro...@oracle.com
[ OFED-1.5.2-20100810-0600 ] - rds-ping  rds-stress test...
2104 maj krzysztof.sprzaczkow...@int... bonding doesn't
detect port failure with NetEffect

Tziporet

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

inline: graycol.gifinline: ecblank.gif___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] RoCE supplement to IB Spec

2010-04-27 Thread Pradeep Satyanarayana
Is it the intent that the RoCE implementation in OFED-1.5.1 corresponds to the 
RoCE supplement to the IB spec dated April 6th 2010,
excepting bugs of course? Or are there known deviations from the spec?

Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] How to monitor traffic with IBoE/RoCEE?

2010-03-26 Thread Pradeep Satyanarayana
Tom Ammon wrote:
 A widely-supported method for collecting statistics from ethernet
 switching devices is SNMP. As Woody said, you wouldn't manage your
 ethernet network any differently than you do now. In addition many many
 tools have been written to make use of SNMP counters, so you won't have
 a hard time finding tools, both commercial and open source.
 
 Tom
 
 On 03/26/2010 11:29 AM, Woodruff, Robert J wrote:
 Pradeep wrote,

   
 With IBoE/RoCEE, the traditional SM in IB clusters is not needed.
 Most of the current
 IB tools rely on the SM and PM to get packet and error statistics and
 so on. These
 won't be applicable with IBoE/RoCEE. netstat will have no value since
 the kernel
 has been bypassed. So, how does one monitor traffic in such a cluster?
  
 Some managed Ethernet switches have the ability to log into them and
 get information
 on packets transmited/received on each port and errors that are
 detected. However,
 these are typically proprietary. I am not sure about the Mellanox card.

 So I guess the answer is that you manage RoCCE and iWarp clusters just
 like
 you would any Ethernet cluster, with existing management tools that
 are available
 for managing Ethernet.

 woody

True, I suspected that would be the case. However, that means we are using
a different management model than with IB clusters. Hence, I was curious if
Mellanox had any tools to get say some hardware counters without going to a
different management model.

Thanks
Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] How to monitor traffic with IBoE/RoCEE?

2010-03-25 Thread Pradeep Satyanarayana
With IBoE/RoCEE, the traditional SM in IB clusters is not needed. Most of the 
current
IB tools rely on the SM and PM to get packet and error statistics and so on. 
These 
won't be applicable with IBoE/RoCEE. netstat will have no value since the 
kernel 
has been bypassed. So, how does one monitor traffic in such a cluster?

The possibilities that I can think of are to get information from the switches 
or if there 
are some special tools to get information from the adapter ports themselves. 
Are such tools
available with ConnectX adapters? How would one get cluster-wide traffic 
information when
such a cluster is deployed?

Thanks
Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] IB errors with openMPI

2010-02-21 Thread Pradeep Satyanarayana
We are trying run openMPI with OFED-1.5 on the 2.6.31-rt11-preempt-rt kernel 
and see the following errors:

[[45393,1],8][../../../../../ompi/mca/btl/openib/btl_openib_component.c:2951:handle_wc]
from elm3b107 to: elm3b17 error polling HP CQ with status WORK REQUEST FLUSHED
ERROR status number 5 for wr_id 1289846528 opcode -1782678528  vendor error 244
qp_idx 0

At this point I looked at the mlx4 diag counters and saw some non-zero values. 
Since we were attempting 
a series of runs, we don't know when the counters increased from 0. Do these 
counters have any correlation 
to the above MPI error?

[r...@elm3b17 diag_counters]# pwd
/sys/class/infiniband/mlx4_0/diag_counters
[r...@elm3b17 diag_counters]#

[r...@elm3b17 diag_counters]# cat rq_num_rnr
19
[r...@elm3b17 diag_counters]# cat rq_num_wrfe 
2009
[r...@elm3b17 diag_counters]# cat sq_num_tree 
12
[r...@elm3b17 diag_counters]# cat sq_num_wrfe
12
[r...@elm3b17 diag_counters]#

Similarly on 3b107 let us look at the counters.

[r...@elm3b107 diag_counters]# cat rq_num_wrfe
5156
[r...@elm3b107 diag_counters]# cat sq_num_rnr
18
[r...@elm3b107 diag_counters]# cat sq_num_tree
20
[r...@elm3b107 diag_counters]# cat sq_num_wrfe
20
[r...@elm3b107 diag_counters]#


We are using ConnectX dual port DDR HCAs (FW version 2.6). What does the vendor 
error 244 mean? Any suggestions to 
debug this further?

Thanks
Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [PATCH] [for-2.6.33] rdma/cm: revert associating an RDMA device when binding to loopback

2010-02-09 Thread Pradeep Satyanarayana
Steve Wise wrote:
 This patch works.  It also backports cleanly to ofed-1.5.1/RH5.3.
 
 Acked-by: Steve Wise sw...@opengridcomputing.com
 
 Steve.
Steve, Was this tested against both iWARP and IB?

Thanks
Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] rdma/cm: disallow loopback address for iwarp devices

2010-02-08 Thread Pradeep Satyanarayana
Steve Wise wrote:
 Sean, can you try openmpi?  It fails for me, and yet ucmatose succeeds. 
 I don't understand the difference yet...
 
 
 Sean Hefty wrote:
 On my OFED 1.4.1 RHEL4u6 systems, rdma_bind_addr() fails when
 attempting to
 bind to 127.0.0.1 per the email I sent Friday:

http://www.spinics.net/lists/linux-rdma/msg02568.html
 

 This is what I see over IB on 2.6.26, with a couple extra prints added to
 cmatose:

 cst-lin1:/home/mshefty/librdmacm# examples/ucmatose -b 127.0.0.1
 cmatose: starting server
 src addr 0x17f
 rdma_bind_addr: 0

 so we're missing something else.


Hi Steve,

I am attempting to duplicate the problem that you reported with today's OFED 
build (on Sles11, if that matters). I have rarely
used openMPI, so suggestions would be helpful. Here is what I see:

elm3b199:/usr/lib # /usr/mpi/gcc/openmpi-1.4.1/bin/mpirun -np 2 --bynode --mca 
btl_openib_cpc_include rdmacm ring
--
mpirun was unable to launch the specified application as it could not find an 
executable:

Executable: ring
Node: elm3b199

while attempting to start process rank 0.
--
elm3b199:/usr/lib #

Incidentally tvflash did not build (this is a ppc64 machine).

Thanks
Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] rdma/cm: disallow loopback address for iwarp devices

2010-02-08 Thread Pradeep Satyanarayana
Jeff Squyres wrote:
 On Feb 8, 2010, at 7:30 PM, Pradeep Satyanarayana wrote:
 
 elm3b199:/usr/lib # /usr/mpi/gcc/openmpi-1.4.1/bin/mpirun -np 2 --bynode 
 --mca btl_openib_cpc_include rdmacm ring
 --
 mpirun was unable to launch the specified application as it could not find 
 an executable:

 Executable: ring
 Node: elm3b199

 while attempting to start process rank 0.
 --
 elm3b199:/usr/lib #
 
 Is there an executable named ring either in your $PATH or in /usr/lib?
 
 Open MPI is telling you it can't find an executable named ring.

Hi Jeff,

No, there is none. I got this command from one of the mails in the thread. What 
should I use instead?

Thanks
Pradeep


___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: [PATCH 7/9] rdma/cm: fix loopback address support

2009-11-23 Thread Pradeep Satyanarayana


ewg-boun...@lists.openfabrics.org wrote on 11/22/2009 02:36:32 AM:

 Pradeep Satyanarayana wrote:
  Roland Dreier wrote:
 
  Thanks... in any case I applied all 9 of the patches in this series.
  Thanks for pulling all this together.
 
  Sean, Thanks a lot for pulling it all together. Can we consider
 including this
  into OFED-1.5 too?
 
  Pradeep
 
 
 
 Pradeep
 If someone will send a patch to Vlad we can add it to OFED 1.5

Tziporet, Sure, we will do it.

Thanks
Pradeep
prad...@us.ibm.com___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] Re: [PATCH 7/9] rdma/cm: fix loopback address support

2009-11-19 Thread Pradeep Satyanarayana
Roland Dreier wrote:
 Thanks... in any case I applied all 9 of the patches in this series.
 Thanks for pulling all this together.
Sean, Thanks a lot for pulling it all together. Can we consider including this
into OFED-1.5 too?

Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Crash in bonding

2009-11-04 Thread Pradeep Satyanarayana
Tziporet Koren wrote:
 Pradeep Satyanarayana wrote:
 This crash was originally reported against Rhel5.4. However, one can
 recreate this crash quite easily in OFED-1.5 too.

   
 Can you open a bugzilla too?

I have opened bug# 1821

https://bugs.openfabrics.org/show_bug.cgi?id=1821

Pradeep


___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Crash in bonding

2009-11-03 Thread Pradeep Satyanarayana
Shiri Franchi wrote:
 Hi,
 
 I tried to reproduce on RH5 up4 with ping and iperf and it did not
 happened.
 Are you sure you used modprobe -t ib_ipoib or maybe modprobe -r
 bonding?
 
 Thanks,
 Shiri

Hi Shiri,

I used modprobe -r ib_ipoib. One of the pre-requisites to create
this crash is to do ifdown ib0 and ifdown ib1 while traffic is 
flowing through the bond master, and then remove the ipoib module.

If you simply remove the ipoib module, without doing the ifdowns the
crash does not occur.

Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Crash in bonding

2009-11-03 Thread Pradeep Satyanarayana
Shiri Franchi wrote:
 Hi,
 
 I did it exactly as you described:
 1. ifdown ib0
 2. ifdown ib1
 3. modprobe -r ib_ipoib
 
 And it did not reproduced..

It is a race that you may not be recreating. I have tried
this across different HCAs and platforms (x86_64, ppc64). 
Seems to recreate almost at will for me.

Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Crash in bonding

2009-11-02 Thread Pradeep Satyanarayana
This crash was originally reported against Rhel5.4. However, one can recreate 
this crash quite easily in OFED-1.5 too. 
The steps to recreate the crash are as follows:

1. Run traffic (I used ping) on the IB interfaces through the bond master
2. ifdown ib0
3. ifdown ib1
4. modprobe -r ib_ipoib

Quite often, the crash stack trace seen is as follows:

ID: 0  TASK: 81087fc11820  CPU: 13  COMMAND: swapper
 #0 [81010ff07ab0] crash_kexec at 800ac5b9
 #1 [81010ff07b70] __die at 80065127
 #2 [81010ff07bb0] do_page_fault at 80066da7
 #3 [81010ff07ca0] error_exit at 8005dde9
 #4 [81010ff07d58] neigh_connected_output at 8022cb87
 #5 [81010ff07d88] ip_output at 800320ac
 #6 [81010ff07db8] ip_queue_xmit at 8003464d
 #7 [81010ff07e78] tcp_transmit_skb at 80021d73
 #8 [81010ff07ec8] tcp_retransmit_skb at 80250ccd
 #9 [81010ff07f08] tcp_write_timer at 80252652
#10 [81010ff07f28] run_timer_softirq at 800968be
#11 [81010ff07f58] __do_softirq at 8001235a
#12 [81010ff07f88] call_softirq at 8005e2fc
#13 [81010ff07fa0] do_softirq at 8006cb14
#14 [81010ff07fb0] apic_timer_interrupt at 8005dc8e
--- IRQ stack ---
#15 [81010ff03e48] apic_timer_interrupt at 8005dc8e
[exception RIP: mwait_idle+54]
RIP: 800571f4  RSP: 81010ff03ef0  RFLAGS: 0246
RAX:   RBX: 000d  RCX: 
RDX:   RSI: 0001  RDI: 80301698
RBP: 81087fc11a10   R8: 81010ff02000   R9: 0032
R10: 81048e0cc4f0  R11: 8103ebafcd18  R12: 05f33f4d
R13: 0d12e63d7223  R14: 81047fe797a0  R15: 81087fc11820
ORIG_RAX: ff10  CS: 0010  SS: 0018
#16 [81010ff03ef0] cpu_idle at 8004939e



I was able to set up some break points and the analysis follows.

cpu 0x1 stopped at breakpoint 0x1 (d0ec4214 .bond_release+0x0/0x4d0 
[bonding])
mflrr0
enter ? for help
1:mon t
[link register   ] d0ecdf80 .bonding_store_slaves+0x304/0x3f0 [bonding]
[cfd97b00] d0ecdf70 .bonding_store_slaves+0x2f4/0x3f0 [bonding] 
(unreliable)
[cfd97bd0] c029a660 .class_device_attr_store+0x44/0x60
[cfd97c40] c015df9c .sysfs_write_file+0x134/0x1b8
[cfd97cf0] c00f8ec4 .vfs_write+0x118/0x200
[cfd97d90] c00f9634 .sys_write+0x4c/0x8c
[cfd97e30] c00086a4 syscall_exit+0x0/0x40
--- Exception: c00 (System Call) at 0ff11138
SP (ffd1f300) is in userspace

Did some basic sanity checks and confirmed that we hit a couple of breakpoints 
and
the bond master was indeed bond0 as expected and the slave device being 
released was ib1.
After the breakpoints, we crashed 


Faulting instruction address: 0xc034bddc
cpu 0x1: Vector: 300 (Data Access) at [c000e025b2b0]
pc: c034bddc: .neigh_resolve_output+0x28c/0x34c
lr: c034bdc0: .neigh_resolve_output+0x270/0x34c
sp: c000e025b530
   msr: 80009032
   dar: d0c6fe58
 dsisr: 4000
  current = 0xc000e25f1aa0
  paca= 0xc053e280
pid   = 3591, comm = ping
enter ? for help
1:mon e
cpu 0x1: Vector: 300 (Data Access) at [c000e025b2b0]
pc: c034bddc: .neigh_resolve_output+0x28c/0x34c
lr: c034bdc0: .neigh_resolve_output+0x270/0x34c
sp: c000e025b530
   msr: 80009032
   dar: d0c6fe58
 dsisr: 4000
  current = 0xc000e25f1aa0
  paca= 0xc053e280
pid   = 3591, comm = ping
1:mon t
[c000e025b5e0] c0376934 .ip_output+0x358/0x3c0
[c000e025b670] c0374a04 .ip_push_pending_frames+0x440/0x558
[c000e025b720] c0397f10 .raw_sendmsg+0x770/0x860
[c000e025b860] c03a24f8 .inet_sendmsg+0x7c/0xa8
[c000e025b900] c033031c .sock_sendmsg+0x114/0x1b8
[c000e025bb00] c0331878 .sys_sendmsg+0x218/0x2ac
[c000e025bd20] c0356314 .compat_sys_sendmsg+0x14/0x28
[c000e025bd90] c0357914 .compat_sys_socketcall+0x1e4/0x214
[c000e025be30] c00086a4 syscall_exit+0x0/0x40
--- Exception: c00 (System Call) at 07f03c98
SP (ffb6e570) is in userspace
1:mon

I looked at the skb and confirmed that this was indeed against bond0.

One thing is apparent at this point. ping is continuing even though 
bond_release()
for ib1 (and of course ib0) occurred way back!

This is the reason for the crash. Any suggestions as to how to fix this?

Pradeep



___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] EWG/OFED meeting minutes for Sep 21, 09

2009-09-21 Thread Pradeep Satyanarayana


ewg-boun...@lists.openfabrics.org wrote on 09/21/2009 09:30:45 AM:

 EWG/OFED the meeting minutes for Sep 21, 2009

 Meeting summary:
 
 * This was a short meeting on OFED 1.5 status
 * We wish to have OFED 1.5 RC1 this week
 * Must for RC1 is to resolve all compilation issues


 Compilation that must be resolved:
 1. PPC64 fails on SLES11 on NFS/RDMA

Tziporet,

We created two bugs

For Sles11 build failure:
https://bugs.openfabrics.org/show_bug.cgi?id=1746

And for Rhel5.4

https://bugs.openfabrics.org/show_bug.cgi?id=1747

Both these failures were on ppc64 machines.

As an aside, do we continue using ewg mailing list for OFED issues.
Are there any plans to move to a new mailing list similar to linux-rdma?

Thanks!

Pradeep
prad...@us.ibm.com___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] ipv6 support in rping

2009-09-17 Thread Pradeep Satyanarayana
Tziporet, Vlad,

Who will be able to help us with this? Need to include the correct level of
librdmacm.
Is it reasonable to expect that this will get done before the next beta
release?

Thanks!

Pradeep
prad...@us.ibm.com

ewg-boun...@lists.openfabrics.org wrote on 09/17/2009 08:54:42 AM:


 On Thu, 2009-09-17 at 08:10 +0300, Or Gerlitz wrote:
  David J. Wilder wrote:
   I am not finding support for ipv6 in rping in the 1.5 beta.
   What is the story for ipv6 support?  Is it supported by librdma and
   missing in rping? Is ipv6 in rping planed?
 
  rping supports IPv6 since last year, see the below commit
 
  Or.
 
   commit 267c28a2f03b8fb63fa9907badd4130c710a1305
   Author: Aleksey Senin aleks...@voltaire.com
   Date:   Thu Aug 14 08:01:58 2008 -0700
  
   rping: add ipv6 support
  
   Signed-off-by: Aleksey Senin aleks...@voltaire.com
   Signed-off-by: Sean Hefty sean.he...@intel.com

 Humm,  that explains it..
 lidrdma in 1.5 contains librdmacm-1.0.8.tar.gz  dated 31-Jul-2008 two
 weeks before the change was checked in.  librdma in 1.5 needs to be
 updated.

 from rping.c in the 1.5 source.

 static int get_addr(char *dst, struct sockaddr_in *addr)
 {
 struct addrinfo *res;
 int ret;

 ret = getaddrinfo(dst, NULL, NULL, res);
 if (ret) {
 printf(getaddrinfo failed - invalid hostname or IP
 address\n);
 return ret;
 }

 if (res-ai_family != PF_INET) {  
 ret = -1;
 goto out;
 }

 *addr = *(struct sockaddr_in *) res-ai_addr;
 out:
 freeaddrinfo(res);
 return ret;
 }



 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] OFED 1.5 beta status

2009-09-14 Thread Pradeep Satyanarayana

I am still seeing the following problem trying to install today's OFED-1.5
build on on Sles11 (ppc64) :

  gcc -m64
-Wp,-MD,/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/net/sunrpc/.svc.o.d
-nostdinc -isystem /usr/lib64/gcc/powerpc64-suse-linux/4.3/include
-D__KERNEL__ \
-D__OFED_BUILD__ \,
-include include/linux/autoconf.h \
-include /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/linux/autoconf.h
\
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/kernel_addons/backport/2.6.27_sles11/include/
 \
 \
 \
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include \
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/debug \.
-I/usr/local/include/scst \
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/ulp/srpt \
-I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/net/cxgb3 \
-Iinclude \
-Iinclude2 -I/usr/src/linux-2.6.27.19-5/include \
-I/usr/src/linux-2.6.27.19-5/arch/powerpc/include \
 -Iarch/powerpc  -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/net/sunrpc
-Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing
-fno-common -Werror-implicit-function-declaration -Os -msoft-float -pipe
-I/usr/src/linux-2.6.27.19-5/arch/powerpc -Iarch/powerpc -mminimal-toc
-mtraceback=none -mcall-aixdesc -mcpu=power4 -mtune=cell -mno-altivec
-mno-spe -funit-at-a-time -mno-string -Wa,-maltivec -fno-stack-protector
-fomit-frame-pointer -g -Wdeclaration-after-statement -Wno-pointer-sign
-DMODULE -DKBUILD_STR(s)=#s -DKBUILD_BASENAME=KBUILD_STR(svc)
-DKBUILD_MODNAME=KBUILD_STR(sunrpc) -DDEBUG_HASH=27 -DDEBUG_HASH2=2
-c
-o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/net/sunrpc/.tmp_svc.o 
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/net/sunrpc/svc.c
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/net/sunrpc/svc.c: In function
âsvc_pool_map_set_cpumaskâ:
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/net/sunrpc/svc.c:320: error:
implicit declaration of function â_node_to_cpumask_ptrâ
/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/net/sunrpc/svc.c:320: warning:
passing argument 2 of âset_cpus_allowed_ptrâ makes pointer from integer
without a cast
make[5]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/net/sunrpc/svc.o]
Error 1
make[4]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/net/sunrpc] Error 2
make[3]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5] Error 2
make[2]: *** [sub-make] Error 2
make[1]: *** [all] Error 2
make[1]: Leaving directory `/usr/src/linux-2.6.27.19-5-obj/ppc64/ppc64'
make: *** [kernel] Error 2
error: Bad exit status from /var/tmp/rpm-tmp.81065 (%build)


We did get past the mthca build failure, but this error occurs when I tried
to install all packages.

However, an HPC installation succeeded.,

Thanks

Pradeep,
prad...@us.ibm.com


   
 Jack Morgenstein  
 ja...@dev.mellan 
 ox.co.il, To
 Sent by:  Alexander Schmidt   
 ewg-boun...@lists al...@linux.vnet.ibm.com  
 .openfabrics.org   cc
   Hoang-Nam Nguyen
   hngu...@linux.vnet.ibm.com,   
 09/12/2009 11:46  Christoph Raisch
 PMrai...@de.ibm.com, Stefan Roscher
   stefan.rosc...@de.ibm.com,
   ewg@lists.openfabrics.org   
   Subject
   Re: [ewg] OFED 1.5 beta status  
   
   
   
   
   
   




On Thursday 10 September 2009 19:11, Alexander Schmidt wrote:

 The following change fixes the issue for me, and it did not break other
parts of
 the stack, could someone review this?

 Thanks

 Index:
ofa_kernel-1.5/kernel_addons/backport/2.6.27_sles11/include/linux/cpumask.h
 ===
 ---
ofa_kernel-1.5.orig/kernel_addons/backport/2.6.27_sles11/include/linux/cpumask.h

 +++
ofa_kernel-1.5/kernel_addons/backport/2.6.27_sles11/include/linux/cpumask.h
 @@ -3,7 +3,6 @@

  #include_next linux/cpumask.h
  #include asm/percpu.h
 -#include asm/topology.h

  #define cpumask_of(cpu)   (get_cpu_mask(cpu))
  #define cpumask_of_node(node) (_node_to_cpumask_ptr(node))


Alex,

Thanks for sending the fix!
-Jack

Re: [ewg] OFED 1.5 beta status

2009-09-10 Thread Pradeep Satyanarayana


ewg-boun...@lists.openfabrics.org wrote on 09/10/2009 01:23:08 AM:

 On Wed, 9 Sep 2009 16:47:14 +0300
 Tziporet Koren tzipo...@mellanox.co.il wrote:
 Hi,

  Hi,
  I wish to update all that we plan to release OFED 1.5 beta tomorrow
 
  I know it's a week late then what we planned but we waited that all
  modules will at least pass compilation on all supported OSes before we
  publish the beta

 the mthca driver does not compile yet on SLES11 on powerpc. I've already
 reported this twice.

 -I/usr/src/linux-2.6.27.23-0.1/arch/powerpc/include \
  -Iarch/powerpc  -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.
 5/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes -
 Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-
 function-declaration -fwrapv -Os -msoft-float -pipe -
 I/usr/src/linux-2.6.27.23-0.1/arch/powerpc -Iarch/powerpc -mminimal-
 toc -mtraceback=none -mcall-aixdesc -mcpu=power4 -mtune=cell -mno-
 altivec -mno-spe -funit-at-a-time -mno-string -Wa,-maltivec -fno-
 stack-protector -fomit-frame-pointer -g -Wdeclaration-after-
 statement -Wno-pointer-sign -fwrapv -DMODULE -DKBUILD_STR(s)=#s -
 DKBUILD_BASENAME=KBUILD_STR(mthca_eq)  -
 DKBUILD_MODNAME=KBUILD_STR(ib_mthca) -DDEBUG_HASH=48 -
 DDEBUG_HASH2=63 -c -o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.
 5/drivers/infiniband/hw/mthca/.tmp_mthca_eq.o
 /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.
 5/drivers/infiniband/hw/mthca/mthca_eq.c
 In file included from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.
 5/kernel_addons/backport/2.6.27_sles11/include/linux/cpumask.h:6,
  from /usr/src/linux-2.6.27.23-0.
 1/include/linux/interrupt.h:9,
  from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.
 5/drivers/infiniband/hw/mthca/mthca_eq.c:35:
 include2/asm/topology.h:75: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or
 ‘__attribute__’ before ‘dump_numa_cpu_topology’

I looked at the pre-processor output (used --save-temps). Looks like
 __init is indeed not recognized. In the pre-processor output file,
linux/init.h is included after the occurance of dump_numa_cpu_topology.
There must have been some changes to header file inclusion causing this
failure.

Pradeep
prad...@us.ibm.com___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] OFED-1.5 and RDMA_CM support for IPv6

2009-07-20 Thread Pradeep Satyanarayana
Tziporet Koren wrote:
 Pradeep Satyanarayana wrote:
 Since the RDMA_CM with support for IPv6 was dropped from OFED-1.4,
 (and is now upstream) can one expect that it will be in OFED-1.5?


   
 If its in 2.6.30 then we already have it
 If its in 2.6.31 we will need to take the code
 
 Can you let me know
Hi Tziporet,

I see it in 2.6.30, so it should be in OFED-1.5 then. Thanks!

Pradeep


___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: OFED bug#1616

2009-05-19 Thread Pradeep Satyanarayana
As mentioned on the OFED conference call, I downloaded yesterday's build
and did confirm that the bug was fixed by running the Connectathon tests on
a couple of ppc64 machines.
The tests ran to completion without any problems.

Steve, Jon, Thanks for your help in resolving the issues.

Pradeep
prad...@us.ibm.com___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] ibv_devinfo on 1.4.1-rc4 shows only 1 port on HCA, two HCAs installed

2009-05-04 Thread Pradeep Satyanarayana

Hello Mike,

Could this be a firmware issue? I presume you are using ConnectX -is that
correct?

We have not seen this with Rhel5.3/OFED-1.4.1 rc4 in our tests.

Pradeep
prad...@us.ibm.com


   
 Mike  
 Aho/Rochester/IBM 
 @IBMUS To
 Sent by:  ewg@lists.openfabrics.org   
 ewg-boun...@lists  cc
 .openfabrics.org  
   Subject
   [ewg] ibv_devinfo on 1.4.1-rc4  
 05/04/2009 12:06  shows only 1 port on HCA,  two
 PMHCAs installed  
   
   
   
   
   
   





We have 1.4.1-rc4 installed on RHEL 5.  We have two Mellanox DDR two-port
cards installed.  ibv_devinfo only shows the info on the first port of each
HCA when run.  Please advise if this is a change in how ibv_devinfo works.
If not, any info I need to send along as a bugzilla?

Thanks.

Mike Aho

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
inline: graycol.gifinline: pic03516.gifinline: ecblank.gif___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] Bonding fail over not working

2008-10-20 Thread Pradeep Satyanarayana
I downloaded a recent version of Roland's git tree and tried IPoIB bonding. 
Fail over does not seem
to be working at all. I have tried OFED 1.3.2 on a Rhel5 derivative and that 
(fail over) worked as expected.

Is this a known issue? Given that OFED 1.4 will be in sync with main line 
kernel, is this an issue
to be addressed in OFED 1.4 too? Has any one else tried this out recently? My 
impression is that all bonding
patches were already upstream.

Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: [PATCH]IPOIB/CM fix for bug# 906 -OFED-1.3

2008-02-18 Thread Pradeep Satyanarayana
Eli Cohen wrote:
 On Sun, 2008-02-17 at 11:21 +0200, Or Gerlitz wrote:
 Thanks, this sheds more light on the solution but I still can not 
 understand how can the upstream code live without the QPs getting 
 destroyed? or the bug exist also there? if yes, I would recommend to 
 reshape the change-log to the extent explaining well the problem and 
 solution and then resend to Roland.

 Pradeep, I think it's your call.

Hello Eli, I have already submitted this patch to mainline. I will follow
up with Roland to get this merged there.

Pradeep


___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: [PATCH]IPOIB/CM fix for bug# 906 -OFED-1.3

2008-02-18 Thread Pradeep Satyanarayana
Roland Dreier wrote:
   Hello Eli, I have already submitted this patch to mainline. I will follow
   up with Roland to get this merged there.
 
 I didn't see the submission... can you resend?
Roland,

Here is the link to the patch sent previously:

http://lists.openfabrics.org/pipermail/general/2008-February/046463.html

Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: [PATCH]IPOIB/CM fix for bug# 906 -OFED-1.3

2008-02-18 Thread Pradeep Satyanarayana
Roland Dreier wrote:
   Here is the link to the patch sent previously:
 
   http://lists.openfabrics.org/pipermail/general/2008-February/046463.html
 
 OK, applied, although that link points to an HTML-mangled version of
 the patch, and I also had to figure out why we needed that change and
 write the patch description myself.

Roland,

Thanks for your help.

Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] [PATCH] IPOIB/CM Increase retry counts for OFED-1.3

2008-02-12 Thread Pradeep Satyanarayana
This patch change retry counts to small values. This helps interoperability
between ehca and mthca. Without this patch I had seen send completion errors.

Or Gerlitz has started a thread on the general mailing list and the complete
discussion will be available there. This is the second part of the patch
submitted yesterday and is split up as per Eli's request.

Signed-off-by: Pradeep Satyanarayana [EMAIL PROTECTED]
---

--- ofa_kernel-1.3_a/drivers/infiniband/ulp/ipoib/ipoib_cm.c2008-02-12 
17:46:03.0 -0500
+++ ofa_kernel-1.3_b/drivers/infiniband/ulp/ipoib/ipoib_cm.c2008-02-12 
17:46:58.0 -0500
@@ -1016,8 +1016,8 @@ static int ipoib_cm_send_req(struct net_
req.responder_resources   = 4;
req.remote_cm_response_timeout = 20;
req.local_cm_response_timeout  = 20;
-   req.retry_count   = 0; /* RFC draft warns against retries */
-   req.rnr_retry_count   = 0; /* RFC draft warns against retries */
+   req.retry_count   = 3;
+   req.rnr_retry_count   = 3;
req.max_cm_retries= 15;
req.srq   = ipoib_cm_has_srq(dev);
return ib_send_cm_req(id, req);

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] [PATCH]IPOIB/CM fix for bug# 906 -OFED-1.3

2008-02-12 Thread Pradeep Satyanarayana
This patch fixes -fail to destroy ipoib rx QP 
(https://bugs.openfabrics.org/show_bug.cgi?id=906)
Hence the usecnt issue reported previously on ehca is solved and allows the qp 
to be destroyed.

As per Eli's request, I am splitting up the patches. This is first portion of 
yesterday's patch.
Tested on ppc64 machines with ehca and mthca.

Signed-off-by: Pradeep Satyanarayana [EMAIL PROTECTED]
---

--- ofa_kernel-1.3_a/drivers/infiniband/ulp/ipoib/ipoib_cm.c2008-02-11 
14:28:47.0 -0500
+++ ofa_kernel-1.3_b/drivers/infiniband/ulp/ipoib/ipoib_cm.c2008-02-12 
17:44:07.0 -0500
@@ -883,9 +883,9 @@ void ipoib_cm_dev_stop(struct net_device
/*
 * assume the HW is wedged and just free up everything.
 */
-   list_splice_init(priv-cm.rx_flush_list, list);
-   list_splice_init(priv-cm.rx_error_list, list);
-   list_splice_init(priv-cm.rx_drain_list, list);
+   list_splice_init(priv-cm.rx_flush_list, 
priv-cm.rx_reap_list);
+   list_splice_init(priv-cm.rx_error_list, 
priv-cm.rx_reap_list);
+   list_splice_init(priv-cm.rx_drain_list, 
priv-cm.rx_reap_list);
break;
}
spin_unlock_irq(priv-lock);

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] Re: [PATCH] IPOIB/CM fixes for issues seen in OFED-1.3

2008-02-12 Thread Pradeep Satyanarayana
Or Gerlitz wrote:
 Eli Cohen wrote:
 could you send as distinct patches according to what they fix?
 
 Pradeep Satyanarayana wrote:
 2. Change retry counts to small values. This helps interoperability
 between ehca and mthca.
 
 Indeed, I sent a note on that now to the general list, lets discuss it
 there since its an architectural issue which need to be addressed
 correctly.

Hello Or,

Sure that is a good idea. I had proposed that (more than 6 months ago)
on the general list but there was no response. This may be a good time
to restart the conversation there.

Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH] IPOIB/CM fixes for issues seen in OFED-1.3

2008-02-12 Thread Pradeep Satyanarayana
Eli Cohen wrote:
 Pradeep,
 
 could you send as distinct patches according to what they fix?
 
 Thanks.
Hello Eli,

Sure I will do that. And I will drop the change due to the UD split CQ.

Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ofa-general] Re: [ewg] OFED 1.3 rc4 update

2008-02-07 Thread Pradeep Satyanarayana
Eli Cohen wrote:
 This problem was seen on a ehca that supports SRQ.

 
 Please reply how many scatter entries does ehca support when working
 in SRQ mode? Also any piece of info I might need to try and mimic ehca
 behaviour on Mellanox devices. I will appreciate if you can repeat the
 exact sequence of actions you do to reproduce this.

Hello Eli,

Ehca supports fewer than 16 s/g entries- hence the srq patch addresses that 
issue. 
The sequence of steps that I followed for the touch test:
1. On a freshly booted system, configure ib0 and assign an IP addresss
2. Switch to connected mode and change mtu
3. ping remote ib interface (already in CM mode)
4. modprobe -r ib_ehca

I see a series of cascading failures in /var/log/messages, starting with 
the issue of not being able to destroy the cq (specifically rcq)

Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ofa-general] Re: [ewg] OFED 1.3 rc4 update

2008-02-06 Thread Pradeep Satyanarayana
Tziporet Koren wrote:
 Shirley Ma wrote:

 Thanks Tziporet. We will test it right after it's out.

   
 You can start use the lates build -
 http://www.openfabrics.org/builds/ofed-1.3/OFED-1.3-20080206-0751.tgz
 
 Tziporet
 

I have downloaded the todays build mentioned above. I am still seeing the issue
of failing ib_destroy_cq() for the rcq mentioned yesterday.

Here are the steps that I follow:

1. On a freshly booted system configure ib0
2. Switch to connected mode ( on HCA that supports SRQ)
3. ping remote interface
4. modprobe -r ib_ehca
5. I see the failures about ib_destroy_cq() failing and the
cascading failures following that (srq and pd cannot be destroyed)
6. If I try a modprobe ib_ehca I get an error Cannot allocate memory
This also means some one is chewing tons of memory. I realize that the
qp and associated pd were not freed, so some memory is lost. However,
this system has 8 GB of memory.

Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ofa-general] Re: [ewg] OFED 1.3 rc4 update

2008-02-06 Thread Pradeep Satyanarayana
Pradeep Satyanarayana wrote:
 Tziporet Koren wrote:
 Shirley Ma wrote:
 Thanks Tziporet. We will test it right after it's out.

   
 You can start use the lates build -
 http://www.openfabrics.org/builds/ofed-1.3/OFED-1.3-20080206-0751.tgz

 Tziporet

 
 I have downloaded the todays build mentioned above. I am still seeing the 
 issue
 of failing ib_destroy_cq() for the rcq mentioned yesterday.
 
 Here are the steps that I follow:
 
 1. On a freshly booted system configure ib0
 2. Switch to connected mode ( on HCA that supports SRQ)
 3. ping remote interface
 4. modprobe -r ib_ehca
 5. I see the failures about ib_destroy_cq() failing and the
 cascading failures following that (srq and pd cannot be destroyed)

The ib_destroy_qp() fails because of refcnt is not zero. On my
system it was set to 2.

Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [ofa-general] Oops with today's OFED 1.3

2008-02-05 Thread Pradeep Satyanarayana
Pradeep Satyanarayana wrote:
 Eli Cohen wrote:
 Pradeep,
 Can you check if this is resolved?

 On 2/4/08, Pradeep Satyanarayana [EMAIL PROTECTED] wrote:
 I pulled today's (Feb 4th) OFED build and saw the following Oops while 
 touch testing
 on ehca1 on a 2.6.24 kernel.

 
 snip
 
 
 NIP [d0299ca8] .ipoib_cm_dev_init+0x440/0x63c [ib_ipoib]
 LR [d0299a70] .ipoib_cm_dev_init+0x208/0x63c [ib_ipoib]
 Call Trace:
 [c001cc85f630] [d0299a70] .ipoib_cm_dev_init+0x208/0x63c 
 [ib_ipoib] (unreliable)
 [c001cc85f7d0] [d0297f4c] .ipoib_transport_dev_init+0x120/0x458 
 [ib_ipoib]
 [c001cc85f930] [d029463c] .ipoib_ib_dev_init+0x44/0xb8 
 [ib_ipoib]
 [c001cc85f9c0] [d02902ec] .ipoib_dev_init+0xe0/0x138 [ib_ipoib]
 [c001cc85fa60] [d0290544] .ipoib_add_one+0x200/0x424 [ib_ipoib]
 [c001cc85fb20] [d01610e4] .ib_register_client+0x94/0xf4 
 [ib_core]
 [c001cc85fbb0] [d029dcac] .ipoib_init_module+0x1f8/0x246c 
 [ib_ipoib]
 [c001cc85fc70] [c00905f0] .sys_init_module+0x176c/0x187c
 [c001cc85fe30] [c000852c] syscall_exit+0x0/0x40
 Instruction dump:
 801f0f20 3b60 2f80 409d0040 e81f0f30 e97f04f0 7b6926e4 395b0001
 7d5b07b4 7c080214 816b0018 7d290214 9169002c 6000 6000 6000
 
 Hello Eli,
 
 Yes, this particular issue has been solved. However, I do see some other 
 issues.
 
 I seeing some new messages (not seen previously) in dmesg relating to 
 ib_cq_destroy() (on ehca):
 
 ib0: ib_cq_destroy failed
 ib_destroy_srq failed: -16
 ib_dealloc_pd failed
 
 This happens after some network tests and an rmmod of ib_ehca.
 
 At this point my guess is that this has to do with the split CQ patch. I have 
 not 
 had enough cycles to state that with absolute certainty. Can you please take 
 a look too?
 
 Pradeep
 

I looked at this some more. This error occurs because ib_cq_destroy() for rcq 
failed.
After that there are a series of cascading failures.

Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] [Fwd: Re: non SRQ patch for OFED 1.3] -need some help

2008-02-03 Thread Pradeep Satyanarayana

 Pradeep, Shir
 We tries to apply this patch for OFED 1.3 and its breaks some of the
 backports.
 Please use the makedist script on the ofa server (there is an
 explanation in the developers Wiki) and fix this so we can try to
 apply it
 Vlad will help you later today too

 Thanks,
 Tziporet

 

 ___
 ewg mailing list
 ewg@lists.openfabrics.org
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
 Pradeep,
 I added your patch (kernel_patches/fixes/ipoib_0200_non_srq.patch) and
 fixed the backport issue (ipoib_0100_to_2.6.21.patch).
 Please check if ofed_1_3/linux-2.6.git ofed_kernel is ok.

Hello Vladimir,

I downloaded it and tried it on a 2.6.24 kernel (Sles10Sp2b1) and
it compiled fine. I touch tested it and looks okay too. Thanks for
your help.

However, when I tried on a 2.6.16.57-0.9-ppc64 (Sles10sp2b1) after
running ofed_scripts/configure, the make failed as follows:

drivers/infiniband/core/addr.c: In function addr_arp_recv:
drivers/infiniband/core/addr.c:359: error: âstruct sk_buffâ has no member named 
nh

This seems to be coming from the addr_1_netevents_revert_to_2_6_17.patch
patch, which is completely unrelated to this patch.

Is there a place where the steps in the build process is completely described.
The Wiki at :
https://wiki.openfabrics.org/tiki-index.php?page=HOWTO%20Build%20OFED-1.3
is probably missing a few steps. It would help greatly if you could describe
all the steps. Why is it that we see differing results?

Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] [Fwd: Re: non SRQ patch for OFED 1.3] -need some help

2008-02-01 Thread Pradeep Satyanarayana
I tried running ofed_scripts/ofed_makedist.sh before and after copying my patch
to kernel_patches/fixes. In both cases makedist.sh seems to complete without
errors and creates the tar.gz files for the various kernels.

In short I am unable to reproduce the problem that Tziporet mentions. Any tips 
or
pointers to resolve this issue would be appreciated. Thanks!

Pradeep
---BeginMessage---

Pradeep Satyanarayana wrote:

Some HCAs like ehca do not natively support srq. This patch would enable IPoIB 
CM
for such HCAs. This patch has been accepted into Roland's for-2.6.25 git tree for 
about 3 months now.


Please consider including this patch into OFED 1.3.


  

Pradeep,
We tries to apply this patch for OFED 1.3 and its breaks some of the 
backports.
Please use the makedist script on the ofa server (there is an 
explanation in the developers Wiki) and fix this so we can try to apply it

Vlad will help you later today too

Thanks,
Tziporet

---End Message---
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] non SRQ patch for OFED 1.3

2008-01-24 Thread Pradeep Satyanarayana
Some HCAs like ehca do not natively support srq. This patch would enable IPoIB 
CM
for such HCAs. This patch has been accepted into Roland's for-2.6.25 git tree 
for 
about 3 months now.

Please consider including this patch into OFED 1.3.

Signed-off-by: Pradeep Satyanarayana [EMAIL PROTECTED]
---

--- a/drivers/infiniband/ulp/ipoib/ipoib.h  2008-01-23 13:29:06.0 
-0800
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h  2008-01-23 16:03:33.0 
-0800
@@ -69,6 +69,7 @@ enum {
IPOIB_TX_RING_SIZE= 64,
IPOIB_MAX_QUEUE_SIZE  = 8192,
IPOIB_MIN_QUEUE_SIZE  = 2,
+   IPOIB_CM_MAX_CONN_QP  = 4096,
 
IPOIB_NUM_WC  = 4,
 
@@ -188,10 +189,13 @@ enum ipoib_cm_state {
 struct ipoib_cm_rx {
struct ib_cm_id *id;
struct ib_qp*qp;
+   struct ipoib_cm_rx_buf *rx_ring;
struct list_head list;
struct net_device   *dev;
unsigned longjiffies;
enum ipoib_cm_state  state;
+   int  index;
+   int  recv_count;
 };
 
 struct ipoib_cm_tx {
@@ -234,6 +238,7 @@ struct ipoib_cm_dev_priv {
struct ib_wcibwc[IPOIB_NUM_WC];
struct ib_sge   rx_sge[IPOIB_CM_RX_SG];
struct ib_recv_wr   rx_wr;
+   int nonsrq_conn_qp;
int max_cm_mtu;
int num_frags;
 };
@@ -463,6 +468,8 @@ void ipoib_drain_cq(struct net_device *d
 /* We don't support UC connections at the moment */
 #define IPOIB_CM_SUPPORTED(ha)   (ha[0]  (IPOIB_FLAGS_RC))
 
+extern int ipoib_max_conn_qp;
+
 static inline int ipoib_cm_admin_enabled(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -493,6 +500,12 @@ static inline void ipoib_cm_set(struct i
neigh-cm = tx;
 }
 
+static inline int ipoib_cm_has_srq(struct net_device *dev)
+{
+   struct ipoib_dev_priv *priv = netdev_priv(dev);
+   return !!priv-cm.srq;
+}
+
 void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct 
ipoib_cm_tx *tx);
 int ipoib_cm_dev_open(struct net_device *dev);
 void ipoib_cm_dev_stop(struct net_device *dev);
@@ -510,6 +523,8 @@ void ipoib_cm_handle_tx_wc(struct net_de
 
 struct ipoib_cm_tx;
 
+#define ipoib_max_conn_qp 0
+
 static inline int ipoib_cm_admin_enabled(struct net_device *dev)
 {
return 0;
@@ -535,6 +550,11 @@ static inline void ipoib_cm_set(struct i
 {
 }
 
+static inline int ipoib_cm_has_srq(struct net_device *dev)
+{
+   return 0;
+}
+
 static inline
 void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct 
ipoib_cm_tx *tx)
 {
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c   2008-01-23 13:29:06.0 
-0800
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c   2008-01-23 16:46:47.0 
-0800
@@ -39,6 +39,13 @@
 #include linux/icmpv6.h
 #include linux/delay.h
 
+int ipoib_max_conn_qp = 128;
+
+module_param_named(max_nonsrq_conn_qp, ipoib_max_conn_qp, int, 0444);
+MODULE_PARM_DESC(max_nonsrq_conn_qp,
+Max number of connected-mode QPs per interface 
+(applied only if shared receive queue is not available));
+
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA
 static int data_debug_level;
 
@@ -81,7 +88,7 @@ static void ipoib_cm_dma_unmap_rx(struct
ib_dma_unmap_single(priv-ca, mapping[i + 1], PAGE_SIZE, 
DMA_FROM_DEVICE);
 }
 
-static int ipoib_cm_post_receive(struct net_device *dev, int id)
+static int ipoib_cm_post_receive_srq(struct net_device *dev, int id)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
struct ib_recv_wr *bad_wr;
@@ -104,7 +111,33 @@ static int ipoib_cm_post_receive(struct 
return ret;
 }
 
-static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, 
int frags,
+static int ipoib_cm_post_receive_nonsrq(struct net_device *dev,
+   struct ipoib_cm_rx *rx, int id)
+{
+   struct ipoib_dev_priv *priv = netdev_priv(dev);
+   struct ib_recv_wr *bad_wr;
+   int i, ret;
+
+   priv-cm.rx_wr.wr_id = id | IPOIB_OP_CM | IPOIB_OP_RECV;
+
+   for (i = 0; i  IPOIB_CM_RX_SG; ++i)
+   priv-cm.rx_sge[i].addr = rx-rx_ring[id].mapping[i];
+
+   ret = ib_post_recv(rx-qp, priv-cm.rx_wr, bad_wr);
+   if (unlikely(ret)) {
+   ipoib_warn(priv, post recv failed for buf %d (%d)\n, id, ret);
+   ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1,
+ rx-rx_ring[id].mapping);
+   dev_kfree_skb_any(rx-rx_ring[id].skb);
+   rx-rx_ring[id].skb = NULL;
+   }
+
+   return ret;
+}
+
+static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev,
+struct ipoib_cm_rx_buf *rx_ring,
+int id, int frags,
 u64 mapping[IPOIB_CM_RX_SG

[ewg] Question regarding non srq patch for OFED 1.3

2008-01-22 Thread Pradeep Satyanarayana
Some HCAs like ehca do not natively support srq. In order to enable IPoIB CM
for such HCAs, I have developed a nonsrq patch. This patch has been accepted
into Roland's for-2.6.25 git tree for about 3 months now.

I am working on porting that to OFED 1.3 and it will take me at least several 
days to finish the port and test it. Is there a date by which I need to 
complete 
it for it's inclusion into OFED 1.3?

Pradeep

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [PATCH] IPoIB/CM Enable SRQ support for HCAs with les than 16 s/g entries (in OFED 1.3)

2008-01-21 Thread Pradeep Satyanarayana
Pradeep Satyanarayana wrote:
 Some HCAs like ehca2 support fewer than 16 SG entries. Currently IPoIB/CM
 implicitly assumes all HCAs will support 16 SG entries of 4K pages for 64K 
 MTUs. This patch removes that restriction.
 
 This patch continues to use order 0 allocations and enables implementation of 
 connected mode on such HCAs with smaller MTUs. HCAs having the capability to 
 support 16 SG entries are left untouched.
 
 A version of this patch has been integrated into Roland's for-2.6.25 git tree 
 for 
 a couple of weeks. Here is a back ported version of that patch (for OFED 
 1.3). Please 
 consider for inclusion into OFED 1.3.
 
 This patch addresses bug# 728:
 https://bugs.openfabrics.org/show_bug.cgi?id=728

I am not sure if this eluded your attention. Would it be possible to merge this 
patch
into OFED 1.3? Patch was inlined in a previous mail. Would you like me to 
resend?

Pradeep
 

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] ConnectX support

2007-08-01 Thread Pradeep Satyanarayana
I saw some specific known issues and limitations wrt ConnectX in OFED 
1.2c. Is ConnectX officially supported in OFED 1.2c, or will that be OFED 
1.3?

Pradeep
[EMAIL PROTECTED]
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg