Re: [ewg] OFED 3.2 - Errors with admin-rdma script and make on x86
We have experimented with various configuration options and the build does complete, depending on what config options are chosen. So, yes we do get the kernel modules built. Basic question, how does one get the corresponding user level libraries and scripts that go with OFED-3.2. Is that still in the works, or is one expected to pull from a different tree i.e. where does one get the rest of the stuff? Thanks Pradeep prad...@us.ibm.com ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Errors building OFED-3.2 on ppc64
Here are the steps that we followed and the errors encountered: /home/user/compat-rdma # git clone git://git.openfabrics.org/compat-rdma/linux-3.2.git /home/user/compat-rdma # git clone git://git.openfabrics.org/compat-rdma/compat.git /home/user/compat-rdma # git clone git://git.openfabrics.org/compat-rdma/compat-rdma.git /home/user/compat-rdma # export GIT_TREE=/home/user/linux-3.2/ /home/user/compat-rdma # export GIT_COMPAT_TREE=/home/user/compat /home/user/compat-rdma # ./scripts/admin_rdma.sh You said to use git tree at: /home/user/linux-3.2/ for linux-next You said to use git tree at: /home/user/compat for compat mkdir -p ./Documentation cp -a /home/user/linux-3.2//Documentation/infiniband/ ./Documentation mkdir -p ./drivers/base cp -a /home/user/linux-3.2//drivers/base/attribute_container.c ./drivers/base mkdir -p ./drivers/base cp -a /home/user/linux-3.2//drivers/base/transport_class.c ./drivers/base mkdir -p ./drivers cp -a /home/user/linux-3.2//drivers/infiniband/ ./drivers mkdir -p ./drivers/net/ethernet/chelsio cp -a /home/user/linux-3.2//drivers/net/ethernet/chelsio/cxgb3/ ./drivers/net/ethernet/chelsio mkdir -p ./drivers/net/ethernet/chelsio cp -a /home/user/linux-3.2//drivers/net/ethernet/chelsio/cxgb4/ ./drivers/net/ethernet/chelsio mkdir -p ./drivers/net/ethernet/mellanox cp -a /home/user/linux-3.2//drivers/net/ethernet/mellanox/mlx4/ ./drivers/net/ethernet/mellanox mkdir -p ./drivers/scsi cp -a /home/user/linux-3.2//drivers/scsi/iscsi_tcp.c ./drivers/scsi mkdir -p ./drivers/scsi cp -a /home/user/linux-3.2//drivers/scsi/iscsi_tcp.h ./drivers/scsi mkdir -p ./drivers/scsi cp -a /home/user/linux-3.2//drivers/scsi/libiscsi.c ./drivers/scsi mkdir -p ./drivers/scsi cp -a /home/user/linux-3.2//drivers/scsi/scsi_transport_iscsi.c ./drivers/scsi mkdir -p ./fs cp -a /home/user/linux-3.2//fs/exportfs/ ./fs mkdir -p ./fs cp -a /home/user/linux-3.2//fs/lockd/ ./fs mkdir -p ./fs cp -a /home/user/linux-3.2//fs/nfs/ ./fs mkdir -p ./fs cp -a /home/user/linux-3.2//fs/nfs_common/ ./fs mkdir -p ./fs cp -a /home/user/linux-3.2//fs/nfsd/ ./fs mkdir -p ./include/linux cp -a /home/user/linux-3.2//include/linux/mlx4/ ./include/linux mkdir -p ./include cp -a /home/user/linux-3.2//include/rdma/ ./include mkdir -p ./include/scsi cp -a /home/user/linux-3.2//include/scsi/srp.h ./include/scsi mkdir -p ./kernel cp -a /home/user/linux-3.2//kernel/kfifo.c ./kernel mkdir -p ./lib cp -a /home/user/linux-3.2//lib/klist.c ./lib mkdir -p ./net cp -a /home/user/linux-3.2//net/rds/ ./net mkdir -p ./include/linux cp -a /home/user/linux-3.2//include/linux/rds.h ./include/linux mkdir -p ./net cp -a /home/user/linux-3.2//net/sunrpc/ ./net Copying /home/user/compat/ files... cp: cannot stat `/home/user/compat/include/scsi/*': No such file or directory fatal: cannot describe '6343adaefb8a7a21d9ae1a018159f90f482cac61' Updated from local tree: /home/user/linux-3.2/ Origin remote URL: git://git.openfabrics.org/compat-rdma/linux-3.2.git fatal: cannot describe '55379a99264a53afa3cfc82594ee14b5513b0141' compat-rdma code metrics 423399 - Total upstream lines of code being pulled Base tree: linux-3.2.git Base tree version: compat-rdma release: /home/user/compat-rdma # Thanks Pradeep prad...@us.ibm.com ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Errors building OFED-3.2 on ppc64
Here are the steps that we followed and the errors encountered: /home/user/compat-rdma # git clone git://git.openfabrics.org/compat-rdma/linux-3.2.git /home/user/compat-rdma # git clone git://git.openfabrics.org/compat-rdma/compat.git /home/user/compat-rdma # git clone git://git.openfabrics.org/compat-rdma/compat-rdma.git /home/user/compat-rdma # export GIT_TREE=/home/user/linux-3.2/ /home/user/compat-rdma # export GIT_COMPAT_TREE=/home/user/compat /home/user/compat-rdma # ./scripts/admin_rdma.sh You said to use git tree at: /home/user/linux-3.2/ for linux-next You said to use git tree at: /home/user/compat for compat mkdir -p ./Documentation cp -a /home/user/linux-3.2//Documentation/infiniband/ ./Documentation mkdir -p ./drivers/base cp -a /home/user/linux-3.2//drivers/base/attribute_container.c ./drivers/base mkdir -p ./drivers/base cp -a /home/user/linux-3.2//drivers/base/transport_class.c ./drivers/base mkdir -p ./drivers cp -a /home/user/linux-3.2//drivers/infiniband/ ./drivers mkdir -p ./drivers/net/ethernet/chelsio cp -a /home/user/linux-3.2//drivers/net/ethernet/chelsio/cxgb3/ ./drivers/net/ethernet/chelsio mkdir -p ./drivers/net/ethernet/chelsio cp -a /home/user/linux-3.2//drivers/net/ethernet/chelsio/cxgb4/ ./drivers/net/ethernet/chelsio mkdir -p ./drivers/net/ethernet/mellanox cp -a /home/user/linux-3.2//drivers/net/ethernet/mellanox/mlx4/ ./drivers/net/ethernet/mellanox mkdir -p ./drivers/scsi cp -a /home/user/linux-3.2//drivers/scsi/iscsi_tcp.c ./drivers/scsi mkdir -p ./drivers/scsi cp -a /home/user/linux-3.2//drivers/scsi/iscsi_tcp.h ./drivers/scsi mkdir -p ./drivers/scsi cp -a /home/user/linux-3.2//drivers/scsi/libiscsi.c ./drivers/scsi mkdir -p ./drivers/scsi cp -a /home/user/linux-3.2//drivers/scsi/scsi_transport_iscsi.c ./drivers/scsi mkdir -p ./fs cp -a /home/user/linux-3.2//fs/exportfs/ ./fs mkdir -p ./fs cp -a /home/user/linux-3.2//fs/lockd/ ./fs mkdir -p ./fs cp -a /home/user/linux-3.2//fs/nfs/ ./fs mkdir -p ./fs cp -a /home/user/linux-3.2//fs/nfs_common/ ./fs mkdir -p ./fs cp -a /home/user/linux-3.2//fs/nfsd/ ./fs mkdir -p ./include/linux cp -a /home/user/linux-3.2//include/linux/mlx4/ ./include/linux mkdir -p ./include cp -a /home/user/linux-3.2//include/rdma/ ./include mkdir -p ./include/scsi cp -a /home/user/linux-3.2//include/scsi/srp.h ./include/scsi mkdir -p ./kernel cp -a /home/user/linux-3.2//kernel/kfifo.c ./kernel mkdir -p ./lib cp -a /home/user/linux-3.2//lib/klist.c ./lib mkdir -p ./net cp -a /home/user/linux-3.2//net/rds/ ./net mkdir -p ./include/linux cp -a /home/user/linux-3.2//include/linux/rds.h ./include/linux mkdir -p ./net cp -a /home/user/linux-3.2//net/sunrpc/ ./net Copying /home/user/compat/ files... cp: cannot stat `/home/user/compat/include/scsi/*': No such file or directory fatal: cannot describe '6343adaefb8a7a21d9ae1a018159f90f482cac61' Updated from local tree: /home/user/linux-3.2/ Origin remote URL: git://git.openfabrics.org/compat-rdma/linux-3.2.git fatal: cannot describe '55379a99264a53afa3cfc82594ee14b5513b0141' compat-rdma code metrics 423399 - Total upstream lines of code being pulled Base tree: linux-3.2.git Base tree version: compat-rdma release: /home/user/compat-rdma # Thanks Pradeep prad...@us.ibm.com ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] tcp_mem settings in /sbin/ib_ipoib_sysctl
Some customers have seen strange behaviors with the tcp_mem settings in /sbin/ib_ipoib_sysctl.The problems are similar to what has been reported at: http://lists.linbit.com/pipermail/drbd-user/2009-September/012711.html Essentially a setting of : /sbin/sysctl -q -w net.ipv4.tcp_mem=16777216 16777216 16777216 chews up so much memory that it starves other applications (including file systems) and deadlocks the system. I am not sure which OFED release this crept in, but I do understand that it is a performance tweak to help IPoIB CM. What may not have been considered is that this vector specifies pages (not bytes) as shown: tcp_mem - vector of 3 INTEGERs: min, pressure, max min: below this number of pages TCP is not bothered about its memory appetite. pressure: when amount of memory allocated by TCP exceeds this number of pages, TCP moderates its memory consumption and enters memory pressure mode, which is exited when memory consumption falls under min. max: number of pages allowed for queueing by all TCP sockets. Defaults are calculated at boot time from amount of available memory. In effect (with this setting)TCP will not moderate it's memory consumption below 16M*4K (if page size=4k) i.e. 64GB! This may be more than the RAM available on smaller systems, and what what about the case when page size is say 64K (1024GB before TCP starts to moderate)? Can we consider removing the tcp_mem settings from the /sbin/ib_ipoib_sysctl file for OFED-1.5.3? As mentioned above, defaults are calculated at boot time based on the memory available and should be good enough for most uses. Thanks Pradeep prad...@us.ibm.com___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] EWG meeting agenda
uDAPL users confirmed that the fix is available and bug# 2026 is now closed. Thanks Pradeep prad...@us.ibm.com | | From: | | --| |Tziporet Koren tzipo...@mellanox.co.il | --| | | To:| | --| |ewg@lists.openfabrics.org ewg@lists.openfabrics.org | --| | | Date: | | --| |08/30/2010 09:06 AM | --| | | Subject: | | --| |[ewg] EWG meeting agenda | --| | | Sent by: | | --| |ewg-boun...@lists.openfabrics.org | --| OFED meeting agenda: OFED 1.5.2 RC5 was released today Updated schedule: OFED-1.5.2-rc6 - September 6. OFED-1.5.2 GA - September 13. Bugs list: 2075 blo sas...@voltaire.com Applications built against OFED 1.5 will not run on OFED ... 767 maj sw...@opengridcomputing.com backport Kernels that don't build in genalloc cause c... 2026 maj arlin.r.da...@intel.com What's the current target release date for OFED 1.5.2 ? 2063 maj yevge...@mellanox.co.il mlx4_en: When disabling both nic ports, and enable one - ... 2077 maj johann.geo...@qlogic.comqperf from OFED-1.5.2-rc2.tgz is not running for RDMA spe... 2084 maj perki...@cse.ohio-state.edu OFED-1.5.2-20100711-0600- IMB test fails to start on MVAP... 2096 maj i...@dev.mellanox.co.il ib-send_* and ib_write_* on ia64 does not work on OFED1.5... 2097 maj andy.gro...@oracle.com [ OFED-1.5.2-20100810-0600 ] - rds-ping rds-stress test... 2104 maj krzysztof.sprzaczkow...@int... bonding doesn't detect port failure with NetEffect Tziporet ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg inline: graycol.gifinline: ecblank.gif___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] RoCE supplement to IB Spec
Is it the intent that the RoCE implementation in OFED-1.5.1 corresponds to the RoCE supplement to the IB spec dated April 6th 2010, excepting bugs of course? Or are there known deviations from the spec? Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] How to monitor traffic with IBoE/RoCEE?
Tom Ammon wrote: A widely-supported method for collecting statistics from ethernet switching devices is SNMP. As Woody said, you wouldn't manage your ethernet network any differently than you do now. In addition many many tools have been written to make use of SNMP counters, so you won't have a hard time finding tools, both commercial and open source. Tom On 03/26/2010 11:29 AM, Woodruff, Robert J wrote: Pradeep wrote, With IBoE/RoCEE, the traditional SM in IB clusters is not needed. Most of the current IB tools rely on the SM and PM to get packet and error statistics and so on. These won't be applicable with IBoE/RoCEE. netstat will have no value since the kernel has been bypassed. So, how does one monitor traffic in such a cluster? Some managed Ethernet switches have the ability to log into them and get information on packets transmited/received on each port and errors that are detected. However, these are typically proprietary. I am not sure about the Mellanox card. So I guess the answer is that you manage RoCCE and iWarp clusters just like you would any Ethernet cluster, with existing management tools that are available for managing Ethernet. woody True, I suspected that would be the case. However, that means we are using a different management model than with IB clusters. Hence, I was curious if Mellanox had any tools to get say some hardware counters without going to a different management model. Thanks Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] How to monitor traffic with IBoE/RoCEE?
With IBoE/RoCEE, the traditional SM in IB clusters is not needed. Most of the current IB tools rely on the SM and PM to get packet and error statistics and so on. These won't be applicable with IBoE/RoCEE. netstat will have no value since the kernel has been bypassed. So, how does one monitor traffic in such a cluster? The possibilities that I can think of are to get information from the switches or if there are some special tools to get information from the adapter ports themselves. Are such tools available with ConnectX adapters? How would one get cluster-wide traffic information when such a cluster is deployed? Thanks Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] IB errors with openMPI
We are trying run openMPI with OFED-1.5 on the 2.6.31-rt11-preempt-rt kernel and see the following errors: [[45393,1],8][../../../../../ompi/mca/btl/openib/btl_openib_component.c:2951:handle_wc] from elm3b107 to: elm3b17 error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 1289846528 opcode -1782678528 vendor error 244 qp_idx 0 At this point I looked at the mlx4 diag counters and saw some non-zero values. Since we were attempting a series of runs, we don't know when the counters increased from 0. Do these counters have any correlation to the above MPI error? [r...@elm3b17 diag_counters]# pwd /sys/class/infiniband/mlx4_0/diag_counters [r...@elm3b17 diag_counters]# [r...@elm3b17 diag_counters]# cat rq_num_rnr 19 [r...@elm3b17 diag_counters]# cat rq_num_wrfe 2009 [r...@elm3b17 diag_counters]# cat sq_num_tree 12 [r...@elm3b17 diag_counters]# cat sq_num_wrfe 12 [r...@elm3b17 diag_counters]# Similarly on 3b107 let us look at the counters. [r...@elm3b107 diag_counters]# cat rq_num_wrfe 5156 [r...@elm3b107 diag_counters]# cat sq_num_rnr 18 [r...@elm3b107 diag_counters]# cat sq_num_tree 20 [r...@elm3b107 diag_counters]# cat sq_num_wrfe 20 [r...@elm3b107 diag_counters]# We are using ConnectX dual port DDR HCAs (FW version 2.6). What does the vendor error 244 mean? Any suggestions to debug this further? Thanks Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [PATCH] [for-2.6.33] rdma/cm: revert associating an RDMA device when binding to loopback
Steve Wise wrote: This patch works. It also backports cleanly to ofed-1.5.1/RH5.3. Acked-by: Steve Wise sw...@opengridcomputing.com Steve. Steve, Was this tested against both iWARP and IB? Thanks Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] rdma/cm: disallow loopback address for iwarp devices
Steve Wise wrote: Sean, can you try openmpi? It fails for me, and yet ucmatose succeeds. I don't understand the difference yet... Sean Hefty wrote: On my OFED 1.4.1 RHEL4u6 systems, rdma_bind_addr() fails when attempting to bind to 127.0.0.1 per the email I sent Friday: http://www.spinics.net/lists/linux-rdma/msg02568.html This is what I see over IB on 2.6.26, with a couple extra prints added to cmatose: cst-lin1:/home/mshefty/librdmacm# examples/ucmatose -b 127.0.0.1 cmatose: starting server src addr 0x17f rdma_bind_addr: 0 so we're missing something else. Hi Steve, I am attempting to duplicate the problem that you reported with today's OFED build (on Sles11, if that matters). I have rarely used openMPI, so suggestions would be helpful. Here is what I see: elm3b199:/usr/lib # /usr/mpi/gcc/openmpi-1.4.1/bin/mpirun -np 2 --bynode --mca btl_openib_cpc_include rdmacm ring -- mpirun was unable to launch the specified application as it could not find an executable: Executable: ring Node: elm3b199 while attempting to start process rank 0. -- elm3b199:/usr/lib # Incidentally tvflash did not build (this is a ppc64 machine). Thanks Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] rdma/cm: disallow loopback address for iwarp devices
Jeff Squyres wrote: On Feb 8, 2010, at 7:30 PM, Pradeep Satyanarayana wrote: elm3b199:/usr/lib # /usr/mpi/gcc/openmpi-1.4.1/bin/mpirun -np 2 --bynode --mca btl_openib_cpc_include rdmacm ring -- mpirun was unable to launch the specified application as it could not find an executable: Executable: ring Node: elm3b199 while attempting to start process rank 0. -- elm3b199:/usr/lib # Is there an executable named ring either in your $PATH or in /usr/lib? Open MPI is telling you it can't find an executable named ring. Hi Jeff, No, there is none. I got this command from one of the mails in the thread. What should I use instead? Thanks Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [PATCH 7/9] rdma/cm: fix loopback address support
ewg-boun...@lists.openfabrics.org wrote on 11/22/2009 02:36:32 AM: Pradeep Satyanarayana wrote: Roland Dreier wrote: Thanks... in any case I applied all 9 of the patches in this series. Thanks for pulling all this together. Sean, Thanks a lot for pulling it all together. Can we consider including this into OFED-1.5 too? Pradeep Pradeep If someone will send a patch to Vlad we can add it to OFED 1.5 Tziporet, Sure, we will do it. Thanks Pradeep prad...@us.ibm.com___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH 7/9] rdma/cm: fix loopback address support
Roland Dreier wrote: Thanks... in any case I applied all 9 of the patches in this series. Thanks for pulling all this together. Sean, Thanks a lot for pulling it all together. Can we consider including this into OFED-1.5 too? Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Crash in bonding
Tziporet Koren wrote: Pradeep Satyanarayana wrote: This crash was originally reported against Rhel5.4. However, one can recreate this crash quite easily in OFED-1.5 too. Can you open a bugzilla too? I have opened bug# 1821 https://bugs.openfabrics.org/show_bug.cgi?id=1821 Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Crash in bonding
Shiri Franchi wrote: Hi, I tried to reproduce on RH5 up4 with ping and iperf and it did not happened. Are you sure you used modprobe -t ib_ipoib or maybe modprobe -r bonding? Thanks, Shiri Hi Shiri, I used modprobe -r ib_ipoib. One of the pre-requisites to create this crash is to do ifdown ib0 and ifdown ib1 while traffic is flowing through the bond master, and then remove the ipoib module. If you simply remove the ipoib module, without doing the ifdowns the crash does not occur. Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Crash in bonding
Shiri Franchi wrote: Hi, I did it exactly as you described: 1. ifdown ib0 2. ifdown ib1 3. modprobe -r ib_ipoib And it did not reproduced.. It is a race that you may not be recreating. I have tried this across different HCAs and platforms (x86_64, ppc64). Seems to recreate almost at will for me. Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Crash in bonding
This crash was originally reported against Rhel5.4. However, one can recreate this crash quite easily in OFED-1.5 too. The steps to recreate the crash are as follows: 1. Run traffic (I used ping) on the IB interfaces through the bond master 2. ifdown ib0 3. ifdown ib1 4. modprobe -r ib_ipoib Quite often, the crash stack trace seen is as follows: ID: 0 TASK: 81087fc11820 CPU: 13 COMMAND: swapper #0 [81010ff07ab0] crash_kexec at 800ac5b9 #1 [81010ff07b70] __die at 80065127 #2 [81010ff07bb0] do_page_fault at 80066da7 #3 [81010ff07ca0] error_exit at 8005dde9 #4 [81010ff07d58] neigh_connected_output at 8022cb87 #5 [81010ff07d88] ip_output at 800320ac #6 [81010ff07db8] ip_queue_xmit at 8003464d #7 [81010ff07e78] tcp_transmit_skb at 80021d73 #8 [81010ff07ec8] tcp_retransmit_skb at 80250ccd #9 [81010ff07f08] tcp_write_timer at 80252652 #10 [81010ff07f28] run_timer_softirq at 800968be #11 [81010ff07f58] __do_softirq at 8001235a #12 [81010ff07f88] call_softirq at 8005e2fc #13 [81010ff07fa0] do_softirq at 8006cb14 #14 [81010ff07fb0] apic_timer_interrupt at 8005dc8e --- IRQ stack --- #15 [81010ff03e48] apic_timer_interrupt at 8005dc8e [exception RIP: mwait_idle+54] RIP: 800571f4 RSP: 81010ff03ef0 RFLAGS: 0246 RAX: RBX: 000d RCX: RDX: RSI: 0001 RDI: 80301698 RBP: 81087fc11a10 R8: 81010ff02000 R9: 0032 R10: 81048e0cc4f0 R11: 8103ebafcd18 R12: 05f33f4d R13: 0d12e63d7223 R14: 81047fe797a0 R15: 81087fc11820 ORIG_RAX: ff10 CS: 0010 SS: 0018 #16 [81010ff03ef0] cpu_idle at 8004939e I was able to set up some break points and the analysis follows. cpu 0x1 stopped at breakpoint 0x1 (d0ec4214 .bond_release+0x0/0x4d0 [bonding]) mflrr0 enter ? for help 1:mon t [link register ] d0ecdf80 .bonding_store_slaves+0x304/0x3f0 [bonding] [cfd97b00] d0ecdf70 .bonding_store_slaves+0x2f4/0x3f0 [bonding] (unreliable) [cfd97bd0] c029a660 .class_device_attr_store+0x44/0x60 [cfd97c40] c015df9c .sysfs_write_file+0x134/0x1b8 [cfd97cf0] c00f8ec4 .vfs_write+0x118/0x200 [cfd97d90] c00f9634 .sys_write+0x4c/0x8c [cfd97e30] c00086a4 syscall_exit+0x0/0x40 --- Exception: c00 (System Call) at 0ff11138 SP (ffd1f300) is in userspace Did some basic sanity checks and confirmed that we hit a couple of breakpoints and the bond master was indeed bond0 as expected and the slave device being released was ib1. After the breakpoints, we crashed Faulting instruction address: 0xc034bddc cpu 0x1: Vector: 300 (Data Access) at [c000e025b2b0] pc: c034bddc: .neigh_resolve_output+0x28c/0x34c lr: c034bdc0: .neigh_resolve_output+0x270/0x34c sp: c000e025b530 msr: 80009032 dar: d0c6fe58 dsisr: 4000 current = 0xc000e25f1aa0 paca= 0xc053e280 pid = 3591, comm = ping enter ? for help 1:mon e cpu 0x1: Vector: 300 (Data Access) at [c000e025b2b0] pc: c034bddc: .neigh_resolve_output+0x28c/0x34c lr: c034bdc0: .neigh_resolve_output+0x270/0x34c sp: c000e025b530 msr: 80009032 dar: d0c6fe58 dsisr: 4000 current = 0xc000e25f1aa0 paca= 0xc053e280 pid = 3591, comm = ping 1:mon t [c000e025b5e0] c0376934 .ip_output+0x358/0x3c0 [c000e025b670] c0374a04 .ip_push_pending_frames+0x440/0x558 [c000e025b720] c0397f10 .raw_sendmsg+0x770/0x860 [c000e025b860] c03a24f8 .inet_sendmsg+0x7c/0xa8 [c000e025b900] c033031c .sock_sendmsg+0x114/0x1b8 [c000e025bb00] c0331878 .sys_sendmsg+0x218/0x2ac [c000e025bd20] c0356314 .compat_sys_sendmsg+0x14/0x28 [c000e025bd90] c0357914 .compat_sys_socketcall+0x1e4/0x214 [c000e025be30] c00086a4 syscall_exit+0x0/0x40 --- Exception: c00 (System Call) at 07f03c98 SP (ffb6e570) is in userspace 1:mon I looked at the skb and confirmed that this was indeed against bond0. One thing is apparent at this point. ping is continuing even though bond_release() for ib1 (and of course ib0) occurred way back! This is the reason for the crash. Any suggestions as to how to fix this? Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] EWG/OFED meeting minutes for Sep 21, 09
ewg-boun...@lists.openfabrics.org wrote on 09/21/2009 09:30:45 AM: EWG/OFED the meeting minutes for Sep 21, 2009 Meeting summary: * This was a short meeting on OFED 1.5 status * We wish to have OFED 1.5 RC1 this week * Must for RC1 is to resolve all compilation issues Compilation that must be resolved: 1. PPC64 fails on SLES11 on NFS/RDMA Tziporet, We created two bugs For Sles11 build failure: https://bugs.openfabrics.org/show_bug.cgi?id=1746 And for Rhel5.4 https://bugs.openfabrics.org/show_bug.cgi?id=1747 Both these failures were on ppc64 machines. As an aside, do we continue using ewg mailing list for OFED issues. Are there any plans to move to a new mailing list similar to linux-rdma? Thanks! Pradeep prad...@us.ibm.com___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] ipv6 support in rping
Tziporet, Vlad, Who will be able to help us with this? Need to include the correct level of librdmacm. Is it reasonable to expect that this will get done before the next beta release? Thanks! Pradeep prad...@us.ibm.com ewg-boun...@lists.openfabrics.org wrote on 09/17/2009 08:54:42 AM: On Thu, 2009-09-17 at 08:10 +0300, Or Gerlitz wrote: David J. Wilder wrote: I am not finding support for ipv6 in rping in the 1.5 beta. What is the story for ipv6 support? Is it supported by librdma and missing in rping? Is ipv6 in rping planed? rping supports IPv6 since last year, see the below commit Or. commit 267c28a2f03b8fb63fa9907badd4130c710a1305 Author: Aleksey Senin aleks...@voltaire.com Date: Thu Aug 14 08:01:58 2008 -0700 rping: add ipv6 support Signed-off-by: Aleksey Senin aleks...@voltaire.com Signed-off-by: Sean Hefty sean.he...@intel.com Humm, that explains it.. lidrdma in 1.5 contains librdmacm-1.0.8.tar.gz dated 31-Jul-2008 two weeks before the change was checked in. librdma in 1.5 needs to be updated. from rping.c in the 1.5 source. static int get_addr(char *dst, struct sockaddr_in *addr) { struct addrinfo *res; int ret; ret = getaddrinfo(dst, NULL, NULL, res); if (ret) { printf(getaddrinfo failed - invalid hostname or IP address\n); return ret; } if (res-ai_family != PF_INET) { ret = -1; goto out; } *addr = *(struct sockaddr_in *) res-ai_addr; out: freeaddrinfo(res); return ret; } ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] OFED 1.5 beta status
I am still seeing the following problem trying to install today's OFED-1.5 build on on Sles11 (ppc64) : gcc -m64 -Wp,-MD,/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/net/sunrpc/.svc.o.d -nostdinc -isystem /usr/lib64/gcc/powerpc64-suse-linux/4.3/include -D__KERNEL__ \ -D__OFED_BUILD__ \, -include include/linux/autoconf.h \ -include /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include/linux/autoconf.h \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/kernel_addons/backport/2.6.27_sles11/include/ \ \ \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/include \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/debug \. -I/usr/local/include/scst \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/infiniband/ulp/srpt \ -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/drivers/net/cxgb3 \ -Iinclude \ -Iinclude2 -I/usr/src/linux-2.6.27.19-5/include \ -I/usr/src/linux-2.6.27.19-5/arch/powerpc/include \ -Iarch/powerpc -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/net/sunrpc -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Os -msoft-float -pipe -I/usr/src/linux-2.6.27.19-5/arch/powerpc -Iarch/powerpc -mminimal-toc -mtraceback=none -mcall-aixdesc -mcpu=power4 -mtune=cell -mno-altivec -mno-spe -funit-at-a-time -mno-string -Wa,-maltivec -fno-stack-protector -fomit-frame-pointer -g -Wdeclaration-after-statement -Wno-pointer-sign -DMODULE -DKBUILD_STR(s)=#s -DKBUILD_BASENAME=KBUILD_STR(svc) -DKBUILD_MODNAME=KBUILD_STR(sunrpc) -DDEBUG_HASH=27 -DDEBUG_HASH2=2 -c -o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/net/sunrpc/.tmp_svc.o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/net/sunrpc/svc.c /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/net/sunrpc/svc.c: In function âsvc_pool_map_set_cpumaskâ: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/net/sunrpc/svc.c:320: error: implicit declaration of function â_node_to_cpumask_ptrâ /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/net/sunrpc/svc.c:320: warning: passing argument 2 of âset_cpus_allowed_ptrâ makes pointer from integer without a cast make[5]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/net/sunrpc/svc.o] Error 1 make[4]: *** [/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5/net/sunrpc] Error 2 make[3]: *** [_module_/var/tmp/OFED_topdir/BUILD/ofa_kernel-1.5] Error 2 make[2]: *** [sub-make] Error 2 make[1]: *** [all] Error 2 make[1]: Leaving directory `/usr/src/linux-2.6.27.19-5-obj/ppc64/ppc64' make: *** [kernel] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.81065 (%build) We did get past the mthca build failure, but this error occurs when I tried to install all packages. However, an HPC installation succeeded., Thanks Pradeep, prad...@us.ibm.com Jack Morgenstein ja...@dev.mellan ox.co.il, To Sent by: Alexander Schmidt ewg-boun...@lists al...@linux.vnet.ibm.com .openfabrics.org cc Hoang-Nam Nguyen hngu...@linux.vnet.ibm.com, 09/12/2009 11:46 Christoph Raisch PMrai...@de.ibm.com, Stefan Roscher stefan.rosc...@de.ibm.com, ewg@lists.openfabrics.org Subject Re: [ewg] OFED 1.5 beta status On Thursday 10 September 2009 19:11, Alexander Schmidt wrote: The following change fixes the issue for me, and it did not break other parts of the stack, could someone review this? Thanks Index: ofa_kernel-1.5/kernel_addons/backport/2.6.27_sles11/include/linux/cpumask.h === --- ofa_kernel-1.5.orig/kernel_addons/backport/2.6.27_sles11/include/linux/cpumask.h +++ ofa_kernel-1.5/kernel_addons/backport/2.6.27_sles11/include/linux/cpumask.h @@ -3,7 +3,6 @@ #include_next linux/cpumask.h #include asm/percpu.h -#include asm/topology.h #define cpumask_of(cpu) (get_cpu_mask(cpu)) #define cpumask_of_node(node) (_node_to_cpumask_ptr(node)) Alex, Thanks for sending the fix! -Jack
Re: [ewg] OFED 1.5 beta status
ewg-boun...@lists.openfabrics.org wrote on 09/10/2009 01:23:08 AM: On Wed, 9 Sep 2009 16:47:14 +0300 Tziporet Koren tzipo...@mellanox.co.il wrote: Hi, Hi, I wish to update all that we plan to release OFED 1.5 beta tomorrow I know it's a week late then what we planned but we waited that all modules will at least pass compilation on all supported OSes before we publish the beta the mthca driver does not compile yet on SLES11 on powerpc. I've already reported this twice. -I/usr/src/linux-2.6.27.23-0.1/arch/powerpc/include \ -Iarch/powerpc -I/var/tmp/OFED_topdir/BUILD/ofa_kernel-1. 5/drivers/infiniband/hw/mthca -Wall -Wundef -Wstrict-prototypes - Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit- function-declaration -fwrapv -Os -msoft-float -pipe - I/usr/src/linux-2.6.27.23-0.1/arch/powerpc -Iarch/powerpc -mminimal- toc -mtraceback=none -mcall-aixdesc -mcpu=power4 -mtune=cell -mno- altivec -mno-spe -funit-at-a-time -mno-string -Wa,-maltivec -fno- stack-protector -fomit-frame-pointer -g -Wdeclaration-after- statement -Wno-pointer-sign -fwrapv -DMODULE -DKBUILD_STR(s)=#s - DKBUILD_BASENAME=KBUILD_STR(mthca_eq) - DKBUILD_MODNAME=KBUILD_STR(ib_mthca) -DDEBUG_HASH=48 - DDEBUG_HASH2=63 -c -o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1. 5/drivers/infiniband/hw/mthca/.tmp_mthca_eq.o /var/tmp/OFED_topdir/BUILD/ofa_kernel-1. 5/drivers/infiniband/hw/mthca/mthca_eq.c In file included from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1. 5/kernel_addons/backport/2.6.27_sles11/include/linux/cpumask.h:6, from /usr/src/linux-2.6.27.23-0. 1/include/linux/interrupt.h:9, from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1. 5/drivers/infiniband/hw/mthca/mthca_eq.c:35: include2/asm/topology.h:75: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘dump_numa_cpu_topology’ I looked at the pre-processor output (used --save-temps). Looks like __init is indeed not recognized. In the pre-processor output file, linux/init.h is included after the occurance of dump_numa_cpu_topology. There must have been some changes to header file inclusion causing this failure. Pradeep prad...@us.ibm.com___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] OFED-1.5 and RDMA_CM support for IPv6
Tziporet Koren wrote: Pradeep Satyanarayana wrote: Since the RDMA_CM with support for IPv6 was dropped from OFED-1.4, (and is now upstream) can one expect that it will be in OFED-1.5? If its in 2.6.30 then we already have it If its in 2.6.31 we will need to take the code Can you let me know Hi Tziporet, I see it in 2.6.30, so it should be in OFED-1.5 then. Thanks! Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: OFED bug#1616
As mentioned on the OFED conference call, I downloaded yesterday's build and did confirm that the bug was fixed by running the Connectathon tests on a couple of ppc64 machines. The tests ran to completion without any problems. Steve, Jon, Thanks for your help in resolving the issues. Pradeep prad...@us.ibm.com___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] ibv_devinfo on 1.4.1-rc4 shows only 1 port on HCA, two HCAs installed
Hello Mike, Could this be a firmware issue? I presume you are using ConnectX -is that correct? We have not seen this with Rhel5.3/OFED-1.4.1 rc4 in our tests. Pradeep prad...@us.ibm.com Mike Aho/Rochester/IBM @IBMUS To Sent by: ewg@lists.openfabrics.org ewg-boun...@lists cc .openfabrics.org Subject [ewg] ibv_devinfo on 1.4.1-rc4 05/04/2009 12:06 shows only 1 port on HCA, two PMHCAs installed We have 1.4.1-rc4 installed on RHEL 5. We have two Mellanox DDR two-port cards installed. ibv_devinfo only shows the info on the first port of each HCA when run. Please advise if this is a change in how ibv_devinfo works. If not, any info I need to send along as a bugzilla? Thanks. Mike Aho ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg inline: graycol.gifinline: pic03516.gifinline: ecblank.gif___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Bonding fail over not working
I downloaded a recent version of Roland's git tree and tried IPoIB bonding. Fail over does not seem to be working at all. I have tried OFED 1.3.2 on a Rhel5 derivative and that (fail over) worked as expected. Is this a known issue? Given that OFED 1.4 will be in sync with main line kernel, is this an issue to be addressed in OFED 1.4 too? Has any one else tried this out recently? My impression is that all bonding patches were already upstream. Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [PATCH]IPOIB/CM fix for bug# 906 -OFED-1.3
Eli Cohen wrote: On Sun, 2008-02-17 at 11:21 +0200, Or Gerlitz wrote: Thanks, this sheds more light on the solution but I still can not understand how can the upstream code live without the QPs getting destroyed? or the bug exist also there? if yes, I would recommend to reshape the change-log to the extent explaining well the problem and solution and then resend to Roland. Pradeep, I think it's your call. Hello Eli, I have already submitted this patch to mainline. I will follow up with Roland to get this merged there. Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [PATCH]IPOIB/CM fix for bug# 906 -OFED-1.3
Roland Dreier wrote: Hello Eli, I have already submitted this patch to mainline. I will follow up with Roland to get this merged there. I didn't see the submission... can you resend? Roland, Here is the link to the patch sent previously: http://lists.openfabrics.org/pipermail/general/2008-February/046463.html Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [PATCH]IPOIB/CM fix for bug# 906 -OFED-1.3
Roland Dreier wrote: Here is the link to the patch sent previously: http://lists.openfabrics.org/pipermail/general/2008-February/046463.html OK, applied, although that link points to an HTML-mangled version of the patch, and I also had to figure out why we needed that change and write the patch description myself. Roland, Thanks for your help. Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] [PATCH] IPOIB/CM Increase retry counts for OFED-1.3
This patch change retry counts to small values. This helps interoperability between ehca and mthca. Without this patch I had seen send completion errors. Or Gerlitz has started a thread on the general mailing list and the complete discussion will be available there. This is the second part of the patch submitted yesterday and is split up as per Eli's request. Signed-off-by: Pradeep Satyanarayana [EMAIL PROTECTED] --- --- ofa_kernel-1.3_a/drivers/infiniband/ulp/ipoib/ipoib_cm.c2008-02-12 17:46:03.0 -0500 +++ ofa_kernel-1.3_b/drivers/infiniband/ulp/ipoib/ipoib_cm.c2008-02-12 17:46:58.0 -0500 @@ -1016,8 +1016,8 @@ static int ipoib_cm_send_req(struct net_ req.responder_resources = 4; req.remote_cm_response_timeout = 20; req.local_cm_response_timeout = 20; - req.retry_count = 0; /* RFC draft warns against retries */ - req.rnr_retry_count = 0; /* RFC draft warns against retries */ + req.retry_count = 3; + req.rnr_retry_count = 3; req.max_cm_retries= 15; req.srq = ipoib_cm_has_srq(dev); return ib_send_cm_req(id, req); ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] [PATCH]IPOIB/CM fix for bug# 906 -OFED-1.3
This patch fixes -fail to destroy ipoib rx QP (https://bugs.openfabrics.org/show_bug.cgi?id=906) Hence the usecnt issue reported previously on ehca is solved and allows the qp to be destroyed. As per Eli's request, I am splitting up the patches. This is first portion of yesterday's patch. Tested on ppc64 machines with ehca and mthca. Signed-off-by: Pradeep Satyanarayana [EMAIL PROTECTED] --- --- ofa_kernel-1.3_a/drivers/infiniband/ulp/ipoib/ipoib_cm.c2008-02-11 14:28:47.0 -0500 +++ ofa_kernel-1.3_b/drivers/infiniband/ulp/ipoib/ipoib_cm.c2008-02-12 17:44:07.0 -0500 @@ -883,9 +883,9 @@ void ipoib_cm_dev_stop(struct net_device /* * assume the HW is wedged and just free up everything. */ - list_splice_init(priv-cm.rx_flush_list, list); - list_splice_init(priv-cm.rx_error_list, list); - list_splice_init(priv-cm.rx_drain_list, list); + list_splice_init(priv-cm.rx_flush_list, priv-cm.rx_reap_list); + list_splice_init(priv-cm.rx_error_list, priv-cm.rx_reap_list); + list_splice_init(priv-cm.rx_drain_list, priv-cm.rx_reap_list); break; } spin_unlock_irq(priv-lock); ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] Re: [PATCH] IPOIB/CM fixes for issues seen in OFED-1.3
Or Gerlitz wrote: Eli Cohen wrote: could you send as distinct patches according to what they fix? Pradeep Satyanarayana wrote: 2. Change retry counts to small values. This helps interoperability between ehca and mthca. Indeed, I sent a note on that now to the general list, lets discuss it there since its an architectural issue which need to be addressed correctly. Hello Or, Sure that is a good idea. I had proposed that (more than 6 months ago) on the general list but there was no response. This may be a good time to restart the conversation there. Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] IPOIB/CM fixes for issues seen in OFED-1.3
Eli Cohen wrote: Pradeep, could you send as distinct patches according to what they fix? Thanks. Hello Eli, Sure I will do that. And I will drop the change due to the UD split CQ. Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ofa-general] Re: [ewg] OFED 1.3 rc4 update
Eli Cohen wrote: This problem was seen on a ehca that supports SRQ. Please reply how many scatter entries does ehca support when working in SRQ mode? Also any piece of info I might need to try and mimic ehca behaviour on Mellanox devices. I will appreciate if you can repeat the exact sequence of actions you do to reproduce this. Hello Eli, Ehca supports fewer than 16 s/g entries- hence the srq patch addresses that issue. The sequence of steps that I followed for the touch test: 1. On a freshly booted system, configure ib0 and assign an IP addresss 2. Switch to connected mode and change mtu 3. ping remote ib interface (already in CM mode) 4. modprobe -r ib_ehca I see a series of cascading failures in /var/log/messages, starting with the issue of not being able to destroy the cq (specifically rcq) Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ofa-general] Re: [ewg] OFED 1.3 rc4 update
Tziporet Koren wrote: Shirley Ma wrote: Thanks Tziporet. We will test it right after it's out. You can start use the lates build - http://www.openfabrics.org/builds/ofed-1.3/OFED-1.3-20080206-0751.tgz Tziporet I have downloaded the todays build mentioned above. I am still seeing the issue of failing ib_destroy_cq() for the rcq mentioned yesterday. Here are the steps that I follow: 1. On a freshly booted system configure ib0 2. Switch to connected mode ( on HCA that supports SRQ) 3. ping remote interface 4. modprobe -r ib_ehca 5. I see the failures about ib_destroy_cq() failing and the cascading failures following that (srq and pd cannot be destroyed) 6. If I try a modprobe ib_ehca I get an error Cannot allocate memory This also means some one is chewing tons of memory. I realize that the qp and associated pd were not freed, so some memory is lost. However, this system has 8 GB of memory. Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ofa-general] Re: [ewg] OFED 1.3 rc4 update
Pradeep Satyanarayana wrote: Tziporet Koren wrote: Shirley Ma wrote: Thanks Tziporet. We will test it right after it's out. You can start use the lates build - http://www.openfabrics.org/builds/ofed-1.3/OFED-1.3-20080206-0751.tgz Tziporet I have downloaded the todays build mentioned above. I am still seeing the issue of failing ib_destroy_cq() for the rcq mentioned yesterday. Here are the steps that I follow: 1. On a freshly booted system configure ib0 2. Switch to connected mode ( on HCA that supports SRQ) 3. ping remote interface 4. modprobe -r ib_ehca 5. I see the failures about ib_destroy_cq() failing and the cascading failures following that (srq and pd cannot be destroyed) The ib_destroy_qp() fails because of refcnt is not zero. On my system it was set to 2. Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [ofa-general] Oops with today's OFED 1.3
Pradeep Satyanarayana wrote: Eli Cohen wrote: Pradeep, Can you check if this is resolved? On 2/4/08, Pradeep Satyanarayana [EMAIL PROTECTED] wrote: I pulled today's (Feb 4th) OFED build and saw the following Oops while touch testing on ehca1 on a 2.6.24 kernel. snip NIP [d0299ca8] .ipoib_cm_dev_init+0x440/0x63c [ib_ipoib] LR [d0299a70] .ipoib_cm_dev_init+0x208/0x63c [ib_ipoib] Call Trace: [c001cc85f630] [d0299a70] .ipoib_cm_dev_init+0x208/0x63c [ib_ipoib] (unreliable) [c001cc85f7d0] [d0297f4c] .ipoib_transport_dev_init+0x120/0x458 [ib_ipoib] [c001cc85f930] [d029463c] .ipoib_ib_dev_init+0x44/0xb8 [ib_ipoib] [c001cc85f9c0] [d02902ec] .ipoib_dev_init+0xe0/0x138 [ib_ipoib] [c001cc85fa60] [d0290544] .ipoib_add_one+0x200/0x424 [ib_ipoib] [c001cc85fb20] [d01610e4] .ib_register_client+0x94/0xf4 [ib_core] [c001cc85fbb0] [d029dcac] .ipoib_init_module+0x1f8/0x246c [ib_ipoib] [c001cc85fc70] [c00905f0] .sys_init_module+0x176c/0x187c [c001cc85fe30] [c000852c] syscall_exit+0x0/0x40 Instruction dump: 801f0f20 3b60 2f80 409d0040 e81f0f30 e97f04f0 7b6926e4 395b0001 7d5b07b4 7c080214 816b0018 7d290214 9169002c 6000 6000 6000 Hello Eli, Yes, this particular issue has been solved. However, I do see some other issues. I seeing some new messages (not seen previously) in dmesg relating to ib_cq_destroy() (on ehca): ib0: ib_cq_destroy failed ib_destroy_srq failed: -16 ib_dealloc_pd failed This happens after some network tests and an rmmod of ib_ehca. At this point my guess is that this has to do with the split CQ patch. I have not had enough cycles to state that with absolute certainty. Can you please take a look too? Pradeep I looked at this some more. This error occurs because ib_cq_destroy() for rcq failed. After that there are a series of cascading failures. Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] [Fwd: Re: non SRQ patch for OFED 1.3] -need some help
Pradeep, Shir We tries to apply this patch for OFED 1.3 and its breaks some of the backports. Please use the makedist script on the ofa server (there is an explanation in the developers Wiki) and fix this so we can try to apply it Vlad will help you later today too Thanks, Tziporet ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg Pradeep, I added your patch (kernel_patches/fixes/ipoib_0200_non_srq.patch) and fixed the backport issue (ipoib_0100_to_2.6.21.patch). Please check if ofed_1_3/linux-2.6.git ofed_kernel is ok. Hello Vladimir, I downloaded it and tried it on a 2.6.24 kernel (Sles10Sp2b1) and it compiled fine. I touch tested it and looks okay too. Thanks for your help. However, when I tried on a 2.6.16.57-0.9-ppc64 (Sles10sp2b1) after running ofed_scripts/configure, the make failed as follows: drivers/infiniband/core/addr.c: In function addr_arp_recv: drivers/infiniband/core/addr.c:359: error: âstruct sk_buffâ has no member named nh This seems to be coming from the addr_1_netevents_revert_to_2_6_17.patch patch, which is completely unrelated to this patch. Is there a place where the steps in the build process is completely described. The Wiki at : https://wiki.openfabrics.org/tiki-index.php?page=HOWTO%20Build%20OFED-1.3 is probably missing a few steps. It would help greatly if you could describe all the steps. Why is it that we see differing results? Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] [Fwd: Re: non SRQ patch for OFED 1.3] -need some help
I tried running ofed_scripts/ofed_makedist.sh before and after copying my patch to kernel_patches/fixes. In both cases makedist.sh seems to complete without errors and creates the tar.gz files for the various kernels. In short I am unable to reproduce the problem that Tziporet mentions. Any tips or pointers to resolve this issue would be appreciated. Thanks! Pradeep ---BeginMessage--- Pradeep Satyanarayana wrote: Some HCAs like ehca do not natively support srq. This patch would enable IPoIB CM for such HCAs. This patch has been accepted into Roland's for-2.6.25 git tree for about 3 months now. Please consider including this patch into OFED 1.3. Pradeep, We tries to apply this patch for OFED 1.3 and its breaks some of the backports. Please use the makedist script on the ofa server (there is an explanation in the developers Wiki) and fix this so we can try to apply it Vlad will help you later today too Thanks, Tziporet ---End Message--- ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] non SRQ patch for OFED 1.3
Some HCAs like ehca do not natively support srq. This patch would enable IPoIB CM for such HCAs. This patch has been accepted into Roland's for-2.6.25 git tree for about 3 months now. Please consider including this patch into OFED 1.3. Signed-off-by: Pradeep Satyanarayana [EMAIL PROTECTED] --- --- a/drivers/infiniband/ulp/ipoib/ipoib.h 2008-01-23 13:29:06.0 -0800 +++ b/drivers/infiniband/ulp/ipoib/ipoib.h 2008-01-23 16:03:33.0 -0800 @@ -69,6 +69,7 @@ enum { IPOIB_TX_RING_SIZE= 64, IPOIB_MAX_QUEUE_SIZE = 8192, IPOIB_MIN_QUEUE_SIZE = 2, + IPOIB_CM_MAX_CONN_QP = 4096, IPOIB_NUM_WC = 4, @@ -188,10 +189,13 @@ enum ipoib_cm_state { struct ipoib_cm_rx { struct ib_cm_id *id; struct ib_qp*qp; + struct ipoib_cm_rx_buf *rx_ring; struct list_head list; struct net_device *dev; unsigned longjiffies; enum ipoib_cm_state state; + int index; + int recv_count; }; struct ipoib_cm_tx { @@ -234,6 +238,7 @@ struct ipoib_cm_dev_priv { struct ib_wcibwc[IPOIB_NUM_WC]; struct ib_sge rx_sge[IPOIB_CM_RX_SG]; struct ib_recv_wr rx_wr; + int nonsrq_conn_qp; int max_cm_mtu; int num_frags; }; @@ -463,6 +468,8 @@ void ipoib_drain_cq(struct net_device *d /* We don't support UC connections at the moment */ #define IPOIB_CM_SUPPORTED(ha) (ha[0] (IPOIB_FLAGS_RC)) +extern int ipoib_max_conn_qp; + static inline int ipoib_cm_admin_enabled(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -493,6 +500,12 @@ static inline void ipoib_cm_set(struct i neigh-cm = tx; } +static inline int ipoib_cm_has_srq(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + return !!priv-cm.srq; +} + void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx); int ipoib_cm_dev_open(struct net_device *dev); void ipoib_cm_dev_stop(struct net_device *dev); @@ -510,6 +523,8 @@ void ipoib_cm_handle_tx_wc(struct net_de struct ipoib_cm_tx; +#define ipoib_max_conn_qp 0 + static inline int ipoib_cm_admin_enabled(struct net_device *dev) { return 0; @@ -535,6 +550,11 @@ static inline void ipoib_cm_set(struct i { } +static inline int ipoib_cm_has_srq(struct net_device *dev) +{ + return 0; +} + static inline void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx) { --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2008-01-23 13:29:06.0 -0800 +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2008-01-23 16:46:47.0 -0800 @@ -39,6 +39,13 @@ #include linux/icmpv6.h #include linux/delay.h +int ipoib_max_conn_qp = 128; + +module_param_named(max_nonsrq_conn_qp, ipoib_max_conn_qp, int, 0444); +MODULE_PARM_DESC(max_nonsrq_conn_qp, +Max number of connected-mode QPs per interface +(applied only if shared receive queue is not available)); + #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA static int data_debug_level; @@ -81,7 +88,7 @@ static void ipoib_cm_dma_unmap_rx(struct ib_dma_unmap_single(priv-ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE); } -static int ipoib_cm_post_receive(struct net_device *dev, int id) +static int ipoib_cm_post_receive_srq(struct net_device *dev, int id) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_recv_wr *bad_wr; @@ -104,7 +111,33 @@ static int ipoib_cm_post_receive(struct return ret; } -static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, int id, int frags, +static int ipoib_cm_post_receive_nonsrq(struct net_device *dev, + struct ipoib_cm_rx *rx, int id) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_recv_wr *bad_wr; + int i, ret; + + priv-cm.rx_wr.wr_id = id | IPOIB_OP_CM | IPOIB_OP_RECV; + + for (i = 0; i IPOIB_CM_RX_SG; ++i) + priv-cm.rx_sge[i].addr = rx-rx_ring[id].mapping[i]; + + ret = ib_post_recv(rx-qp, priv-cm.rx_wr, bad_wr); + if (unlikely(ret)) { + ipoib_warn(priv, post recv failed for buf %d (%d)\n, id, ret); + ipoib_cm_dma_unmap_rx(priv, IPOIB_CM_RX_SG - 1, + rx-rx_ring[id].mapping); + dev_kfree_skb_any(rx-rx_ring[id].skb); + rx-rx_ring[id].skb = NULL; + } + + return ret; +} + +static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev, +struct ipoib_cm_rx_buf *rx_ring, +int id, int frags, u64 mapping[IPOIB_CM_RX_SG
[ewg] Question regarding non srq patch for OFED 1.3
Some HCAs like ehca do not natively support srq. In order to enable IPoIB CM for such HCAs, I have developed a nonsrq patch. This patch has been accepted into Roland's for-2.6.25 git tree for about 3 months now. I am working on porting that to OFED 1.3 and it will take me at least several days to finish the port and test it. Is there a date by which I need to complete it for it's inclusion into OFED 1.3? Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] Re: [PATCH] IPoIB/CM Enable SRQ support for HCAs with les than 16 s/g entries (in OFED 1.3)
Pradeep Satyanarayana wrote: Some HCAs like ehca2 support fewer than 16 SG entries. Currently IPoIB/CM implicitly assumes all HCAs will support 16 SG entries of 4K pages for 64K MTUs. This patch removes that restriction. This patch continues to use order 0 allocations and enables implementation of connected mode on such HCAs with smaller MTUs. HCAs having the capability to support 16 SG entries are left untouched. A version of this patch has been integrated into Roland's for-2.6.25 git tree for a couple of weeks. Here is a back ported version of that patch (for OFED 1.3). Please consider for inclusion into OFED 1.3. This patch addresses bug# 728: https://bugs.openfabrics.org/show_bug.cgi?id=728 I am not sure if this eluded your attention. Would it be possible to merge this patch into OFED 1.3? Patch was inlined in a previous mail. Would you like me to resend? Pradeep ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] ConnectX support
I saw some specific known issues and limitations wrt ConnectX in OFED 1.2c. Is ConnectX officially supported in OFED 1.2c, or will that be OFED 1.3? Pradeep [EMAIL PROTECTED] ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg