Re: rdma provider module references

2010-12-16 Thread Steve Wise

On 12/15/2010 11:09 AM, Roland Dreier wrote:

I notice that if I have a user rdma application running that has an
rdma connection using iw_cxgb3, then the iw_cxgb3 module reference
count is bumped and thus it cannot be unloaded.  However when I have
an NFSRDMA connection that utilizes iw_cxgb3, the module reference
count is not bumped, and iw_cxgb3 can erroneously be unloaded while
the NFSRDMA connection is still active, causing a crash.

What is supposed to happen is that as the HW driver is unloading, it
calls ib_unregister_device() first, and this calls each client's
.remove() method to have it release everything related to that device.

However I guess NFS/RDMA is behind the RDMA CM, which is supposed to
handle device removal.  In that code it seems to end up in
cma_process_remove(), which appears at first glance to do the right
things to destroy all connections etc.

The idea is that RDMA devices should be like net devices, ie you can
remove them even if they're in use -- things should just clean up,
rather than blocking the module removal.  The uverbs case is a bit of a
hack because we don't have a way to handle revoking the mmap regions
etc yet.

What goes wrong with NFS/RDMA in this scheme?  It looks like it should work.



Here's one stack.  From this I assume the offload connection was still active 
after iw_cxgb3 was unloaded...

Call Trace:
IRQ  [80037136] kref_get+0x38/0x3d
 [885fb5b1] :iw_cxgb3:sched+0x17/0x49
 [8824cf37] :cxgb3:process_rx+0x37/0x8b
 [8824a3e7] :cxgb3:process_responses+0xc09/0xc63
 [8824ac65] :cxgb3:napi_rx_handler+0x36/0xa4
 [8000c88a] net_rx_action+0xac/0x1e0
 [8824ac15] :cxgb3:t3_sge_intr_msix_napi+0x173/0x18d
 [80012409] __do_softirq+0x89/0x133
 [8005f2fc] call_softirq+0x1c/0x28
 [8006dba8] do_softirq+0x2c/0x85
 [8006da30] do_IRQ+0xec/0xf5
 [800575d0] mwait_idle+0x0/0x4a
 [8005e615] ret_from_intr+0x0/0xa
EOI  [80057606] mwait_idle+0x36/0x4a
 [800497be] cpu_idle+0x95/0xb8
 [80078997] start_secondary+0x498/0x4a7

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rdma provider module references

2010-12-16 Thread Steve Wise

However I guess NFS/RDMA is behind the RDMA CM, which is supposed to

handle device removal.  In that code it seems to end up in
cma_process_remove(), which appears at first glance to do the right
things to destroy all connections etc.



Function cma_process_remove() calls cma_remove_id_dev() for each cm_id bound to the device being removed.  Function 
cma_remove_id_dev() calls the event handler function for each cm_id and passes a RDMA_CM_EVENT_DEVICE_REMOVAL event.  
The NFSRDMA server marks the RPC transport as XPT_CLOSE, but doesn't immediately destroy the cm_id in the event handler 
function.  This is in net/sunrpc/xprtrdma/svc_rdma_transport.c / rdma_cma_handler(). That's the issue methinks.  Each 
RDMA kernel user must destroy all the resources in the event handler function itself.  These cannot be scheduled or 
deferred in any way given the current design.



Steve.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Intel NetEffect NE020 E10G81GP use

2010-12-16 Thread Andrea Gozzelino
Hi,

I wrote about RDMA over TCP/IP (iWARP).
Today I have installed mvapich2-1.6rc1 and openmpi-1.4.3 on my machine.

I don't know a lot about MPI code.
I just try run hello file from example folder.

[r...@redigo-01 examples]# ls
connectivity_c.c hello_c.c hello_f77.f Makefile README ring_cxx.cc
ring_f90.f90
hello hello_cxx.cc hello_f90.f90 Makefile.include ring_c.c ring_f77.f
[r...@redigo-01 examples]# pwd
/root/openmpi-1.4.3/examples

COMMAND LINE:
A) [r...@redigo-01 examples]# mpiexec -n 6
/root/openmpi-1.4.3/examples/cpi
problem with execution of /root/openmpi-1.4.3/examples/cpi on
redigo-01.lnl.infn.it: [Errno 2] No such file or directory

B) [r...@redigo-01 examples]# mpirun rsh redigo-01.lnl.infn.it_59240
hello
redigo-01.lnl.infn.it_59240: host unknown
redigo-01.lnl.infn.it_59240: host unknown
trying normal rsh (/usr/bin/rsh)
redigo-01.lnl.infn.it_59240: Unknown host


I write the mod.conf (as root user) file and I put it with correct
permissions in /etc directory.

I don't understand where is my error? 

Could you help me with these first steps in MPI?

Thank you very much.
Regards,
Andrea



On Dec 15, 2010 10:45 PM, Jaszcza, Andrzej andrzej.jasz...@intel.com
wrote:

 Hi Andrea,
 
 I think you meant RDMA over TCP/IP (iWarp) and not RDMA over Converged
 Ethernet (RoCE) as Intel NetEffect cards do not support the latter.
 You may safely try to run any MPI benchmarks using MVAPICH2, Open MPI
 or Intel MPI. There is more information on iWarp, Intel NetEffect
 cards and their possible uses on Intel website:
 http://www.intel.com/technology/comms/iwarp/index.htm.
 
 Regards,
 Andrzej
 
 -Original Message-
 From: linux-rdma-ow...@vger.kernel.org
 [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Andrea
 Gozzelino
 Sent: Wednesday, December 15, 2010 11:08 AM
 To: linux-rdma@vger.kernel.org
 Subject: Intel NetEffect NE020 E10G81GP use
 Importance: High
 
 Hi all,
 
 I have tested two cards Intel NetEffect NE020 E10G81GP with RDMA over
 Ethernet; results are OK in latency and bandwidth.
 Are there other protocols supported by cards,that offer the same
 performances?
 (TCP Offload, STORAGE, MPI). 
 Socket Direct Protocol (SDO) is not good for those performances. 
 Any suggestions about using cards with high performances?
 What are benefits using these cards in daily storage and data movement
 operations?
 
 Thank you.
 Best regards,
 
 Andrea Gozzelino
 
 INFN - Laboratori Nazionali di Legnaro(LNL)
 Viale dell'Universita' 2 -I-35020 - Legnaro (PD)- ITALIA
 Office: E-101
 Tel: +39 049 8068346
 Fax: +39 049 641925
 Mail: andrea.gozzel...@lnl.infn.it
 Cell: +39 3488245552  
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma
 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 -
 Intel Technology Poland sp. z o.o.
 z siedziba w Gdansku
 ul. Slowackiego 173
 80-298 Gdansk
 
 Sad Rejonowy Gdansk Polnoc w Gdansku, 
 VII Wydzial Gospodarczy Krajowego Rejestru Sadowego, 
 numer KRS 101882
 
 NIP 957-07-52-316
 Kapital zakladowy 200.000 zl
 
 This e-mail and any attachments may contain confidential material for
 the sole use of the intended recipient(s). Any review or distribution
 by others is strictly prohibited. If you are not the intended
 recipient, please contact the sender and delete all copies.
 


Andrea Gozzelino

INFN - Laboratori Nazionali di Legnaro  (LNL)
Viale dell'Universita' 2 -I-35020 - Legnaro (PD)- ITALIA
Office: E-101
Tel: +39 049 8068346
Fax: +39 049 641925
Mail: andrea.gozzel...@lnl.infn.it
Cell: +39 3488245552

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Intel NetEffect NE020 E10G81GP use

2010-12-16 Thread Tung, Chien Tin
 [r...@redigo-01 examples]# ls
 connectivity_c.c hello_c.c hello_f77.f Makefile README ring_cxx.cc
 ring_f90.f90
 hello hello_cxx.cc hello_f90.f90 Makefile.include ring_c.c ring_f77.f
 [r...@redigo-01 examples]# pwd
 /root/openmpi-1.4.3/examples
 
 COMMAND LINE:
 A) [r...@redigo-01 examples]# mpiexec -n 6
 /root/openmpi-1.4.3/examples/cpi
 problem with execution of /root/openmpi-1.4.3/examples/cpi on
 redigo-01.lnl.infn.it: [Errno 2] No such file or directory

You don't have cpi in /root/openmpi-1.4.3/examples directory.  Try running 
hello instead.

Chien


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ibv_get_cq_event blocking forever after successful ibv_post_send...

2010-12-16 Thread 塞尔鱼dwyane
 mine is a single cs mode program.
 client keep on send messages to server, afte a certain number of
 times, the server just block in ibv_get_cq_event.
 the client receive no reply and keep waiting.if manually stop client,
 server get a completion with wc.status is IBV_WC_WR_FLUSH_ERR.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html