Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

2007-10-10 Thread Sean Hefty
The hack to use a socket and bind it to claim the port was just for 
demostrating the idea.  The correct solution, IMO, is to enhance the 
core low level 4-tuple allocation services to be more generic (eg: not 
be tied to a struct sock).  Then the host tcp stack and the host rdma 
stack can allocate TCP/iWARP ports/4tuples from this common exported 
service and share the port space.  This allocation service could also be 
used by other deep adapters like iscsi adapters if needed.


Since iWarp runs on top of TCP, the port space is really the same. 
FWIW, I agree that this proposal is the correct solution to support iWarp.


- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

2007-10-02 Thread Sean Hefty
>Umm... this is a difficult situation for me to merge the changes then.
>We're changing the CM retry behavior blind here.  How do we know that
>the MRA changes don't make the scalability issue worse?

What's currently upstream doesn't work for Intel MPI on our larger clusters.
The connection requests time out on the active side before the passive side can
respond.

The OFED release works because it provides a kernel patch to make the timeout a
module parameter.  I'm trying to avoid adding a module parameter, and the MRA is
designed for this situation.

I tested this by simulating a slow passive side responder, and it worked as
expected for those tests.  Using an MRA does add another MAD to the CM exchange,
which is why it is sent only after seeing a duplicate request.  Alternatively,
we can take the OFED module parameter patch.

- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] [PATCH v3] iw_cxgb3: Support"iwarp-only"interfacesto avoid 4-tuple conflicts.

2007-09-28 Thread Sean Hefty

Kanevsky, Arkady wrote:

Exactly,
it forces the burden on administrator.
And one will be forced to try one mount for iWARP and it does not
work issue another one TCP or UDP if it fails.
Yack!

And server will need to listen on different IP address and simple
* will not work since it will need to listen in two different domains.


The server already has to call listen twice.  Once for the rdma_cm and 
once for sockets.  Similarly on the client side, connect must be made 
over rdma_cm or sockets.  I really don't see any impact on the 
application for this approach.


We just end up separating the port space based on networking addresses, 
rather than keeping the problem at the transport level.  If you have an 
alternate approach that will be accepted upstream, feel free to post it.


- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts.

2007-09-27 Thread Sean Hefty
>It is ok to block while holding a mutex, yes?

It's okay, I just didn't try to trace through the code to see if it ever tries
to acquire the same mutex in the thread that needs to signal the event.

- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only"interfacesto avoid 4-tuple conflicts.

2007-09-27 Thread Sean Hefty
>What is the model on how client connects, say for iSCSI,
>when client and server both support, iWARP and 10GbE or 1GbE,
>and would like to setup "most" performant "connection" for ULP?

For the "most" performance connection, the ULP would use IB, and all these
problems go away.  :)

This proposal is for each iwarp interface to have its own IP address.  Clients
would need an iwarp usable address of the server and would connect using
rdma_connect().  If that call (or rdma_resolve_addr/route) fails, the client
could try connecting using sockets, aoi, or some other interface.  I don't see
that Steve's proposal changes anything from the client's perspective.

- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts.

2007-09-27 Thread Sean Hefty

The sysadmin creates "for iwarp use only" alias interfaces of the form
"devname:iw*" where devname is the native interface name (eg eth0) for the
iwarp netdev device.  The alias label can be anything starting with "iw".
The "iw" immediately after the ':' is the key used by the iw_cxgb3 driver.


I'm still not sure about this, but haven't come up with anything better 
myself.  And if there's a good chance of other rnic's needing the same 
support, I'd rather see the common code separated out, even if just 
encapsulated within this module for easy re-use.


As for the code, I have a couple of questions about whether deadlock and 
a race condition are possible, plus a few minor comments.



+static void insert_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa)
+{
+   struct iwch_addrlist *addr;
+
+   addr = kmalloc(sizeof *addr, GFP_KERNEL);
+   if (!addr) {
+   printk(KERN_ERR MOD "%s - failed to alloc memory!\n",
+  __FUNCTION__);
+   return;
+   }
+   addr->ifa = ifa;
+   mutex_lock(&rnicp->mutex);
+   list_add_tail(&addr->entry, &rnicp->addrlist);
+   mutex_unlock(&rnicp->mutex);
+}


Should this return success/failure?


+static int nb_callback(struct notifier_block *self, unsigned long event,
+  void *ctx)
+{
+   struct in_ifaddr *ifa = ctx;
+   struct iwch_dev *rnicp = container_of(self, struct iwch_dev, nb);
+
+   PDBG("%s rnicp %p event %lx\n", __FUNCTION__, rnicp, event);
+
+   switch (event) {
+   case NETDEV_UP:
+   if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) &&
+   is_iwarp_label(ifa->ifa_label)) {
+   PDBG("%s label %s addr 0x%x added\n",
+   __FUNCTION__, ifa->ifa_label, ifa->ifa_address);
+   insert_ifa(rnicp, ifa);
+   iwch_listeners_add_addr(rnicp, ifa->ifa_address);


If insert_ifa() fails, what will iwch_listeners_add_addr() do?  (I'm not 
easily seeing the relationship between the address list and the listen 
list at this point.)



+   }
+   break;
+   case NETDEV_DOWN:
+   if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) &&
+   is_iwarp_label(ifa->ifa_label)) {
+   PDBG("%s label %s addr 0x%x deleted\n",
+   __FUNCTION__, ifa->ifa_label, ifa->ifa_address);
+   iwch_listeners_del_addr(rnicp, ifa->ifa_address);
+   remove_ifa(rnicp, ifa);
+   }
+   break;
+   default:
+   break;
+   }
+   return 0;
+}
+
+static void delete_addrlist(struct iwch_dev *rnicp)
+{
+   struct iwch_addrlist *addr, *tmp;
+
+   mutex_lock(&rnicp->mutex);
+   list_for_each_entry_safe(addr, tmp, &rnicp->addrlist, entry) {
+   list_del(&addr->entry);
+   kfree(addr);
+   }
+   mutex_unlock(&rnicp->mutex);
+}
+
+static void populate_addrlist(struct iwch_dev *rnicp)
+{
+   int i;
+   struct in_device *indev;
+
+   for (i = 0; i < rnicp->rdev.port_info.nports; i++) {
+   indev = in_dev_get(rnicp->rdev.port_info.lldevs[i]);
+   if (!indev)
+   continue;
+   for_ifa(indev)
+   if (is_iwarp_label(ifa->ifa_label)) {
+   PDBG("%s label %s addr 0x%x added\n",
+__FUNCTION__, ifa->ifa_label,
+ifa->ifa_address);
+   insert_ifa(rnicp, ifa);
+   }
+   endfor_ifa(indev);
+   }
+}
+
 static void rnic_init(struct iwch_dev *rnicp)
 {
PDBG("%s iwch_dev %p\n", __FUNCTION__,  rnicp);
@@ -70,6 +187,12 @@ static void rnic_init(struct iwch_dev *r
idr_init(&rnicp->qpidr);
idr_init(&rnicp->mmidr);
spin_lock_init(&rnicp->lock);
+   INIT_LIST_HEAD(&rnicp->addrlist);
+   INIT_LIST_HEAD(&rnicp->listen_eps);
+   mutex_init(&rnicp->mutex);
+   rnicp->nb.notifier_call = nb_callback;
+   populate_addrlist(rnicp);
+   register_inetaddr_notifier(&rnicp->nb);
 
 	rnicp->attr.vendor_id = 0x168;

rnicp->attr.vendor_part_id = 7;
@@ -148,6 +271,8 @@ static void close_rnic_dev(struct t3cdev
mutex_lock(&dev_mutex);
list_for_each_entry_safe(dev, tmp, &dev_list, entry) {
if (dev->rdev.t3cdev_p == tdev) {
+   unregister_inetaddr_notifier(&dev->nb);
+   delete_addrlist(dev);
list_del(&dev->entry);
iwch_unregister_device(dev);
cxio_rdev_close(&dev->rdev);
diff --git a/drivers/infiniband/hw/cxgb3/iwch.h 
b/drivers/infiniband/hw/cxgb3/iwch.h
index caf4e60..7fa0a47 100644
--- a/drivers/infiniband/hw/cxgb3/iwch.h
+++ b/drivers/infiniband/hw/cxgb3/iwch.h
@@ -36,

Re: [ofa-general] [PATCH] RDMA/CMA: Implement rdma_resolve_ip retry enhancement.

2007-09-19 Thread Sean Hefty

If an application is calling rdma_resolve_ip() and a status of -ENODATA is 
returned from addr_resolve_local/remote(), the timeout mechanism waits until 
the application's timeout occurs before rechecking the address resolution 
status; the application will wait until it's full timeout occurs.  This case is 
seen when the work thread call to process_req() is made before the arp packet 
is processed.


I don't understand the issue.  process_req() is invoked whenever a 
network event occurs, which rechecks all pending requests.



This patch is in addition to Steve Wise's neigh_event_send patch to initiate 
neighbour discovery sent on 9/12/2007.


This patch looks unrelated to Steve's patch.  Can you clarify the 
relationship?


- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts.

2007-09-17 Thread Sean Hefty

+addr = kmalloc(sizeof *addr, GFP_KERNEL);


As a small nitpick: this wants to be sizeof(struct in_ifaddr)


See chapter 14 of CodingStyle document.  kmalloc(sizeof *addr... is correct.

- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

2007-09-14 Thread Sean Hefty
>OK -- just to make sure I'm understanding what you're saying: have you
>confirmed that your proposed patches actually fix the issue?

Not directly.  I cannot easily test kernel patches on our larger, production
clusters.  We've seen the issue with specific applications on 512 and 1024
cores, but I've only been able to test the patch on a 48-core cluster.  I have
verified that it successfully increases the timeout to where it *should* work,
but cannot absolutely confirm that it will fix the problem.  I'm unlikely to
know that until the production clusters move to an OFED release (1.3?)
containing this patch.

- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts.

2007-09-13 Thread Sean Hefty

The iWARP driver must translate all listens on address 0.0.0.0 to the
set of rdma-only ip addresses for the device in question.  This prevents
incoming connect requests to the TCP ipaddresses from going up the
rdma stack.


I've only given this a high level review at this point, and while the 
patch looks okay on first pass, is there a way to move some of this 
functionality to either the rdma_cm or iw_cm?  I don't like the idea of 
every iwarp driver having to implement address/listen list maintenance. 
 I may have some ideas after re-examining it.



Implementation Details:


There are a couple of areas that I made a note to look at in more detail 
(because I didn't understand everything that was happening), but I did 
have one minor nit - most uses of list_del_init can just be list_del.


- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24

2007-09-13 Thread Sean Hefty
> - My user_mad P_Key index support patch.  I'll test the ioctl to
>   change to the new mode and merge this I guess, since Hal and Sean
>   have tested this out.

I can give this patch a reviewed-by: too, and I will also try to review a couple
of the pending ipoib patches.

> - Sean's QoS changes.  These look fine at first glance, and I just
>   plan to understand the backwards compatibility story (ie how this
>   works with an old SM) and merge.  Anyone who objects let me know.

The new QoS fields fall into fields that are currently reserved, which should be
ignored by an older SM.  I've only tested this against openSM however.

> - Sean's IB CM MRA interface changes.  Don't know at this point.  It
>   seems OK but I'm not clear on what if any real-world improvement
>   this gives us.

This patch was generated in response to an Intel MPI issue.  We've seen MPI take
several minutes to respond to a connection request during the middle of large
application runs.  When this happens, the active side times out the connection.
In OFED, we added module parameters to adjust the rdma_cm connection timeout on
the active side, but I believe that sending an MRA from the passive side is a
better solution.

- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery.

2007-09-12 Thread Sean Hefty
>RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery.
>
>Calling arp_send() to initiate neighbour discovery (ND) doesn't do the
>full ND protocol.  Namely, it doesn't handle retransmitting the arp
>request if it is dropped. The function neigh_event_send() does all this.
>Without doing full ND, rdma address resolution fails in the presence of
>dropped arp bcast packets.
>
>Signed-off-by: Steve Wise <[EMAIL PROTECTED]>

Acked-by: Sean Hefty <[EMAIL PROTECTED]>

Roland - can you please queue this up for 2.6.24?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom the host TCP port space.

2007-08-19 Thread Sean Hefty
>Just be realistic and accept that RDMA is a point in time solution,
>and like any other such technology takes flexibility away from users.

All technologies are just point in time solutions.  While management is
important, shouldn't the customers decide how important it is relative to their
problems?  Whether some future technology will be better matters little if a
problem needs to be solved today.

>If you can't see that this is the future, you have my condolences.
>Because frankly, the signs are all around that this is where things
>are going.

Adding a bazillion cores to a processor doesn't do a thing to help memory
bandwidth.

Millions of Infiniband ports are in operation today.  Over 25% of the top 500
supercomputers use Infiniband.  The formation of the OpenFabrics Alliance was
pushed and has been continuously funded by an RDMA customer - the US National
Labs.  RDMA technologies are backed by Cisco, IBM, Intel, QLogic, Sun, Voltaire,
Mellanox, NetApp, AMD, Dell, HP, Oracle, Unisys, Emulex, Hitachi, NEC, Fujitsu,
LSI, SGI, Sandia, and at least two dozen other companies.  IDC expects
Infiniband adapter revenue to triple between 2006 and 2011, and switch revenue
to increase six-fold (combined revenues of 1 billion).

Customers see real benefits using channel based architectures.  Do all customers
need it?  Of course not.  Is it a niche?  Yes, but I would say that about any
10+ gig network.  That doesn't mean that it hasn't become essential for some
customers.

- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom the host TCP port space.

2007-08-15 Thread Sean Hefty
>It's not about being a niche.  It's about creating a maintainable
>software net stack that has predictable behavior.
>
>Needing to reach out of the RDMA sandbox and reserve net stack resources
>away from itself travels a path we've consistently avoided.

We need to ensure that we're also creating a maintainable kernel.  RDMA doesn't
use sockets, but that doesn't mean it's not part of the networking support
provided by the Linux kernel.  Making blanket statements that RDMA should stay
within a sandbox is equivalent to saying that RDMA should duplicate any network
related functionality that it might need.

>>> I will NACK any patch that opens up sockets to eat up ports or
>>> anything stupid like that.
>
>Ditto for me as well.

I agree that using a socket is the wrong approach, but my guess is that it was
suggested as a possibility because of the attempt to keep RDMA in its 'sandbox'.
The iWarp architecture implements RDMA over TCP; it just doesn't use sockets.
The Linux network stack doesn't easily support this possibility.  Are there any
reasonable ways to enable this to the degree necessary for iWarp?

- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

2007-08-09 Thread Sean Hefty

How about we just remove the RDMA stack altogether?  I am not at all
kidding.  If you guys can't stay in your sand box and need to cause
problems for the normal network stack, it's unacceptable.  We were
told all along the if RDMA went into the tree none of this kind of
stuff would be an issue.


There are currently two RDMA solutions available.  Each solution has 
different requirements and uses the normal network stack differently. 
Infiniband uses its own transport.  iWarp runs over TCP.


We have tried to leverage the existing infrastructure where it makes sense.


After TCP port reservation, what's next?  It seems an at least
bi-monthly event that the RDMA folks need to put their fingers
into something else in the normal networking stack.  No more.


Currently, the RDMA stack uses its own port space.  This causes a 
problem for iWarp, and is what Steve is looking for a solution for.  I'm 
not an iWarp guru, so I don't know what options exist.  Can iWarp use 
its own address family?  Identify specific IP addresses for iWarp use? 
Restrict iWarp to specific port numbers?  Let the app control the 
correct operation?  I don't know.


Steve merely defined a problem and suggested a possible solution.  He's 
looking for constructive help trying to solve the problem.


- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

2007-08-09 Thread Sean Hefty

Steve Wise wrote:

Any more comments?


Does anyone have ideas on how to reserve the port space without using a 
struct socket?


- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [openib-general] [PATCH v4 2/2] iWARP Core Changes.

2006-08-03 Thread Sean Hefty

Steve Wise wrote:

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index d294bbc..83f84ef 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -32,6 +32,7 @@ #include 
 #include 
 #include 
 #include 
+#include 


File is included 3 lines up.


diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index e05ca2c..061858c 100644
 #include 
 #include 
 #include 
-#include  /* INIT_WORK, schedule_work(), flush_scheduled_work() 
*/


I'm guessing that the include isn't currently needed, since none of the other 
changes to the file should have removed its dependency.  Should this be put into 
a separate patch?


+static int iw_conn_req_handler(struct iw_cm_id *cm_id, 
+			   struct iw_cm_event *iw_event)

+{
+   struct rdma_cm_id *new_cm_id;
+   struct rdma_id_private *listen_id, *conn_id;
+   struct sockaddr_in *sin;
+   struct net_device *dev = NULL;
+   int ret;
+
+   listen_id = cm_id->context;
+   atomic_inc(&listen_id->dev_remove);
+   if (!cma_comp(listen_id, CMA_LISTEN)) {
+   ret = -ECONNABORTED;
+   goto out;
+   }
+
+   /* Create a new RDMA id for the new IW CM ID */
+	new_cm_id = rdma_create_id(listen_id->id.event_handler, 
+   listen_id->id.context,

+  RDMA_PS_TCP);
+   if (!new_cm_id) {
+   ret = -ENOMEM;
+   goto out;
+   }
+   conn_id = container_of(new_cm_id, struct rdma_id_private, id);
+   atomic_inc(&conn_id->dev_remove);


This is not released in error cases.  See below.


+   conn_id->state = CMA_CONNECT;
+
+   dev = ip_dev_find(iw_event->local_addr.sin_addr.s_addr);
+   if (!dev) {
+   ret = -EADDRNOTAVAIL;


cma_release_remove(conn_id);


+   rdma_destroy_id(new_cm_id);
+   goto out;
+   }
+   ret = rdma_copy_addr(&conn_id->id.route.addr.dev_addr, dev, NULL);
+   if (ret) {


cma_release_remove(conn_id);


+   rdma_destroy_id(new_cm_id);
+   goto out;
+   }
+
+   ret = cma_acquire_dev(conn_id);
+   if (ret) {


cma_release_remove(conn_id);


+   rdma_destroy_id(new_cm_id);
+   goto out;
+   }
+
+   conn_id->cm_id.iw = cm_id;
+   cm_id->context = conn_id;
+   cm_id->cm_handler = cma_iw_handler;
+
+   sin = (struct sockaddr_in *) &new_cm_id->route.addr.src_addr;
+   *sin = iw_event->local_addr;
+   sin = (struct sockaddr_in *) &new_cm_id->route.addr.dst_addr;
+   *sin = iw_event->remote_addr;
+
+   ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0,
+ iw_event->private_data,
+ iw_event->private_data_len);
+   if (ret) {
+   /* User wants to destroy the CM ID */
+   conn_id->cm_id.iw = NULL;
+   cma_exch(conn_id, CMA_DESTROYING);
+   cma_release_remove(conn_id);
+   rdma_destroy_id(&conn_id->id);
+   }
+
+out:
+   if (!dev)
+   dev_put(dev);


Shouldn't this be: if (dev)?


+   cma_release_remove(listen_id);
+   return ret;
+}
@@ -1357,8 +1552,8 @@ static int cma_resolve_loopback(struct r
ib_addr_set_dgid(&id_priv->id.route.addr.dev_addr, &gid);
 
 	if (cma_zero_addr(&id_priv->id.route.addr.src_addr)) {

-   src_in = (struct sockaddr_in *)&id_priv->id.route.addr.src_addr;
-   dst_in = (struct sockaddr_in *)&id_priv->id.route.addr.dst_addr;
+   src_in = (struct sockaddr_in *) 
&id_priv->id.route.addr.src_addr;
+   dst_in = (struct sockaddr_in *) 
&id_priv->id.route.addr.dst_addr;


trivial spacing change only

+static inline void iw_addr_get_sgid(struct rdma_dev_addr* rda, 
+union ib_gid *gid)

+{
+   memcpy(gid, rda->src_dev_addr, sizeof *gid);
+}
+
+static inline union ib_gid* iw_addr_get_dgid(struct rdma_dev_addr* rda)
+{
+   return (union ib_gid *) rda->dst_dev_addr;
+}


Minor personal nit: for consistency with the rest of the file, can you use 
dev_addr in place of rda?


- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [openib-general] [PATCH v2 1/2] iWARP Connection Manager.

2006-06-13 Thread Sean Hefty
>> Er...no. It will lose this event. Depending on the event...the carnage
>> varies. We'll take a look at this.
>>
>
>This behavior is consistent with the Infiniband CM (see
>drivers/infiniband/core/cm.c function cm_recv_handler()).  But I think
>we should at least log an error because a lost event will usually stall
>the rdma connection.

I believe that there's a difference here.  For the Infiniband CM, an allocation
error behaves the same as if the received MAD were lost or dropped.  Since MADs
are unreliable anyway, it's not so much that an IB CM event gets lost, as it
doesn't ever occur.  A remote CM should retry the send, which hopefully allows
the connection to make forward progress.

- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] iWARP Connection Manager.

2006-06-01 Thread Sean Hefty

Steve Wise wrote:

+int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt)
+{
+   struct iwcm_id_private *cm_id_priv;
+   unsigned long flags;
+   int ret = 0;
+
+   cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+   /* Wait if we're currently in a connect or accept downcall */
+	wait_event(cm_id_priv->connect_wait, 
+		   !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags));


Am I understanding this check correctly?  You're checking to see if the user has 
called iw_cm_disconnect() at the same time that they called iw_cm_connect() or 
iw_cm_accept().  Are connect / accept blocking, or are you just waiting for an 
event?



The CM must wait for the low level provider to finish a connect() or
accept() operation before telling the low level provider to disconnect
via modifying the iwarp QP.  Regardless of whether they block, this
disconnect can happen concurrently with the connect/accept so we need to
hold the disconnect until the connect/accept completes.



+EXPORT_SYMBOL(iw_cm_disconnect);
+static void destroy_cm_id(struct iw_cm_id *cm_id)
+{
+   struct iwcm_id_private *cm_id_priv;
+   unsigned long flags;
+   int ret;
+
+   cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+   /* Wait if we're currently in a connect or accept downcall. A
+* listening endpoint should never block here. */
+	wait_event(cm_id_priv->connect_wait, 
+		   !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags));


Same question/comment as above.




Same answer.  


There's a difference between trying to handle the user calling 
disconnect/destroy at the same time a call to accept/connect is active, versus 
the user calling disconnect/destroy after accept/connect have returned.  In the 
latter case, I think you're fine.  In the first case, this is allowing a user to 
call destroy at the same time that they're calling accept/connect. 
Additionally, there's no guarantee that the F_CONNECT_WAIT flag has been set by 
accept/connect by the time disconnect/destroy tests it.


- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] iWARP Connection Manager.

2006-05-31 Thread Sean Hefty

Steve Wise wrote:
+/* 
+ * Release a reference on cm_id. If the last reference is being removed

+ * and iw_destroy_cm_id is waiting, wake up the waiting thread.
+ */
+static int iwcm_deref_id(struct iwcm_id_private *cm_id_priv)
+{
+   int ret = 0;
+
+   BUG_ON(atomic_read(&cm_id_priv->refcount)==0);
+   if (atomic_dec_and_test(&cm_id_priv->refcount)) {
+   BUG_ON(!list_empty(&cm_id_priv->work_list));
+   if (waitqueue_active(&cm_id_priv->destroy_wait)) {
+   BUG_ON(cm_id_priv->state != IW_CM_STATE_DESTROYING);
+   BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY,
+   &cm_id_priv->flags));
+   ret = 1;
+   wake_up(&cm_id_priv->destroy_wait);


We recently changed the RDMA CM, IB CM, and a couple of other modules from using 
wait objects to completions.   This avoids a race condition between decrementing 
the reference count, which allows destruction to proceed, and calling wake_up on 
a freed cm_id.  My guess is that you may need to do the same.


Can you also explain the use of the return value here?  It's ignored below in 
rem_ref() and destroy_cm_id().



+static void add_ref(struct iw_cm_id *cm_id)
+{
+   struct iwcm_id_private *cm_id_priv;
+   cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+   atomic_inc(&cm_id_priv->refcount);
+}
+
+static void rem_ref(struct iw_cm_id *cm_id)
+{
+   struct iwcm_id_private *cm_id_priv;
+   cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+   iwcm_deref_id(cm_id_priv);
+}
+


+/* 
+ * CM_ID <-- CLOSING

+ *
+ * Block if a passive or active connection is currenlty being processed. Then
+ * process the event as follows:
+ * - If we are ESTABLISHED, move to CLOSING and modify the QP state
+ *   based on the abrupt flag 
+ * - If the connection is already in the CLOSING or IDLE state, the peer is
+ *   disconnecting concurrently with us and we've already seen the 
+ *   DISCONNECT event -- ignore the request and return 0

+ * - Disconnect on a listening endpoint returns -EINVAL
+ */
+int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt)
+{
+   struct iwcm_id_private *cm_id_priv;
+   unsigned long flags;
+   int ret = 0;
+
+   cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+   /* Wait if we're currently in a connect or accept downcall */
+	wait_event(cm_id_priv->connect_wait, 
+		   !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags));


Am I understanding this check correctly?  You're checking to see if the user has 
called iw_cm_disconnect() at the same time that they called iw_cm_connect() or 
iw_cm_accept().  Are connect / accept blocking, or are you just waiting for an 
event?



+
+   spin_lock_irqsave(&cm_id_priv->lock, flags);
+   switch (cm_id_priv->state) {
+   case IW_CM_STATE_ESTABLISHED:
+   cm_id_priv->state = IW_CM_STATE_CLOSING;
+   spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+   if (cm_id_priv->qp)  { /* QP could be  for user-mode 
client */
+   if (abrupt)
+   ret = iwcm_modify_qp_err(cm_id_priv->qp);
+   else
+   ret = iwcm_modify_qp_sqd(cm_id_priv->qp);
+			/* 
+			 * If both sides are disconnecting the QP could

+* already be in ERR or SQD states
+*/
+   ret = 0;
+   }
+   else
+   ret = -EINVAL;
+   break;
+   case IW_CM_STATE_LISTEN:
+   spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+   ret = -EINVAL;
+   break;
+   case IW_CM_STATE_CLOSING:
+   /* remote peer closed first */
+   case IW_CM_STATE_IDLE:  
+   /* accept or connect returned !0 */
+   spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+   break;
+   case IW_CM_STATE_CONN_RECV:
+		/* 
+		 * App called disconnect before/without calling accept after

+* connect_request event delivered.
+*/
+   spin_unlock_irqrestore(&cm_id_priv->lock, flags);
+   break;
+   case IW_CM_STATE_CONN_SENT:
+   /* Can only get here if wait above fails */
+   default:
+   BUG_ON(1);
+   }
+
+   return ret;
+}
+EXPORT_SYMBOL(iw_cm_disconnect);
+static void destroy_cm_id(struct iw_cm_id *cm_id)
+{
+   struct iwcm_id_private *cm_id_priv;
+   unsigned long flags;
+   int ret;
+
+   cm_id_priv = container_of(cm_id, struct iwcm_id_private, id);
+   /* Wait if we're currently in a connect or accept downcall. A
+* listening endpoint should never block here. */
+	wait_event(cm_id_priv->connect_wait, 
+		   !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags));


Same question/comment as abo

Re: [PATCH 2/2] iWARP Core Changes.

2006-05-31 Thread Sean Hefty

Mainly nits...

Steve Wise wrote:

-static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
+int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
 unsigned char *dst_dev_addr)


Might want to rename this to something like rdma_copy_addr if you're going to 
export it.



+static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event)
+{
+   struct rdma_id_private *id_priv = iw_id->context;
+   enum rdma_cm_event_type event = 0;
+   struct sockaddr_in *sin;
+   int ret = 0;
+
+   atomic_inc(&id_priv->dev_remove);
+
+   switch (iw_event->event) {
+   case IW_CM_EVENT_CLOSE:
+   event = RDMA_CM_EVENT_DISCONNECTED;
+   break;
+   case IW_CM_EVENT_CONNECT_REPLY:
+   sin = (struct sockaddr_in*)&id_priv->id.route.addr.src_addr;
+   *sin = iw_event->local_addr;
+   sin = (struct sockaddr_in*)&id_priv->id.route.addr.dst_addr;


spacing nit - (struct sockaddr_in *) &id_priv->...


+struct net_device *ip_dev_find(u32 ip);


Just include header file with definition.


+   sin = (struct sockaddr_in*)&new_cm_id->route.addr.src_addr;
+   *sin = iw_event->local_addr;
+   sin = (struct sockaddr_in*)&new_cm_id->route.addr.dst_addr;


same spacing nit...  appears in a couple other places as well.


+static inline union ib_gid* iw_addr_get_sgid(struct rdma_dev_addr* rda)
+{
+   return (union ib_gid*)rda->src_dev_addr;
+}
+
+static inline union ib_gid* iw_addr_get_dgid(struct rdma_dev_addr* rda)
+{
+   return (union ib_gid*)rda->dst_dev_addr;
+}


spacing nits


+struct iw_cm_verbs;
 struct ib_device {
struct device*dma_device;
 
@@ -846,6 +873,8 @@ struct ib_device {
 
 	u32   flags;
 
+	struct iw_cm_verbs*   iwcm;

+


'*' placement nit

- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [openib-general] Re: [PATCH 4/6 v2] IB: address translation to map IP toIB addresses (GIDs)

2006-03-21 Thread Sean Hefty

Roland Dreier wrote:

 > +struct workqueue_struct *rdma_wq;
 > +EXPORT_SYMBOL(rdma_wq);

Sean, I don't think I saw an answer when I asked you this before.  Why
is ib_addr exporting a workqueue?  Is there some sort of ordering
constraint that is forcing other modules to go through the same
workqueue for things?

This seems like a very fragile internal thing to be exposing, and I'm
wondering if there's a better way to handle it.


I responded in a different thread, but here's what I wrote:

"This is simply an attempt to reduce/combine work queues used by the Infiniband 
code.  This keeps the threading a little simpler in the rdma_cm, since all 
callbacks are invoked using the same work queue.  (I'm also using this with the 
local SA/multicast code, but that's not ready for merging.)"


There's no specific ordering constraint that's required.  We're just ending up 
with several Infiniband modules creating their own work queues (ib_mad, ib_cm, 
ib_addr, rdma_cm, plus a couple more in modules under development), and this is 
an attempt to reduce that.  If having separate work queues would work better, 
there shouldn't be anything that prevents this.


- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 5/6 v2] IB: IP address based RDMA connection manager

2006-03-13 Thread Sean Hefty
> > +static void cma_detach_from_dev(struct rdma_id_private *id_priv)
> > +{
> > +   list_del(&id_priv->list);
> > +   if (atomic_dec_and_test(&id_priv->cma_dev->refcount))
> > +   wake_up(&id_priv->cma_dev->wait);
> > +   id_priv->cma_dev = NULL;
> > +}
>
>doesn't need to do atomic_dec_and_test(), because it is never dropping
>the last reference to id_priv (and in fact if it was, the last line
>would be a use-after-free bug).

It's dropping the reference on cma_dev, as opposed to id_priv.

- Sean

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 4/6 v2] IB: address translation to map IP toIB addresses (GIDs)

2006-03-10 Thread Sean Hefty
>The ib_addr module depends on CONFIG_INET, because it uses symbols
>like arp_tbl, which are only exported if INET is enabled.
>
>I fixed this up by creating a new (non-user-visible) config symbol to
>control when ib_addr is built -- I put the following diff on top of
>your patch in my tree:

Thanks!
-Sean

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6 v2] IB: IP address based RDMA connection manager

2006-03-06 Thread Sean Hefty
Kernel mode connection management agent over Infiniband that connects based
on IP addresses.  The agent defines a generic RDMA connection abstraction
to support clients wanting to connect over different RDMA devices.

Agent also handles RDMA device hotplug events on behalf of clients.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/cma.c 
linux-2.6.ib/drivers/infiniband/core/cma.c
--- linux-2.6.git/drivers/infiniband/core/cma.c 1969-12-31 16:00:00.0 
-0800
+++ linux-2.6.ib/drivers/infiniband/core/cma.c  2006-01-16 16:17:34.0 
-0800
@@ -0,0 +1,1639 @@
+/*
+ * Copyright (c) 2005 Voltaire Inc.  All rights reserved.
+ * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved.
+ * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved.
+ * Copyright (c) 2005 Intel Corporation.  All rights reserved.
+ *
+ * This Software is licensed under one of the following licenses:
+ *
+ * 1) under the terms of the "Common Public License 1.0" a copy of which is
+ *available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/cpl.php.
+ *
+ * 2) under the terms of the "The BSD License" a copy of which is
+ *available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/bsd-license.php.
+ *
+ * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
+ *copy of which is available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/gpl-license.php.
+ *
+ * Licensee has the right to choose one of the above licenses.
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice and one of the license notices.
+ *
+ * Redistributions in binary form must reproduce both the above copyright
+ * notice, one of the license notices in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_AUTHOR("Sean Hefty");
+MODULE_DESCRIPTION("Generic RDMA CM Agent");
+MODULE_LICENSE("Dual BSD/GPL");
+
+#define CMA_CM_RESPONSE_TIMEOUT 20
+#define CMA_MAX_CM_RETRIES 3
+
+static void cma_add_one(struct ib_device *device);
+static void cma_remove_one(struct ib_device *device);
+
+static struct ib_client cma_client = {
+   .name   = "cma",
+   .add= cma_add_one,
+   .remove = cma_remove_one
+};
+
+static LIST_HEAD(dev_list);
+static LIST_HEAD(listen_any_list);
+static DEFINE_MUTEX(lock);
+
+struct cma_device {
+   struct list_headlist;
+   struct ib_device*device;
+   __be64  node_guid;
+   wait_queue_head_t   wait;
+   atomic_trefcount;
+   struct list_headid_list;
+};
+
+enum cma_state {
+   CMA_IDLE,
+   CMA_ADDR_QUERY,
+   CMA_ADDR_RESOLVED,
+   CMA_ROUTE_QUERY,
+   CMA_ROUTE_RESOLVED,
+   CMA_CONNECT,
+   CMA_ADDR_BOUND,
+   CMA_LISTEN,
+   CMA_DEVICE_REMOVAL,
+   CMA_DESTROYING
+};
+
+/*
+ * Device removal can occur at anytime, so we need extra handling to
+ * serialize notifying the user of device removal with other callbacks.
+ * We do this by disabling removal notification while a callback is in process,
+ * and reporting it after the callback completes.
+ */
+struct rdma_id_private {
+   struct rdma_cm_id   id;
+
+   struct list_headlist;
+   struct list_headlisten_list;
+   struct cma_device   *cma_dev;
+
+   enum cma_state  state;
+   spinlock_t  lock;
+   wait_queue_head_t   wait;
+   atomic_trefcount;
+   wait_queue_head_t   wait_remove;
+   atomic_tdev_remove;
+
+   int backlog;
+   int timeout_ms;
+   struct ib_sa_query  *query;
+   int query_id;
+   struct ib_cm_id *cm_id;
+
+   u32 seq_num;
+   u32 qp_num;
+   enum ib_qp_type qp_type;
+   u8  srq;
+};
+
+struct cma_work {
+   struct work_struct  work;
+   struct rdma_id_private  *id;
+};
+
+union cma_ip_addr {
+   struct in6_addr ip6;
+   struct {
+   __u32 pad[3];
+   __u32 addr;
+   } ip4;
+};
+
+struct cma_hdr {
+   u8 cma_version;
+   u8 ip_version;  /* IP version: 7:4 */
+   __u16 port;
+   union cma_ip_addr src_addr;
+   union cma_ip_addr dst_addr;
+};
+
+struct sdp_hh {
+   u8 sdp_version;
+   u8 ip_version;  /* IP version: 7:4 */
+   u8 sdp_specific1[10];
+   __u16 port;
+   __u16 sdp_specific2;
+   union cma_ip_addr src_addr;
+   union cma_ip_addr dst_addr;
+};
+
+#define CMA_VERSION 0x00
+#de

[PATCH 6/6 v2] IB: userspace support for RDMA connection manager

2006-03-06 Thread Sean Hefty
Kernel component necessary to support the userspace RDMA connection management
library.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

Discussion on the list suggested giving the userspace interface more time to
develop, which seems reasonable.

diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/Makefile 
linux-2.6.ib/drivers/infiniband/core/Makefile
--- linux-2.6.git/drivers/infiniband/core/Makefile  2006-01-16 
16:58:58.0 -0800
+++ linux-2.6.ib/drivers/infiniband/core/Makefile   2006-01-16 
16:55:25.0 -0800
@@ -1,5 +1,5 @@
 obj-$(CONFIG_INFINIBAND) +=ib_core.o ib_mad.o ib_sa.o \
-   ib_cm.o ib_addr.o rdma_cm.o
+   ib_cm.o ib_addr.o rdma_cm.o rdma_ucm.o
 obj-$(CONFIG_INFINIBAND_USER_MAD) +=   ib_umad.o
 obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=ib_uverbs.o ib_ucm.o
 
@@ -14,6 +14,8 @@ ib_cm-y :=cm.o
 
 rdma_cm-y :=   cma.o
 
+rdma_ucm-y :=  ucma.o
+
 ib_addr-y :=   addr.o
 
 ib_umad-y :=   user_mad.o
diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/ucma.c 
linux-2.6.ib/drivers/infiniband/core/ucma.c
--- linux-2.6.git/drivers/infiniband/core/ucma.c1969-12-31 
16:00:00.0 -0800
+++ linux-2.6.ib/drivers/infiniband/core/ucma.c 2006-01-16 16:54:31.0 
-0800
@@ -0,0 +1,788 @@
+/*
+ * Copyright (c) 2005 Intel Corporation.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+MODULE_AUTHOR("Sean Hefty");
+MODULE_DESCRIPTION("RDMA Userspace Connection Manager Access");
+MODULE_LICENSE("Dual BSD/GPL");
+
+enum {
+   UCMA_MAX_BACKLOG= 128
+};
+
+struct ucma_file {
+   struct mutexfile_mutex;
+   struct file *filp;
+   struct list_headctxs;
+   struct list_headevents;
+   wait_queue_head_t   poll_wait;
+};
+
+struct ucma_context {
+   int id;
+   wait_queue_head_t   wait;
+   atomic_tref;
+   int events_reported;
+   int backlog;
+
+   struct ucma_file*file;
+   struct rdma_cm_id   *cm_id;
+   __u64   uid;
+
+   struct list_headevents;/* list of pending events. */
+   struct list_headfile_list; /* member in file ctx list */
+};
+
+struct ucma_event {
+   struct ucma_context *ctx;
+   struct list_headfile_list; /* member in file event list */
+   struct list_headctx_list;  /* member in ctx event list */
+   struct rdma_cm_id   *cm_id;
+   struct rdma_ucm_event_resp resp;
+};
+
+static DEFINE_MUTEX(ctx_mutex);
+static DEFINE_IDR(ctx_idr);
+
+static struct ucma_context* ucma_get_ctx(struct ucma_file *file, int id)
+{
+   struct ucma_context *ctx;
+
+   mutex_lock(&ctx_mutex);
+   ctx = idr_find(&ctx_idr, id);
+   if (!ctx)
+   ctx = ERR_PTR(-ENOENT);
+   else if (ctx->file != file)
+   ctx = ERR_PTR(-EINVAL);
+   else
+   atomic_inc(&ctx->ref);
+   mutex_unlock(&ctx_mutex);
+
+   return ctx;
+}
+
+static void ucma_put_ctx(struct ucma_context *ctx)
+{
+   if (atomic_dec_and_test(&ctx->ref))
+   wake_up(&ctx->wait);
+}
+
+static void ucma_cleanup_events(struct 

[PATCH 4/6 v2] IB: address translation to map IP toIB addresses (GIDs)

2006-03-06 Thread Sean Hefty
Add an address translation service that maps IP addresses to Infiniband
GID addresses using IPoIB.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

This should be the correct patch.  The only difference between this and the 
mis-post
is the use of mutex_lock/unlock in place of up/down.


diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/addr.c 
linux-2.6.ib/drivers/infiniband/core/addr.c
--- linux-2.6.git/drivers/infiniband/core/addr.c1969-12-31 
16:00:00.0 -0800
+++ linux-2.6.ib/drivers/infiniband/core/addr.c 2006-01-16 16:14:24.0 
-0800
@@ -0,0 +1,356 @@
+/*
+ * Copyright (c) 2005 Voltaire Inc.  All rights reserved.
+ * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved.
+ * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved.
+ * Copyright (c) 2005 Intel Corporation.  All rights reserved.
+ *
+ * This Software is licensed under one of the following licenses:
+ *
+ * 1) under the terms of the "Common Public License 1.0" a copy of which is
+ *available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/cpl.php.
+ *
+ * 2) under the terms of the "The BSD License" a copy of which is
+ *available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/bsd-license.php.
+ *
+ * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
+ *copy of which is available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/gpl-license.php.
+ *
+ * Licensee has the right to choose one of the above licenses.
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice and one of the license notices.
+ *
+ * Redistributions in binary form must reproduce both the above copyright
+ * notice, one of the license notices in the documentation
+ * and/or other materials provided with the distribution.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_AUTHOR("Sean Hefty");
+MODULE_DESCRIPTION("IB Address Translation");
+MODULE_LICENSE("Dual BSD/GPL");
+
+struct addr_req {
+   struct list_head list;
+   struct sockaddr src_addr;
+   struct sockaddr dst_addr;
+   struct rdma_dev_addr *addr;
+   void *context;
+   void (*callback)(int status, struct sockaddr *src_addr,
+struct rdma_dev_addr *addr, void *context);
+   unsigned long timeout;
+   int status;
+};
+
+static void process_req(void *data);
+
+static DEFINE_MUTEX(lock);
+static LIST_HEAD(req_list);
+static DECLARE_WORK(work, process_req, NULL);
+struct workqueue_struct *rdma_wq;
+EXPORT_SYMBOL(rdma_wq);
+
+static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
+unsigned char *dst_dev_addr)
+{
+   switch (dev->type) {
+   case ARPHRD_INFINIBAND:
+   dev_addr->dev_type = IB_NODE_CA;
+   break;
+   default:
+   return -EADDRNOTAVAIL;
+   }
+
+   memcpy(dev_addr->src_dev_addr, dev->dev_addr, MAX_ADDR_LEN);
+   memcpy(dev_addr->broadcast, dev->broadcast, MAX_ADDR_LEN);
+   if (dst_dev_addr)
+   memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN);
+   return 0;
+}
+
+int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr)
+{
+   struct net_device *dev;
+   u32 ip = ((struct sockaddr_in *) addr)->sin_addr.s_addr;
+   int ret;
+
+   dev = ip_dev_find(ip);
+   if (!dev)
+   return -EADDRNOTAVAIL;
+
+   ret = copy_addr(dev_addr, dev, NULL);
+   dev_put(dev);
+   return ret;
+}
+EXPORT_SYMBOL(rdma_translate_ip);
+
+static void set_timeout(unsigned long time)
+{
+   unsigned long delay;
+
+   cancel_delayed_work(&work);
+
+   delay = time - jiffies;
+   if ((long)delay <= 0)
+   delay = 1;
+
+   queue_delayed_work(rdma_wq, &work, delay);
+}
+
+static void queue_req(struct addr_req *req)
+{
+   struct addr_req *temp_req;
+
+   mutex_lock(&lock);
+   list_for_each_entry_reverse(temp_req, &req_list, list) {
+   if (time_after(req->timeout, temp_req->timeout))
+   break;
+   }
+
+   list_add(&req->list, &temp_req->list);
+
+   if (req_list.next == &req->list)
+   set_timeout(req->timeout);
+   mutex_unlock(&lock);
+}
+
+static void addr_send_arp(struct sockaddr_in *dst_in)
+{
+   struct rtable *rt;
+   struct flowi fl;
+   u32 dst_ip = dst_in->sin_addr.s_addr;
+
+   memset(&fl, 0, sizeof fl);
+   fl.nl_u.ip4_u.daddr = dst_ip;
+   if (ip_route_output_key(&rt, &fl))
+   return;
+
+   arp_send(ARPOP_REQUEST, ETH_P_ARP, rt->rt_gateway, rt->idev->dev,
+rt->r

[PATCH 2/6 v2] IB: match connection requests based on private data

2006-03-06 Thread Sean Hefty
Extend matching connection requests to listens in the Infiniband CM to include
private data checks.

This allows applications to listen on the same service identifier, with private
data directing the request to the appropriate application.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

This should be the correct patch that incorporates feedback from the initial
submission.  Sorry about the mis-post.


diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/cm.c 
linux-2.6.ib/drivers/infiniband/core/cm.c
--- linux-2.6.git/drivers/infiniband/core/cm.c  2006-01-16 10:25:26.0 
-0800
+++ linux-2.6.ib/drivers/infiniband/core/cm.c   2006-01-16 16:03:35.0 
-0800
@@ -32,7 +32,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: cm.c 2821 2005-07-08 17:07:28Z sean.hefty $
+ * $Id: cm.c 4311 2005-12-05 18:42:01Z sean.hefty $
  */
 #include 
 #include 
@@ -130,6 +130,7 @@ struct cm_id_private {
/* todo: use alternate port on send failure */
struct cm_av av;
struct cm_av alt_av;
+   struct ib_cm_compare_data *compare_data;
 
void *private_data;
__be64 tid;
@@ -355,6 +356,41 @@ static struct cm_id_private * cm_acquire
return cm_id_priv;
 }
 
+static void cm_mask_copy(u8 *dst, u8 *src, u8 *mask)
+{
+   int i;
+
+   for (i = 0; i < IB_CM_COMPARE_SIZE / sizeof(unsigned long); i++)
+   ((unsigned long *) dst)[i] = ((unsigned long *) src)[i] &
+((unsigned long *) mask)[i];
+}
+
+static int cm_compare_data(struct ib_cm_compare_data *src_data,
+  struct ib_cm_compare_data *dst_data)
+{
+   u8 src[IB_CM_COMPARE_SIZE];
+   u8 dst[IB_CM_COMPARE_SIZE];
+
+   if (!src_data || !dst_data)
+   return 0;
+   
+   cm_mask_copy(src, src_data->data, dst_data->mask);
+   cm_mask_copy(dst, dst_data->data, src_data->mask);
+   return memcmp(src, dst, IB_CM_COMPARE_SIZE);
+}
+
+static int cm_compare_private_data(u8 *private_data,
+  struct ib_cm_compare_data *dst_data)
+{
+   u8 src[IB_CM_COMPARE_SIZE];
+
+   if (!dst_data)
+   return 0;
+   
+   cm_mask_copy(src, private_data, dst_data->mask);
+   return memcmp(src, dst_data->data, IB_CM_COMPARE_SIZE);
+}
+
 static struct cm_id_private * cm_insert_listen(struct cm_id_private 
*cm_id_priv)
 {
struct rb_node **link = &cm.listen_service_table.rb_node;
@@ -362,14 +397,18 @@ static struct cm_id_private * cm_insert_
struct cm_id_private *cur_cm_id_priv;
__be64 service_id = cm_id_priv->id.service_id;
__be64 service_mask = cm_id_priv->id.service_mask;
+   int data_cmp;
 
while (*link) {
parent = *link;
cur_cm_id_priv = rb_entry(parent, struct cm_id_private,
  service_node);
+   data_cmp = cm_compare_data(cm_id_priv->compare_data,
+  cur_cm_id_priv->compare_data);
if ((cur_cm_id_priv->id.service_mask & service_id) ==
(service_mask & cur_cm_id_priv->id.service_id) &&
-   (cm_id_priv->id.device == cur_cm_id_priv->id.device))
+   (cm_id_priv->id.device == cur_cm_id_priv->id.device) &&
+   !data_cmp)
return cur_cm_id_priv;
 
if (cm_id_priv->id.device < cur_cm_id_priv->id.device)
@@ -378,6 +417,10 @@ static struct cm_id_private * cm_insert_
link = &(*link)->rb_right;
else if (service_id < cur_cm_id_priv->id.service_id)
link = &(*link)->rb_left;
+   else if (service_id > cur_cm_id_priv->id.service_id)
+   link = &(*link)->rb_right;
+   else if (data_cmp < 0)
+   link = &(*link)->rb_left;
else
link = &(*link)->rb_right;
}
@@ -387,16 +430,20 @@ static struct cm_id_private * cm_insert_
 }
 
 static struct cm_id_private * cm_find_listen(struct ib_device *device,
-__be64 service_id)
+__be64 service_id,
+u8 *private_data)
 {
struct rb_node *node = cm.listen_service_table.rb_node;
struct cm_id_private *cm_id_priv;
+   int data_cmp;
 
while (node) {
cm_id_priv = rb_entry(node, struct cm_id_private, service_node);
+   data_cmp = cm_compare_private_data(private_data,
+  cm_id_priv->compare_data);
i

Re: [openib-general] [PATCH 2/6] IB: match connection requests based on private data

2006-03-06 Thread Sean Hefty

Sean Hefty wrote:

+static void cm_mask_compare_data(u8 *dst, u8 *src, u8 *mask)
+{
+   int i;
+
+   for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE; i++)
+   dst[i] = src[i] & mask[i];
+}
+
+static int cm_compare_data(struct ib_cm_private_data_compare *src_data,
+  struct ib_cm_private_data_compare *dst_data)
+{
+   u8 src[IB_CM_PRIVATE_DATA_COMPARE_SIZE];
+   u8 dst[IB_CM_PRIVATE_DATA_COMPARE_SIZE];


Ugh.  I sent the wrong patch series.  This was the original set of patches, 
before any feedback was incorporated.  I will need to resend patches 2, 4, 5, 
and 6.  Sorry about this.


- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [openib-general] Re: [PATCH 6/6] IB: userspace support for RDMA connection manager

2006-03-06 Thread Sean Hefty

Roland Dreier wrote:

 > +struct rdma_ucm_query_route_resp {
 > + __u64 node_guid;
 > + struct ib_user_path_rec ib_route[2];
 > + struct sockaddr_in6 src_addr;
 > + struct sockaddr_in6 dst_addr;
 > + __u32 num_paths;
 > + __u8 port_num;
 > + __u8 reserved[3];
 > +};

Is there a 32-bit/64-bit compatibility problem here?  From a quick
look, struct sockaddr_in6 is not 8-byte aligned.


Unless I miss counted, they should be aligned.  ib_user_path_rec is defined near 
the end of patch 1/6.


+struct ib_user_path_rec {
+   __u8dgid[16];
+   __u8sgid[16];
+   __be16  dlid;
+   __be16  slid;
+   __u32   raw_traffic;
+   __be32  flow_label;
+   __u32   reversible;
+   __u32   mtu;
+   __be16  pkey;
+   __u8hop_limit;
+   __u8traffic_class;
+   __u8numb_path;
+   __u8sl;
+   __u8mtu_selector;
+   __u8rate_selector;
+   __u8rate;
+   __u8packet_life_time_selector;
+   __u8packet_life_time;
+   __u8preference;
+};

- Sean

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [openib-general] Re: [PATCH 6/6] IB: userspace support for RDMA connection manager

2006-03-06 Thread Sean Hefty

Roland Dreier wrote:

On the other hand I think it would be good to let this userspace
interface cook a little more, say in -mm.


I think that this makes sense.

- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [openib-general] RE: [PATCH 2/6] IB: match connection requests based on private data

2006-03-06 Thread Sean Hefty

Caitlin Bestler wrote:

The term "private data" is intended to convey the
intent that the data is private to the application
layer and is opaque to middleware and the network.


The private data area is for the use of whatever client resides above the 
Infiniband CM only.  There is no assumption about whether that client is 
middleware or an application.



By what mechanism does the listening application
delegate how much of the private data for use by
the CM for sub-dividing a listen? What does an 
application do if it wishes to retain full ownership

of the private data?


An application that interfaces directly with the Infiniband CM always retains 
full control of any private data.  Applications that interface to middleware are 
restricted by the limitations of that middleware layer.


- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] IB: IP address based RDMA connection manager

2006-03-06 Thread Sean Hefty
Kernel mode connection management agent over Infiniband that connects based
on IP addresses.  The agent defines a generic RDMA connection abstraction
to support clients wanting to connect over different RDMA devices.

Agent also handles RDMA device hotplug events on behalf of clients.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/cma.c 
linux-2.6.ib/drivers/infiniband/core/cma.c
--- linux-2.6.git/drivers/infiniband/core/cma.c 1969-12-31 16:00:00.0 
-0800
+++ linux-2.6.ib/drivers/infiniband/core/cma.c  2006-01-16 16:17:34.0 
-0800
@@ -0,0 +1,1639 @@
+/*
+ * Copyright (c) 2005 Voltaire Inc.  All rights reserved.
+ * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved.
+ * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved.
+ * Copyright (c) 2005 Intel Corporation.  All rights reserved.
+ *
+ * This Software is licensed under one of the following licenses:
+ *
+ * 1) under the terms of the "Common Public License 1.0" a copy of which is
+ *available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/cpl.php.
+ *
+ * 2) under the terms of the "The BSD License" a copy of which is
+ *available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/bsd-license.php.
+ *
+ * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
+ *copy of which is available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/gpl-license.php.
+ *
+ * Licensee has the right to choose one of the above licenses.
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice and one of the license notices.
+ *
+ * Redistributions in binary form must reproduce both the above copyright
+ * notice, one of the license notices in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_AUTHOR("Guy German");
+MODULE_DESCRIPTION("Generic RDMA CM Agent");
+MODULE_LICENSE("Dual BSD/GPL");
+
+#define CMA_CM_RESPONSE_TIMEOUT 20
+#define CMA_MAX_CM_RETRIES 3
+
+static void cma_add_one(struct ib_device *device);
+static void cma_remove_one(struct ib_device *device);
+
+static struct ib_client cma_client = {
+   .name   = "cma",
+   .add= cma_add_one,
+   .remove = cma_remove_one
+};
+
+static LIST_HEAD(dev_list);
+static LIST_HEAD(listen_any_list);
+static DECLARE_MUTEX(mutex);
+
+struct cma_device {
+   struct list_headlist;
+   struct ib_device*device;
+   __be64  node_guid;
+   wait_queue_head_t   wait;
+   atomic_trefcount;
+   struct list_headid_list;
+};
+
+enum cma_state {
+   CMA_IDLE,
+   CMA_ADDR_QUERY,
+   CMA_ADDR_RESOLVED,
+   CMA_ROUTE_QUERY,
+   CMA_ROUTE_RESOLVED,
+   CMA_CONNECT,
+   CMA_ADDR_BOUND,
+   CMA_LISTEN,
+   CMA_DEVICE_REMOVAL,
+   CMA_DESTROYING
+};
+
+/*
+ * Device removal can occur at anytime, so we need extra handling to
+ * serialize notifying the user of device removal with other callbacks.
+ * We do this by disabling removal notification while a callback is in process,
+ * and reporting it after the callback completes.
+ */
+struct rdma_id_private {
+   struct rdma_cm_id   id;
+
+   struct list_headlist;
+   struct list_headlisten_list;
+   struct cma_device   *cma_dev;
+
+   enum cma_state  state;
+   spinlock_t  lock;
+   wait_queue_head_t   wait;
+   atomic_trefcount;
+   wait_queue_head_t   wait_remove;
+   atomic_tdev_remove;
+
+   int backlog;
+   int timeout_ms;
+   struct ib_sa_query  *query;
+   int query_id;
+   struct ib_cm_id *cm_id;
+
+   u32 seq_num;
+   u32 qp_num;
+   enum ib_qp_type qp_type;
+   u8  srq;
+};
+
+struct cma_work {
+   struct work_struct  work;
+   struct rdma_id_private  *id;
+};
+
+union cma_ip_addr {
+   struct in6_addr ip6;
+   struct {
+   __u32 pad[3];
+   __u32 addr;
+   } ip4;
+};
+
+struct cma_hdr {
+   u8 cma_version;
+   u8 ip_version;  /* IP version: 7:4 */
+   __u16 port;
+   union cma_ip_addr src_addr;
+   union cma_ip_addr dst_addr;
+};
+
+struct sdp_hh {
+   u8 sdp_version;
+   u8 ip_version;  /* IP version: 7:4 */
+   u8 sdp_specific1[10];
+   __u16 port;
+   __u16 sdp_specific2;
+   union cma_ip_addr src_addr;
+   union cma_ip_addr dst_addr;
+};
+
+#define CMA_VERSION 0x10
+#de

[PATCH 6/6] IB: userspace support for RDMA connection manager

2006-03-06 Thread Sean Hefty
Kernel component necessary to support the userspace RDMA connection management
library.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/Makefile 
linux-2.6.ib/drivers/infiniband/core/Makefile
--- linux-2.6.git/drivers/infiniband/core/Makefile  2006-01-16 
16:58:58.0 -0800
+++ linux-2.6.ib/drivers/infiniband/core/Makefile   2006-01-16 
16:55:25.0 -0800
@@ -1,5 +1,5 @@
 obj-$(CONFIG_INFINIBAND) +=ib_core.o ib_mad.o ib_sa.o \
-   ib_cm.o ib_addr.o rdma_cm.o
+   ib_cm.o ib_addr.o rdma_cm.o rdma_ucm.o
 obj-$(CONFIG_INFINIBAND_USER_MAD) +=   ib_umad.o
 obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=ib_uverbs.o ib_ucm.o
 
@@ -14,6 +14,8 @@ ib_cm-y :=cm.o
 
 rdma_cm-y :=   cma.o
 
+rdma_ucm-y :=  ucma.o
+
 ib_addr-y :=   addr.o
 
 ib_umad-y :=   user_mad.o
diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/ucma.c 
linux-2.6.ib/drivers/infiniband/core/ucma.c
--- linux-2.6.git/drivers/infiniband/core/ucma.c1969-12-31 
16:00:00.0 -0800
+++ linux-2.6.ib/drivers/infiniband/core/ucma.c 2006-01-16 16:54:31.0 
-0800
@@ -0,0 +1,788 @@
+/*
+ * Copyright (c) 2005 Intel Corporation.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+MODULE_AUTHOR("Sean Hefty");
+MODULE_DESCRIPTION("RDMA Userspace Connection Manager Access");
+MODULE_LICENSE("Dual BSD/GPL");
+
+enum {
+   UCMA_MAX_BACKLOG= 128
+};
+
+struct ucma_file {
+   struct semaphoremutex;
+   struct file *filp;
+   struct list_headctxs;
+   struct list_headevents;
+   wait_queue_head_t   poll_wait;
+};
+
+struct ucma_context {
+   int id;
+   wait_queue_head_t   wait;
+   atomic_tref;
+   int events_reported;
+   int backlog;
+
+   struct ucma_file*file;
+   struct rdma_cm_id   *cm_id;
+   __u64   uid;
+
+   struct list_headevents;/* list of pending events. */
+   struct list_headfile_list; /* member in file ctx list */
+};
+
+struct ucma_event {
+   struct ucma_context *ctx;
+   struct list_headfile_list; /* member in file event list */
+   struct list_headctx_list;  /* member in ctx event list */
+   struct rdma_cm_id   *cm_id;
+   struct rdma_ucm_event_resp resp;
+};
+
+static DECLARE_MUTEX(ctx_mutex);
+static DEFINE_IDR(ctx_idr);
+
+static struct ucma_context* ucma_get_ctx(struct ucma_file *file, int id)
+{
+   struct ucma_context *ctx;
+
+   down(&ctx_mutex);
+   ctx = idr_find(&ctx_idr, id);
+   if (!ctx)
+   ctx = ERR_PTR(-ENOENT);
+   else if (ctx->file != file)
+   ctx = ERR_PTR(-EINVAL);
+   else
+   atomic_inc(&ctx->ref);
+   up(&ctx_mutex);
+
+   return ctx;
+}
+
+static void ucma_put_ctx(struct ucma_context *ctx)
+{
+   if (atomic_dec_and_test(&ctx->ref))
+   wake_up(&ctx->wait);
+}
+
+static void ucma_cleanup_events(struct ucma_context *ctx)
+{
+   struct ucma_event *uevent;
+
+   down(&ctx->file->mutex);
+   list_del(&ctx-&g

[PATCH 4/6] IB: address translation to map IP to IB addresses (GIDs)

2006-03-06 Thread Sean Hefty
Add an address translation service that maps IP addresses to Infiniband
GID addresses using IPoIB.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/addr.c 
linux-2.6.ib/drivers/infiniband/core/addr.c
--- linux-2.6.git/drivers/infiniband/core/addr.c1969-12-31 
16:00:00.0 -0800
+++ linux-2.6.ib/drivers/infiniband/core/addr.c 2006-01-16 16:14:24.0 
-0800
@@ -0,0 +1,356 @@
+/*
+ * Copyright (c) 2005 Voltaire Inc.  All rights reserved.
+ * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved.
+ * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved.
+ * Copyright (c) 2005 Intel Corporation.  All rights reserved.
+ *
+ * This Software is licensed under one of the following licenses:
+ *
+ * 1) under the terms of the "Common Public License 1.0" a copy of which is
+ *available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/cpl.php.
+ *
+ * 2) under the terms of the "The BSD License" a copy of which is
+ *available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/bsd-license.php.
+ *
+ * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
+ *copy of which is available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/gpl-license.php.
+ *
+ * Licensee has the right to choose one of the above licenses.
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice and one of the license notices.
+ *
+ * Redistributions in binary form must reproduce both the above copyright
+ * notice, one of the license notices in the documentation
+ * and/or other materials provided with the distribution.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_AUTHOR("Sean Hefty");
+MODULE_DESCRIPTION("IB Address Translation");
+MODULE_LICENSE("Dual BSD/GPL");
+
+struct addr_req {
+   struct list_head list;
+   struct sockaddr src_addr;
+   struct sockaddr dst_addr;
+   struct rdma_dev_addr *addr;
+   void *context;
+   void (*callback)(int status, struct sockaddr *src_addr,
+struct rdma_dev_addr *addr, void *context);
+   unsigned long timeout;
+   int status;
+};
+
+static void process_req(void *data);
+
+static DECLARE_MUTEX(mutex);
+static LIST_HEAD(req_list);
+static DECLARE_WORK(work, process_req, NULL);
+struct workqueue_struct *rdma_wq;
+EXPORT_SYMBOL(rdma_wq);
+
+static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
+unsigned char *dst_dev_addr)
+{
+   switch (dev->type) {
+   case ARPHRD_INFINIBAND:
+   dev_addr->dev_type = IB_NODE_CA;
+   break;
+   default:
+   return -EADDRNOTAVAIL;
+   }
+
+   memcpy(dev_addr->src_dev_addr, dev->dev_addr, MAX_ADDR_LEN);
+   memcpy(dev_addr->broadcast, dev->broadcast, MAX_ADDR_LEN);
+   if (dst_dev_addr)
+   memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN);
+   return 0;
+}
+
+int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr)
+{
+   struct net_device *dev;
+   u32 ip = ((struct sockaddr_in *) addr)->sin_addr.s_addr;
+   int ret;
+
+   dev = ip_dev_find(ip);
+   if (!dev)
+   return -EADDRNOTAVAIL;
+
+   ret = copy_addr(dev_addr, dev, NULL);
+   dev_put(dev);
+   return ret;
+}
+EXPORT_SYMBOL(rdma_translate_ip);
+
+static void set_timeout(unsigned long time)
+{
+   unsigned long delay;
+
+   cancel_delayed_work(&work);
+
+   delay = time - jiffies;
+   if ((long)delay <= 0)
+   delay = 1;
+
+   queue_delayed_work(rdma_wq, &work, delay);
+}
+
+static void queue_req(struct addr_req *req)
+{
+   struct addr_req *temp_req;
+
+   down(&mutex);
+   list_for_each_entry_reverse(temp_req, &req_list, list) {
+   if (time_after(req->timeout, temp_req->timeout))
+   break;
+   }
+
+   list_add(&req->list, &temp_req->list);
+
+   if (req_list.next == &req->list)
+   set_timeout(req->timeout);
+   up(&mutex);
+}
+
+static void addr_send_arp(struct sockaddr_in *dst_in)
+{
+   struct rtable *rt;
+   struct flowi fl;
+   u32 dst_ip = dst_in->sin_addr.s_addr;
+
+   memset(&fl, 0, sizeof fl);
+   fl.nl_u.ip4_u.daddr = dst_ip;
+   if (ip_route_output_key(&rt, &fl))
+   return;
+
+   arp_send(ARPOP_REQUEST, ETH_P_ARP, rt->rt_gateway, rt->idev->dev,
+rt->rt_src, NULL, rt->idev->dev->dev_addr, NULL);
+   ip_rt_put(rt);
+}
+
+static int addr_resolve_remote(struct sockaddr_in *src_in,
+  

[PATCH 3/6] net/IB: export ip_dev_find

2006-03-06 Thread Sean Hefty
Export ip_dev_find to allow locating a net_device given an IP address.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/net/ipv4/fib_frontend.c 
linux-2.6.ib/net/ipv4/fib_frontend.c
--- linux-2.6.git/net/ipv4/fib_frontend.c   2006-01-16 10:28:29.0 
-0800
+++ linux-2.6.ib/net/ipv4/fib_frontend.c2006-01-16 16:14:24.0 
-0800
@@ -666,4 +666,5 @@ void __init ip_fib_init(void)
 }
 
 EXPORT_SYMBOL(inet_addr_type);
+EXPORT_SYMBOL(ip_dev_find);
 EXPORT_SYMBOL(ip_rt_ioctl);



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] IB: match connection requests based on private data

2006-03-06 Thread Sean Hefty
Extend matching connection requests to listens in the Infiniband CM to include
private data checks.

This allows applications to listen on the same service identifier, with private
data directing the request to the appropriate application.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/cm.c 
linux-2.6.ib/drivers/infiniband/core/cm.c
--- linux-2.6.git/drivers/infiniband/core/cm.c  2006-01-16 10:25:26.0 
-0800
+++ linux-2.6.ib/drivers/infiniband/core/cm.c   2006-01-16 16:03:35.0 
-0800
@@ -32,7 +32,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: cm.c 2821 2005-07-08 17:07:28Z sean.hefty $
+ * $Id: cm.c 4311 2005-12-05 18:42:01Z sean.hefty $
  */
 #include 
 #include 
@@ -130,6 +130,7 @@ struct cm_id_private {
/* todo: use alternate port on send failure */
struct cm_av av;
struct cm_av alt_av;
+   struct ib_cm_private_data_compare *compare_data;
 
void *private_data;
__be64 tid;
@@ -355,6 +356,40 @@ static struct cm_id_private * cm_acquire
return cm_id_priv;
 }
 
+static void cm_mask_compare_data(u8 *dst, u8 *src, u8 *mask)
+{
+   int i;
+
+   for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE; i++)
+   dst[i] = src[i] & mask[i];
+}
+
+static int cm_compare_data(struct ib_cm_private_data_compare *src_data,
+  struct ib_cm_private_data_compare *dst_data)
+{
+   u8 src[IB_CM_PRIVATE_DATA_COMPARE_SIZE];
+   u8 dst[IB_CM_PRIVATE_DATA_COMPARE_SIZE];
+
+   if (!src_data || !dst_data)
+   return 0;
+   
+   cm_mask_compare_data(src, src_data->data, dst_data->mask);
+   cm_mask_compare_data(dst, dst_data->data, src_data->mask);
+   return memcmp(src, dst, IB_CM_PRIVATE_DATA_COMPARE_SIZE);
+}
+
+static int cm_compare_private_data(u8 *private_data,
+  struct ib_cm_private_data_compare *dst_data)
+{
+   u8 src[IB_CM_PRIVATE_DATA_COMPARE_SIZE];
+
+   if (!dst_data)
+   return 0;
+   
+   cm_mask_compare_data(src, private_data, dst_data->mask);
+   return memcmp(src, dst_data->data, IB_CM_PRIVATE_DATA_COMPARE_SIZE);
+}
+
 static struct cm_id_private * cm_insert_listen(struct cm_id_private 
*cm_id_priv)
 {
struct rb_node **link = &cm.listen_service_table.rb_node;
@@ -362,14 +397,18 @@ static struct cm_id_private * cm_insert_
struct cm_id_private *cur_cm_id_priv;
__be64 service_id = cm_id_priv->id.service_id;
__be64 service_mask = cm_id_priv->id.service_mask;
+   int data_cmp;
 
while (*link) {
parent = *link;
cur_cm_id_priv = rb_entry(parent, struct cm_id_private,
  service_node);
+   data_cmp = cm_compare_data(cm_id_priv->compare_data,
+  cur_cm_id_priv->compare_data);
if ((cur_cm_id_priv->id.service_mask & service_id) ==
(service_mask & cur_cm_id_priv->id.service_id) &&
-   (cm_id_priv->id.device == cur_cm_id_priv->id.device))
+   (cm_id_priv->id.device == cur_cm_id_priv->id.device) &&
+   !data_cmp)
return cur_cm_id_priv;
 
if (cm_id_priv->id.device < cur_cm_id_priv->id.device)
@@ -378,6 +417,10 @@ static struct cm_id_private * cm_insert_
link = &(*link)->rb_right;
else if (service_id < cur_cm_id_priv->id.service_id)
link = &(*link)->rb_left;
+   else if (service_id > cur_cm_id_priv->id.service_id)
+   link = &(*link)->rb_right;
+   else if (data_cmp < 0)
+   link = &(*link)->rb_left;
else
link = &(*link)->rb_right;
}
@@ -387,16 +430,20 @@ static struct cm_id_private * cm_insert_
 }
 
 static struct cm_id_private * cm_find_listen(struct ib_device *device,
-__be64 service_id)
+__be64 service_id,
+u8 *private_data)
 {
struct rb_node *node = cm.listen_service_table.rb_node;
struct cm_id_private *cm_id_priv;
+   int data_cmp;
 
while (node) {
cm_id_priv = rb_entry(node, struct cm_id_private, service_node);
+   data_cmp = cm_compare_private_data(private_data,
+  cm_id_priv->compare_data);
if ((cm_id_priv->id.service_mask & service_id) ==
 cm_id_priv->id.service_id &&

[PATCH 1/6] IB: common handling for marshalling parameters to/from userspace

2006-03-06 Thread Sean Hefty

Provide common handling for marshalling data between userspace clients
and kernel mode Infiniband drivers.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/Makefile 
linux-2.6.ib/drivers/infiniband/core/Makefile
--- linux-2.6.git/drivers/infiniband/core/Makefile  2006-01-16 
10:25:27.0 -0800
+++ linux-2.6.ib/drivers/infiniband/core/Makefile   2006-01-16 
15:34:15.0 -0800
@@ -16,4 +16,5 @@ ib_umad-y :=  user_mad.o
 
 ib_ucm-y :=ucm.o
 
-ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o
+ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \
+   uverbs_marshall.o
diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/ucm.c 
linux-2.6.ib/drivers/infiniband/core/ucm.c
--- linux-2.6.git/drivers/infiniband/core/ucm.c 2006-01-16 10:25:26.0 
-0800
+++ linux-2.6.ib/drivers/infiniband/core/ucm.c  2006-01-16 15:34:15.0 
-0800
@@ -30,7 +30,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ucm.c 2594 2005-06-13 19:46:02Z libor $
+ * $Id: ucm.c 4311 2005-12-05 18:42:01Z sean.hefty $
  */
 #include 
 #include 
@@ -48,6 +48,7 @@
 
 #include 
 #include 
+#include 
 
 MODULE_AUTHOR("Libor Michalek");
 MODULE_DESCRIPTION("InfiniBand userspace Connection Manager access");
@@ -203,36 +204,6 @@ error:
return NULL;
 }
 
-static void ib_ucm_event_path_get(struct ib_ucm_path_rec *upath,
- struct ib_sa_path_rec  *kpath)
-{
-   if (!kpath || !upath)
-   return;
-
-   memcpy(upath->dgid, kpath->dgid.raw, sizeof *upath->dgid);
-   memcpy(upath->sgid, kpath->sgid.raw, sizeof *upath->sgid);
-
-   upath->dlid = kpath->dlid;
-   upath->slid = kpath->slid;
-   upath->raw_traffic  = kpath->raw_traffic;
-   upath->flow_label   = kpath->flow_label;
-   upath->hop_limit= kpath->hop_limit;
-   upath->traffic_class= kpath->traffic_class;
-   upath->reversible   = kpath->reversible;
-   upath->numb_path= kpath->numb_path;
-   upath->pkey = kpath->pkey;
-   upath->sl   = kpath->sl;
-   upath->mtu_selector = kpath->mtu_selector;
-   upath->mtu  = kpath->mtu;
-   upath->rate_selector= kpath->rate_selector;
-   upath->rate = kpath->rate;
-   upath->packet_life_time = kpath->packet_life_time;
-   upath->preference   = kpath->preference;
-
-   upath->packet_life_time_selector =
-   kpath->packet_life_time_selector;
-}
-
 static void ib_ucm_event_req_get(struct ib_ucm_req_event_resp *ureq,
 struct ib_cm_req_event_param *kreq)
 {
@@ -251,8 +222,10 @@ static void ib_ucm_event_req_get(struct 
ureq->srq= kreq->srq;
ureq->port   = kreq->port;
 
-   ib_ucm_event_path_get(&ureq->primary_path, kreq->primary_path);
-   ib_ucm_event_path_get(&ureq->alternate_path, kreq->alternate_path);
+   ib_copy_path_rec_to_user(&ureq->primary_path, kreq->primary_path);
+   if (kreq->alternate_path)
+   ib_copy_path_rec_to_user(&ureq->alternate_path,
+kreq->alternate_path);
 }
 
 static void ib_ucm_event_rep_get(struct ib_ucm_rep_event_resp *urep,
@@ -322,8 +295,8 @@ static int ib_ucm_event_process(struct i
info  = evt->param.rej_rcvd.ari;
break;
case IB_CM_LAP_RECEIVED:
-   ib_ucm_event_path_get(&uvt->resp.u.lap_resp.path,
- evt->param.lap_rcvd.alternate_path);
+   ib_copy_path_rec_to_user(&uvt->resp.u.lap_resp.path,
+evt->param.lap_rcvd.alternate_path);
uvt->data_len = IB_CM_LAP_PRIVATE_DATA_SIZE;
uvt->resp.present = IB_UCM_PRES_ALTERNATE;
break;
@@ -635,65 +608,11 @@ static ssize_t ib_ucm_attr_id(struct ib_
return result;
 }
 
-static void ib_ucm_copy_ah_attr(struct ib_ucm_ah_attr *dest_attr,
-   struct ib_ah_attr *src_attr)
-{
-   memcpy(dest_attr->grh_dgid, src_attr->grh.dgid.raw,
-  sizeof src_attr->grh.dgid);
-   dest_attr->grh_flow_label = src_attr->grh.flow_label;
-   dest_attr->grh_sgid_index = src_attr->grh.sgid_index;
-   dest_attr->grh_hop_limit = src_attr->grh.hop_limit;
-   dest_attr->grh_traffic_class = src_att

[PATCH 3/5] export of ip_dev_find as part of Infiniband connection abstraction

2006-03-03 Thread Sean Hefty
I wanted to make doubly sure that this didn't get lost in the patch series, but
ip_dev_find() is re-exported.  The use is shown below.

- Sean

>+int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr)
>+{
>+  struct net_device *dev;
>+  u32 ip = ((struct sockaddr_in *) addr)->sin_addr.s_addr;
>+  int ret;
>+
>+  dev = ip_dev_find(ip);
>+  if (!dev)
>+  return -EADDRNOTAVAIL;
>+
>+  ret = copy_addr(dev_addr, dev, NULL);
>+  dev_put(dev);
>+  return ret;
>+}

{snip}

>+static int addr_resolve_local(struct sockaddr_in *src_in,
>+struct sockaddr_in *dst_in,
>+struct rdma_dev_addr *addr)
>+{
>+  struct net_device *dev;
>+  u32 src_ip = src_in->sin_addr.s_addr;
>+  u32 dst_ip = dst_in->sin_addr.s_addr;
>+  int ret;
>+
>+  dev = ip_dev_find(dst_ip);
>+  if (!dev)
>+  return -EADDRNOTAVAIL;
>+
>+  if (!src_ip) {
>+  src_in->sin_family = dst_in->sin_family;
>+  src_in->sin_addr.s_addr = dst_ip;
>+  ret = copy_addr(addr, dev, dev->dev_addr);
>+  } else {
>+  ret = rdma_translate_ip((struct sockaddr *)src_in, addr);
>+  if (!ret)
>+  memcpy(addr->dst_dev_addr, dev->dev_addr, MAX_ADDR_LEN);
>+  }
>+
>+  dev_put(dev);
>+  return ret;
>+}

{snip}

>diff -uprN -X linux-2.6.git/Documentation/dontdiff
>linux-2.6.git/net/ipv4/fib_frontend.c
>linux-2.6.ib/net/ipv4/fib_frontend.c
>--- linux-2.6.git/net/ipv4/fib_frontend.c  2006-01-16 10:28:29.0
-0800
>+++ linux-2.6.ib/net/ipv4/fib_frontend.c   2006-01-16 16:14:24.0
-0800
>@@ -666,4 +666,5 @@ void __init ip_fib_init(void)
> }
>
> EXPORT_SYMBOL(inet_addr_type);
>+EXPORT_SYMBOL(ip_dev_find);
> EXPORT_SYMBOL(ip_rt_ioctl);

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] Infiniband: connection abstraction

2006-03-03 Thread Sean Hefty
>Here's an updated version of these patches based on feedback.   (The license
>did not change and continues to match that of the other Infiniband code.)
>Please consider for inclusion in 2.6.17.

This is just a ping for anymore feedback to this patch series, so that I can
respond to any requests before 2.6.17 opens up.

I can resubmit the patches if necessary.

Thanks,
Sean

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] Infiniband: connection abstraction

2006-02-01 Thread Sean Hefty
This patch adds the kernel component to support the userspace Infiniband/RDMA
connection agent library.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/Makefile 
linux-2.6.ib/drivers/infiniband/core/Makefile
--- linux-2.6.git/drivers/infiniband/core/Makefile  2006-01-16 
16:58:58.0 -0800
+++ linux-2.6.ib/drivers/infiniband/core/Makefile   2006-01-16 
16:55:25.0 -0800
@@ -1,5 +1,5 @@
 obj-$(CONFIG_INFINIBAND) +=ib_core.o ib_mad.o ib_sa.o \
-   ib_cm.o ib_addr.o rdma_cm.o
+   ib_cm.o ib_addr.o rdma_cm.o rdma_ucm.o
 obj-$(CONFIG_INFINIBAND_USER_MAD) +=   ib_umad.o
 obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=ib_uverbs.o ib_ucm.o
 
@@ -14,6 +14,8 @@ ib_cm-y :=cm.o
 
 rdma_cm-y :=   cma.o
 
+rdma_ucm-y :=  ucma.o
+
 ib_addr-y :=   addr.o
 
 ib_umad-y :=   user_mad.o
diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/ucma.c 
linux-2.6.ib/drivers/infiniband/core/ucma.c
--- linux-2.6.git/drivers/infiniband/core/ucma.c1969-12-31 
16:00:00.0 -0800
+++ linux-2.6.ib/drivers/infiniband/core/ucma.c 2006-01-16 16:54:31.0 
-0800
@@ -0,0 +1,788 @@
+/*
+ * Copyright (c) 2005 Intel Corporation.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+MODULE_AUTHOR("Sean Hefty");
+MODULE_DESCRIPTION("RDMA Userspace Connection Manager Access");
+MODULE_LICENSE("Dual BSD/GPL");
+
+enum {
+   UCMA_MAX_BACKLOG= 128
+};
+
+struct ucma_file {
+   struct mutexfile_mutex;
+   struct file *filp;
+   struct list_headctxs;
+   struct list_headevents;
+   wait_queue_head_t   poll_wait;
+};
+
+struct ucma_context {
+   int id;
+   wait_queue_head_t   wait;
+   atomic_tref;
+   int events_reported;
+   int backlog;
+
+   struct ucma_file*file;
+   struct rdma_cm_id   *cm_id;
+   __u64   uid;
+
+   struct list_headevents;/* list of pending events. */
+   struct list_headfile_list; /* member in file ctx list */
+};
+
+struct ucma_event {
+   struct ucma_context *ctx;
+   struct list_headfile_list; /* member in file event list */
+   struct list_headctx_list;  /* member in ctx event list */
+   struct rdma_cm_id   *cm_id;
+   struct rdma_ucm_event_resp resp;
+};
+
+static DEFINE_MUTEX(ctx_mutex);
+static DEFINE_IDR(ctx_idr);
+
+static struct ucma_context* ucma_get_ctx(struct ucma_file *file, int id)
+{
+   struct ucma_context *ctx;
+
+   mutex_lock(&ctx_mutex);
+   ctx = idr_find(&ctx_idr, id);
+   if (!ctx)
+   ctx = ERR_PTR(-ENOENT);
+   else if (ctx->file != file)
+   ctx = ERR_PTR(-EINVAL);
+   else
+   atomic_inc(&ctx->ref);
+   mutex_unlock(&ctx_mutex);
+
+   return ctx;
+}
+
+static void ucma_put_ctx(struct ucma_context *ctx)
+{
+   if (atomic_dec_and_test(&ctx->ref))
+   wake_up(&ctx->wait);
+}
+
+static void ucma_cleanup_events(struct ucma_context *ctx)
+{
+   struct ucma_event *uevent;
+
+   mutex_lock(&ctx->file

[PATCH 4/5] Infiniband: connection abstraction

2006-02-01 Thread Sean Hefty
The following patch implements a kernel mode connection management agent
over Infiniband that connects based on IP addresses.

The agent defines a generic RDMA connection abstraction to support clients
wanting to connect over different RDMA devices.

It also handles RDMA device hotplug events on behalf of clients.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/cma.c 
linux-2.6.ib/drivers/infiniband/core/cma.c
--- linux-2.6.git/drivers/infiniband/core/cma.c 1969-12-31 16:00:00.0 
-0800
+++ linux-2.6.ib/drivers/infiniband/core/cma.c  2006-01-16 16:17:34.0 
-0800
@@ -0,0 +1,1639 @@
+/*
+ * Copyright (c) 2005 Voltaire Inc.  All rights reserved.
+ * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved.
+ * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved.
+ * Copyright (c) 2005 Intel Corporation.  All rights reserved.
+ *
+ * This Software is licensed under one of the following licenses:
+ *
+ * 1) under the terms of the "Common Public License 1.0" a copy of which is
+ *available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/cpl.php.
+ *
+ * 2) under the terms of the "The BSD License" a copy of which is
+ *available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/bsd-license.php.
+ *
+ * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
+ *copy of which is available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/gpl-license.php.
+ *
+ * Licensee has the right to choose one of the above licenses.
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice and one of the license notices.
+ *
+ * Redistributions in binary form must reproduce both the above copyright
+ * notice, one of the license notices in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_AUTHOR("Sean Hefty");
+MODULE_DESCRIPTION("Generic RDMA CM Agent");
+MODULE_LICENSE("Dual BSD/GPL");
+
+#define CMA_CM_RESPONSE_TIMEOUT 20
+#define CMA_MAX_CM_RETRIES 3
+
+static void cma_add_one(struct ib_device *device);
+static void cma_remove_one(struct ib_device *device);
+
+static struct ib_client cma_client = {
+   .name   = "cma",
+   .add= cma_add_one,
+   .remove = cma_remove_one
+};
+
+static LIST_HEAD(dev_list);
+static LIST_HEAD(listen_any_list);
+static DEFINE_MUTEX(lock);
+
+struct cma_device {
+   struct list_headlist;
+   struct ib_device*device;
+   __be64  node_guid;
+   wait_queue_head_t   wait;
+   atomic_trefcount;
+   struct list_headid_list;
+};
+
+enum cma_state {
+   CMA_IDLE,
+   CMA_ADDR_QUERY,
+   CMA_ADDR_RESOLVED,
+   CMA_ROUTE_QUERY,
+   CMA_ROUTE_RESOLVED,
+   CMA_CONNECT,
+   CMA_ADDR_BOUND,
+   CMA_LISTEN,
+   CMA_DEVICE_REMOVAL,
+   CMA_DESTROYING
+};
+
+/*
+ * Device removal can occur at anytime, so we need extra handling to
+ * serialize notifying the user of device removal with other callbacks.
+ * We do this by disabling removal notification while a callback is in process,
+ * and reporting it after the callback completes.
+ */
+struct rdma_id_private {
+   struct rdma_cm_id   id;
+
+   struct list_headlist;
+   struct list_headlisten_list;
+   struct cma_device   *cma_dev;
+
+   enum cma_state  state;
+   spinlock_t  lock;
+   wait_queue_head_t   wait;
+   atomic_trefcount;
+   wait_queue_head_t   wait_remove;
+   atomic_tdev_remove;
+
+   int backlog;
+   int timeout_ms;
+   struct ib_sa_query  *query;
+   int query_id;
+   struct ib_cm_id *cm_id;
+
+   u32 seq_num;
+   u32 qp_num;
+   enum ib_qp_type qp_type;
+   u8  srq;
+};
+
+struct cma_work {
+   struct work_struct  work;
+   struct rdma_id_private  *id;
+};
+
+union cma_ip_addr {
+   struct in6_addr ip6;
+   struct {
+   __u32 pad[3];
+   __u32 addr;
+   } ip4;
+};
+
+struct cma_hdr {
+   u8 cma_version;
+   u8 ip_version;  /* IP version: 7:4 */
+   __u16 port;
+   union cma_ip_addr src_addr;
+   union cma_ip_addr dst_addr;
+};
+
+struct sdp_hh {
+   u8 sdp_version;
+   u8 ip_version;  /* IP version: 7:4 */
+   u8 sdp_specific1[10];
+   __u16 port;
+   __u16 sdp_specific2;
+   union cma_ip_addr src_addr;
+   union cma_ip_addr dst_addr;
+};

[PATCH 3/5] Infiniband: connection abstraction

2006-02-01 Thread Sean Hefty
The following provides an address translation service that maps IP addresses
to Infiniband addresses (GIDs) using IPoIB.

This patch exports ip_dev_find() to locate a net_device given an IP address.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/addr.c 
linux-2.6.ib/drivers/infiniband/core/addr.c
--- linux-2.6.git/drivers/infiniband/core/addr.c1969-12-31 
16:00:00.0 -0800
+++ linux-2.6.ib/drivers/infiniband/core/addr.c 2006-01-16 16:14:24.0 
-0800
@@ -0,0 +1,356 @@
+/*
+ * Copyright (c) 2005 Voltaire Inc.  All rights reserved.
+ * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved.
+ * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved.
+ * Copyright (c) 2005 Intel Corporation.  All rights reserved.
+ *
+ * This Software is licensed under one of the following licenses:
+ *
+ * 1) under the terms of the "Common Public License 1.0" a copy of which is
+ *available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/cpl.php.
+ *
+ * 2) under the terms of the "The BSD License" a copy of which is
+ *available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/bsd-license.php.
+ *
+ * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
+ *copy of which is available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/gpl-license.php.
+ *
+ * Licensee has the right to choose one of the above licenses.
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice and one of the license notices.
+ *
+ * Redistributions in binary form must reproduce both the above copyright
+ * notice, one of the license notices in the documentation
+ * and/or other materials provided with the distribution.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_AUTHOR("Sean Hefty");
+MODULE_DESCRIPTION("IB Address Translation");
+MODULE_LICENSE("Dual BSD/GPL");
+
+struct addr_req {
+   struct list_head list;
+   struct sockaddr src_addr;
+   struct sockaddr dst_addr;
+   struct rdma_dev_addr *addr;
+   void *context;
+   void (*callback)(int status, struct sockaddr *src_addr,
+struct rdma_dev_addr *addr, void *context);
+   unsigned long timeout;
+   int status;
+};
+
+static void process_req(void *data);
+
+static DEFINE_MUTEX(lock);
+static LIST_HEAD(req_list);
+static DECLARE_WORK(work, process_req, NULL);
+struct workqueue_struct *rdma_wq;
+EXPORT_SYMBOL(rdma_wq);
+
+static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
+unsigned char *dst_dev_addr)
+{
+   switch (dev->type) {
+   case ARPHRD_INFINIBAND:
+   dev_addr->dev_type = IB_NODE_CA;
+   break;
+   default:
+   return -EADDRNOTAVAIL;
+   }
+
+   memcpy(dev_addr->src_dev_addr, dev->dev_addr, MAX_ADDR_LEN);
+   memcpy(dev_addr->broadcast, dev->broadcast, MAX_ADDR_LEN);
+   if (dst_dev_addr)
+   memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN);
+   return 0;
+}
+
+int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr)
+{
+   struct net_device *dev;
+   u32 ip = ((struct sockaddr_in *) addr)->sin_addr.s_addr;
+   int ret;
+
+   dev = ip_dev_find(ip);
+   if (!dev)
+   return -EADDRNOTAVAIL;
+
+   ret = copy_addr(dev_addr, dev, NULL);
+   dev_put(dev);
+   return ret;
+}
+EXPORT_SYMBOL(rdma_translate_ip);
+
+static void set_timeout(unsigned long time)
+{
+   unsigned long delay;
+
+   cancel_delayed_work(&work);
+
+   delay = time - jiffies;
+   if ((long)delay <= 0)
+   delay = 1;
+
+   queue_delayed_work(rdma_wq, &work, delay);
+}
+
+static void queue_req(struct addr_req *req)
+{
+   struct addr_req *temp_req;
+
+   mutex_lock(&lock);
+   list_for_each_entry_reverse(temp_req, &req_list, list) {
+   if (time_after(req->timeout, temp_req->timeout))
+   break;
+   }
+
+   list_add(&req->list, &temp_req->list);
+
+   if (req_list.next == &req->list)
+   set_timeout(req->timeout);
+   mutex_unlock(&lock);
+}
+
+static void addr_send_arp(struct sockaddr_in *dst_in)
+{
+   struct rtable *rt;
+   struct flowi fl;
+   u32 dst_ip = dst_in->sin_addr.s_addr;
+
+   memset(&fl, 0, sizeof fl);
+   fl.nl_u.ip4_u.daddr = dst_ip;
+   if (ip_route_output_key(&rt, &fl))
+   return;
+
+   arp_send(ARPOP_REQUEST, ETH_P_ARP, rt->rt_gateway, rt->idev->dev,
+rt->rt_src, NULL, rt->

[PATCH 2/5] Infiniband: connection abstraction

2006-02-01 Thread Sean Hefty
The following patch extends matching connection requests to listens in the
Infiniband CM to include private data.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/cm.c 
linux-2.6.ib/drivers/infiniband/core/cm.c
--- linux-2.6.git/drivers/infiniband/core/cm.c  2006-01-16 10:25:26.0 
-0800
+++ linux-2.6.ib/drivers/infiniband/core/cm.c   2006-01-16 16:03:35.0 
-0800
@@ -32,7 +32,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: cm.c 2821 2005-07-08 17:07:28Z sean.hefty $
+ * $Id: cm.c 4311 2005-12-05 18:42:01Z sean.hefty $
  */
 #include 
 #include 
@@ -130,6 +130,7 @@ struct cm_id_private {
/* todo: use alternate port on send failure */
struct cm_av av;
struct cm_av alt_av;
+   struct ib_cm_compare_data *compare_data;
 
void *private_data;
__be64 tid;
@@ -355,6 +356,41 @@ static struct cm_id_private * cm_acquire
return cm_id_priv;
 }
 
+static void cm_mask_copy(u8 *dst, u8 *src, u8 *mask)
+{
+   int i;
+
+   for (i = 0; i < IB_CM_COMPARE_SIZE / sizeof(unsigned long); i++)
+   ((unsigned long *) dst)[i] = ((unsigned long *) src)[i] &
+((unsigned long *) mask)[i];
+}
+
+static int cm_compare_data(struct ib_cm_compare_data *src_data,
+  struct ib_cm_compare_data *dst_data)
+{
+   u8 src[IB_CM_COMPARE_SIZE];
+   u8 dst[IB_CM_COMPARE_SIZE];
+
+   if (!src_data || !dst_data)
+   return 0;
+   
+   cm_mask_copy(src, src_data->data, dst_data->mask);
+   cm_mask_copy(dst, dst_data->data, src_data->mask);
+   return memcmp(src, dst, IB_CM_COMPARE_SIZE);
+}
+
+static int cm_compare_private_data(u8 *private_data,
+  struct ib_cm_compare_data *dst_data)
+{
+   u8 src[IB_CM_COMPARE_SIZE];
+
+   if (!dst_data)
+   return 0;
+   
+   cm_mask_copy(src, private_data, dst_data->mask);
+   return memcmp(src, dst_data->data, IB_CM_COMPARE_SIZE);
+}
+
 static struct cm_id_private * cm_insert_listen(struct cm_id_private 
*cm_id_priv)
 {
struct rb_node **link = &cm.listen_service_table.rb_node;
@@ -362,14 +397,18 @@ static struct cm_id_private * cm_insert_
struct cm_id_private *cur_cm_id_priv;
__be64 service_id = cm_id_priv->id.service_id;
__be64 service_mask = cm_id_priv->id.service_mask;
+   int data_cmp;
 
while (*link) {
parent = *link;
cur_cm_id_priv = rb_entry(parent, struct cm_id_private,
  service_node);
+   data_cmp = cm_compare_data(cm_id_priv->compare_data,
+  cur_cm_id_priv->compare_data);
if ((cur_cm_id_priv->id.service_mask & service_id) ==
(service_mask & cur_cm_id_priv->id.service_id) &&
-   (cm_id_priv->id.device == cur_cm_id_priv->id.device))
+   (cm_id_priv->id.device == cur_cm_id_priv->id.device) &&
+   !data_cmp)
return cur_cm_id_priv;
 
if (cm_id_priv->id.device < cur_cm_id_priv->id.device)
@@ -378,6 +417,10 @@ static struct cm_id_private * cm_insert_
link = &(*link)->rb_right;
else if (service_id < cur_cm_id_priv->id.service_id)
link = &(*link)->rb_left;
+   else if (service_id > cur_cm_id_priv->id.service_id)
+   link = &(*link)->rb_right;
+   else if (data_cmp < 0)
+   link = &(*link)->rb_left;
else
link = &(*link)->rb_right;
}
@@ -387,16 +430,20 @@ static struct cm_id_private * cm_insert_
 }
 
 static struct cm_id_private * cm_find_listen(struct ib_device *device,
-__be64 service_id)
+__be64 service_id,
+u8 *private_data)
 {
struct rb_node *node = cm.listen_service_table.rb_node;
struct cm_id_private *cm_id_priv;
+   int data_cmp;
 
while (node) {
cm_id_priv = rb_entry(node, struct cm_id_private, service_node);
+   data_cmp = cm_compare_private_data(private_data,
+  cm_id_priv->compare_data);
if ((cm_id_priv->id.service_mask & service_id) ==
 cm_id_priv->id.service_id &&
-   (cm_id_priv->id.device == device))
+   (cm_id_priv->id.device == device) && !data_cmp)
   

[PATCH 1/5] Infiniband: connection abstraction

2006-02-01 Thread Sean Hefty
The following patch provides common handling for marshalling data between
Userspace clients and kernel mode Infiniband drivers.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/Makefile 
linux-2.6.ib/drivers/infiniband/core/Makefile
--- linux-2.6.git/drivers/infiniband/core/Makefile  2006-01-16 
10:25:27.0 -0800
+++ linux-2.6.ib/drivers/infiniband/core/Makefile   2006-01-16 
15:34:15.0 -0800
@@ -16,4 +16,5 @@ ib_umad-y :=  user_mad.o
 
 ib_ucm-y :=ucm.o
 
-ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o
+ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \
+   uverbs_marshall.o
diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/ucm.c 
linux-2.6.ib/drivers/infiniband/core/ucm.c
--- linux-2.6.git/drivers/infiniband/core/ucm.c 2006-01-16 10:25:26.0 
-0800
+++ linux-2.6.ib/drivers/infiniband/core/ucm.c  2006-01-16 15:34:15.0 
-0800
@@ -30,7 +30,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ucm.c 2594 2005-06-13 19:46:02Z libor $
+ * $Id: ucm.c 4311 2005-12-05 18:42:01Z sean.hefty $
  */
 #include 
 #include 
@@ -48,6 +48,7 @@
 
 #include 
 #include 
+#include 
 
 MODULE_AUTHOR("Libor Michalek");
 MODULE_DESCRIPTION("InfiniBand userspace Connection Manager access");
@@ -203,36 +204,6 @@ error:
return NULL;
 }
 
-static void ib_ucm_event_path_get(struct ib_ucm_path_rec *upath,
- struct ib_sa_path_rec  *kpath)
-{
-   if (!kpath || !upath)
-   return;
-
-   memcpy(upath->dgid, kpath->dgid.raw, sizeof *upath->dgid);
-   memcpy(upath->sgid, kpath->sgid.raw, sizeof *upath->sgid);
-
-   upath->dlid = kpath->dlid;
-   upath->slid = kpath->slid;
-   upath->raw_traffic  = kpath->raw_traffic;
-   upath->flow_label   = kpath->flow_label;
-   upath->hop_limit= kpath->hop_limit;
-   upath->traffic_class= kpath->traffic_class;
-   upath->reversible   = kpath->reversible;
-   upath->numb_path= kpath->numb_path;
-   upath->pkey = kpath->pkey;
-   upath->sl   = kpath->sl;
-   upath->mtu_selector = kpath->mtu_selector;
-   upath->mtu  = kpath->mtu;
-   upath->rate_selector= kpath->rate_selector;
-   upath->rate = kpath->rate;
-   upath->packet_life_time = kpath->packet_life_time;
-   upath->preference   = kpath->preference;
-
-   upath->packet_life_time_selector =
-   kpath->packet_life_time_selector;
-}
-
 static void ib_ucm_event_req_get(struct ib_ucm_req_event_resp *ureq,
 struct ib_cm_req_event_param *kreq)
 {
@@ -251,8 +222,10 @@ static void ib_ucm_event_req_get(struct 
ureq->srq= kreq->srq;
ureq->port   = kreq->port;
 
-   ib_ucm_event_path_get(&ureq->primary_path, kreq->primary_path);
-   ib_ucm_event_path_get(&ureq->alternate_path, kreq->alternate_path);
+   ib_copy_path_rec_to_user(&ureq->primary_path, kreq->primary_path);
+   if (kreq->alternate_path)
+   ib_copy_path_rec_to_user(&ureq->alternate_path,
+kreq->alternate_path);
 }
 
 static void ib_ucm_event_rep_get(struct ib_ucm_rep_event_resp *urep,
@@ -322,8 +295,8 @@ static int ib_ucm_event_process(struct i
info  = evt->param.rej_rcvd.ari;
break;
case IB_CM_LAP_RECEIVED:
-   ib_ucm_event_path_get(&uvt->resp.u.lap_resp.path,
- evt->param.lap_rcvd.alternate_path);
+   ib_copy_path_rec_to_user(&uvt->resp.u.lap_resp.path,
+evt->param.lap_rcvd.alternate_path);
uvt->data_len = IB_CM_LAP_PRIVATE_DATA_SIZE;
uvt->resp.present = IB_UCM_PRES_ALTERNATE;
break;
@@ -635,65 +608,11 @@ static ssize_t ib_ucm_attr_id(struct ib_
return result;
 }
 
-static void ib_ucm_copy_ah_attr(struct ib_ucm_ah_attr *dest_attr,
-   struct ib_ah_attr *src_attr)
-{
-   memcpy(dest_attr->grh_dgid, src_attr->grh.dgid.raw,
-  sizeof src_attr->grh.dgid);
-   dest_attr->grh_flow_label = src_attr->grh.flow_label;
-   dest_attr->grh_sgid_index = src_attr->grh.sgid_index;
-   dest_attr->grh_hop_limit = src_attr->grh.hop_limit;
-   dest_attr->grh_

[PATCH 0/5] Infiniband: connection abstraction

2006-02-01 Thread Sean Hefty
Here's an updated version of these patches based on feedback.   (The license
did not change and continues to match that of the other Infiniband code.)
Please consider for inclusion in 2.6.17.

The following set of patches defines a connection abstraction for Infiniband and
other RDMA devices, and serves several purposes:

* It implements a connection protocol over Infiniband based on IP addressing.
This greatly simplifies clients wishing to establish connections over
Infiniband.

* It defines a connection abstraction that works over multiple RDMA devices.
The submitted implementation targets Infiniband, but has been tested over other
RDMA devices as well.

* It handles RDMA device insertion and removal on behalf of its clients.

The changes have been broken into 5 separate patches.  The basic purpose of each
patch is:

1. Provide common handling for marshalling data between userspace clients and
kernel mode Infiniband drivers.

2. Extend the Infiniband CM to include private data comparisons as part of its
connection request matching process.

3. Provide an address translation service that maps IP addresses to Infiniband
addresses (GIDs).  This patch touches outside of the Infiniband core, so I'm
including the netdev mailing list.

4. Implement the kernel mode RDMA connection management agent.

5. Implement the userspace RDMA connection management agent kernel support
module.

Please copy the openib-general mailing list on any replies.

Thanks,
Sean

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [openib-general] [PATCH 5/5] [RFC] Infiniband: connection abstraction

2006-01-18 Thread Sean Hefty

Roland Dreier wrote:

 > + UCMA_MAX_BACKLOG= 128

Is there any reason that we might want to make this a tunable?  Maybe
as a module parameter that's writable in sysfs...


There's no reason not to make this tunable.

- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [openib-general] RE: [PATCH 2/5] [RFC] Infiniband: connection abstraction

2006-01-18 Thread Sean Hefty

Grant Grundler wrote:

Is this code going to get invoked very often?


In practice, it would be invoked when matching any listen requests 
originating from the CMA (RDMA connection abstraction).


hrm..I'm not sure how to translate your answer into a workload.
e.g. which netperf or netpipe test would excercise this alot?
Or would it take something like MPI or specweb/ttcp?


The code will be invoked at least once for every connection that is established.


e.g something like:
	for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE/sizeof(unsigned 
	long);

i++)
		((unsigned long *)dst)[i] = ((unsigned long *)src)[i] 
		& ((unsigned long *)mask)[i];


Yes - something like this should work.  Thanks.



Do you need a patch?
I can submit one but it will be untested.


I will incorporate the change with the next set of updates.  Someone else 
pointed out that I'd need to make sure that there won't be any alignment issues.


- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [openib-general] RE: [PATCH 2/5] [RFC] Infiniband: connection abstraction

2006-01-18 Thread Sean Hefty

Grant Grundler wrote:

+static void cm_mask_compare_data(u8 *dst, u8 *src, u8 *mask)
+{
+   int i;
+
+   for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE; i++)
+   dst[i] = src[i] & mask[i];
+}


Is this code going to get invoked very often?


In practice, it would be invoked when matching any listen requests originating 
from the CMA (RDMA connection abstraction).



If so, can the mask operation use a "native" size since
IB_CM_PRIVATE_DATA_COMPARE_SIZE is hard coded to 64 byte?

e.g something like:
for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE/sizeof(unsigned long);
i++)
		((unsigned long *)dst)[i] = ((unsigned long *)src)[i] 
		& ((unsigned long *)mask)[i];


Yes - something like this should work.  Thanks.

- Sean
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/5] [RFC] Infiniband: connection abstraction

2006-01-17 Thread Sean Hefty
>> +static void cm_mask_compare_data(u8 *dst, u8 *src, u8 *mask)
>
>static void cm_mask_compare_data(u8 *dst, const u8 *src, u8 *mask)
>
>but I would rename it to cm_mask_copy since it doesn't really do a compare.

I'll change this.  The function is masking the "data to use in the comparison",
but I can see the confusion.

>> +static int cm_compare_data(struct ib_cm_private_data_compare *src_data,
>> +   struct ib_cm_private_data_compare *dst_data)
>
>static int cm_compare_data(const struct ib_cm_private_data_compare *src,
>  cosnt struct ib_cm_private_data_compare *dst)
>Your data type names are getting too long 

I'll fix.

Thanks for the comments.

- Sean

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] [RFC] Infiniband: connection abstraction

2006-01-17 Thread Sean Hefty
This patch adds the kernel component to support the userspace Infiniband/RDMA
connection agent library.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/Makefile 
linux-2.6.ib/drivers/infiniband/core/Makefile
--- linux-2.6.git/drivers/infiniband/core/Makefile  2006-01-16 
16:58:58.0 -0800
+++ linux-2.6.ib/drivers/infiniband/core/Makefile   2006-01-16 
16:55:25.0 -0800
@@ -1,5 +1,5 @@
 obj-$(CONFIG_INFINIBAND) +=ib_core.o ib_mad.o ib_sa.o \
-   ib_cm.o ib_addr.o rdma_cm.o
+   ib_cm.o ib_addr.o rdma_cm.o rdma_ucm.o
 obj-$(CONFIG_INFINIBAND_USER_MAD) +=   ib_umad.o
 obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=ib_uverbs.o ib_ucm.o
 
@@ -14,6 +14,8 @@ ib_cm-y :=cm.o
 
 rdma_cm-y :=   cma.o
 
+rdma_ucm-y :=  ucma.o
+
 ib_addr-y :=   addr.o
 
 ib_umad-y :=   user_mad.o
diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/ucma.c 
linux-2.6.ib/drivers/infiniband/core/ucma.c
--- linux-2.6.git/drivers/infiniband/core/ucma.c1969-12-31 
16:00:00.0 -0800
+++ linux-2.6.ib/drivers/infiniband/core/ucma.c 2006-01-16 16:54:31.0 
-0800
@@ -0,0 +1,788 @@
+/*
+ * Copyright (c) 2005 Intel Corporation.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+MODULE_AUTHOR("Sean Hefty");
+MODULE_DESCRIPTION("RDMA Userspace Connection Manager Access");
+MODULE_LICENSE("Dual BSD/GPL");
+
+enum {
+   UCMA_MAX_BACKLOG= 128
+};
+
+struct ucma_file {
+   struct semaphoremutex;
+   struct file *filp;
+   struct list_headctxs;
+   struct list_headevents;
+   wait_queue_head_t   poll_wait;
+};
+
+struct ucma_context {
+   int id;
+   wait_queue_head_t   wait;
+   atomic_tref;
+   int events_reported;
+   int backlog;
+
+   struct ucma_file*file;
+   struct rdma_cm_id   *cm_id;
+   __u64   uid;
+
+   struct list_headevents;/* list of pending events. */
+   struct list_headfile_list; /* member in file ctx list */
+};
+
+struct ucma_event {
+   struct ucma_context *ctx;
+   struct list_headfile_list; /* member in file event list */
+   struct list_headctx_list;  /* member in ctx event list */
+   struct rdma_cm_id   *cm_id;
+   struct rdma_ucm_event_resp resp;
+};
+
+static DECLARE_MUTEX(ctx_mutex);
+static DEFINE_IDR(ctx_idr);
+
+static struct ucma_context* ucma_get_ctx(struct ucma_file *file, int id)
+{
+   struct ucma_context *ctx;
+
+   down(&ctx_mutex);
+   ctx = idr_find(&ctx_idr, id);
+   if (!ctx)
+   ctx = ERR_PTR(-ENOENT);
+   else if (ctx->file != file)
+   ctx = ERR_PTR(-EINVAL);
+   else
+   atomic_inc(&ctx->ref);
+   up(&ctx_mutex);
+
+   return ctx;
+}
+
+static void ucma_put_ctx(struct ucma_context *ctx)
+{
+   if (atomic_dec_and_test(&ctx->ref))
+   wake_up(&ctx->wait);
+}
+
+static void ucma_cleanup_events(struct ucma_context *ctx)
+{
+   struct ucma_event *uevent;
+
+   down(&ctx->file->mutex);
+   lis

[PATCH 4/5] [RFC] Infiniband: connection abstraction

2006-01-17 Thread Sean Hefty
The following patch implements a kernel mode connection management agent
over Infiniband that connects based on IP addresses.

The agent defines a generic RDMA connection abstraction to support clients
wanting to connect over different RDMA devices.

It also handles RDMA device hotplug events on behalf of clients.

- Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/cma.c 
linux-2.6.ib/drivers/infiniband/core/cma.c
--- linux-2.6.git/drivers/infiniband/core/cma.c 1969-12-31 16:00:00.0 
-0800
+++ linux-2.6.ib/drivers/infiniband/core/cma.c  2006-01-16 16:17:34.0 
-0800
@@ -0,0 +1,1639 @@
+/*
+ * Copyright (c) 2005 Voltaire Inc.  All rights reserved.
+ * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved.
+ * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved.
+ * Copyright (c) 2005 Intel Corporation.  All rights reserved.
+ *
+ * This Software is licensed under one of the following licenses:
+ *
+ * 1) under the terms of the "Common Public License 1.0" a copy of which is
+ *available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/cpl.php.
+ *
+ * 2) under the terms of the "The BSD License" a copy of which is
+ *available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/bsd-license.php.
+ *
+ * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
+ *copy of which is available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/gpl-license.php.
+ *
+ * Licensee has the right to choose one of the above licenses.
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice and one of the license notices.
+ *
+ * Redistributions in binary form must reproduce both the above copyright
+ * notice, one of the license notices in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_AUTHOR("Guy German");
+MODULE_DESCRIPTION("Generic RDMA CM Agent");
+MODULE_LICENSE("Dual BSD/GPL");
+
+#define CMA_CM_RESPONSE_TIMEOUT 20
+#define CMA_MAX_CM_RETRIES 3
+
+static void cma_add_one(struct ib_device *device);
+static void cma_remove_one(struct ib_device *device);
+
+static struct ib_client cma_client = {
+   .name   = "cma",
+   .add= cma_add_one,
+   .remove = cma_remove_one
+};
+
+static LIST_HEAD(dev_list);
+static LIST_HEAD(listen_any_list);
+static DECLARE_MUTEX(mutex);
+
+struct cma_device {
+   struct list_headlist;
+   struct ib_device*device;
+   __be64  node_guid;
+   wait_queue_head_t   wait;
+   atomic_trefcount;
+   struct list_headid_list;
+};
+
+enum cma_state {
+   CMA_IDLE,
+   CMA_ADDR_QUERY,
+   CMA_ADDR_RESOLVED,
+   CMA_ROUTE_QUERY,
+   CMA_ROUTE_RESOLVED,
+   CMA_CONNECT,
+   CMA_ADDR_BOUND,
+   CMA_LISTEN,
+   CMA_DEVICE_REMOVAL,
+   CMA_DESTROYING
+};
+
+/*
+ * Device removal can occur at anytime, so we need extra handling to
+ * serialize notifying the user of device removal with other callbacks.
+ * We do this by disabling removal notification while a callback is in process,
+ * and reporting it after the callback completes.
+ */
+struct rdma_id_private {
+   struct rdma_cm_id   id;
+
+   struct list_headlist;
+   struct list_headlisten_list;
+   struct cma_device   *cma_dev;
+
+   enum cma_state  state;
+   spinlock_t  lock;
+   wait_queue_head_t   wait;
+   atomic_trefcount;
+   wait_queue_head_t   wait_remove;
+   atomic_tdev_remove;
+
+   int backlog;
+   int timeout_ms;
+   struct ib_sa_query  *query;
+   int query_id;
+   struct ib_cm_id *cm_id;
+
+   u32 seq_num;
+   u32 qp_num;
+   enum ib_qp_type qp_type;
+   u8  srq;
+};
+
+struct cma_work {
+   struct work_struct  work;
+   struct rdma_id_private  *id;
+};
+
+union cma_ip_addr {
+   struct in6_addr ip6;
+   struct {
+   __u32 pad[3];
+   __u32 addr;
+   } ip4;
+};
+
+struct cma_hdr {
+   u8 cma_version;
+   u8 ip_version;  /* IP version: 7:4 */
+   __u16 port;
+   union cma_ip_addr src_addr;
+   union cma_ip_addr dst_addr;
+};
+
+struct sdp_hh {
+   u8 sdp_version;
+   u8 ip_version;  /* IP version: 7:4 */
+   u8 sdp_specific1[10];
+   __u16 port;
+   __u16 sdp_specific2;
+   union cma_ip_addr src_addr;
+   union cma_ip_addr dst_addr;
+};

[PATCH 3/5] [RFC] Infiniband: connection abstraction

2006-01-17 Thread Sean Hefty
The following provides an address translation service that maps IP addresses
to Infiniband addresses (GIDs) using IPoIB.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/addr.c 
linux-2.6.ib/drivers/infiniband/core/addr.c
--- linux-2.6.git/drivers/infiniband/core/addr.c1969-12-31 
16:00:00.0 -0800
+++ linux-2.6.ib/drivers/infiniband/core/addr.c 2006-01-16 16:14:24.0 
-0800
@@ -0,0 +1,356 @@
+/*
+ * Copyright (c) 2005 Voltaire Inc.  All rights reserved.
+ * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved.
+ * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved.
+ * Copyright (c) 2005 Intel Corporation.  All rights reserved.
+ *
+ * This Software is licensed under one of the following licenses:
+ *
+ * 1) under the terms of the "Common Public License 1.0" a copy of which is
+ *available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/cpl.php.
+ *
+ * 2) under the terms of the "The BSD License" a copy of which is
+ *available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/bsd-license.php.
+ *
+ * 3) under the terms of the "GNU General Public License (GPL) Version 2" a
+ *copy of which is available from the Open Source Initiative, see
+ *http://www.opensource.org/licenses/gpl-license.php.
+ *
+ * Licensee has the right to choose one of the above licenses.
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice and one of the license notices.
+ *
+ * Redistributions in binary form must reproduce both the above copyright
+ * notice, one of the license notices in the documentation
+ * and/or other materials provided with the distribution.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_AUTHOR("Sean Hefty");
+MODULE_DESCRIPTION("IB Address Translation");
+MODULE_LICENSE("Dual BSD/GPL");
+
+struct addr_req {
+   struct list_head list;
+   struct sockaddr src_addr;
+   struct sockaddr dst_addr;
+   struct rdma_dev_addr *addr;
+   void *context;
+   void (*callback)(int status, struct sockaddr *src_addr,
+struct rdma_dev_addr *addr, void *context);
+   unsigned long timeout;
+   int status;
+};
+
+static void process_req(void *data);
+
+static DECLARE_MUTEX(mutex);
+static LIST_HEAD(req_list);
+static DECLARE_WORK(work, process_req, NULL);
+struct workqueue_struct *rdma_wq;
+EXPORT_SYMBOL(rdma_wq);
+
+static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev,
+unsigned char *dst_dev_addr)
+{
+   switch (dev->type) {
+   case ARPHRD_INFINIBAND:
+   dev_addr->dev_type = IB_NODE_CA;
+   break;
+   default:
+   return -EADDRNOTAVAIL;
+   }
+
+   memcpy(dev_addr->src_dev_addr, dev->dev_addr, MAX_ADDR_LEN);
+   memcpy(dev_addr->broadcast, dev->broadcast, MAX_ADDR_LEN);
+   if (dst_dev_addr)
+   memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN);
+   return 0;
+}
+
+int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr)
+{
+   struct net_device *dev;
+   u32 ip = ((struct sockaddr_in *) addr)->sin_addr.s_addr;
+   int ret;
+
+   dev = ip_dev_find(ip);
+   if (!dev)
+   return -EADDRNOTAVAIL;
+
+   ret = copy_addr(dev_addr, dev, NULL);
+   dev_put(dev);
+   return ret;
+}
+EXPORT_SYMBOL(rdma_translate_ip);
+
+static void set_timeout(unsigned long time)
+{
+   unsigned long delay;
+
+   cancel_delayed_work(&work);
+
+   delay = time - jiffies;
+   if ((long)delay <= 0)
+   delay = 1;
+
+   queue_delayed_work(rdma_wq, &work, delay);
+}
+
+static void queue_req(struct addr_req *req)
+{
+   struct addr_req *temp_req;
+
+   down(&mutex);
+   list_for_each_entry_reverse(temp_req, &req_list, list) {
+   if (time_after(req->timeout, temp_req->timeout))
+   break;
+   }
+
+   list_add(&req->list, &temp_req->list);
+
+   if (req_list.next == &req->list)
+   set_timeout(req->timeout);
+   up(&mutex);
+}
+
+static void addr_send_arp(struct sockaddr_in *dst_in)
+{
+   struct rtable *rt;
+   struct flowi fl;
+   u32 dst_ip = dst_in->sin_addr.s_addr;
+
+   memset(&fl, 0, sizeof fl);
+   fl.nl_u.ip4_u.daddr = dst_ip;
+   if (ip_route_output_key(&rt, &fl))
+   return;
+
+   arp_send(ARPOP_REQUEST, ETH_P_ARP, rt->rt_gateway, rt->idev->dev,
+rt->rt_src, NULL, rt->idev->dev->dev_addr, NULL);
+   ip_rt_put(rt);
+}
+
+static int addr_resolve_remote(struct sockadd

RE: [PATCH 2/5] [RFC] Infiniband: connection abstraction

2006-01-17 Thread Sean Hefty
The following patch extends matching connection requests to listens in the
Infiniband CM to include private data.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/cm.c 
linux-2.6.ib/drivers/infiniband/core/cm.c
--- linux-2.6.git/drivers/infiniband/core/cm.c  2006-01-16 10:25:26.0 
-0800
+++ linux-2.6.ib/drivers/infiniband/core/cm.c   2006-01-16 16:03:35.0 
-0800
@@ -32,7 +32,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: cm.c 2821 2005-07-08 17:07:28Z sean.hefty $
+ * $Id: cm.c 4311 2005-12-05 18:42:01Z sean.hefty $
  */
 #include 
 #include 
@@ -130,6 +130,7 @@ struct cm_id_private {
/* todo: use alternate port on send failure */
struct cm_av av;
struct cm_av alt_av;
+   struct ib_cm_private_data_compare *compare_data;
 
void *private_data;
__be64 tid;
@@ -355,6 +356,40 @@ static struct cm_id_private * cm_acquire
return cm_id_priv;
 }
 
+static void cm_mask_compare_data(u8 *dst, u8 *src, u8 *mask)
+{
+   int i;
+
+   for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE; i++)
+   dst[i] = src[i] & mask[i];
+}
+
+static int cm_compare_data(struct ib_cm_private_data_compare *src_data,
+  struct ib_cm_private_data_compare *dst_data)
+{
+   u8 src[IB_CM_PRIVATE_DATA_COMPARE_SIZE];
+   u8 dst[IB_CM_PRIVATE_DATA_COMPARE_SIZE];
+
+   if (!src_data || !dst_data)
+   return 0;
+   
+   cm_mask_compare_data(src, src_data->data, dst_data->mask);
+   cm_mask_compare_data(dst, dst_data->data, src_data->mask);
+   return memcmp(src, dst, IB_CM_PRIVATE_DATA_COMPARE_SIZE);
+}
+
+static int cm_compare_private_data(u8 *private_data,
+  struct ib_cm_private_data_compare *dst_data)
+{
+   u8 src[IB_CM_PRIVATE_DATA_COMPARE_SIZE];
+
+   if (!dst_data)
+   return 0;
+   
+   cm_mask_compare_data(src, private_data, dst_data->mask);
+   return memcmp(src, dst_data->data, IB_CM_PRIVATE_DATA_COMPARE_SIZE);
+}
+
 static struct cm_id_private * cm_insert_listen(struct cm_id_private 
*cm_id_priv)
 {
struct rb_node **link = &cm.listen_service_table.rb_node;
@@ -362,14 +397,18 @@ static struct cm_id_private * cm_insert_
struct cm_id_private *cur_cm_id_priv;
__be64 service_id = cm_id_priv->id.service_id;
__be64 service_mask = cm_id_priv->id.service_mask;
+   int data_cmp;
 
while (*link) {
parent = *link;
cur_cm_id_priv = rb_entry(parent, struct cm_id_private,
  service_node);
+   data_cmp = cm_compare_data(cm_id_priv->compare_data,
+  cur_cm_id_priv->compare_data);
if ((cur_cm_id_priv->id.service_mask & service_id) ==
(service_mask & cur_cm_id_priv->id.service_id) &&
-   (cm_id_priv->id.device == cur_cm_id_priv->id.device))
+   (cm_id_priv->id.device == cur_cm_id_priv->id.device) &&
+   !data_cmp)
return cur_cm_id_priv;
 
if (cm_id_priv->id.device < cur_cm_id_priv->id.device)
@@ -378,6 +417,10 @@ static struct cm_id_private * cm_insert_
link = &(*link)->rb_right;
else if (service_id < cur_cm_id_priv->id.service_id)
link = &(*link)->rb_left;
+   else if (service_id > cur_cm_id_priv->id.service_id)
+   link = &(*link)->rb_right;
+   else if (data_cmp < 0)
+   link = &(*link)->rb_left;
else
link = &(*link)->rb_right;
}
@@ -387,16 +430,20 @@ static struct cm_id_private * cm_insert_
 }
 
 static struct cm_id_private * cm_find_listen(struct ib_device *device,
-__be64 service_id)
+__be64 service_id,
+u8 *private_data)
 {
struct rb_node *node = cm.listen_service_table.rb_node;
struct cm_id_private *cm_id_priv;
+   int data_cmp;
 
while (node) {
cm_id_priv = rb_entry(node, struct cm_id_private, service_node);
+   data_cmp = cm_compare_private_data(private_data,
+  cm_id_priv->compare_data);
if ((cm_id_priv->id.service_mask & service_id) ==
 cm_id_priv->id.service_id &&
-   (cm_id_priv->id.device == device))
+   (cm_id_priv->id.device == devic

[PATCH 1/5] [RFC] Infiniband: connection abstraction

2006-01-17 Thread Sean Hefty
The following patch provides common handling for marshalling data between
userspace clients and kernel mode Infiniband drivers.

Signed-off-by: Sean Hefty <[EMAIL PROTECTED]>

---

diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/Makefile 
linux-2.6.ib/drivers/infiniband/core/Makefile
--- linux-2.6.git/drivers/infiniband/core/Makefile  2006-01-16 
10:25:27.0 -0800
+++ linux-2.6.ib/drivers/infiniband/core/Makefile   2006-01-16 
15:34:15.0 -0800
@@ -16,4 +16,5 @@ ib_umad-y :=  user_mad.o
 
 ib_ucm-y :=ucm.o
 
-ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o
+ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \
+   uverbs_marshall.o
diff -uprN -X linux-2.6.git/Documentation/dontdiff 
linux-2.6.git/drivers/infiniband/core/ucm.c 
linux-2.6.ib/drivers/infiniband/core/ucm.c
--- linux-2.6.git/drivers/infiniband/core/ucm.c 2006-01-16 10:25:26.0 
-0800
+++ linux-2.6.ib/drivers/infiniband/core/ucm.c  2006-01-16 15:34:15.0 
-0800
@@ -30,7 +30,7 @@
  * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
  * SOFTWARE.
  *
- * $Id: ucm.c 2594 2005-06-13 19:46:02Z libor $
+ * $Id: ucm.c 4311 2005-12-05 18:42:01Z sean.hefty $
  */
 #include 
 #include 
@@ -48,6 +48,7 @@
 
 #include 
 #include 
+#include 
 
 MODULE_AUTHOR("Libor Michalek");
 MODULE_DESCRIPTION("InfiniBand userspace Connection Manager access");
@@ -203,36 +204,6 @@ error:
return NULL;
 }
 
-static void ib_ucm_event_path_get(struct ib_ucm_path_rec *upath,
- struct ib_sa_path_rec  *kpath)
-{
-   if (!kpath || !upath)
-   return;
-
-   memcpy(upath->dgid, kpath->dgid.raw, sizeof *upath->dgid);
-   memcpy(upath->sgid, kpath->sgid.raw, sizeof *upath->sgid);
-
-   upath->dlid = kpath->dlid;
-   upath->slid = kpath->slid;
-   upath->raw_traffic  = kpath->raw_traffic;
-   upath->flow_label   = kpath->flow_label;
-   upath->hop_limit= kpath->hop_limit;
-   upath->traffic_class= kpath->traffic_class;
-   upath->reversible   = kpath->reversible;
-   upath->numb_path= kpath->numb_path;
-   upath->pkey = kpath->pkey;
-   upath->sl   = kpath->sl;
-   upath->mtu_selector = kpath->mtu_selector;
-   upath->mtu  = kpath->mtu;
-   upath->rate_selector= kpath->rate_selector;
-   upath->rate = kpath->rate;
-   upath->packet_life_time = kpath->packet_life_time;
-   upath->preference   = kpath->preference;
-
-   upath->packet_life_time_selector =
-   kpath->packet_life_time_selector;
-}
-
 static void ib_ucm_event_req_get(struct ib_ucm_req_event_resp *ureq,
 struct ib_cm_req_event_param *kreq)
 {
@@ -251,8 +222,10 @@ static void ib_ucm_event_req_get(struct 
ureq->srq= kreq->srq;
ureq->port   = kreq->port;
 
-   ib_ucm_event_path_get(&ureq->primary_path, kreq->primary_path);
-   ib_ucm_event_path_get(&ureq->alternate_path, kreq->alternate_path);
+   ib_copy_path_rec_to_user(&ureq->primary_path, kreq->primary_path);
+   if (kreq->alternate_path)
+   ib_copy_path_rec_to_user(&ureq->alternate_path,
+kreq->alternate_path);
 }
 
 static void ib_ucm_event_rep_get(struct ib_ucm_rep_event_resp *urep,
@@ -322,8 +295,8 @@ static int ib_ucm_event_process(struct i
info  = evt->param.rej_rcvd.ari;
break;
case IB_CM_LAP_RECEIVED:
-   ib_ucm_event_path_get(&uvt->resp.u.lap_resp.path,
- evt->param.lap_rcvd.alternate_path);
+   ib_copy_path_rec_to_user(&uvt->resp.u.lap_resp.path,
+evt->param.lap_rcvd.alternate_path);
uvt->data_len = IB_CM_LAP_PRIVATE_DATA_SIZE;
uvt->resp.present = IB_UCM_PRES_ALTERNATE;
break;
@@ -635,65 +608,11 @@ static ssize_t ib_ucm_attr_id(struct ib_
return result;
 }
 
-static void ib_ucm_copy_ah_attr(struct ib_ucm_ah_attr *dest_attr,
-   struct ib_ah_attr *src_attr)
-{
-   memcpy(dest_attr->grh_dgid, src_attr->grh.dgid.raw,
-  sizeof src_attr->grh.dgid);
-   dest_attr->grh_flow_label = src_attr->grh.flow_label;
-   dest_attr->grh_sgid_index = src_attr->grh.sgid_index;
-   dest_attr->grh_hop_limit = src_attr->grh.hop_limit;
-   dest_attr->grh_

[PATCH 0/5] [RFC] Infiniband: connection abstraction

2006-01-17 Thread Sean Hefty
The following set of patches defines a connection abstraction for Infiniband and
other RDMA devices, and serves several purposes:

* It implements a connection protocol over Infiniband based on IP addressing.
This greatly simplifies clients wishing to establish connections over
Infiniband.

* It defines a connection abstraction that works over multiple RDMA devices.
The submitted implementation targets Infiniband, but has been tested over other
RDMA devices as well.

* It handles RDMA device insertion and removal on behalf of its clients.

The changes have been broken into 5 separate patches.  The basic purpose of each
patch is:

1. Provide common handling for marshalling data between userspace clients and
kernel mode Infiniband  drivers.

2. Extend the Infiniband CM to include private data comparisons as part of its
connection request matching process.

3. Provide an address translation service that maps IP addresses to Infiniband
addresses (GIDs).  This patch touches outside of the Infiniband core, so I'm
including the netdev mailing list.

4. Implement the kernel mode RDMA connection management agent.

5. Implement the userspace RDMA connection management agent kernel support
module.

Please copy the openib-general mailing list on any replies.

Thanks,
Sean

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html