Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
The hack to use a socket and bind it to claim the port was just for demostrating the idea. The correct solution, IMO, is to enhance the core low level 4-tuple allocation services to be more generic (eg: not be tied to a struct sock). Then the host tcp stack and the host rdma stack can allocate TCP/iWARP ports/4tuples from this common exported service and share the port space. This allocation service could also be used by other deep adapters like iscsi adapters if needed. Since iWarp runs on top of TCP, the port space is really the same. FWIW, I agree that this proposal is the correct solution to support iWarp. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24
>Umm... this is a difficult situation for me to merge the changes then. >We're changing the CM retry behavior blind here. How do we know that >the MRA changes don't make the scalability issue worse? What's currently upstream doesn't work for Intel MPI on our larger clusters. The connection requests time out on the active side before the passive side can respond. The OFED release works because it provides a kernel patch to make the timeout a module parameter. I'm trying to avoid adding a module parameter, and the MRA is designed for this situation. I tested this by simulating a slow passive side responder, and it worked as expected for those tests. Using an MRA does add another MAD to the CM exchange, which is why it is sent only after seeing a duplicate request. Alternatively, we can take the OFED module parameter patch. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] [PATCH v3] iw_cxgb3: Support"iwarp-only"interfacesto avoid 4-tuple conflicts.
Kanevsky, Arkady wrote: Exactly, it forces the burden on administrator. And one will be forced to try one mount for iWARP and it does not work issue another one TCP or UDP if it fails. Yack! And server will need to listen on different IP address and simple * will not work since it will need to listen in two different domains. The server already has to call listen twice. Once for the rdma_cm and once for sockets. Similarly on the client side, connect must be made over rdma_cm or sockets. I really don't see any impact on the application for this approach. We just end up separating the port space based on networking addresses, rather than keeping the problem at the transport level. If you have an alternate approach that will be accepted upstream, feel free to post it. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts.
>It is ok to block while holding a mutex, yes? It's okay, I just didn't try to trace through the code to see if it ever tries to acquire the same mutex in the thread that needs to signal the event. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only"interfacesto avoid 4-tuple conflicts.
>What is the model on how client connects, say for iSCSI, >when client and server both support, iWARP and 10GbE or 1GbE, >and would like to setup "most" performant "connection" for ULP? For the "most" performance connection, the ULP would use IB, and all these problems go away. :) This proposal is for each iwarp interface to have its own IP address. Clients would need an iwarp usable address of the server and would connect using rdma_connect(). If that call (or rdma_resolve_addr/route) fails, the client could try connecting using sockets, aoi, or some other interface. I don't see that Steve's proposal changes anything from the client's perspective. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] [PATCH v3] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts.
The sysadmin creates "for iwarp use only" alias interfaces of the form "devname:iw*" where devname is the native interface name (eg eth0) for the iwarp netdev device. The alias label can be anything starting with "iw". The "iw" immediately after the ':' is the key used by the iw_cxgb3 driver. I'm still not sure about this, but haven't come up with anything better myself. And if there's a good chance of other rnic's needing the same support, I'd rather see the common code separated out, even if just encapsulated within this module for easy re-use. As for the code, I have a couple of questions about whether deadlock and a race condition are possible, plus a few minor comments. +static void insert_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) +{ + struct iwch_addrlist *addr; + + addr = kmalloc(sizeof *addr, GFP_KERNEL); + if (!addr) { + printk(KERN_ERR MOD "%s - failed to alloc memory!\n", + __FUNCTION__); + return; + } + addr->ifa = ifa; + mutex_lock(&rnicp->mutex); + list_add_tail(&addr->entry, &rnicp->addrlist); + mutex_unlock(&rnicp->mutex); +} Should this return success/failure? +static int nb_callback(struct notifier_block *self, unsigned long event, + void *ctx) +{ + struct in_ifaddr *ifa = ctx; + struct iwch_dev *rnicp = container_of(self, struct iwch_dev, nb); + + PDBG("%s rnicp %p event %lx\n", __FUNCTION__, rnicp, event); + + switch (event) { + case NETDEV_UP: + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && + is_iwarp_label(ifa->ifa_label)) { + PDBG("%s label %s addr 0x%x added\n", + __FUNCTION__, ifa->ifa_label, ifa->ifa_address); + insert_ifa(rnicp, ifa); + iwch_listeners_add_addr(rnicp, ifa->ifa_address); If insert_ifa() fails, what will iwch_listeners_add_addr() do? (I'm not easily seeing the relationship between the address list and the listen list at this point.) + } + break; + case NETDEV_DOWN: + if (netdev_is_ours(rnicp, ifa->ifa_dev->dev) && + is_iwarp_label(ifa->ifa_label)) { + PDBG("%s label %s addr 0x%x deleted\n", + __FUNCTION__, ifa->ifa_label, ifa->ifa_address); + iwch_listeners_del_addr(rnicp, ifa->ifa_address); + remove_ifa(rnicp, ifa); + } + break; + default: + break; + } + return 0; +} + +static void delete_addrlist(struct iwch_dev *rnicp) +{ + struct iwch_addrlist *addr, *tmp; + + mutex_lock(&rnicp->mutex); + list_for_each_entry_safe(addr, tmp, &rnicp->addrlist, entry) { + list_del(&addr->entry); + kfree(addr); + } + mutex_unlock(&rnicp->mutex); +} + +static void populate_addrlist(struct iwch_dev *rnicp) +{ + int i; + struct in_device *indev; + + for (i = 0; i < rnicp->rdev.port_info.nports; i++) { + indev = in_dev_get(rnicp->rdev.port_info.lldevs[i]); + if (!indev) + continue; + for_ifa(indev) + if (is_iwarp_label(ifa->ifa_label)) { + PDBG("%s label %s addr 0x%x added\n", +__FUNCTION__, ifa->ifa_label, +ifa->ifa_address); + insert_ifa(rnicp, ifa); + } + endfor_ifa(indev); + } +} + static void rnic_init(struct iwch_dev *rnicp) { PDBG("%s iwch_dev %p\n", __FUNCTION__, rnicp); @@ -70,6 +187,12 @@ static void rnic_init(struct iwch_dev *r idr_init(&rnicp->qpidr); idr_init(&rnicp->mmidr); spin_lock_init(&rnicp->lock); + INIT_LIST_HEAD(&rnicp->addrlist); + INIT_LIST_HEAD(&rnicp->listen_eps); + mutex_init(&rnicp->mutex); + rnicp->nb.notifier_call = nb_callback; + populate_addrlist(rnicp); + register_inetaddr_notifier(&rnicp->nb); rnicp->attr.vendor_id = 0x168; rnicp->attr.vendor_part_id = 7; @@ -148,6 +271,8 @@ static void close_rnic_dev(struct t3cdev mutex_lock(&dev_mutex); list_for_each_entry_safe(dev, tmp, &dev_list, entry) { if (dev->rdev.t3cdev_p == tdev) { + unregister_inetaddr_notifier(&dev->nb); + delete_addrlist(dev); list_del(&dev->entry); iwch_unregister_device(dev); cxio_rdev_close(&dev->rdev); diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h index caf4e60..7fa0a47 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.h +++ b/drivers/infiniband/hw/cxgb3/iwch.h @@ -36,
Re: [ofa-general] [PATCH] RDMA/CMA: Implement rdma_resolve_ip retry enhancement.
If an application is calling rdma_resolve_ip() and a status of -ENODATA is returned from addr_resolve_local/remote(), the timeout mechanism waits until the application's timeout occurs before rechecking the address resolution status; the application will wait until it's full timeout occurs. This case is seen when the work thread call to process_req() is made before the arp packet is processed. I don't understand the issue. process_req() is invoked whenever a network event occurs, which rechecks all pending requests. This patch is in addition to Steve Wise's neigh_event_send patch to initiate neighbour discovery sent on 9/12/2007. This patch looks unrelated to Steve's patch. Can you clarify the relationship? - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts.
+addr = kmalloc(sizeof *addr, GFP_KERNEL); As a small nitpick: this wants to be sizeof(struct in_ifaddr) See chapter 14 of CodingStyle document. kmalloc(sizeof *addr... is correct. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24
>OK -- just to make sure I'm understanding what you're saying: have you >confirmed that your proposed patches actually fix the issue? Not directly. I cannot easily test kernel patches on our larger, production clusters. We've seen the issue with specific applications on 512 and 1024 cores, but I've only been able to test the patch on a 48-core cluster. I have verified that it successfully increases the timeout to where it *should* work, but cannot absolutely confirm that it will fix the problem. I'm unlikely to know that until the production clusters move to an OFED release (1.3?) containing this patch. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] [PATCH v2] iw_cxgb3: Support "iwarp-only" interfaces to avoid 4-tuple conflicts.
The iWARP driver must translate all listens on address 0.0.0.0 to the set of rdma-only ip addresses for the device in question. This prevents incoming connect requests to the TCP ipaddresses from going up the rdma stack. I've only given this a high level review at this point, and while the patch looks okay on first pass, is there a way to move some of this functionality to either the rdma_cm or iw_cm? I don't like the idea of every iwarp driver having to implement address/listen list maintenance. I may have some ideas after re-examining it. Implementation Details: There are a couple of areas that I made a note to look at in more detail (because I didn't understand everything that was happening), but I did have one minor nit - most uses of list_del_init can just be list_del. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] InfiniBand/RDMA merge plans for 2.6.24
> - My user_mad P_Key index support patch. I'll test the ioctl to > change to the new mode and merge this I guess, since Hal and Sean > have tested this out. I can give this patch a reviewed-by: too, and I will also try to review a couple of the pending ipoib patches. > - Sean's QoS changes. These look fine at first glance, and I just > plan to understand the backwards compatibility story (ie how this > works with an old SM) and merge. Anyone who objects let me know. The new QoS fields fall into fields that are currently reserved, which should be ignored by an older SM. I've only tested this against openSM however. > - Sean's IB CM MRA interface changes. Don't know at this point. It > seems OK but I'm not clear on what if any real-world improvement > this gives us. This patch was generated in response to an Intel MPI issue. We've seen MPI take several minutes to respond to a connection request during the middle of large application runs. When this happens, the active side times out the connection. In OFED, we added module parameters to adjust the rdma_cm connection timeout on the active side, but I believe that sending an MRA from the passive side is a better solution. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery.
>RDMA/CMA: Use neigh_event_send() to initiate neighbour discovery. > >Calling arp_send() to initiate neighbour discovery (ND) doesn't do the >full ND protocol. Namely, it doesn't handle retransmitting the arp >request if it is dropped. The function neigh_event_send() does all this. >Without doing full ND, rdma address resolution fails in the presence of >dropped arp bcast packets. > >Signed-off-by: Steve Wise <[EMAIL PROTECTED]> Acked-by: Sean Hefty <[EMAIL PROTECTED]> Roland - can you please queue this up for 2.6.24? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom the host TCP port space.
>Just be realistic and accept that RDMA is a point in time solution, >and like any other such technology takes flexibility away from users. All technologies are just point in time solutions. While management is important, shouldn't the customers decide how important it is relative to their problems? Whether some future technology will be better matters little if a problem needs to be solved today. >If you can't see that this is the future, you have my condolences. >Because frankly, the signs are all around that this is where things >are going. Adding a bazillion cores to a processor doesn't do a thing to help memory bandwidth. Millions of Infiniband ports are in operation today. Over 25% of the top 500 supercomputers use Infiniband. The formation of the OpenFabrics Alliance was pushed and has been continuously funded by an RDMA customer - the US National Labs. RDMA technologies are backed by Cisco, IBM, Intel, QLogic, Sun, Voltaire, Mellanox, NetApp, AMD, Dell, HP, Oracle, Unisys, Emulex, Hitachi, NEC, Fujitsu, LSI, SGI, Sandia, and at least two dozen other companies. IDC expects Infiniband adapter revenue to triple between 2006 and 2011, and switch revenue to increase six-fold (combined revenues of 1 billion). Customers see real benefits using channel based architectures. Do all customers need it? Of course not. Is it a niche? Yes, but I would say that about any 10+ gig network. That doesn't mean that it hasn't become essential for some customers. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom the host TCP port space.
>It's not about being a niche. It's about creating a maintainable >software net stack that has predictable behavior. > >Needing to reach out of the RDMA sandbox and reserve net stack resources >away from itself travels a path we've consistently avoided. We need to ensure that we're also creating a maintainable kernel. RDMA doesn't use sockets, but that doesn't mean it's not part of the networking support provided by the Linux kernel. Making blanket statements that RDMA should stay within a sandbox is equivalent to saying that RDMA should duplicate any network related functionality that it might need. >>> I will NACK any patch that opens up sockets to eat up ports or >>> anything stupid like that. > >Ditto for me as well. I agree that using a socket is the wrong approach, but my guess is that it was suggested as a possibility because of the attempt to keep RDMA in its 'sandbox'. The iWarp architecture implements RDMA over TCP; it just doesn't use sockets. The Linux network stack doesn't easily support this possibility. Are there any reasonable ways to enable this to the degree necessary for iWarp? - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
How about we just remove the RDMA stack altogether? I am not at all kidding. If you guys can't stay in your sand box and need to cause problems for the normal network stack, it's unacceptable. We were told all along the if RDMA went into the tree none of this kind of stuff would be an issue. There are currently two RDMA solutions available. Each solution has different requirements and uses the normal network stack differently. Infiniband uses its own transport. iWarp runs over TCP. We have tried to leverage the existing infrastructure where it makes sense. After TCP port reservation, what's next? It seems an at least bi-monthly event that the RDMA folks need to put their fingers into something else in the normal networking stack. No more. Currently, the RDMA stack uses its own port space. This causes a problem for iWarp, and is what Steve is looking for a solution for. I'm not an iWarp guru, so I don't know what options exist. Can iWarp use its own address family? Identify specific IP addresses for iWarp use? Restrict iWarp to specific port numbers? Let the app control the correct operation? I don't know. Steve merely defined a problem and suggested a possible solution. He's looking for constructive help trying to solve the problem. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
Steve Wise wrote: Any more comments? Does anyone have ideas on how to reserve the port space without using a struct socket? - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [openib-general] [PATCH v4 2/2] iWARP Core Changes.
Steve Wise wrote: diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index d294bbc..83f84ef 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -32,6 +32,7 @@ #include #include #include #include +#include File is included 3 lines up. diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index e05ca2c..061858c 100644 #include #include #include -#include /* INIT_WORK, schedule_work(), flush_scheduled_work() */ I'm guessing that the include isn't currently needed, since none of the other changes to the file should have removed its dependency. Should this be put into a separate patch? +static int iw_conn_req_handler(struct iw_cm_id *cm_id, + struct iw_cm_event *iw_event) +{ + struct rdma_cm_id *new_cm_id; + struct rdma_id_private *listen_id, *conn_id; + struct sockaddr_in *sin; + struct net_device *dev = NULL; + int ret; + + listen_id = cm_id->context; + atomic_inc(&listen_id->dev_remove); + if (!cma_comp(listen_id, CMA_LISTEN)) { + ret = -ECONNABORTED; + goto out; + } + + /* Create a new RDMA id for the new IW CM ID */ + new_cm_id = rdma_create_id(listen_id->id.event_handler, + listen_id->id.context, + RDMA_PS_TCP); + if (!new_cm_id) { + ret = -ENOMEM; + goto out; + } + conn_id = container_of(new_cm_id, struct rdma_id_private, id); + atomic_inc(&conn_id->dev_remove); This is not released in error cases. See below. + conn_id->state = CMA_CONNECT; + + dev = ip_dev_find(iw_event->local_addr.sin_addr.s_addr); + if (!dev) { + ret = -EADDRNOTAVAIL; cma_release_remove(conn_id); + rdma_destroy_id(new_cm_id); + goto out; + } + ret = rdma_copy_addr(&conn_id->id.route.addr.dev_addr, dev, NULL); + if (ret) { cma_release_remove(conn_id); + rdma_destroy_id(new_cm_id); + goto out; + } + + ret = cma_acquire_dev(conn_id); + if (ret) { cma_release_remove(conn_id); + rdma_destroy_id(new_cm_id); + goto out; + } + + conn_id->cm_id.iw = cm_id; + cm_id->context = conn_id; + cm_id->cm_handler = cma_iw_handler; + + sin = (struct sockaddr_in *) &new_cm_id->route.addr.src_addr; + *sin = iw_event->local_addr; + sin = (struct sockaddr_in *) &new_cm_id->route.addr.dst_addr; + *sin = iw_event->remote_addr; + + ret = cma_notify_user(conn_id, RDMA_CM_EVENT_CONNECT_REQUEST, 0, + iw_event->private_data, + iw_event->private_data_len); + if (ret) { + /* User wants to destroy the CM ID */ + conn_id->cm_id.iw = NULL; + cma_exch(conn_id, CMA_DESTROYING); + cma_release_remove(conn_id); + rdma_destroy_id(&conn_id->id); + } + +out: + if (!dev) + dev_put(dev); Shouldn't this be: if (dev)? + cma_release_remove(listen_id); + return ret; +} @@ -1357,8 +1552,8 @@ static int cma_resolve_loopback(struct r ib_addr_set_dgid(&id_priv->id.route.addr.dev_addr, &gid); if (cma_zero_addr(&id_priv->id.route.addr.src_addr)) { - src_in = (struct sockaddr_in *)&id_priv->id.route.addr.src_addr; - dst_in = (struct sockaddr_in *)&id_priv->id.route.addr.dst_addr; + src_in = (struct sockaddr_in *) &id_priv->id.route.addr.src_addr; + dst_in = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr; trivial spacing change only +static inline void iw_addr_get_sgid(struct rdma_dev_addr* rda, +union ib_gid *gid) +{ + memcpy(gid, rda->src_dev_addr, sizeof *gid); +} + +static inline union ib_gid* iw_addr_get_dgid(struct rdma_dev_addr* rda) +{ + return (union ib_gid *) rda->dst_dev_addr; +} Minor personal nit: for consistency with the rest of the file, can you use dev_addr in place of rda? - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [openib-general] [PATCH v2 1/2] iWARP Connection Manager.
>> Er...no. It will lose this event. Depending on the event...the carnage >> varies. We'll take a look at this. >> > >This behavior is consistent with the Infiniband CM (see >drivers/infiniband/core/cm.c function cm_recv_handler()). But I think >we should at least log an error because a lost event will usually stall >the rdma connection. I believe that there's a difference here. For the Infiniband CM, an allocation error behaves the same as if the received MAD were lost or dropped. Since MADs are unreliable anyway, it's not so much that an IB CM event gets lost, as it doesn't ever occur. A remote CM should retry the send, which hopefully allows the connection to make forward progress. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] iWARP Connection Manager.
Steve Wise wrote: +int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + /* Wait if we're currently in a connect or accept downcall */ + wait_event(cm_id_priv->connect_wait, + !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); Am I understanding this check correctly? You're checking to see if the user has called iw_cm_disconnect() at the same time that they called iw_cm_connect() or iw_cm_accept(). Are connect / accept blocking, or are you just waiting for an event? The CM must wait for the low level provider to finish a connect() or accept() operation before telling the low level provider to disconnect via modifying the iwarp QP. Regardless of whether they block, this disconnect can happen concurrently with the connect/accept so we need to hold the disconnect until the connect/accept completes. +EXPORT_SYMBOL(iw_cm_disconnect); +static void destroy_cm_id(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + /* Wait if we're currently in a connect or accept downcall. A +* listening endpoint should never block here. */ + wait_event(cm_id_priv->connect_wait, + !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); Same question/comment as above. Same answer. There's a difference between trying to handle the user calling disconnect/destroy at the same time a call to accept/connect is active, versus the user calling disconnect/destroy after accept/connect have returned. In the latter case, I think you're fine. In the first case, this is allowing a user to call destroy at the same time that they're calling accept/connect. Additionally, there's no guarantee that the F_CONNECT_WAIT flag has been set by accept/connect by the time disconnect/destroy tests it. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] iWARP Connection Manager.
Steve Wise wrote: +/* + * Release a reference on cm_id. If the last reference is being removed + * and iw_destroy_cm_id is waiting, wake up the waiting thread. + */ +static int iwcm_deref_id(struct iwcm_id_private *cm_id_priv) +{ + int ret = 0; + + BUG_ON(atomic_read(&cm_id_priv->refcount)==0); + if (atomic_dec_and_test(&cm_id_priv->refcount)) { + BUG_ON(!list_empty(&cm_id_priv->work_list)); + if (waitqueue_active(&cm_id_priv->destroy_wait)) { + BUG_ON(cm_id_priv->state != IW_CM_STATE_DESTROYING); + BUG_ON(test_bit(IWCM_F_CALLBACK_DESTROY, + &cm_id_priv->flags)); + ret = 1; + wake_up(&cm_id_priv->destroy_wait); We recently changed the RDMA CM, IB CM, and a couple of other modules from using wait objects to completions. This avoids a race condition between decrementing the reference count, which allows destruction to proceed, and calling wake_up on a freed cm_id. My guess is that you may need to do the same. Can you also explain the use of the return value here? It's ignored below in rem_ref() and destroy_cm_id(). +static void add_ref(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + atomic_inc(&cm_id_priv->refcount); +} + +static void rem_ref(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + iwcm_deref_id(cm_id_priv); +} + +/* + * CM_ID <-- CLOSING + * + * Block if a passive or active connection is currenlty being processed. Then + * process the event as follows: + * - If we are ESTABLISHED, move to CLOSING and modify the QP state + * based on the abrupt flag + * - If the connection is already in the CLOSING or IDLE state, the peer is + * disconnecting concurrently with us and we've already seen the + * DISCONNECT event -- ignore the request and return 0 + * - Disconnect on a listening endpoint returns -EINVAL + */ +int iw_cm_disconnect(struct iw_cm_id *cm_id, int abrupt) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret = 0; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + /* Wait if we're currently in a connect or accept downcall */ + wait_event(cm_id_priv->connect_wait, + !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); Am I understanding this check correctly? You're checking to see if the user has called iw_cm_disconnect() at the same time that they called iw_cm_connect() or iw_cm_accept(). Are connect / accept blocking, or are you just waiting for an event? + + spin_lock_irqsave(&cm_id_priv->lock, flags); + switch (cm_id_priv->state) { + case IW_CM_STATE_ESTABLISHED: + cm_id_priv->state = IW_CM_STATE_CLOSING; + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + if (cm_id_priv->qp) { /* QP could be for user-mode client */ + if (abrupt) + ret = iwcm_modify_qp_err(cm_id_priv->qp); + else + ret = iwcm_modify_qp_sqd(cm_id_priv->qp); + /* + * If both sides are disconnecting the QP could +* already be in ERR or SQD states +*/ + ret = 0; + } + else + ret = -EINVAL; + break; + case IW_CM_STATE_LISTEN: + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + ret = -EINVAL; + break; + case IW_CM_STATE_CLOSING: + /* remote peer closed first */ + case IW_CM_STATE_IDLE: + /* accept or connect returned !0 */ + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + break; + case IW_CM_STATE_CONN_RECV: + /* + * App called disconnect before/without calling accept after +* connect_request event delivered. +*/ + spin_unlock_irqrestore(&cm_id_priv->lock, flags); + break; + case IW_CM_STATE_CONN_SENT: + /* Can only get here if wait above fails */ + default: + BUG_ON(1); + } + + return ret; +} +EXPORT_SYMBOL(iw_cm_disconnect); +static void destroy_cm_id(struct iw_cm_id *cm_id) +{ + struct iwcm_id_private *cm_id_priv; + unsigned long flags; + int ret; + + cm_id_priv = container_of(cm_id, struct iwcm_id_private, id); + /* Wait if we're currently in a connect or accept downcall. A +* listening endpoint should never block here. */ + wait_event(cm_id_priv->connect_wait, + !test_bit(IWCM_F_CONNECT_WAIT, &cm_id_priv->flags)); Same question/comment as abo
Re: [PATCH 2/2] iWARP Core Changes.
Mainly nits... Steve Wise wrote: -static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, +int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, unsigned char *dst_dev_addr) Might want to rename this to something like rdma_copy_addr if you're going to export it. +static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event) +{ + struct rdma_id_private *id_priv = iw_id->context; + enum rdma_cm_event_type event = 0; + struct sockaddr_in *sin; + int ret = 0; + + atomic_inc(&id_priv->dev_remove); + + switch (iw_event->event) { + case IW_CM_EVENT_CLOSE: + event = RDMA_CM_EVENT_DISCONNECTED; + break; + case IW_CM_EVENT_CONNECT_REPLY: + sin = (struct sockaddr_in*)&id_priv->id.route.addr.src_addr; + *sin = iw_event->local_addr; + sin = (struct sockaddr_in*)&id_priv->id.route.addr.dst_addr; spacing nit - (struct sockaddr_in *) &id_priv->... +struct net_device *ip_dev_find(u32 ip); Just include header file with definition. + sin = (struct sockaddr_in*)&new_cm_id->route.addr.src_addr; + *sin = iw_event->local_addr; + sin = (struct sockaddr_in*)&new_cm_id->route.addr.dst_addr; same spacing nit... appears in a couple other places as well. +static inline union ib_gid* iw_addr_get_sgid(struct rdma_dev_addr* rda) +{ + return (union ib_gid*)rda->src_dev_addr; +} + +static inline union ib_gid* iw_addr_get_dgid(struct rdma_dev_addr* rda) +{ + return (union ib_gid*)rda->dst_dev_addr; +} spacing nits +struct iw_cm_verbs; struct ib_device { struct device*dma_device; @@ -846,6 +873,8 @@ struct ib_device { u32 flags; + struct iw_cm_verbs* iwcm; + '*' placement nit - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [openib-general] Re: [PATCH 4/6 v2] IB: address translation to map IP toIB addresses (GIDs)
Roland Dreier wrote: > +struct workqueue_struct *rdma_wq; > +EXPORT_SYMBOL(rdma_wq); Sean, I don't think I saw an answer when I asked you this before. Why is ib_addr exporting a workqueue? Is there some sort of ordering constraint that is forcing other modules to go through the same workqueue for things? This seems like a very fragile internal thing to be exposing, and I'm wondering if there's a better way to handle it. I responded in a different thread, but here's what I wrote: "This is simply an attempt to reduce/combine work queues used by the Infiniband code. This keeps the threading a little simpler in the rdma_cm, since all callbacks are invoked using the same work queue. (I'm also using this with the local SA/multicast code, but that's not ready for merging.)" There's no specific ordering constraint that's required. We're just ending up with several Infiniband modules creating their own work queues (ib_mad, ib_cm, ib_addr, rdma_cm, plus a couple more in modules under development), and this is an attempt to reduce that. If having separate work queues would work better, there shouldn't be anything that prevents this. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 5/6 v2] IB: IP address based RDMA connection manager
> > +static void cma_detach_from_dev(struct rdma_id_private *id_priv) > > +{ > > + list_del(&id_priv->list); > > + if (atomic_dec_and_test(&id_priv->cma_dev->refcount)) > > + wake_up(&id_priv->cma_dev->wait); > > + id_priv->cma_dev = NULL; > > +} > >doesn't need to do atomic_dec_and_test(), because it is never dropping >the last reference to id_priv (and in fact if it was, the last line >would be a use-after-free bug). It's dropping the reference on cma_dev, as opposed to id_priv. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 4/6 v2] IB: address translation to map IP toIB addresses (GIDs)
>The ib_addr module depends on CONFIG_INET, because it uses symbols >like arp_tbl, which are only exported if INET is enabled. > >I fixed this up by creating a new (non-user-visible) config symbol to >control when ib_addr is built -- I put the following diff on top of >your patch in my tree: Thanks! -Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6 v2] IB: IP address based RDMA connection manager
Kernel mode connection management agent over Infiniband that connects based on IP addresses. The agent defines a generic RDMA connection abstraction to support clients wanting to connect over different RDMA devices. Agent also handles RDMA device hotplug events on behalf of clients. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/cma.c linux-2.6.ib/drivers/infiniband/core/cma.c --- linux-2.6.git/drivers/infiniband/core/cma.c 1969-12-31 16:00:00.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/cma.c 2006-01-16 16:17:34.0 -0800 @@ -0,0 +1,1639 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + *available from the Open Source Initiative, see + *http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + *available from the Open Source Initiative, see + *http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + *copy of which is available from the Open Source Initiative, see + *http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ +#include +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("Generic RDMA CM Agent"); +MODULE_LICENSE("Dual BSD/GPL"); + +#define CMA_CM_RESPONSE_TIMEOUT 20 +#define CMA_MAX_CM_RETRIES 3 + +static void cma_add_one(struct ib_device *device); +static void cma_remove_one(struct ib_device *device); + +static struct ib_client cma_client = { + .name = "cma", + .add= cma_add_one, + .remove = cma_remove_one +}; + +static LIST_HEAD(dev_list); +static LIST_HEAD(listen_any_list); +static DEFINE_MUTEX(lock); + +struct cma_device { + struct list_headlist; + struct ib_device*device; + __be64 node_guid; + wait_queue_head_t wait; + atomic_trefcount; + struct list_headid_list; +}; + +enum cma_state { + CMA_IDLE, + CMA_ADDR_QUERY, + CMA_ADDR_RESOLVED, + CMA_ROUTE_QUERY, + CMA_ROUTE_RESOLVED, + CMA_CONNECT, + CMA_ADDR_BOUND, + CMA_LISTEN, + CMA_DEVICE_REMOVAL, + CMA_DESTROYING +}; + +/* + * Device removal can occur at anytime, so we need extra handling to + * serialize notifying the user of device removal with other callbacks. + * We do this by disabling removal notification while a callback is in process, + * and reporting it after the callback completes. + */ +struct rdma_id_private { + struct rdma_cm_id id; + + struct list_headlist; + struct list_headlisten_list; + struct cma_device *cma_dev; + + enum cma_state state; + spinlock_t lock; + wait_queue_head_t wait; + atomic_trefcount; + wait_queue_head_t wait_remove; + atomic_tdev_remove; + + int backlog; + int timeout_ms; + struct ib_sa_query *query; + int query_id; + struct ib_cm_id *cm_id; + + u32 seq_num; + u32 qp_num; + enum ib_qp_type qp_type; + u8 srq; +}; + +struct cma_work { + struct work_struct work; + struct rdma_id_private *id; +}; + +union cma_ip_addr { + struct in6_addr ip6; + struct { + __u32 pad[3]; + __u32 addr; + } ip4; +}; + +struct cma_hdr { + u8 cma_version; + u8 ip_version; /* IP version: 7:4 */ + __u16 port; + union cma_ip_addr src_addr; + union cma_ip_addr dst_addr; +}; + +struct sdp_hh { + u8 sdp_version; + u8 ip_version; /* IP version: 7:4 */ + u8 sdp_specific1[10]; + __u16 port; + __u16 sdp_specific2; + union cma_ip_addr src_addr; + union cma_ip_addr dst_addr; +}; + +#define CMA_VERSION 0x00 +#de
[PATCH 6/6 v2] IB: userspace support for RDMA connection manager
Kernel component necessary to support the userspace RDMA connection management library. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- Discussion on the list suggested giving the userspace interface more time to develop, which seems reasonable. diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/Makefile linux-2.6.ib/drivers/infiniband/core/Makefile --- linux-2.6.git/drivers/infiniband/core/Makefile 2006-01-16 16:58:58.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/Makefile 2006-01-16 16:55:25.0 -0800 @@ -1,5 +1,5 @@ obj-$(CONFIG_INFINIBAND) +=ib_core.o ib_mad.o ib_sa.o \ - ib_cm.o ib_addr.o rdma_cm.o + ib_cm.o ib_addr.o rdma_cm.o rdma_ucm.o obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=ib_uverbs.o ib_ucm.o @@ -14,6 +14,8 @@ ib_cm-y :=cm.o rdma_cm-y := cma.o +rdma_ucm-y := ucma.o + ib_addr-y := addr.o ib_umad-y := user_mad.o diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/ucma.c linux-2.6.ib/drivers/infiniband/core/ucma.c --- linux-2.6.git/drivers/infiniband/core/ucma.c1969-12-31 16:00:00.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/ucma.c 2006-01-16 16:54:31.0 -0800 @@ -0,0 +1,788 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include + +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("RDMA Userspace Connection Manager Access"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + UCMA_MAX_BACKLOG= 128 +}; + +struct ucma_file { + struct mutexfile_mutex; + struct file *filp; + struct list_headctxs; + struct list_headevents; + wait_queue_head_t poll_wait; +}; + +struct ucma_context { + int id; + wait_queue_head_t wait; + atomic_tref; + int events_reported; + int backlog; + + struct ucma_file*file; + struct rdma_cm_id *cm_id; + __u64 uid; + + struct list_headevents;/* list of pending events. */ + struct list_headfile_list; /* member in file ctx list */ +}; + +struct ucma_event { + struct ucma_context *ctx; + struct list_headfile_list; /* member in file event list */ + struct list_headctx_list; /* member in ctx event list */ + struct rdma_cm_id *cm_id; + struct rdma_ucm_event_resp resp; +}; + +static DEFINE_MUTEX(ctx_mutex); +static DEFINE_IDR(ctx_idr); + +static struct ucma_context* ucma_get_ctx(struct ucma_file *file, int id) +{ + struct ucma_context *ctx; + + mutex_lock(&ctx_mutex); + ctx = idr_find(&ctx_idr, id); + if (!ctx) + ctx = ERR_PTR(-ENOENT); + else if (ctx->file != file) + ctx = ERR_PTR(-EINVAL); + else + atomic_inc(&ctx->ref); + mutex_unlock(&ctx_mutex); + + return ctx; +} + +static void ucma_put_ctx(struct ucma_context *ctx) +{ + if (atomic_dec_and_test(&ctx->ref)) + wake_up(&ctx->wait); +} + +static void ucma_cleanup_events(struct
[PATCH 4/6 v2] IB: address translation to map IP toIB addresses (GIDs)
Add an address translation service that maps IP addresses to Infiniband GID addresses using IPoIB. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- This should be the correct patch. The only difference between this and the mis-post is the use of mutex_lock/unlock in place of up/down. diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/addr.c linux-2.6.ib/drivers/infiniband/core/addr.c --- linux-2.6.git/drivers/infiniband/core/addr.c1969-12-31 16:00:00.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/addr.c 2006-01-16 16:14:24.0 -0800 @@ -0,0 +1,356 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + *available from the Open Source Initiative, see + *http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + *available from the Open Source Initiative, see + *http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + *copy of which is available from the Open Source Initiative, see + *http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("IB Address Translation"); +MODULE_LICENSE("Dual BSD/GPL"); + +struct addr_req { + struct list_head list; + struct sockaddr src_addr; + struct sockaddr dst_addr; + struct rdma_dev_addr *addr; + void *context; + void (*callback)(int status, struct sockaddr *src_addr, +struct rdma_dev_addr *addr, void *context); + unsigned long timeout; + int status; +}; + +static void process_req(void *data); + +static DEFINE_MUTEX(lock); +static LIST_HEAD(req_list); +static DECLARE_WORK(work, process_req, NULL); +struct workqueue_struct *rdma_wq; +EXPORT_SYMBOL(rdma_wq); + +static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, +unsigned char *dst_dev_addr) +{ + switch (dev->type) { + case ARPHRD_INFINIBAND: + dev_addr->dev_type = IB_NODE_CA; + break; + default: + return -EADDRNOTAVAIL; + } + + memcpy(dev_addr->src_dev_addr, dev->dev_addr, MAX_ADDR_LEN); + memcpy(dev_addr->broadcast, dev->broadcast, MAX_ADDR_LEN); + if (dst_dev_addr) + memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN); + return 0; +} + +int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr) +{ + struct net_device *dev; + u32 ip = ((struct sockaddr_in *) addr)->sin_addr.s_addr; + int ret; + + dev = ip_dev_find(ip); + if (!dev) + return -EADDRNOTAVAIL; + + ret = copy_addr(dev_addr, dev, NULL); + dev_put(dev); + return ret; +} +EXPORT_SYMBOL(rdma_translate_ip); + +static void set_timeout(unsigned long time) +{ + unsigned long delay; + + cancel_delayed_work(&work); + + delay = time - jiffies; + if ((long)delay <= 0) + delay = 1; + + queue_delayed_work(rdma_wq, &work, delay); +} + +static void queue_req(struct addr_req *req) +{ + struct addr_req *temp_req; + + mutex_lock(&lock); + list_for_each_entry_reverse(temp_req, &req_list, list) { + if (time_after(req->timeout, temp_req->timeout)) + break; + } + + list_add(&req->list, &temp_req->list); + + if (req_list.next == &req->list) + set_timeout(req->timeout); + mutex_unlock(&lock); +} + +static void addr_send_arp(struct sockaddr_in *dst_in) +{ + struct rtable *rt; + struct flowi fl; + u32 dst_ip = dst_in->sin_addr.s_addr; + + memset(&fl, 0, sizeof fl); + fl.nl_u.ip4_u.daddr = dst_ip; + if (ip_route_output_key(&rt, &fl)) + return; + + arp_send(ARPOP_REQUEST, ETH_P_ARP, rt->rt_gateway, rt->idev->dev, +rt->r
[PATCH 2/6 v2] IB: match connection requests based on private data
Extend matching connection requests to listens in the Infiniband CM to include private data checks. This allows applications to listen on the same service identifier, with private data directing the request to the appropriate application. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- This should be the correct patch that incorporates feedback from the initial submission. Sorry about the mis-post. diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/cm.c linux-2.6.ib/drivers/infiniband/core/cm.c --- linux-2.6.git/drivers/infiniband/core/cm.c 2006-01-16 10:25:26.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/cm.c 2006-01-16 16:03:35.0 -0800 @@ -32,7 +32,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: cm.c 2821 2005-07-08 17:07:28Z sean.hefty $ + * $Id: cm.c 4311 2005-12-05 18:42:01Z sean.hefty $ */ #include #include @@ -130,6 +130,7 @@ struct cm_id_private { /* todo: use alternate port on send failure */ struct cm_av av; struct cm_av alt_av; + struct ib_cm_compare_data *compare_data; void *private_data; __be64 tid; @@ -355,6 +356,41 @@ static struct cm_id_private * cm_acquire return cm_id_priv; } +static void cm_mask_copy(u8 *dst, u8 *src, u8 *mask) +{ + int i; + + for (i = 0; i < IB_CM_COMPARE_SIZE / sizeof(unsigned long); i++) + ((unsigned long *) dst)[i] = ((unsigned long *) src)[i] & +((unsigned long *) mask)[i]; +} + +static int cm_compare_data(struct ib_cm_compare_data *src_data, + struct ib_cm_compare_data *dst_data) +{ + u8 src[IB_CM_COMPARE_SIZE]; + u8 dst[IB_CM_COMPARE_SIZE]; + + if (!src_data || !dst_data) + return 0; + + cm_mask_copy(src, src_data->data, dst_data->mask); + cm_mask_copy(dst, dst_data->data, src_data->mask); + return memcmp(src, dst, IB_CM_COMPARE_SIZE); +} + +static int cm_compare_private_data(u8 *private_data, + struct ib_cm_compare_data *dst_data) +{ + u8 src[IB_CM_COMPARE_SIZE]; + + if (!dst_data) + return 0; + + cm_mask_copy(src, private_data, dst_data->mask); + return memcmp(src, dst_data->data, IB_CM_COMPARE_SIZE); +} + static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) { struct rb_node **link = &cm.listen_service_table.rb_node; @@ -362,14 +397,18 @@ static struct cm_id_private * cm_insert_ struct cm_id_private *cur_cm_id_priv; __be64 service_id = cm_id_priv->id.service_id; __be64 service_mask = cm_id_priv->id.service_mask; + int data_cmp; while (*link) { parent = *link; cur_cm_id_priv = rb_entry(parent, struct cm_id_private, service_node); + data_cmp = cm_compare_data(cm_id_priv->compare_data, + cur_cm_id_priv->compare_data); if ((cur_cm_id_priv->id.service_mask & service_id) == (service_mask & cur_cm_id_priv->id.service_id) && - (cm_id_priv->id.device == cur_cm_id_priv->id.device)) + (cm_id_priv->id.device == cur_cm_id_priv->id.device) && + !data_cmp) return cur_cm_id_priv; if (cm_id_priv->id.device < cur_cm_id_priv->id.device) @@ -378,6 +417,10 @@ static struct cm_id_private * cm_insert_ link = &(*link)->rb_right; else if (service_id < cur_cm_id_priv->id.service_id) link = &(*link)->rb_left; + else if (service_id > cur_cm_id_priv->id.service_id) + link = &(*link)->rb_right; + else if (data_cmp < 0) + link = &(*link)->rb_left; else link = &(*link)->rb_right; } @@ -387,16 +430,20 @@ static struct cm_id_private * cm_insert_ } static struct cm_id_private * cm_find_listen(struct ib_device *device, -__be64 service_id) +__be64 service_id, +u8 *private_data) { struct rb_node *node = cm.listen_service_table.rb_node; struct cm_id_private *cm_id_priv; + int data_cmp; while (node) { cm_id_priv = rb_entry(node, struct cm_id_private, service_node); + data_cmp = cm_compare_private_data(private_data, + cm_id_priv->compare_data); i
Re: [openib-general] [PATCH 2/6] IB: match connection requests based on private data
Sean Hefty wrote: +static void cm_mask_compare_data(u8 *dst, u8 *src, u8 *mask) +{ + int i; + + for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE; i++) + dst[i] = src[i] & mask[i]; +} + +static int cm_compare_data(struct ib_cm_private_data_compare *src_data, + struct ib_cm_private_data_compare *dst_data) +{ + u8 src[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; + u8 dst[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; Ugh. I sent the wrong patch series. This was the original set of patches, before any feedback was incorporated. I will need to resend patches 2, 4, 5, and 6. Sorry about this. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [openib-general] Re: [PATCH 6/6] IB: userspace support for RDMA connection manager
Roland Dreier wrote: > +struct rdma_ucm_query_route_resp { > + __u64 node_guid; > + struct ib_user_path_rec ib_route[2]; > + struct sockaddr_in6 src_addr; > + struct sockaddr_in6 dst_addr; > + __u32 num_paths; > + __u8 port_num; > + __u8 reserved[3]; > +}; Is there a 32-bit/64-bit compatibility problem here? From a quick look, struct sockaddr_in6 is not 8-byte aligned. Unless I miss counted, they should be aligned. ib_user_path_rec is defined near the end of patch 1/6. +struct ib_user_path_rec { + __u8dgid[16]; + __u8sgid[16]; + __be16 dlid; + __be16 slid; + __u32 raw_traffic; + __be32 flow_label; + __u32 reversible; + __u32 mtu; + __be16 pkey; + __u8hop_limit; + __u8traffic_class; + __u8numb_path; + __u8sl; + __u8mtu_selector; + __u8rate_selector; + __u8rate; + __u8packet_life_time_selector; + __u8packet_life_time; + __u8preference; +}; - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [openib-general] Re: [PATCH 6/6] IB: userspace support for RDMA connection manager
Roland Dreier wrote: On the other hand I think it would be good to let this userspace interface cook a little more, say in -mm. I think that this makes sense. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [openib-general] RE: [PATCH 2/6] IB: match connection requests based on private data
Caitlin Bestler wrote: The term "private data" is intended to convey the intent that the data is private to the application layer and is opaque to middleware and the network. The private data area is for the use of whatever client resides above the Infiniband CM only. There is no assumption about whether that client is middleware or an application. By what mechanism does the listening application delegate how much of the private data for use by the CM for sub-dividing a listen? What does an application do if it wishes to retain full ownership of the private data? An application that interfaces directly with the Infiniband CM always retains full control of any private data. Applications that interface to middleware are restricted by the limitations of that middleware layer. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] IB: IP address based RDMA connection manager
Kernel mode connection management agent over Infiniband that connects based on IP addresses. The agent defines a generic RDMA connection abstraction to support clients wanting to connect over different RDMA devices. Agent also handles RDMA device hotplug events on behalf of clients. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/cma.c linux-2.6.ib/drivers/infiniband/core/cma.c --- linux-2.6.git/drivers/infiniband/core/cma.c 1969-12-31 16:00:00.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/cma.c 2006-01-16 16:17:34.0 -0800 @@ -0,0 +1,1639 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + *available from the Open Source Initiative, see + *http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + *available from the Open Source Initiative, see + *http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + *copy of which is available from the Open Source Initiative, see + *http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ +#include +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Guy German"); +MODULE_DESCRIPTION("Generic RDMA CM Agent"); +MODULE_LICENSE("Dual BSD/GPL"); + +#define CMA_CM_RESPONSE_TIMEOUT 20 +#define CMA_MAX_CM_RETRIES 3 + +static void cma_add_one(struct ib_device *device); +static void cma_remove_one(struct ib_device *device); + +static struct ib_client cma_client = { + .name = "cma", + .add= cma_add_one, + .remove = cma_remove_one +}; + +static LIST_HEAD(dev_list); +static LIST_HEAD(listen_any_list); +static DECLARE_MUTEX(mutex); + +struct cma_device { + struct list_headlist; + struct ib_device*device; + __be64 node_guid; + wait_queue_head_t wait; + atomic_trefcount; + struct list_headid_list; +}; + +enum cma_state { + CMA_IDLE, + CMA_ADDR_QUERY, + CMA_ADDR_RESOLVED, + CMA_ROUTE_QUERY, + CMA_ROUTE_RESOLVED, + CMA_CONNECT, + CMA_ADDR_BOUND, + CMA_LISTEN, + CMA_DEVICE_REMOVAL, + CMA_DESTROYING +}; + +/* + * Device removal can occur at anytime, so we need extra handling to + * serialize notifying the user of device removal with other callbacks. + * We do this by disabling removal notification while a callback is in process, + * and reporting it after the callback completes. + */ +struct rdma_id_private { + struct rdma_cm_id id; + + struct list_headlist; + struct list_headlisten_list; + struct cma_device *cma_dev; + + enum cma_state state; + spinlock_t lock; + wait_queue_head_t wait; + atomic_trefcount; + wait_queue_head_t wait_remove; + atomic_tdev_remove; + + int backlog; + int timeout_ms; + struct ib_sa_query *query; + int query_id; + struct ib_cm_id *cm_id; + + u32 seq_num; + u32 qp_num; + enum ib_qp_type qp_type; + u8 srq; +}; + +struct cma_work { + struct work_struct work; + struct rdma_id_private *id; +}; + +union cma_ip_addr { + struct in6_addr ip6; + struct { + __u32 pad[3]; + __u32 addr; + } ip4; +}; + +struct cma_hdr { + u8 cma_version; + u8 ip_version; /* IP version: 7:4 */ + __u16 port; + union cma_ip_addr src_addr; + union cma_ip_addr dst_addr; +}; + +struct sdp_hh { + u8 sdp_version; + u8 ip_version; /* IP version: 7:4 */ + u8 sdp_specific1[10]; + __u16 port; + __u16 sdp_specific2; + union cma_ip_addr src_addr; + union cma_ip_addr dst_addr; +}; + +#define CMA_VERSION 0x10 +#de
[PATCH 6/6] IB: userspace support for RDMA connection manager
Kernel component necessary to support the userspace RDMA connection management library. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/Makefile linux-2.6.ib/drivers/infiniband/core/Makefile --- linux-2.6.git/drivers/infiniband/core/Makefile 2006-01-16 16:58:58.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/Makefile 2006-01-16 16:55:25.0 -0800 @@ -1,5 +1,5 @@ obj-$(CONFIG_INFINIBAND) +=ib_core.o ib_mad.o ib_sa.o \ - ib_cm.o ib_addr.o rdma_cm.o + ib_cm.o ib_addr.o rdma_cm.o rdma_ucm.o obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=ib_uverbs.o ib_ucm.o @@ -14,6 +14,8 @@ ib_cm-y :=cm.o rdma_cm-y := cma.o +rdma_ucm-y := ucma.o + ib_addr-y := addr.o ib_umad-y := user_mad.o diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/ucma.c linux-2.6.ib/drivers/infiniband/core/ucma.c --- linux-2.6.git/drivers/infiniband/core/ucma.c1969-12-31 16:00:00.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/ucma.c 2006-01-16 16:54:31.0 -0800 @@ -0,0 +1,788 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include + +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("RDMA Userspace Connection Manager Access"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + UCMA_MAX_BACKLOG= 128 +}; + +struct ucma_file { + struct semaphoremutex; + struct file *filp; + struct list_headctxs; + struct list_headevents; + wait_queue_head_t poll_wait; +}; + +struct ucma_context { + int id; + wait_queue_head_t wait; + atomic_tref; + int events_reported; + int backlog; + + struct ucma_file*file; + struct rdma_cm_id *cm_id; + __u64 uid; + + struct list_headevents;/* list of pending events. */ + struct list_headfile_list; /* member in file ctx list */ +}; + +struct ucma_event { + struct ucma_context *ctx; + struct list_headfile_list; /* member in file event list */ + struct list_headctx_list; /* member in ctx event list */ + struct rdma_cm_id *cm_id; + struct rdma_ucm_event_resp resp; +}; + +static DECLARE_MUTEX(ctx_mutex); +static DEFINE_IDR(ctx_idr); + +static struct ucma_context* ucma_get_ctx(struct ucma_file *file, int id) +{ + struct ucma_context *ctx; + + down(&ctx_mutex); + ctx = idr_find(&ctx_idr, id); + if (!ctx) + ctx = ERR_PTR(-ENOENT); + else if (ctx->file != file) + ctx = ERR_PTR(-EINVAL); + else + atomic_inc(&ctx->ref); + up(&ctx_mutex); + + return ctx; +} + +static void ucma_put_ctx(struct ucma_context *ctx) +{ + if (atomic_dec_and_test(&ctx->ref)) + wake_up(&ctx->wait); +} + +static void ucma_cleanup_events(struct ucma_context *ctx) +{ + struct ucma_event *uevent; + + down(&ctx->file->mutex); + list_del(&ctx-&g
[PATCH 4/6] IB: address translation to map IP to IB addresses (GIDs)
Add an address translation service that maps IP addresses to Infiniband GID addresses using IPoIB. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/addr.c linux-2.6.ib/drivers/infiniband/core/addr.c --- linux-2.6.git/drivers/infiniband/core/addr.c1969-12-31 16:00:00.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/addr.c 2006-01-16 16:14:24.0 -0800 @@ -0,0 +1,356 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + *available from the Open Source Initiative, see + *http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + *available from the Open Source Initiative, see + *http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + *copy of which is available from the Open Source Initiative, see + *http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("IB Address Translation"); +MODULE_LICENSE("Dual BSD/GPL"); + +struct addr_req { + struct list_head list; + struct sockaddr src_addr; + struct sockaddr dst_addr; + struct rdma_dev_addr *addr; + void *context; + void (*callback)(int status, struct sockaddr *src_addr, +struct rdma_dev_addr *addr, void *context); + unsigned long timeout; + int status; +}; + +static void process_req(void *data); + +static DECLARE_MUTEX(mutex); +static LIST_HEAD(req_list); +static DECLARE_WORK(work, process_req, NULL); +struct workqueue_struct *rdma_wq; +EXPORT_SYMBOL(rdma_wq); + +static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, +unsigned char *dst_dev_addr) +{ + switch (dev->type) { + case ARPHRD_INFINIBAND: + dev_addr->dev_type = IB_NODE_CA; + break; + default: + return -EADDRNOTAVAIL; + } + + memcpy(dev_addr->src_dev_addr, dev->dev_addr, MAX_ADDR_LEN); + memcpy(dev_addr->broadcast, dev->broadcast, MAX_ADDR_LEN); + if (dst_dev_addr) + memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN); + return 0; +} + +int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr) +{ + struct net_device *dev; + u32 ip = ((struct sockaddr_in *) addr)->sin_addr.s_addr; + int ret; + + dev = ip_dev_find(ip); + if (!dev) + return -EADDRNOTAVAIL; + + ret = copy_addr(dev_addr, dev, NULL); + dev_put(dev); + return ret; +} +EXPORT_SYMBOL(rdma_translate_ip); + +static void set_timeout(unsigned long time) +{ + unsigned long delay; + + cancel_delayed_work(&work); + + delay = time - jiffies; + if ((long)delay <= 0) + delay = 1; + + queue_delayed_work(rdma_wq, &work, delay); +} + +static void queue_req(struct addr_req *req) +{ + struct addr_req *temp_req; + + down(&mutex); + list_for_each_entry_reverse(temp_req, &req_list, list) { + if (time_after(req->timeout, temp_req->timeout)) + break; + } + + list_add(&req->list, &temp_req->list); + + if (req_list.next == &req->list) + set_timeout(req->timeout); + up(&mutex); +} + +static void addr_send_arp(struct sockaddr_in *dst_in) +{ + struct rtable *rt; + struct flowi fl; + u32 dst_ip = dst_in->sin_addr.s_addr; + + memset(&fl, 0, sizeof fl); + fl.nl_u.ip4_u.daddr = dst_ip; + if (ip_route_output_key(&rt, &fl)) + return; + + arp_send(ARPOP_REQUEST, ETH_P_ARP, rt->rt_gateway, rt->idev->dev, +rt->rt_src, NULL, rt->idev->dev->dev_addr, NULL); + ip_rt_put(rt); +} + +static int addr_resolve_remote(struct sockaddr_in *src_in, +
[PATCH 3/6] net/IB: export ip_dev_find
Export ip_dev_find to allow locating a net_device given an IP address. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/net/ipv4/fib_frontend.c linux-2.6.ib/net/ipv4/fib_frontend.c --- linux-2.6.git/net/ipv4/fib_frontend.c 2006-01-16 10:28:29.0 -0800 +++ linux-2.6.ib/net/ipv4/fib_frontend.c2006-01-16 16:14:24.0 -0800 @@ -666,4 +666,5 @@ void __init ip_fib_init(void) } EXPORT_SYMBOL(inet_addr_type); +EXPORT_SYMBOL(ip_dev_find); EXPORT_SYMBOL(ip_rt_ioctl); - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] IB: match connection requests based on private data
Extend matching connection requests to listens in the Infiniband CM to include private data checks. This allows applications to listen on the same service identifier, with private data directing the request to the appropriate application. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/cm.c linux-2.6.ib/drivers/infiniband/core/cm.c --- linux-2.6.git/drivers/infiniband/core/cm.c 2006-01-16 10:25:26.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/cm.c 2006-01-16 16:03:35.0 -0800 @@ -32,7 +32,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: cm.c 2821 2005-07-08 17:07:28Z sean.hefty $ + * $Id: cm.c 4311 2005-12-05 18:42:01Z sean.hefty $ */ #include #include @@ -130,6 +130,7 @@ struct cm_id_private { /* todo: use alternate port on send failure */ struct cm_av av; struct cm_av alt_av; + struct ib_cm_private_data_compare *compare_data; void *private_data; __be64 tid; @@ -355,6 +356,40 @@ static struct cm_id_private * cm_acquire return cm_id_priv; } +static void cm_mask_compare_data(u8 *dst, u8 *src, u8 *mask) +{ + int i; + + for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE; i++) + dst[i] = src[i] & mask[i]; +} + +static int cm_compare_data(struct ib_cm_private_data_compare *src_data, + struct ib_cm_private_data_compare *dst_data) +{ + u8 src[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; + u8 dst[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; + + if (!src_data || !dst_data) + return 0; + + cm_mask_compare_data(src, src_data->data, dst_data->mask); + cm_mask_compare_data(dst, dst_data->data, src_data->mask); + return memcmp(src, dst, IB_CM_PRIVATE_DATA_COMPARE_SIZE); +} + +static int cm_compare_private_data(u8 *private_data, + struct ib_cm_private_data_compare *dst_data) +{ + u8 src[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; + + if (!dst_data) + return 0; + + cm_mask_compare_data(src, private_data, dst_data->mask); + return memcmp(src, dst_data->data, IB_CM_PRIVATE_DATA_COMPARE_SIZE); +} + static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) { struct rb_node **link = &cm.listen_service_table.rb_node; @@ -362,14 +397,18 @@ static struct cm_id_private * cm_insert_ struct cm_id_private *cur_cm_id_priv; __be64 service_id = cm_id_priv->id.service_id; __be64 service_mask = cm_id_priv->id.service_mask; + int data_cmp; while (*link) { parent = *link; cur_cm_id_priv = rb_entry(parent, struct cm_id_private, service_node); + data_cmp = cm_compare_data(cm_id_priv->compare_data, + cur_cm_id_priv->compare_data); if ((cur_cm_id_priv->id.service_mask & service_id) == (service_mask & cur_cm_id_priv->id.service_id) && - (cm_id_priv->id.device == cur_cm_id_priv->id.device)) + (cm_id_priv->id.device == cur_cm_id_priv->id.device) && + !data_cmp) return cur_cm_id_priv; if (cm_id_priv->id.device < cur_cm_id_priv->id.device) @@ -378,6 +417,10 @@ static struct cm_id_private * cm_insert_ link = &(*link)->rb_right; else if (service_id < cur_cm_id_priv->id.service_id) link = &(*link)->rb_left; + else if (service_id > cur_cm_id_priv->id.service_id) + link = &(*link)->rb_right; + else if (data_cmp < 0) + link = &(*link)->rb_left; else link = &(*link)->rb_right; } @@ -387,16 +430,20 @@ static struct cm_id_private * cm_insert_ } static struct cm_id_private * cm_find_listen(struct ib_device *device, -__be64 service_id) +__be64 service_id, +u8 *private_data) { struct rb_node *node = cm.listen_service_table.rb_node; struct cm_id_private *cm_id_priv; + int data_cmp; while (node) { cm_id_priv = rb_entry(node, struct cm_id_private, service_node); + data_cmp = cm_compare_private_data(private_data, + cm_id_priv->compare_data); if ((cm_id_priv->id.service_mask & service_id) == cm_id_priv->id.service_id &&
[PATCH 1/6] IB: common handling for marshalling parameters to/from userspace
Provide common handling for marshalling data between userspace clients and kernel mode Infiniband drivers. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/Makefile linux-2.6.ib/drivers/infiniband/core/Makefile --- linux-2.6.git/drivers/infiniband/core/Makefile 2006-01-16 10:25:27.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/Makefile 2006-01-16 15:34:15.0 -0800 @@ -16,4 +16,5 @@ ib_umad-y := user_mad.o ib_ucm-y :=ucm.o -ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o +ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \ + uverbs_marshall.o diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/ucm.c linux-2.6.ib/drivers/infiniband/core/ucm.c --- linux-2.6.git/drivers/infiniband/core/ucm.c 2006-01-16 10:25:26.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/ucm.c 2006-01-16 15:34:15.0 -0800 @@ -30,7 +30,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ucm.c 2594 2005-06-13 19:46:02Z libor $ + * $Id: ucm.c 4311 2005-12-05 18:42:01Z sean.hefty $ */ #include #include @@ -48,6 +48,7 @@ #include #include +#include MODULE_AUTHOR("Libor Michalek"); MODULE_DESCRIPTION("InfiniBand userspace Connection Manager access"); @@ -203,36 +204,6 @@ error: return NULL; } -static void ib_ucm_event_path_get(struct ib_ucm_path_rec *upath, - struct ib_sa_path_rec *kpath) -{ - if (!kpath || !upath) - return; - - memcpy(upath->dgid, kpath->dgid.raw, sizeof *upath->dgid); - memcpy(upath->sgid, kpath->sgid.raw, sizeof *upath->sgid); - - upath->dlid = kpath->dlid; - upath->slid = kpath->slid; - upath->raw_traffic = kpath->raw_traffic; - upath->flow_label = kpath->flow_label; - upath->hop_limit= kpath->hop_limit; - upath->traffic_class= kpath->traffic_class; - upath->reversible = kpath->reversible; - upath->numb_path= kpath->numb_path; - upath->pkey = kpath->pkey; - upath->sl = kpath->sl; - upath->mtu_selector = kpath->mtu_selector; - upath->mtu = kpath->mtu; - upath->rate_selector= kpath->rate_selector; - upath->rate = kpath->rate; - upath->packet_life_time = kpath->packet_life_time; - upath->preference = kpath->preference; - - upath->packet_life_time_selector = - kpath->packet_life_time_selector; -} - static void ib_ucm_event_req_get(struct ib_ucm_req_event_resp *ureq, struct ib_cm_req_event_param *kreq) { @@ -251,8 +222,10 @@ static void ib_ucm_event_req_get(struct ureq->srq= kreq->srq; ureq->port = kreq->port; - ib_ucm_event_path_get(&ureq->primary_path, kreq->primary_path); - ib_ucm_event_path_get(&ureq->alternate_path, kreq->alternate_path); + ib_copy_path_rec_to_user(&ureq->primary_path, kreq->primary_path); + if (kreq->alternate_path) + ib_copy_path_rec_to_user(&ureq->alternate_path, +kreq->alternate_path); } static void ib_ucm_event_rep_get(struct ib_ucm_rep_event_resp *urep, @@ -322,8 +295,8 @@ static int ib_ucm_event_process(struct i info = evt->param.rej_rcvd.ari; break; case IB_CM_LAP_RECEIVED: - ib_ucm_event_path_get(&uvt->resp.u.lap_resp.path, - evt->param.lap_rcvd.alternate_path); + ib_copy_path_rec_to_user(&uvt->resp.u.lap_resp.path, +evt->param.lap_rcvd.alternate_path); uvt->data_len = IB_CM_LAP_PRIVATE_DATA_SIZE; uvt->resp.present = IB_UCM_PRES_ALTERNATE; break; @@ -635,65 +608,11 @@ static ssize_t ib_ucm_attr_id(struct ib_ return result; } -static void ib_ucm_copy_ah_attr(struct ib_ucm_ah_attr *dest_attr, - struct ib_ah_attr *src_attr) -{ - memcpy(dest_attr->grh_dgid, src_attr->grh.dgid.raw, - sizeof src_attr->grh.dgid); - dest_attr->grh_flow_label = src_attr->grh.flow_label; - dest_attr->grh_sgid_index = src_attr->grh.sgid_index; - dest_attr->grh_hop_limit = src_attr->grh.hop_limit; - dest_attr->grh_traffic_class = src_att
[PATCH 3/5] export of ip_dev_find as part of Infiniband connection abstraction
I wanted to make doubly sure that this didn't get lost in the patch series, but ip_dev_find() is re-exported. The use is shown below. - Sean >+int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr) >+{ >+ struct net_device *dev; >+ u32 ip = ((struct sockaddr_in *) addr)->sin_addr.s_addr; >+ int ret; >+ >+ dev = ip_dev_find(ip); >+ if (!dev) >+ return -EADDRNOTAVAIL; >+ >+ ret = copy_addr(dev_addr, dev, NULL); >+ dev_put(dev); >+ return ret; >+} {snip} >+static int addr_resolve_local(struct sockaddr_in *src_in, >+struct sockaddr_in *dst_in, >+struct rdma_dev_addr *addr) >+{ >+ struct net_device *dev; >+ u32 src_ip = src_in->sin_addr.s_addr; >+ u32 dst_ip = dst_in->sin_addr.s_addr; >+ int ret; >+ >+ dev = ip_dev_find(dst_ip); >+ if (!dev) >+ return -EADDRNOTAVAIL; >+ >+ if (!src_ip) { >+ src_in->sin_family = dst_in->sin_family; >+ src_in->sin_addr.s_addr = dst_ip; >+ ret = copy_addr(addr, dev, dev->dev_addr); >+ } else { >+ ret = rdma_translate_ip((struct sockaddr *)src_in, addr); >+ if (!ret) >+ memcpy(addr->dst_dev_addr, dev->dev_addr, MAX_ADDR_LEN); >+ } >+ >+ dev_put(dev); >+ return ret; >+} {snip} >diff -uprN -X linux-2.6.git/Documentation/dontdiff >linux-2.6.git/net/ipv4/fib_frontend.c >linux-2.6.ib/net/ipv4/fib_frontend.c >--- linux-2.6.git/net/ipv4/fib_frontend.c 2006-01-16 10:28:29.0 -0800 >+++ linux-2.6.ib/net/ipv4/fib_frontend.c 2006-01-16 16:14:24.0 -0800 >@@ -666,4 +666,5 @@ void __init ip_fib_init(void) > } > > EXPORT_SYMBOL(inet_addr_type); >+EXPORT_SYMBOL(ip_dev_find); > EXPORT_SYMBOL(ip_rt_ioctl); - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] Infiniband: connection abstraction
>Here's an updated version of these patches based on feedback. (The license >did not change and continues to match that of the other Infiniband code.) >Please consider for inclusion in 2.6.17. This is just a ping for anymore feedback to this patch series, so that I can respond to any requests before 2.6.17 opens up. I can resubmit the patches if necessary. Thanks, Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] Infiniband: connection abstraction
This patch adds the kernel component to support the userspace Infiniband/RDMA connection agent library. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/Makefile linux-2.6.ib/drivers/infiniband/core/Makefile --- linux-2.6.git/drivers/infiniband/core/Makefile 2006-01-16 16:58:58.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/Makefile 2006-01-16 16:55:25.0 -0800 @@ -1,5 +1,5 @@ obj-$(CONFIG_INFINIBAND) +=ib_core.o ib_mad.o ib_sa.o \ - ib_cm.o ib_addr.o rdma_cm.o + ib_cm.o ib_addr.o rdma_cm.o rdma_ucm.o obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=ib_uverbs.o ib_ucm.o @@ -14,6 +14,8 @@ ib_cm-y :=cm.o rdma_cm-y := cma.o +rdma_ucm-y := ucma.o + ib_addr-y := addr.o ib_umad-y := user_mad.o diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/ucma.c linux-2.6.ib/drivers/infiniband/core/ucma.c --- linux-2.6.git/drivers/infiniband/core/ucma.c1969-12-31 16:00:00.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/ucma.c 2006-01-16 16:54:31.0 -0800 @@ -0,0 +1,788 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include + +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("RDMA Userspace Connection Manager Access"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + UCMA_MAX_BACKLOG= 128 +}; + +struct ucma_file { + struct mutexfile_mutex; + struct file *filp; + struct list_headctxs; + struct list_headevents; + wait_queue_head_t poll_wait; +}; + +struct ucma_context { + int id; + wait_queue_head_t wait; + atomic_tref; + int events_reported; + int backlog; + + struct ucma_file*file; + struct rdma_cm_id *cm_id; + __u64 uid; + + struct list_headevents;/* list of pending events. */ + struct list_headfile_list; /* member in file ctx list */ +}; + +struct ucma_event { + struct ucma_context *ctx; + struct list_headfile_list; /* member in file event list */ + struct list_headctx_list; /* member in ctx event list */ + struct rdma_cm_id *cm_id; + struct rdma_ucm_event_resp resp; +}; + +static DEFINE_MUTEX(ctx_mutex); +static DEFINE_IDR(ctx_idr); + +static struct ucma_context* ucma_get_ctx(struct ucma_file *file, int id) +{ + struct ucma_context *ctx; + + mutex_lock(&ctx_mutex); + ctx = idr_find(&ctx_idr, id); + if (!ctx) + ctx = ERR_PTR(-ENOENT); + else if (ctx->file != file) + ctx = ERR_PTR(-EINVAL); + else + atomic_inc(&ctx->ref); + mutex_unlock(&ctx_mutex); + + return ctx; +} + +static void ucma_put_ctx(struct ucma_context *ctx) +{ + if (atomic_dec_and_test(&ctx->ref)) + wake_up(&ctx->wait); +} + +static void ucma_cleanup_events(struct ucma_context *ctx) +{ + struct ucma_event *uevent; + + mutex_lock(&ctx->file
[PATCH 4/5] Infiniband: connection abstraction
The following patch implements a kernel mode connection management agent over Infiniband that connects based on IP addresses. The agent defines a generic RDMA connection abstraction to support clients wanting to connect over different RDMA devices. It also handles RDMA device hotplug events on behalf of clients. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/cma.c linux-2.6.ib/drivers/infiniband/core/cma.c --- linux-2.6.git/drivers/infiniband/core/cma.c 1969-12-31 16:00:00.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/cma.c 2006-01-16 16:17:34.0 -0800 @@ -0,0 +1,1639 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + *available from the Open Source Initiative, see + *http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + *available from the Open Source Initiative, see + *http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + *copy of which is available from the Open Source Initiative, see + *http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ +#include +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("Generic RDMA CM Agent"); +MODULE_LICENSE("Dual BSD/GPL"); + +#define CMA_CM_RESPONSE_TIMEOUT 20 +#define CMA_MAX_CM_RETRIES 3 + +static void cma_add_one(struct ib_device *device); +static void cma_remove_one(struct ib_device *device); + +static struct ib_client cma_client = { + .name = "cma", + .add= cma_add_one, + .remove = cma_remove_one +}; + +static LIST_HEAD(dev_list); +static LIST_HEAD(listen_any_list); +static DEFINE_MUTEX(lock); + +struct cma_device { + struct list_headlist; + struct ib_device*device; + __be64 node_guid; + wait_queue_head_t wait; + atomic_trefcount; + struct list_headid_list; +}; + +enum cma_state { + CMA_IDLE, + CMA_ADDR_QUERY, + CMA_ADDR_RESOLVED, + CMA_ROUTE_QUERY, + CMA_ROUTE_RESOLVED, + CMA_CONNECT, + CMA_ADDR_BOUND, + CMA_LISTEN, + CMA_DEVICE_REMOVAL, + CMA_DESTROYING +}; + +/* + * Device removal can occur at anytime, so we need extra handling to + * serialize notifying the user of device removal with other callbacks. + * We do this by disabling removal notification while a callback is in process, + * and reporting it after the callback completes. + */ +struct rdma_id_private { + struct rdma_cm_id id; + + struct list_headlist; + struct list_headlisten_list; + struct cma_device *cma_dev; + + enum cma_state state; + spinlock_t lock; + wait_queue_head_t wait; + atomic_trefcount; + wait_queue_head_t wait_remove; + atomic_tdev_remove; + + int backlog; + int timeout_ms; + struct ib_sa_query *query; + int query_id; + struct ib_cm_id *cm_id; + + u32 seq_num; + u32 qp_num; + enum ib_qp_type qp_type; + u8 srq; +}; + +struct cma_work { + struct work_struct work; + struct rdma_id_private *id; +}; + +union cma_ip_addr { + struct in6_addr ip6; + struct { + __u32 pad[3]; + __u32 addr; + } ip4; +}; + +struct cma_hdr { + u8 cma_version; + u8 ip_version; /* IP version: 7:4 */ + __u16 port; + union cma_ip_addr src_addr; + union cma_ip_addr dst_addr; +}; + +struct sdp_hh { + u8 sdp_version; + u8 ip_version; /* IP version: 7:4 */ + u8 sdp_specific1[10]; + __u16 port; + __u16 sdp_specific2; + union cma_ip_addr src_addr; + union cma_ip_addr dst_addr; +};
[PATCH 3/5] Infiniband: connection abstraction
The following provides an address translation service that maps IP addresses to Infiniband addresses (GIDs) using IPoIB. This patch exports ip_dev_find() to locate a net_device given an IP address. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/addr.c linux-2.6.ib/drivers/infiniband/core/addr.c --- linux-2.6.git/drivers/infiniband/core/addr.c1969-12-31 16:00:00.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/addr.c 2006-01-16 16:14:24.0 -0800 @@ -0,0 +1,356 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + *available from the Open Source Initiative, see + *http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + *available from the Open Source Initiative, see + *http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + *copy of which is available from the Open Source Initiative, see + *http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("IB Address Translation"); +MODULE_LICENSE("Dual BSD/GPL"); + +struct addr_req { + struct list_head list; + struct sockaddr src_addr; + struct sockaddr dst_addr; + struct rdma_dev_addr *addr; + void *context; + void (*callback)(int status, struct sockaddr *src_addr, +struct rdma_dev_addr *addr, void *context); + unsigned long timeout; + int status; +}; + +static void process_req(void *data); + +static DEFINE_MUTEX(lock); +static LIST_HEAD(req_list); +static DECLARE_WORK(work, process_req, NULL); +struct workqueue_struct *rdma_wq; +EXPORT_SYMBOL(rdma_wq); + +static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, +unsigned char *dst_dev_addr) +{ + switch (dev->type) { + case ARPHRD_INFINIBAND: + dev_addr->dev_type = IB_NODE_CA; + break; + default: + return -EADDRNOTAVAIL; + } + + memcpy(dev_addr->src_dev_addr, dev->dev_addr, MAX_ADDR_LEN); + memcpy(dev_addr->broadcast, dev->broadcast, MAX_ADDR_LEN); + if (dst_dev_addr) + memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN); + return 0; +} + +int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr) +{ + struct net_device *dev; + u32 ip = ((struct sockaddr_in *) addr)->sin_addr.s_addr; + int ret; + + dev = ip_dev_find(ip); + if (!dev) + return -EADDRNOTAVAIL; + + ret = copy_addr(dev_addr, dev, NULL); + dev_put(dev); + return ret; +} +EXPORT_SYMBOL(rdma_translate_ip); + +static void set_timeout(unsigned long time) +{ + unsigned long delay; + + cancel_delayed_work(&work); + + delay = time - jiffies; + if ((long)delay <= 0) + delay = 1; + + queue_delayed_work(rdma_wq, &work, delay); +} + +static void queue_req(struct addr_req *req) +{ + struct addr_req *temp_req; + + mutex_lock(&lock); + list_for_each_entry_reverse(temp_req, &req_list, list) { + if (time_after(req->timeout, temp_req->timeout)) + break; + } + + list_add(&req->list, &temp_req->list); + + if (req_list.next == &req->list) + set_timeout(req->timeout); + mutex_unlock(&lock); +} + +static void addr_send_arp(struct sockaddr_in *dst_in) +{ + struct rtable *rt; + struct flowi fl; + u32 dst_ip = dst_in->sin_addr.s_addr; + + memset(&fl, 0, sizeof fl); + fl.nl_u.ip4_u.daddr = dst_ip; + if (ip_route_output_key(&rt, &fl)) + return; + + arp_send(ARPOP_REQUEST, ETH_P_ARP, rt->rt_gateway, rt->idev->dev, +rt->rt_src, NULL, rt->
[PATCH 2/5] Infiniband: connection abstraction
The following patch extends matching connection requests to listens in the Infiniband CM to include private data. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/cm.c linux-2.6.ib/drivers/infiniband/core/cm.c --- linux-2.6.git/drivers/infiniband/core/cm.c 2006-01-16 10:25:26.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/cm.c 2006-01-16 16:03:35.0 -0800 @@ -32,7 +32,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: cm.c 2821 2005-07-08 17:07:28Z sean.hefty $ + * $Id: cm.c 4311 2005-12-05 18:42:01Z sean.hefty $ */ #include #include @@ -130,6 +130,7 @@ struct cm_id_private { /* todo: use alternate port on send failure */ struct cm_av av; struct cm_av alt_av; + struct ib_cm_compare_data *compare_data; void *private_data; __be64 tid; @@ -355,6 +356,41 @@ static struct cm_id_private * cm_acquire return cm_id_priv; } +static void cm_mask_copy(u8 *dst, u8 *src, u8 *mask) +{ + int i; + + for (i = 0; i < IB_CM_COMPARE_SIZE / sizeof(unsigned long); i++) + ((unsigned long *) dst)[i] = ((unsigned long *) src)[i] & +((unsigned long *) mask)[i]; +} + +static int cm_compare_data(struct ib_cm_compare_data *src_data, + struct ib_cm_compare_data *dst_data) +{ + u8 src[IB_CM_COMPARE_SIZE]; + u8 dst[IB_CM_COMPARE_SIZE]; + + if (!src_data || !dst_data) + return 0; + + cm_mask_copy(src, src_data->data, dst_data->mask); + cm_mask_copy(dst, dst_data->data, src_data->mask); + return memcmp(src, dst, IB_CM_COMPARE_SIZE); +} + +static int cm_compare_private_data(u8 *private_data, + struct ib_cm_compare_data *dst_data) +{ + u8 src[IB_CM_COMPARE_SIZE]; + + if (!dst_data) + return 0; + + cm_mask_copy(src, private_data, dst_data->mask); + return memcmp(src, dst_data->data, IB_CM_COMPARE_SIZE); +} + static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) { struct rb_node **link = &cm.listen_service_table.rb_node; @@ -362,14 +397,18 @@ static struct cm_id_private * cm_insert_ struct cm_id_private *cur_cm_id_priv; __be64 service_id = cm_id_priv->id.service_id; __be64 service_mask = cm_id_priv->id.service_mask; + int data_cmp; while (*link) { parent = *link; cur_cm_id_priv = rb_entry(parent, struct cm_id_private, service_node); + data_cmp = cm_compare_data(cm_id_priv->compare_data, + cur_cm_id_priv->compare_data); if ((cur_cm_id_priv->id.service_mask & service_id) == (service_mask & cur_cm_id_priv->id.service_id) && - (cm_id_priv->id.device == cur_cm_id_priv->id.device)) + (cm_id_priv->id.device == cur_cm_id_priv->id.device) && + !data_cmp) return cur_cm_id_priv; if (cm_id_priv->id.device < cur_cm_id_priv->id.device) @@ -378,6 +417,10 @@ static struct cm_id_private * cm_insert_ link = &(*link)->rb_right; else if (service_id < cur_cm_id_priv->id.service_id) link = &(*link)->rb_left; + else if (service_id > cur_cm_id_priv->id.service_id) + link = &(*link)->rb_right; + else if (data_cmp < 0) + link = &(*link)->rb_left; else link = &(*link)->rb_right; } @@ -387,16 +430,20 @@ static struct cm_id_private * cm_insert_ } static struct cm_id_private * cm_find_listen(struct ib_device *device, -__be64 service_id) +__be64 service_id, +u8 *private_data) { struct rb_node *node = cm.listen_service_table.rb_node; struct cm_id_private *cm_id_priv; + int data_cmp; while (node) { cm_id_priv = rb_entry(node, struct cm_id_private, service_node); + data_cmp = cm_compare_private_data(private_data, + cm_id_priv->compare_data); if ((cm_id_priv->id.service_mask & service_id) == cm_id_priv->id.service_id && - (cm_id_priv->id.device == device)) + (cm_id_priv->id.device == device) && !data_cmp)
[PATCH 1/5] Infiniband: connection abstraction
The following patch provides common handling for marshalling data between Userspace clients and kernel mode Infiniband drivers. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/Makefile linux-2.6.ib/drivers/infiniband/core/Makefile --- linux-2.6.git/drivers/infiniband/core/Makefile 2006-01-16 10:25:27.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/Makefile 2006-01-16 15:34:15.0 -0800 @@ -16,4 +16,5 @@ ib_umad-y := user_mad.o ib_ucm-y :=ucm.o -ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o +ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \ + uverbs_marshall.o diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/ucm.c linux-2.6.ib/drivers/infiniband/core/ucm.c --- linux-2.6.git/drivers/infiniband/core/ucm.c 2006-01-16 10:25:26.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/ucm.c 2006-01-16 15:34:15.0 -0800 @@ -30,7 +30,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ucm.c 2594 2005-06-13 19:46:02Z libor $ + * $Id: ucm.c 4311 2005-12-05 18:42:01Z sean.hefty $ */ #include #include @@ -48,6 +48,7 @@ #include #include +#include MODULE_AUTHOR("Libor Michalek"); MODULE_DESCRIPTION("InfiniBand userspace Connection Manager access"); @@ -203,36 +204,6 @@ error: return NULL; } -static void ib_ucm_event_path_get(struct ib_ucm_path_rec *upath, - struct ib_sa_path_rec *kpath) -{ - if (!kpath || !upath) - return; - - memcpy(upath->dgid, kpath->dgid.raw, sizeof *upath->dgid); - memcpy(upath->sgid, kpath->sgid.raw, sizeof *upath->sgid); - - upath->dlid = kpath->dlid; - upath->slid = kpath->slid; - upath->raw_traffic = kpath->raw_traffic; - upath->flow_label = kpath->flow_label; - upath->hop_limit= kpath->hop_limit; - upath->traffic_class= kpath->traffic_class; - upath->reversible = kpath->reversible; - upath->numb_path= kpath->numb_path; - upath->pkey = kpath->pkey; - upath->sl = kpath->sl; - upath->mtu_selector = kpath->mtu_selector; - upath->mtu = kpath->mtu; - upath->rate_selector= kpath->rate_selector; - upath->rate = kpath->rate; - upath->packet_life_time = kpath->packet_life_time; - upath->preference = kpath->preference; - - upath->packet_life_time_selector = - kpath->packet_life_time_selector; -} - static void ib_ucm_event_req_get(struct ib_ucm_req_event_resp *ureq, struct ib_cm_req_event_param *kreq) { @@ -251,8 +222,10 @@ static void ib_ucm_event_req_get(struct ureq->srq= kreq->srq; ureq->port = kreq->port; - ib_ucm_event_path_get(&ureq->primary_path, kreq->primary_path); - ib_ucm_event_path_get(&ureq->alternate_path, kreq->alternate_path); + ib_copy_path_rec_to_user(&ureq->primary_path, kreq->primary_path); + if (kreq->alternate_path) + ib_copy_path_rec_to_user(&ureq->alternate_path, +kreq->alternate_path); } static void ib_ucm_event_rep_get(struct ib_ucm_rep_event_resp *urep, @@ -322,8 +295,8 @@ static int ib_ucm_event_process(struct i info = evt->param.rej_rcvd.ari; break; case IB_CM_LAP_RECEIVED: - ib_ucm_event_path_get(&uvt->resp.u.lap_resp.path, - evt->param.lap_rcvd.alternate_path); + ib_copy_path_rec_to_user(&uvt->resp.u.lap_resp.path, +evt->param.lap_rcvd.alternate_path); uvt->data_len = IB_CM_LAP_PRIVATE_DATA_SIZE; uvt->resp.present = IB_UCM_PRES_ALTERNATE; break; @@ -635,65 +608,11 @@ static ssize_t ib_ucm_attr_id(struct ib_ return result; } -static void ib_ucm_copy_ah_attr(struct ib_ucm_ah_attr *dest_attr, - struct ib_ah_attr *src_attr) -{ - memcpy(dest_attr->grh_dgid, src_attr->grh.dgid.raw, - sizeof src_attr->grh.dgid); - dest_attr->grh_flow_label = src_attr->grh.flow_label; - dest_attr->grh_sgid_index = src_attr->grh.sgid_index; - dest_attr->grh_hop_limit = src_attr->grh.hop_limit; - dest_attr->grh_
[PATCH 0/5] Infiniband: connection abstraction
Here's an updated version of these patches based on feedback. (The license did not change and continues to match that of the other Infiniband code.) Please consider for inclusion in 2.6.17. The following set of patches defines a connection abstraction for Infiniband and other RDMA devices, and serves several purposes: * It implements a connection protocol over Infiniband based on IP addressing. This greatly simplifies clients wishing to establish connections over Infiniband. * It defines a connection abstraction that works over multiple RDMA devices. The submitted implementation targets Infiniband, but has been tested over other RDMA devices as well. * It handles RDMA device insertion and removal on behalf of its clients. The changes have been broken into 5 separate patches. The basic purpose of each patch is: 1. Provide common handling for marshalling data between userspace clients and kernel mode Infiniband drivers. 2. Extend the Infiniband CM to include private data comparisons as part of its connection request matching process. 3. Provide an address translation service that maps IP addresses to Infiniband addresses (GIDs). This patch touches outside of the Infiniband core, so I'm including the netdev mailing list. 4. Implement the kernel mode RDMA connection management agent. 5. Implement the userspace RDMA connection management agent kernel support module. Please copy the openib-general mailing list on any replies. Thanks, Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [openib-general] [PATCH 5/5] [RFC] Infiniband: connection abstraction
Roland Dreier wrote: > + UCMA_MAX_BACKLOG= 128 Is there any reason that we might want to make this a tunable? Maybe as a module parameter that's writable in sysfs... There's no reason not to make this tunable. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [openib-general] RE: [PATCH 2/5] [RFC] Infiniband: connection abstraction
Grant Grundler wrote: Is this code going to get invoked very often? In practice, it would be invoked when matching any listen requests originating from the CMA (RDMA connection abstraction). hrm..I'm not sure how to translate your answer into a workload. e.g. which netperf or netpipe test would excercise this alot? Or would it take something like MPI or specweb/ttcp? The code will be invoked at least once for every connection that is established. e.g something like: for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE/sizeof(unsigned long); i++) ((unsigned long *)dst)[i] = ((unsigned long *)src)[i] & ((unsigned long *)mask)[i]; Yes - something like this should work. Thanks. Do you need a patch? I can submit one but it will be untested. I will incorporate the change with the next set of updates. Someone else pointed out that I'd need to make sure that there won't be any alignment issues. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [openib-general] RE: [PATCH 2/5] [RFC] Infiniband: connection abstraction
Grant Grundler wrote: +static void cm_mask_compare_data(u8 *dst, u8 *src, u8 *mask) +{ + int i; + + for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE; i++) + dst[i] = src[i] & mask[i]; +} Is this code going to get invoked very often? In practice, it would be invoked when matching any listen requests originating from the CMA (RDMA connection abstraction). If so, can the mask operation use a "native" size since IB_CM_PRIVATE_DATA_COMPARE_SIZE is hard coded to 64 byte? e.g something like: for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE/sizeof(unsigned long); i++) ((unsigned long *)dst)[i] = ((unsigned long *)src)[i] & ((unsigned long *)mask)[i]; Yes - something like this should work. Thanks. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/5] [RFC] Infiniband: connection abstraction
>> +static void cm_mask_compare_data(u8 *dst, u8 *src, u8 *mask) > >static void cm_mask_compare_data(u8 *dst, const u8 *src, u8 *mask) > >but I would rename it to cm_mask_copy since it doesn't really do a compare. I'll change this. The function is masking the "data to use in the comparison", but I can see the confusion. >> +static int cm_compare_data(struct ib_cm_private_data_compare *src_data, >> + struct ib_cm_private_data_compare *dst_data) > >static int cm_compare_data(const struct ib_cm_private_data_compare *src, > cosnt struct ib_cm_private_data_compare *dst) >Your data type names are getting too long I'll fix. Thanks for the comments. - Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] [RFC] Infiniband: connection abstraction
This patch adds the kernel component to support the userspace Infiniband/RDMA connection agent library. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/Makefile linux-2.6.ib/drivers/infiniband/core/Makefile --- linux-2.6.git/drivers/infiniband/core/Makefile 2006-01-16 16:58:58.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/Makefile 2006-01-16 16:55:25.0 -0800 @@ -1,5 +1,5 @@ obj-$(CONFIG_INFINIBAND) +=ib_core.o ib_mad.o ib_sa.o \ - ib_cm.o ib_addr.o rdma_cm.o + ib_cm.o ib_addr.o rdma_cm.o rdma_ucm.o obj-$(CONFIG_INFINIBAND_USER_MAD) += ib_umad.o obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=ib_uverbs.o ib_ucm.o @@ -14,6 +14,8 @@ ib_cm-y :=cm.o rdma_cm-y := cma.o +rdma_ucm-y := ucma.o + ib_addr-y := addr.o ib_umad-y := user_mad.o diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/ucma.c linux-2.6.ib/drivers/infiniband/core/ucma.c --- linux-2.6.git/drivers/infiniband/core/ucma.c1969-12-31 16:00:00.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/ucma.c 2006-01-16 16:54:31.0 -0800 @@ -0,0 +1,788 @@ +/* + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + * copyright notice, this list of conditions and the following + * disclaimer. + * + * - Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following + * disclaimer in the documentation and/or other materials + * provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include +#include +#include +#include +#include + +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("RDMA Userspace Connection Manager Access"); +MODULE_LICENSE("Dual BSD/GPL"); + +enum { + UCMA_MAX_BACKLOG= 128 +}; + +struct ucma_file { + struct semaphoremutex; + struct file *filp; + struct list_headctxs; + struct list_headevents; + wait_queue_head_t poll_wait; +}; + +struct ucma_context { + int id; + wait_queue_head_t wait; + atomic_tref; + int events_reported; + int backlog; + + struct ucma_file*file; + struct rdma_cm_id *cm_id; + __u64 uid; + + struct list_headevents;/* list of pending events. */ + struct list_headfile_list; /* member in file ctx list */ +}; + +struct ucma_event { + struct ucma_context *ctx; + struct list_headfile_list; /* member in file event list */ + struct list_headctx_list; /* member in ctx event list */ + struct rdma_cm_id *cm_id; + struct rdma_ucm_event_resp resp; +}; + +static DECLARE_MUTEX(ctx_mutex); +static DEFINE_IDR(ctx_idr); + +static struct ucma_context* ucma_get_ctx(struct ucma_file *file, int id) +{ + struct ucma_context *ctx; + + down(&ctx_mutex); + ctx = idr_find(&ctx_idr, id); + if (!ctx) + ctx = ERR_PTR(-ENOENT); + else if (ctx->file != file) + ctx = ERR_PTR(-EINVAL); + else + atomic_inc(&ctx->ref); + up(&ctx_mutex); + + return ctx; +} + +static void ucma_put_ctx(struct ucma_context *ctx) +{ + if (atomic_dec_and_test(&ctx->ref)) + wake_up(&ctx->wait); +} + +static void ucma_cleanup_events(struct ucma_context *ctx) +{ + struct ucma_event *uevent; + + down(&ctx->file->mutex); + lis
[PATCH 4/5] [RFC] Infiniband: connection abstraction
The following patch implements a kernel mode connection management agent over Infiniband that connects based on IP addresses. The agent defines a generic RDMA connection abstraction to support clients wanting to connect over different RDMA devices. It also handles RDMA device hotplug events on behalf of clients. - Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/cma.c linux-2.6.ib/drivers/infiniband/core/cma.c --- linux-2.6.git/drivers/infiniband/core/cma.c 1969-12-31 16:00:00.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/cma.c 2006-01-16 16:17:34.0 -0800 @@ -0,0 +1,1639 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + *available from the Open Source Initiative, see + *http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + *available from the Open Source Initiative, see + *http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + *copy of which is available from the Open Source Initiative, see + *http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + * + */ +#include +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Guy German"); +MODULE_DESCRIPTION("Generic RDMA CM Agent"); +MODULE_LICENSE("Dual BSD/GPL"); + +#define CMA_CM_RESPONSE_TIMEOUT 20 +#define CMA_MAX_CM_RETRIES 3 + +static void cma_add_one(struct ib_device *device); +static void cma_remove_one(struct ib_device *device); + +static struct ib_client cma_client = { + .name = "cma", + .add= cma_add_one, + .remove = cma_remove_one +}; + +static LIST_HEAD(dev_list); +static LIST_HEAD(listen_any_list); +static DECLARE_MUTEX(mutex); + +struct cma_device { + struct list_headlist; + struct ib_device*device; + __be64 node_guid; + wait_queue_head_t wait; + atomic_trefcount; + struct list_headid_list; +}; + +enum cma_state { + CMA_IDLE, + CMA_ADDR_QUERY, + CMA_ADDR_RESOLVED, + CMA_ROUTE_QUERY, + CMA_ROUTE_RESOLVED, + CMA_CONNECT, + CMA_ADDR_BOUND, + CMA_LISTEN, + CMA_DEVICE_REMOVAL, + CMA_DESTROYING +}; + +/* + * Device removal can occur at anytime, so we need extra handling to + * serialize notifying the user of device removal with other callbacks. + * We do this by disabling removal notification while a callback is in process, + * and reporting it after the callback completes. + */ +struct rdma_id_private { + struct rdma_cm_id id; + + struct list_headlist; + struct list_headlisten_list; + struct cma_device *cma_dev; + + enum cma_state state; + spinlock_t lock; + wait_queue_head_t wait; + atomic_trefcount; + wait_queue_head_t wait_remove; + atomic_tdev_remove; + + int backlog; + int timeout_ms; + struct ib_sa_query *query; + int query_id; + struct ib_cm_id *cm_id; + + u32 seq_num; + u32 qp_num; + enum ib_qp_type qp_type; + u8 srq; +}; + +struct cma_work { + struct work_struct work; + struct rdma_id_private *id; +}; + +union cma_ip_addr { + struct in6_addr ip6; + struct { + __u32 pad[3]; + __u32 addr; + } ip4; +}; + +struct cma_hdr { + u8 cma_version; + u8 ip_version; /* IP version: 7:4 */ + __u16 port; + union cma_ip_addr src_addr; + union cma_ip_addr dst_addr; +}; + +struct sdp_hh { + u8 sdp_version; + u8 ip_version; /* IP version: 7:4 */ + u8 sdp_specific1[10]; + __u16 port; + __u16 sdp_specific2; + union cma_ip_addr src_addr; + union cma_ip_addr dst_addr; +};
[PATCH 3/5] [RFC] Infiniband: connection abstraction
The following provides an address translation service that maps IP addresses to Infiniband addresses (GIDs) using IPoIB. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/addr.c linux-2.6.ib/drivers/infiniband/core/addr.c --- linux-2.6.git/drivers/infiniband/core/addr.c1969-12-31 16:00:00.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/addr.c 2006-01-16 16:14:24.0 -0800 @@ -0,0 +1,356 @@ +/* + * Copyright (c) 2005 Voltaire Inc. All rights reserved. + * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. + * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. + * Copyright (c) 2005 Intel Corporation. All rights reserved. + * + * This Software is licensed under one of the following licenses: + * + * 1) under the terms of the "Common Public License 1.0" a copy of which is + *available from the Open Source Initiative, see + *http://www.opensource.org/licenses/cpl.php. + * + * 2) under the terms of the "The BSD License" a copy of which is + *available from the Open Source Initiative, see + *http://www.opensource.org/licenses/bsd-license.php. + * + * 3) under the terms of the "GNU General Public License (GPL) Version 2" a + *copy of which is available from the Open Source Initiative, see + *http://www.opensource.org/licenses/gpl-license.php. + * + * Licensee has the right to choose one of the above licenses. + * + * Redistributions of source code must retain the above copyright + * notice and one of the license notices. + * + * Redistributions in binary form must reproduce both the above copyright + * notice, one of the license notices in the documentation + * and/or other materials provided with the distribution. + */ +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Sean Hefty"); +MODULE_DESCRIPTION("IB Address Translation"); +MODULE_LICENSE("Dual BSD/GPL"); + +struct addr_req { + struct list_head list; + struct sockaddr src_addr; + struct sockaddr dst_addr; + struct rdma_dev_addr *addr; + void *context; + void (*callback)(int status, struct sockaddr *src_addr, +struct rdma_dev_addr *addr, void *context); + unsigned long timeout; + int status; +}; + +static void process_req(void *data); + +static DECLARE_MUTEX(mutex); +static LIST_HEAD(req_list); +static DECLARE_WORK(work, process_req, NULL); +struct workqueue_struct *rdma_wq; +EXPORT_SYMBOL(rdma_wq); + +static int copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, +unsigned char *dst_dev_addr) +{ + switch (dev->type) { + case ARPHRD_INFINIBAND: + dev_addr->dev_type = IB_NODE_CA; + break; + default: + return -EADDRNOTAVAIL; + } + + memcpy(dev_addr->src_dev_addr, dev->dev_addr, MAX_ADDR_LEN); + memcpy(dev_addr->broadcast, dev->broadcast, MAX_ADDR_LEN); + if (dst_dev_addr) + memcpy(dev_addr->dst_dev_addr, dst_dev_addr, MAX_ADDR_LEN); + return 0; +} + +int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr) +{ + struct net_device *dev; + u32 ip = ((struct sockaddr_in *) addr)->sin_addr.s_addr; + int ret; + + dev = ip_dev_find(ip); + if (!dev) + return -EADDRNOTAVAIL; + + ret = copy_addr(dev_addr, dev, NULL); + dev_put(dev); + return ret; +} +EXPORT_SYMBOL(rdma_translate_ip); + +static void set_timeout(unsigned long time) +{ + unsigned long delay; + + cancel_delayed_work(&work); + + delay = time - jiffies; + if ((long)delay <= 0) + delay = 1; + + queue_delayed_work(rdma_wq, &work, delay); +} + +static void queue_req(struct addr_req *req) +{ + struct addr_req *temp_req; + + down(&mutex); + list_for_each_entry_reverse(temp_req, &req_list, list) { + if (time_after(req->timeout, temp_req->timeout)) + break; + } + + list_add(&req->list, &temp_req->list); + + if (req_list.next == &req->list) + set_timeout(req->timeout); + up(&mutex); +} + +static void addr_send_arp(struct sockaddr_in *dst_in) +{ + struct rtable *rt; + struct flowi fl; + u32 dst_ip = dst_in->sin_addr.s_addr; + + memset(&fl, 0, sizeof fl); + fl.nl_u.ip4_u.daddr = dst_ip; + if (ip_route_output_key(&rt, &fl)) + return; + + arp_send(ARPOP_REQUEST, ETH_P_ARP, rt->rt_gateway, rt->idev->dev, +rt->rt_src, NULL, rt->idev->dev->dev_addr, NULL); + ip_rt_put(rt); +} + +static int addr_resolve_remote(struct sockadd
RE: [PATCH 2/5] [RFC] Infiniband: connection abstraction
The following patch extends matching connection requests to listens in the Infiniband CM to include private data. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/cm.c linux-2.6.ib/drivers/infiniband/core/cm.c --- linux-2.6.git/drivers/infiniband/core/cm.c 2006-01-16 10:25:26.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/cm.c 2006-01-16 16:03:35.0 -0800 @@ -32,7 +32,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: cm.c 2821 2005-07-08 17:07:28Z sean.hefty $ + * $Id: cm.c 4311 2005-12-05 18:42:01Z sean.hefty $ */ #include #include @@ -130,6 +130,7 @@ struct cm_id_private { /* todo: use alternate port on send failure */ struct cm_av av; struct cm_av alt_av; + struct ib_cm_private_data_compare *compare_data; void *private_data; __be64 tid; @@ -355,6 +356,40 @@ static struct cm_id_private * cm_acquire return cm_id_priv; } +static void cm_mask_compare_data(u8 *dst, u8 *src, u8 *mask) +{ + int i; + + for (i = 0; i < IB_CM_PRIVATE_DATA_COMPARE_SIZE; i++) + dst[i] = src[i] & mask[i]; +} + +static int cm_compare_data(struct ib_cm_private_data_compare *src_data, + struct ib_cm_private_data_compare *dst_data) +{ + u8 src[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; + u8 dst[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; + + if (!src_data || !dst_data) + return 0; + + cm_mask_compare_data(src, src_data->data, dst_data->mask); + cm_mask_compare_data(dst, dst_data->data, src_data->mask); + return memcmp(src, dst, IB_CM_PRIVATE_DATA_COMPARE_SIZE); +} + +static int cm_compare_private_data(u8 *private_data, + struct ib_cm_private_data_compare *dst_data) +{ + u8 src[IB_CM_PRIVATE_DATA_COMPARE_SIZE]; + + if (!dst_data) + return 0; + + cm_mask_compare_data(src, private_data, dst_data->mask); + return memcmp(src, dst_data->data, IB_CM_PRIVATE_DATA_COMPARE_SIZE); +} + static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) { struct rb_node **link = &cm.listen_service_table.rb_node; @@ -362,14 +397,18 @@ static struct cm_id_private * cm_insert_ struct cm_id_private *cur_cm_id_priv; __be64 service_id = cm_id_priv->id.service_id; __be64 service_mask = cm_id_priv->id.service_mask; + int data_cmp; while (*link) { parent = *link; cur_cm_id_priv = rb_entry(parent, struct cm_id_private, service_node); + data_cmp = cm_compare_data(cm_id_priv->compare_data, + cur_cm_id_priv->compare_data); if ((cur_cm_id_priv->id.service_mask & service_id) == (service_mask & cur_cm_id_priv->id.service_id) && - (cm_id_priv->id.device == cur_cm_id_priv->id.device)) + (cm_id_priv->id.device == cur_cm_id_priv->id.device) && + !data_cmp) return cur_cm_id_priv; if (cm_id_priv->id.device < cur_cm_id_priv->id.device) @@ -378,6 +417,10 @@ static struct cm_id_private * cm_insert_ link = &(*link)->rb_right; else if (service_id < cur_cm_id_priv->id.service_id) link = &(*link)->rb_left; + else if (service_id > cur_cm_id_priv->id.service_id) + link = &(*link)->rb_right; + else if (data_cmp < 0) + link = &(*link)->rb_left; else link = &(*link)->rb_right; } @@ -387,16 +430,20 @@ static struct cm_id_private * cm_insert_ } static struct cm_id_private * cm_find_listen(struct ib_device *device, -__be64 service_id) +__be64 service_id, +u8 *private_data) { struct rb_node *node = cm.listen_service_table.rb_node; struct cm_id_private *cm_id_priv; + int data_cmp; while (node) { cm_id_priv = rb_entry(node, struct cm_id_private, service_node); + data_cmp = cm_compare_private_data(private_data, + cm_id_priv->compare_data); if ((cm_id_priv->id.service_mask & service_id) == cm_id_priv->id.service_id && - (cm_id_priv->id.device == device)) + (cm_id_priv->id.device == devic
[PATCH 1/5] [RFC] Infiniband: connection abstraction
The following patch provides common handling for marshalling data between userspace clients and kernel mode Infiniband drivers. Signed-off-by: Sean Hefty <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/Makefile linux-2.6.ib/drivers/infiniband/core/Makefile --- linux-2.6.git/drivers/infiniband/core/Makefile 2006-01-16 10:25:27.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/Makefile 2006-01-16 15:34:15.0 -0800 @@ -16,4 +16,5 @@ ib_umad-y := user_mad.o ib_ucm-y :=ucm.o -ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o +ib_uverbs-y := uverbs_main.o uverbs_cmd.o uverbs_mem.o \ + uverbs_marshall.o diff -uprN -X linux-2.6.git/Documentation/dontdiff linux-2.6.git/drivers/infiniband/core/ucm.c linux-2.6.ib/drivers/infiniband/core/ucm.c --- linux-2.6.git/drivers/infiniband/core/ucm.c 2006-01-16 10:25:26.0 -0800 +++ linux-2.6.ib/drivers/infiniband/core/ucm.c 2006-01-16 15:34:15.0 -0800 @@ -30,7 +30,7 @@ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * - * $Id: ucm.c 2594 2005-06-13 19:46:02Z libor $ + * $Id: ucm.c 4311 2005-12-05 18:42:01Z sean.hefty $ */ #include #include @@ -48,6 +48,7 @@ #include #include +#include MODULE_AUTHOR("Libor Michalek"); MODULE_DESCRIPTION("InfiniBand userspace Connection Manager access"); @@ -203,36 +204,6 @@ error: return NULL; } -static void ib_ucm_event_path_get(struct ib_ucm_path_rec *upath, - struct ib_sa_path_rec *kpath) -{ - if (!kpath || !upath) - return; - - memcpy(upath->dgid, kpath->dgid.raw, sizeof *upath->dgid); - memcpy(upath->sgid, kpath->sgid.raw, sizeof *upath->sgid); - - upath->dlid = kpath->dlid; - upath->slid = kpath->slid; - upath->raw_traffic = kpath->raw_traffic; - upath->flow_label = kpath->flow_label; - upath->hop_limit= kpath->hop_limit; - upath->traffic_class= kpath->traffic_class; - upath->reversible = kpath->reversible; - upath->numb_path= kpath->numb_path; - upath->pkey = kpath->pkey; - upath->sl = kpath->sl; - upath->mtu_selector = kpath->mtu_selector; - upath->mtu = kpath->mtu; - upath->rate_selector= kpath->rate_selector; - upath->rate = kpath->rate; - upath->packet_life_time = kpath->packet_life_time; - upath->preference = kpath->preference; - - upath->packet_life_time_selector = - kpath->packet_life_time_selector; -} - static void ib_ucm_event_req_get(struct ib_ucm_req_event_resp *ureq, struct ib_cm_req_event_param *kreq) { @@ -251,8 +222,10 @@ static void ib_ucm_event_req_get(struct ureq->srq= kreq->srq; ureq->port = kreq->port; - ib_ucm_event_path_get(&ureq->primary_path, kreq->primary_path); - ib_ucm_event_path_get(&ureq->alternate_path, kreq->alternate_path); + ib_copy_path_rec_to_user(&ureq->primary_path, kreq->primary_path); + if (kreq->alternate_path) + ib_copy_path_rec_to_user(&ureq->alternate_path, +kreq->alternate_path); } static void ib_ucm_event_rep_get(struct ib_ucm_rep_event_resp *urep, @@ -322,8 +295,8 @@ static int ib_ucm_event_process(struct i info = evt->param.rej_rcvd.ari; break; case IB_CM_LAP_RECEIVED: - ib_ucm_event_path_get(&uvt->resp.u.lap_resp.path, - evt->param.lap_rcvd.alternate_path); + ib_copy_path_rec_to_user(&uvt->resp.u.lap_resp.path, +evt->param.lap_rcvd.alternate_path); uvt->data_len = IB_CM_LAP_PRIVATE_DATA_SIZE; uvt->resp.present = IB_UCM_PRES_ALTERNATE; break; @@ -635,65 +608,11 @@ static ssize_t ib_ucm_attr_id(struct ib_ return result; } -static void ib_ucm_copy_ah_attr(struct ib_ucm_ah_attr *dest_attr, - struct ib_ah_attr *src_attr) -{ - memcpy(dest_attr->grh_dgid, src_attr->grh.dgid.raw, - sizeof src_attr->grh.dgid); - dest_attr->grh_flow_label = src_attr->grh.flow_label; - dest_attr->grh_sgid_index = src_attr->grh.sgid_index; - dest_attr->grh_hop_limit = src_attr->grh.hop_limit; - dest_attr->grh_
[PATCH 0/5] [RFC] Infiniband: connection abstraction
The following set of patches defines a connection abstraction for Infiniband and other RDMA devices, and serves several purposes: * It implements a connection protocol over Infiniband based on IP addressing. This greatly simplifies clients wishing to establish connections over Infiniband. * It defines a connection abstraction that works over multiple RDMA devices. The submitted implementation targets Infiniband, but has been tested over other RDMA devices as well. * It handles RDMA device insertion and removal on behalf of its clients. The changes have been broken into 5 separate patches. The basic purpose of each patch is: 1. Provide common handling for marshalling data between userspace clients and kernel mode Infiniband drivers. 2. Extend the Infiniband CM to include private data comparisons as part of its connection request matching process. 3. Provide an address translation service that maps IP addresses to Infiniband addresses (GIDs). This patch touches outside of the Infiniband core, so I'm including the netdev mailing list. 4. Implement the kernel mode RDMA connection management agent. 5. Implement the userspace RDMA connection management agent kernel support module. Please copy the openib-general mailing list on any replies. Thanks, Sean - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html