[PATCH v2] opensm/osmeventplugin: added couple of events to monitor SM
Hi Sasha, I've added a couple of new events that allow event plug-in to see what SM is doing, when it is sweeping and when it updates dump files: OSM_EVENT_ID_L_SWEEP_STARTED, OSM_EVENT_ID_L_SWEEP_DONE, OSM_EVENT_ID_H_SWEEP_STARTED, OSM_EVENT_ID_H_SWEEP_DONE, OSM_EVENT_ID_REROUTE_DONE, OSM_EVENT_ID_ENTERING_STANDBY, OSM_EVENT_ID_SM_PORT_DOWN, OSM_EVENT_ID_SA_DB_DUMPED The last event is reported when SA DB was actually dumped. I'm thinking of similar optimization for guid2lid file - it doesn't have to be dumped at the end of each heavy sweep, as many heavy sweeps don't really happen because of nodes appearing/disappearing. Signed-off-by: Yevgeny Kliteynik klit...@dev.mellanox.co.il --- Changes from V1: - added reporting OSM_EVENT_ID_H_SWEEP_DONE event - rebased to latest master opensm/include/opensm/osm_event_plugin.h | 10 +- opensm/opensm/osm_state_mgr.c | 22 +- opensm/osmeventplugin/src/osmeventplugin.c | 24 3 files changed, 54 insertions(+), 2 deletions(-) diff --git a/opensm/include/opensm/osm_event_plugin.h b/opensm/include/opensm/osm_event_plugin.h index 33d1920..f5a57d7 100644 --- a/opensm/include/opensm/osm_event_plugin.h +++ b/opensm/include/opensm/osm_event_plugin.h @@ -72,7 +72,15 @@ typedef enum { OSM_EVENT_ID_PORT_SELECT, OSM_EVENT_ID_TRAP, OSM_EVENT_ID_SUBNET_UP, - OSM_EVENT_ID_MAX + OSM_EVENT_ID_MAX, + OSM_EVENT_ID_L_SWEEP_STARTED, + OSM_EVENT_ID_L_SWEEP_DONE, + OSM_EVENT_ID_H_SWEEP_STARTED, + OSM_EVENT_ID_H_SWEEP_DONE, + OSM_EVENT_ID_REROUTE_DONE, + OSM_EVENT_ID_ENTERING_STANDBY, + OSM_EVENT_ID_SM_PORT_DOWN, + OSM_EVENT_ID_SA_DB_DUMPED } osm_epi_event_id_t; typedef struct osm_epi_port_id { diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index e43463f..d5dff14 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -1076,6 +1076,9 @@ static void do_sweep(osm_sm_t * sm) sm-p_subn-sm_state != IB_SMINFO_STATE_DISCOVERING) return; + osm_opensm_report_event(sm-p_subn-p_osm, + OSM_EVENT_ID_L_SWEEP_STARTED, NULL); + if (sm-p_subn-coming_out_of_standby) /* * Need to force re-write of sm_base_lid to all ports @@ -,6 +1114,8 @@ static void do_sweep(osm_sm_t * sm) osm_sa_db_file_dump(sm-p_subn-p_osm); OSM_LOG_MSG_BOX(sm-p_log, OSM_LOG_VERBOSE, LIGHT SWEEP COMPLETE); + osm_opensm_report_event(sm-p_subn-p_osm, + OSM_EVENT_ID_L_SWEEP_DONE, NULL); return; } } @@ -1151,6 +1156,8 @@ static void do_sweep(osm_sm_t * sm) if (!sm-p_subn-subnet_initialization_error) { OSM_LOG_MSG_BOX(sm-p_log, OSM_LOG_VERBOSE, REROUTE COMPLETE); + osm_opensm_report_event(sm-p_subn-p_osm, + OSM_EVENT_ID_REROUTE_DONE, NULL); return; } } @@ -1158,6 +1165,9 @@ static void do_sweep(osm_sm_t * sm) /* go to heavy sweep */ repeat_discovery: + osm_opensm_report_event(sm-p_subn-p_osm, + OSM_EVENT_ID_H_SWEEP_STARTED, NULL); + /* First of all - unset all flags */ sm-p_subn-force_heavy_sweep = FALSE; sm-p_subn-force_reroute = FALSE; @@ -1185,6 +1195,8 @@ repeat_discovery: /* Move to DISCOVERING state */ osm_sm_state_mgr_process(sm, OSM_SM_SIGNAL_DISCOVER); + osm_opensm_report_event(sm-p_subn-p_osm, + OSM_EVENT_ID_SM_PORT_DOWN, NULL); return; } @@ -1205,6 +1217,8 @@ repeat_discovery: ENTERING STANDBY STATE); /* notify master SM about us */ osm_send_trap144(sm, 0); + osm_opensm_report_event(sm-p_subn-p_osm, + OSM_EVENT_ID_ENTERING_STANDBY, NULL); return; } @@ -1212,6 +1226,9 @@ repeat_discovery: if (sm-p_subn-force_heavy_sweep) goto repeat_discovery; + osm_opensm_report_event(sm-p_subn-p_osm, + OSM_EVENT_ID_H_SWEEP_DONE, NULL); + OSM_LOG_MSG_BOX(sm-p_log, OSM_LOG_VERBOSE, HEAVY SWEEP COMPLETE); /* If we are MASTER - get the highest remote_sm, and @@ -1375,7 +1392,10 @@ repeat_discovery: if (osm_log_is_active(sm-p_log, OSM_LOG_VERBOSE) || sm-p_subn-opt.sa_db_dump) - osm_sa_db_file_dump(sm-p_subn-p_osm); + if (!osm_sa_db_file_dump(sm-p_subn-p_osm)) +
[patch] infiniband: checking the wrong variable
The intent here was to check the mfrpl-mapped_page_list allocation. We checked mfrpl-ibfrpl.page_list earlier. Signed-off-by: Dan Carpenter erro...@gmail.com diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c index 56147b2..1d27b9a 100644 --- a/drivers/infiniband/hw/mlx4/mr.c +++ b/drivers/infiniband/hw/mlx4/mr.c @@ -240,7 +240,7 @@ struct ib_fast_reg_page_list *mlx4_ib_alloc_fast_reg_page_list(struct ib_device mfrpl-mapped_page_list = dma_alloc_coherent(dev-dev-pdev-dev, size, mfrpl-map, GFP_KERNEL); - if (!mfrpl-ibfrpl.page_list) + if (!mfrpl-mapped_page_list) goto err_free; WARN_ON(mfrpl-map 0x3f); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [RFC] ummunotify: Userspace support for MMU notifications
I am resubmitting this to try and restart the discussion about how this should be implemented properly. As the dicsussion was left I see two possible solutions, one is a new class of perf event that would be setup to prioritize catching every event over resource consumption that would monitor MMU events filtered by registered address ranges. The other option is the one presented below, a character device that uses ioctl and read to register address ranges and return MMU events. I want to try and pick the best solution so I can move forward with it. From: Roland Dreier rolandd at cisco.com As discussed in http://article.gmane.org/gmane.linux.drivers.openib/61925 and follow-up messages, libraries using RDMA would like to track precisely when application code changes memory mapping via free(), munmap(), etc. Current pure-userspace solutions using malloc hooks and other tricks are not robust, and the feeling among experts is that the issue is unfixable without kernel help. We solve this not by implementing the full API proposed in the email linked above but rather with a simpler and more generic interface, which may be useful in other contexts. Specifically, we implement a new character device driver, ummunotify, that creates a /dev/ummunotify node. A userspace process can open this node read-only and use the fd as follows: 1. ioctl() to register/unregister an address range to watch in the kernel (cf struct ummunotify_register_ioctl in linux/ummunotify.h). 2. read() to retrieve events generated when a mapping in a watched address range is invalidated (cf struct ummunotify_event in linux/ummunotify.h). select()/poll()/epoll() and SIGIO are handled for this IO. 3. mmap() one page at offset 0 to map a kernel page that contains a generation counter that is incremented each time an event is generated. This allows userspace to have a fast path that checks that no events have occurred without a system call. Thanks to Jason Gunthorpe jgunthorpe at obsidianresearch.com for suggestions on the interface design. Also thanks to Jeff Squyres jsquyres at cisco.com for prototyping support for this in Open MPI, which helped find several bugs during development. Signed-off-by: Roland Dreier rolandd at cisco.com Signed-off-by: Eric B Munson ebmun...@us.ibm.com --- Changes since v3: - Fixed replaced [get|put] user with copy_[from|to]_user to fix x86 builds --- Documentation/Makefile|3 +- drivers/char/Kconfig | 12 + drivers/char/Makefile |1 + drivers/char/ummunotify.c | 567 + 4 files changed, 582 insertions(+), 1 deletions(-) create mode 100644 drivers/char/ummunotify.c diff --git a/Documentation/Makefile b/Documentation/Makefile index 6fc7ea1..27ba76a 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -1,3 +1,4 @@ obj-m := DocBook/ accounting/ auxdisplay/ connector/ \ filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \ - pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/ + pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \ + watchdog/src/ diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index 3141dd3..cf26019 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -,6 +,18 @@ config DEVPORT depends on ISA || PCI default y +config UMMUNOTIFY + tristate Userspace MMU notifications + select MMU_NOTIFIER + help + The ummunotify (userspace MMU notification) driver creates a + character device that can be used by userspace libraries to + get notifications when an application's memory mapping + changed. This is used, for example, by RDMA libraries to + improve the reliability of memory registration caching, since + the kernel's MMU notifications can be used to know precisely + when to shoot down a cached registration. + source drivers/s390/char/Kconfig endmenu diff --git a/drivers/char/Makefile b/drivers/char/Makefile index f957edf..521e5de 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -97,6 +97,7 @@ obj-$(CONFIG_NSC_GPIO)+= nsc_gpio.o obj-$(CONFIG_CS5535_GPIO) += cs5535_gpio.o obj-$(CONFIG_GPIO_TB0219) += tb0219.o obj-$(CONFIG_TELCLOCK) += tlclk.o +obj-$(CONFIG_UMMUNOTIFY) += ummunotify.o obj-$(CONFIG_MWAVE)+= mwave/ obj-$(CONFIG_AGP) += agp/ diff --git a/drivers/char/ummunotify.c b/drivers/char/ummunotify.c new file mode 100644 index 000..c14df3f --- /dev/null +++ b/drivers/char/ummunotify.c @@ -0,0 +1,567 @@ +/* + * Copyright (c) 2009 Cisco Systems. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License version + * 2 as published by the Free Software Foundation. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, + *
Re: [PATCH v2 13/51] IB/qib: Add qib_driver.c
+DEFINE_MUTEX(qib_mutex);/* general driver use */ Rather than having this ill-defined mutex that I think is going to make it hard to understand the locking and get the lock ordering right, would it be better to have well-defined locking rules? AFAICT this mutex is used in only two places, qib_diag.c and qib_file_op.c. Are those two uses protecting the same thing? Or could we have two static mutexes, one in each file, that protects what each file needs protected? -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fork safe clarification
I know many of the people here are very busy and these might be questions that are not about your work at hand, but these intricacies of libibverbs are hard to figure out without an intimate knowledge of the drivers (from the user side). I would really appreciate anyone who would take the time to respond. On Thu, Apr 1, 2010 at 10:09 AM, Matthew Small matthewtsm...@gmail.com wrote: I am trying to understand the behavior of the libibverbs after it has been set into fork safe mode via a successful call to ibv_fork_init() or setting the environmental variable IBV_FORK_SAFE. For my purposes I would like to know the following : Are PDs, QPs and CQs created before a fork shared by the parent and child after fork() has returned (ie. both can submit WRs, poll CQ, etc.)? What about MRs registered before the fork? Even though the child doesn't have access to the parent's memory, can he sill submit WRs on a QP with an MR created before the fork? What if the MR pages in the above scenario are accessible in both parent and child (shared memory)? Are there complications with registering shared memory? In general, are pointers returned by libibverbs pointer to user/process address space (as ibv_mr pointers must be) or kernel space (eg. if an unrelated process had another process's QP pointer, lkey, and a virtual address could it post (almost certainly unsafely) a WR to the other process's QP? Sorry if the questions seem progressively more goofy and thanks in advance for any clarification. -Matt -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 35/37] librdmacm/mckey: use AF_IB for unmapped multicast addresses
If the user joins an unmapped multicast address, use AF_IB, rather than AF_INET6, to communicate that information with the kernel. Signed-off-by: Sean Hefty sean.he...@intel.com --- This requires AF_IB support in the kernel. examples/mckey.c | 21 ++--- 1 files changed, 18 insertions(+), 3 deletions(-) diff --git a/examples/mckey.c b/examples/mckey.c index ddc3495..a6b5c4d 100644 --- a/examples/mckey.c +++ b/examples/mckey.c @@ -46,6 +46,7 @@ #include getopt.h #include rdma/rdma_cma.h +#include infiniband/ib.h struct cmatest_node { int id; @@ -67,9 +68,9 @@ struct cmatest { int conn_index; int connects_left; - struct sockaddr_in6 dst_in; + struct sockaddr_storage dst_in; struct sockaddr *dst_addr; - struct sockaddr_in6 src_in; + struct sockaddr_storage src_in; struct sockaddr *src_addr; }; @@ -460,6 +461,20 @@ static int get_addr(char *dst, struct sockaddr *addr) return ret; } +static int get_dst_addr(char *dst, struct sockaddr *addr) +{ + struct sockaddr_ib *sib; + + if (!unmapped_addr) + return get_addr(dst, addr); + + sib = (struct sockaddr_ib *) addr; + memset(sib, 0, sizeof *sib); + sib-sib_family = AF_IB; + inet_pton(AF_INET6, dst, sib-sib_addr); + return 0; +} + static int run(void) { int i, ret; @@ -471,7 +486,7 @@ static int run(void) return ret; } - ret = get_addr(dst_addr, (struct sockaddr *) test.dst_in); + ret = get_dst_addr(dst_addr, (struct sockaddr *) test.dst_in); if (ret) return ret; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 34/37] librdmacm: update man pages
Update man pages to reflect changes to the APIs. Signed-off-by: Sean Hefty sean.he...@intel.com --- man/rdma_cm.7 | 56 +-- man/rdma_create_ep.3 | 57 man/rdma_create_id.3 | 14 man/rdma_create_qp.3 | 15 +++-- man/rdma_get_request.3 | 31 ++ man/rdma_migrate_id.3 |6 - 6 files changed, 165 insertions(+), 14 deletions(-) diff --git a/man/rdma_cm.7 b/man/rdma_cm.7 index fd04959..ff5d489 100644 --- a/man/rdma_cm.7 +++ b/man/rdma_cm.7 @@ -8,17 +8,59 @@ Used to establish communication over RDMA transports. .SH NOTES The RDMA CM is a communication manager used to setup reliable, connected and unreliable datagram data transfers. It provides an RDMA transport -neutral interface for establishing connections. The API is based on sockets, -but adapted for queue pair (QP) based semantics: communication must be -over a specific RDMA device, and data transfers are message based. +neutral interface for establishing connections. The API concepts are +is based on sockets, but adapted for queue pair (QP) based semantics: +communication must be over a specific RDMA device, and data transfers +are message based. .P -The RDMA CM only provides the communication management (connection setup / -teardown) portion of an RDMA API. It works in conjunction with the verbs +The RDMA CM can control both the QP and communication management (connection setup / +teardown) portions of an RDMA API, or only the communication management +piece. It works in conjunction with the verbs API defined by the libibverbs library. The libibverbs library provides the -interfaces needed to send and receive data. +underlying interfaces needed to send and receive data. +.P +The RDMA CM can operate asynchronously or synchronously. The mode of +operation is controlled by the user through the use of the rdma_cm event channel +parameter in specific calls. If an event channel is provided, an rdma_cm identifier +will report its event data (results of connecting, for example), on that channel. +If a channel is not provided, then all rdma_cm operations for the selected +rdma_cm identifier are will block until they complete. +.SH RDMA VERBS +The rdma_cm supports the full range of verbs available through the libibverbs +library and interfaces. However, it also provides wrapper functions for some +of the more commonly used verbs funcationality. The full set of abstracted +verb calls are: +.P rdma_reg_msgs - register an array of buffers for sending and receiving +.P rdma_reg_read - registers a buffer for RDMA read operations +.P rdma_reg_write - registers a buffer for RDMA write operations +.P rdma_dereg_mr - deregisters a memory region +.P +.P rdma_post_recv - post a buffer to receive a message +.P rdma_post_send - post a buffer to send a message +.P rdma_post_read - post an RDMA to read data into a buffer +.P rdma_post_write - post an RDMA to send data from a buffer +.P +.P rdma_post_recvv - post a vector of buffers to receive a message +.P rdma_post_sendv - post a vector of buffers to send a message +.P rdma_post_readv - post a vector of buffers to receive an RDMA read +.P rdma_post_writev - post a vector of buffers to send an RDMA write +.P +.P rdma_post_ud_send - post a buffer to send a message on a UD QP +.P +.P rdma_get_send_comp - get completion status for a send or RDMA operation +.P rdma_get_recv_comp - get information about a completed receive .SH CLIENT OPERATION This section provides a general overview of the basic operation for the active, -or client, side of communication. A general connection flow would be: +or client, side of communication. This flow assume asynchronous operation with +low level call details shown. For +synchronous operation, calls to rdma_create_event_channel, rdma_get_cm_event, +rdma_ack_cm_event, and rdma_destroy_event_channel +would be eliminated. Abstracted calls, such as rdma_create_ep encapsulate +serveral of these calls under a single API. +Users may also refer to the example applications for +code samples. A general connection flow would be: +.IP rdma_getaddrinfo +retrieve address information of the destination .IP rdma_create_event_channel create channel to receive events .IP rdma_create_id diff --git a/man/rdma_create_ep.3 b/man/rdma_create_ep.3 new file mode 100644 index 000..ae07113 --- /dev/null +++ b/man/rdma_create_ep.3 @@ -0,0 +1,57 @@ +.TH RDMA_CREATE_EP 3 2007-08-06 librdmacm Librdmacm Programmer's Manual librdmacm +.SH NAME +rdma_create_ep \- Allocate a communication identifier and optional QP. +.SH SYNOPSIS +.B #include rdma/rdma_cma.h +.P +.B int rdma_create_ep +.BI (struct rdma_cm_id ** id , +.BI struct rdma_addrinfo * res , +.BI struct ibv_pd * pd , +.BI struct ibv_qp_init_attr * qp_init_attr ); +.SH ARGUMENTS +.IP id 12 +A reference where the allocated communication identifier will be
[infiniband-diags] [0/3] support --diff and --diffcheck in ibnetdiscover
Hey Sasha, The following sets of patches implement a --diff and --diffcheck options in ibnetdiscover to let users diff an ibnetdiscover state to a previous ibnetdiscover state. The goal of this option is to help system administrators isolate/determine changes in the network quickly compared to a previous state. Here's an example: # ./ibnetdiscover --diff=orig.cache vendid=0x8f1 devid=0x5a30 sysimgguid=0x8f10400411f57 switchguid=0x8f10400411f56(8f10400411f56) Switch 24 S-0008f10400411f56 # ISR9024D Voltaire base port 0 lid 11 lmc 0 [14] H-0002c90200219ef0[1](2c90200219ef1) # wopr0 lid 64 4xDDR [19] H-0002c903ff7c[1](2c903ff7d) # wopr9 lid 48 4xDDR [20] H-0002c903ff7c[1](2c903ff7d) # wopr9 lid 4 4xDDR vendid=0x2c9 devid=0x6282 sysimgguid=0x2c90200219ef3 caguid=0x2c90200219ef0 Ca2 H-0002c90200219ef0 # wopr0 [1](2c90200219ef1)S-0008f10400411f56[14]# lid 64 lmc 2 ISR9024D Voltaire lid 11 4xDDR In this particular example, port 14 on the switch (which is connected to node 'wopr0') was up before but is now down (and the associated CA is noted too). In addition, 'wopr9' is connected to port 20 instead of port 19 on the switch. By default --diff checks switches, cas, routers, and port connections. The --diffcheck option allows the user to specify which diff options they want done, and also adds other diff checks for lids and/or node descriptions. More diff checks could be added later as needed. For example, the following only checks for differences of lids on switches. # ./ibnetdiscover --diff=orig.cache --diffcheck=sw,lid vendid=0x8f1 devid=0x5a30 sysimgguid=0x8f10400411f57 switchguid=0x8f10400411f56(8f10400411f56) Switch24 S-0008f10400411f56 # ISR9024D Voltaire base port 0 lid 11 lmc 0 Switch24 S-0008f10400411f56 # ISR9024D Voltaire base port 0 lid 3 lmc 0 [13] H-0002c90200219e64[1](2c90200219e65) # wopri lid 4 4xDDR [13] H-0002c90200219e64[1](2c90200219e65) # wopri lid 1 4xDDR Others on the list may wonder how this is different than just using the normal 'diff' tool. The differences I can think of are: 1) This checks differences in the network, not text. This is particularly important when lids, lmc, etc. are changed. Otherwise there are many differences in a normal diff output that aren't necessary. 2) This provides the appropriate context in the diff output, showing the appropriate system ids to allow a system administrator to identify ports on what switch have changed. Under normal diff output, you may not get that appropriate context of information. The system administrator can of course use options like --context in diff, but the goal is to make the diff output clear and concise, not outputting unnecessary junk. 3) As parallelization has been added into ibnetdisocver/libibnetdiscover this becomes more critical as output in ibnetdiscover/libibnetdiscover can be re-ordered. So a normal diff suddenly is non-functional. There's probably other minor advantages. Even if minor output tweaks happen to ibnetdiscover in the future, this can still work against old cache files. Al -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 31/37] librdmacm: provide abstracted verb calls
Provide abstractions to the verb calls to simplify the user interface for more casual verbs consumers. Users still have access to the full range of verbs functionality by calling verbs directly. Signed-off-by: Sean Hefty sean.he...@intel.com --- Makefile.am |5 - include/rdma/rdma_verbs.h | 287 + 2 files changed, 290 insertions(+), 2 deletions(-) diff --git a/Makefile.am b/Makefile.am index 8d86045..8aef24a 100644 --- a/Makefile.am +++ b/Makefile.am @@ -31,7 +31,8 @@ librdmacmincludedir = $(includedir)/rdma $(includedir)/infiniband librdmacminclude_HEADERS = include/rdma/rdma_cma_abi.h \ include/rdma/rdma_cma.h \ - include/infiniband/ib.h + include/infiniband/ib.h \ + include/rdma/rdma_verbs.h man_MANS = \ man/rdma_accept.3 \ @@ -69,7 +70,7 @@ man_MANS = \ man/rdma_cm.7 EXTRA_DIST = include/rdma/rdma_cma_abi.h include/rdma/rdma_cma.h \ -include/infiniband/ib.h \ +include/infiniband/ib.h include/rdma/rdma_verbs.h \ src/cma.h src/librdmacm.map librdmacm.spec.in $(man_MANS) dist-hook: librdmacm.spec diff --git a/include/rdma/rdma_verbs.h b/include/rdma/rdma_verbs.h new file mode 100644 index 000..05964c1 --- /dev/null +++ b/include/rdma/rdma_verbs.h @@ -0,0 +1,287 @@ +/* + * Copyright (c) 2010 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *disclaimer in the documentation and/or other materials + *provided with the distribution. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#if !defined(RDMA_VERBS_H) +#define RDMA_VERBS_H + +#include assert.h +#include infiniband/verbs.h +#include rdma/rdma_cma.h + +#ifdef __cplusplus +extern C { +#endif + +/* + * Memory registration helpers. + */ +static inline struct ibv_mr * +rdma_reg_msgs(struct rdma_cm_id *id, void *addr, size_t length) +{ + return ibv_reg_mr(id-qp-pd, addr, length, IBV_ACCESS_LOCAL_WRITE); +} + +static inline struct ibv_mr * +rdma_reg_read(struct rdma_cm_id *id, void *addr, size_t length) +{ + return ibv_reg_mr(id-qp-pd, addr, length, IBV_ACCESS_LOCAL_WRITE| + IBV_ACCESS_REMOTE_READ); +} + +static inline struct ibv_mr * +rdma_reg_write(struct rdma_cm_id *id, void *addr, size_t length) +{ + return ibv_reg_mr(id-qp-pd, addr, length, IBV_ACCESS_LOCAL_WRITE | + IBV_ACCESS_REMOTE_WRITE); +} + +static inline int +rdma_dereg_mr(struct ibv_mr *mr) +{ + return ibv_dereg_mr(mr); +} + + +/* + * Vectored send, receive, and RDMA operations. + * Support multiple scatter-gather entries. + */ +static inline int +rdma_post_recvv(struct rdma_cm_id *id, void *context, struct ibv_sge *sgl, + int nsge) +{ + struct ibv_recv_wr wr, *bad; + + wr.wr_id = (uintptr_t) context; + wr.next = NULL; + wr.sg_list = sgl; + wr.num_sge = nsge; + + return ibv_post_recv(id-qp, wr, bad); +} + +static inline int +rdma_post_sendv(struct rdma_cm_id *id, void *context, struct ibv_sge *sgl, + int nsge, int flags) +{ + struct ibv_send_wr wr, *bad; + + wr.wr_id = (uintptr_t) context; + wr.next = NULL; + wr.sg_list = sgl; + wr.num_sge = nsge; + wr.opcode = IBV_WR_SEND; + wr.send_flags = flags; + + return ibv_post_send(id-qp, wr, bad); +} + +static inline int +rdma_post_readv(struct rdma_cm_id *id, void *context, struct ibv_sge *sgl, + int nsge, int flags, uint64_t remote_addr, uint32_t rkey) +{ + struct ibv_send_wr wr, *bad; + + wr.wr_id = (uintptr_t) context; + wr.next =
[PATCH 27/37] librdmacm: add support for IB ACM service
Allow the librdmacm to contact a service via sockets to obtain address mapping and path record data. The use of the service is controlled through a build option (with-ib_acm). If the library fails to contact the service, it falls back to using the kernel services to resolve address and routing data. Signed-off-by: Sean Hefty sean.he...@intel.com --- Once IB ACM is proven, the build option can be removed. Makefile.am|2 - configure.in | 14 + src/acm.c | 160 src/addrinfo.c |3 + src/cma.c |9 ++- src/cma.h | 13 - 6 files changed, 197 insertions(+), 4 deletions(-) diff --git a/Makefile.am b/Makefile.am index be53c78..8d86045 100644 --- a/Makefile.am +++ b/Makefile.am @@ -12,7 +12,7 @@ else librdmacm_version_script = endif -src_librdmacm_la_SOURCES = src/cma.c src/addrinfo.c +src_librdmacm_la_SOURCES = src/cma.c src/addrinfo.c src/acm.c src_librdmacm_la_LDFLAGS = -version-info 1 -export-dynamic \ $(librdmacm_version_script) src_librdmacm_la_DEPENDENCIES = $(srcdir)/src/librdmacm.map diff --git a/configure.in b/configure.in index 1122966..3db4247 100644 --- a/configure.in +++ b/configure.in @@ -21,6 +21,15 @@ if test $with_valgrind != test $with_valgrind != no; then fi fi +AC_ARG_WITH([ib_acm], +AC_HELP_STRING([--with-ib_acm], + [Use IB ACM for route resolution - default NO])) + +if test $with_ib_acm != test $with_ib_acm != no; then + AC_DEFINE([USE_IB_ACM], 1, + [Define to 1 to use IB ACM for endpoint resolution]) +fi + AC_ARG_ENABLE(libcheck, [ --disable-libcheck do not test for presence of ib libraries], [ if test $enableval = no; then disable_libcheck=yes @@ -51,6 +60,11 @@ AC_CHECK_HEADER(valgrind/memcheck.h, [], AC_MSG_ERROR([valgrind requested but valgrind/memcheck.h not found.])) fi +if test $with_ib_acm != test $with_ib_acm != no; then +AC_CHECK_HEADER(infiniband/acm.h, [], +AC_MSG_ERROR([IB ACM requested but infiniband/acm.h not found.])) +fi + fi AC_CACHE_CHECK(whether ld accepts --version-script, ac_cv_version_script, diff --git a/src/acm.c b/src/acm.c new file mode 100644 index 000..34fdf3c --- /dev/null +++ b/src/acm.c @@ -0,0 +1,160 @@ +/* + * Copyright (c) 2010 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *disclaimer in the documentation and/or other materials + *provided with the distribution. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#if HAVE_CONFIG_H +# include config.h +#endif /* HAVE_CONFIG_H */ + +#include sys/types.h +#include sys/socket.h +#include netdb.h +#include unistd.h + +#include cma.h +#include rdma/rdma_cma.h +#include infiniband/ib.h +#include infiniband/sa.h + +#ifdef USE_IB_ACM +#include infiniband/acm.h + +static pthread_mutex_t acm_lock = PTHREAD_MUTEX_INITIALIZER; +static int sock; +static short server_port = 6125; + +void ucma_ib_init(void) +{ + struct sockaddr_in addr; + int ret; + + sock = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); + if (sock 0) + return; + + memset(addr, 0, sizeof addr); + addr.sin_family = AF_INET; + addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK); + addr.sin_port = htons(server_port); + ret = connect(sock, (struct sockaddr *) addr, sizeof(addr)); + if (ret) + goto err; + + return; + +err: + close(sock); + sock = 0; +} + +void ucma_ib_cleanup(void) +{ + if (sock 0) { + shutdown(sock, SHUT_RDWR); + close(sock); + } +} + +static void ucma_ib_save_resp(struct rdma_addrinfo *rai, struct acm_resolve_msg *msg) +{ +
[PATCH 26/37] librdmacm: set src_addr in rdma_getaddrinfo
RDMA requires the user to allocate hardware resources before establishing a connection. To support this, the user must know the source address that the connection will use to reach the remote endpoint. Modify rdma_getaddrinfo to determine an appropriate source address based on the specified destination, when a source address is not given. Signed-off-by: Sean Hefty sean.he...@intel.com --- src/addrinfo.c | 60 1 files changed, 60 insertions(+), 0 deletions(-) diff --git a/src/addrinfo.c b/src/addrinfo.c index 15ae071..dfaf9d5 100644 --- a/src/addrinfo.c +++ b/src/addrinfo.c @@ -39,6 +39,7 @@ #include sys/types.h #include sys/socket.h #include netdb.h +#include unistd.h #include cma.h #include rdma/rdma_cma.h @@ -129,6 +130,48 @@ static int ucma_convert_to_rai(struct rdma_addrinfo *rai, struct addrinfo *ai) return 0; } +static int ucma_resolve_src(struct rdma_addrinfo *rai) +{ + struct sockaddr *addr; + socklen_t len; + int ret, s; + + s = socket(rai-ai_family, SOCK_DGRAM, IPPROTO_UDP); + if (s 0) + return s; + + ret = connect(s, rai-ai_dst_addr, rai-ai_dst_len); + if (ret) + goto err1; + + addr = zalloc(rai-ai_dst_len); + if (!addr) { + ret = ERR(ENOMEM); + goto err1; + } + + len = rai-ai_dst_len; + ret = getsockname(s, addr, len); + if (ret) + goto err2; + + if (addr-sa_family == AF_INET) + ((struct sockaddr_in *) addr)-sin_port = 0; + else + ((struct sockaddr_in6 *) addr)-sin6_port = 0; + rai-ai_src_addr = addr; + rai-ai_src_len = len; + + close(s); + return 0; + +err2: + free(addr); +err1: + close(s); + return ret; +} + int rdma_getaddrinfo(char *node, char *service, struct rdma_addrinfo *hints, struct rdma_addrinfo **res) @@ -159,6 +202,23 @@ int rdma_getaddrinfo(char *node, char *service, if (ret) goto err2; + if (!rai-ai_src_len) { + if (hints hints-ai_src_len) { + rai-ai_src_addr = zalloc(hints-ai_src_len); + if (!rai-ai_src_addr) { + ret = ERR(ENOMEM); + goto err2; + } + memcpy(rai-ai_src_addr, hints-ai_src_addr, + hints-ai_src_len); + rai-ai_src_len = hints-ai_src_len; + } else { + ret = ucma_resolve_src(rai); + if (ret) + goto err2; + } + } + freeaddrinfo(ai); *res = rai; return 0; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 19/37] librdmacm: add rdma_get_request
To simplify passive side operation and better support synchronous operations, add rdma_get_request(). This function is called on the listening side to retrieve a connection request event. Signed-off-by: Sean Hefty sean.he...@intel.com --- Ideally, this call would have been rdma_accept, to match with the socket accept call, but it was already taken. include/rdma/rdma_cma.h |5 + src/cma.c | 38 ++ src/librdmacm.map |1 + 3 files changed, 44 insertions(+), 0 deletions(-) diff --git a/include/rdma/rdma_cma.h b/include/rdma/rdma_cma.h index 1db559e..89013a0 100644 --- a/include/rdma/rdma_cma.h +++ b/include/rdma/rdma_cma.h @@ -381,6 +381,11 @@ int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param); int rdma_listen(struct rdma_cm_id *id, int backlog); /** + * rdma_get_request + */ +int rdma_get_request(struct rdma_cm_id *listen, struct rdma_cm_id **id); + +/** * rdma_accept - Called to accept a connection request. * @id: Connection identifier associated with the request. * @conn_param: Optional information needed to establish the connection. diff --git a/src/cma.c b/src/cma.c index 8aa7b05..9de33d4 100644 --- a/src/cma.c +++ b/src/cma.c @@ -1242,6 +1242,44 @@ int rdma_listen(struct rdma_cm_id *id, int backlog) return ucma_query_route(id); } +int rdma_get_request(struct rdma_cm_id *listen, struct rdma_cm_id **id) +{ + struct cma_id_private *id_priv; + struct rdma_cm_event *event; + int ret; + + id_priv = container_of(listen, struct cma_id_private, id); + if (!id_priv-sync) + return ERR(EINVAL); + + if (listen-event) { + rdma_ack_cm_event(listen-event); + listen-event = NULL; + } + + ret = rdma_get_cm_event(listen-channel, event); + if (ret) + return ret; + + if (event-status) { + ret = event-status; + goto err; + } + + if (event-event != RDMA_CM_EVENT_CONNECT_REQUEST) { + ret = ERR(EINVAL); + goto err; + } + + *id = event-id; + (*id)-event = event; + return 0; + +err: + listen-event = event; + return ret; +} + int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) { struct ucma_abi_accept *cmd; diff --git a/src/librdmacm.map b/src/librdmacm.map index 1f07102..f6af452 100644 --- a/src/librdmacm.map +++ b/src/librdmacm.map @@ -30,5 +30,6 @@ RDMACM_1.0 { rdma_migrate_id; rdma_getaddrinfo; rdma_freeaddrinfo; + rdma_get_request; local: *; }; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 17/37] librdmacm: expose ucma_init to other internal modules
Remove static property from ucma_init and expose its definition in cma.h. The address resolution module will need access to this function. Signed-off-by: Sean Hefty sean.he...@intel.com --- src/cma.c | 14 +- src/cma.h |2 ++ 2 files changed, 11 insertions(+), 5 deletions(-) diff --git a/src/cma.c b/src/cma.c index 6ef4b96..8aa7b05 100644 --- a/src/cma.c +++ b/src/cma.c @@ -188,13 +188,17 @@ static int check_abi_version(void) return 0; } -static int ucma_init(void) +int ucma_init(void) { struct ibv_device **dev_list = NULL; struct cma_device *cma_dev; struct ibv_device_attr attr; int i, ret, dev_cnt; + /* Quick check without lock to see if we're already initialized */ + if (cma_dev_cnt) + return 0; + pthread_mutex_lock(mut); if (cma_dev_cnt) { pthread_mutex_unlock(mut); @@ -271,7 +275,7 @@ struct ibv_context **rdma_get_devices(int *num_devices) struct ibv_context **devs = NULL; int i; - if (!cma_dev_cnt ucma_init()) + if (ucma_init()) goto out; devs = malloc(sizeof *devs * (cma_dev_cnt + 1)); @@ -301,7 +305,7 @@ struct rdma_event_channel *rdma_create_event_channel(void) { struct rdma_event_channel *channel; - if (!cma_dev_cnt ucma_init()) + if (ucma_init()) return NULL; channel = malloc(sizeof *channel); @@ -396,7 +400,7 @@ int rdma_create_id(struct rdma_event_channel *channel, void *msg; int ret, size; - ret = cma_dev_cnt ? 0 : ucma_init(); + ret = ucma_init(); if (ret) return ret; @@ -1712,7 +1716,7 @@ int rdma_get_cm_event(struct rdma_event_channel *channel, void *msg; int ret, size; - ret = cma_dev_cnt ? 0 : ucma_init(); + ret = ucma_init(); if (ret) return ret; diff --git a/src/cma.h b/src/cma.h index 92e771e..06ca38c 100644 --- a/src/cma.h +++ b/src/cma.h @@ -82,5 +82,7 @@ static inline void *zalloc(size_t size) return buf; } +int ucma_init(); + #endif /* CMA_H */ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 15/37] librdmacm: allow user to specify max RDMA resources
Allow the user to indicate that the library should select the maximum RDMA read values available that should be used when establishing a connection. The library selects the maximum based on local hardware limitations and connection request data. Signed-off-by: Sean Hefty sean.he...@intel.com --- include/rdma/rdma_cma.h |5 +++ src/cma.c | 83 +++ src/cma.h |2 + 3 files changed, 62 insertions(+), 28 deletions(-) diff --git a/include/rdma/rdma_cma.h b/include/rdma/rdma_cma.h index d8cbb91..f50b4dd 100644 --- a/include/rdma/rdma_cma.h +++ b/include/rdma/rdma_cma.h @@ -121,6 +121,11 @@ struct rdma_cm_id { struct ibv_cq *recv_cq; }; +enum { + RDMA_MAX_RESP_RES = 0xFF, + RDMA_MAX_INIT_DEPTH = 0xFF +}; + struct rdma_conn_param { const void *private_data; uint8_t private_data_len; diff --git a/src/cma.c b/src/cma.c index 805aca3..b8d57a5 100644 --- a/src/cma.c +++ b/src/cma.c @@ -112,6 +112,8 @@ struct cma_id_private { pthread_mutex_t mut; uint32_t handle; struct cma_multicast *mc_list; + uint8_t initiator_depth; + uint8_t responder_resources; }; struct cma_multicast { @@ -850,8 +852,7 @@ static int rdma_init_qp_attr(struct rdma_cm_id *id, struct ibv_qp_attr *qp_attr, return 0; } -static int ucma_modify_qp_rtr(struct rdma_cm_id *id, - struct rdma_conn_param *conn_param) +static int ucma_modify_qp_rtr(struct rdma_cm_id *id, uint8_t resp_res) { struct ibv_qp_attr qp_attr; int qp_attr_mask, ret; @@ -874,13 +875,12 @@ static int ucma_modify_qp_rtr(struct rdma_cm_id *id, if (ret) return ret; - if (conn_param) - qp_attr.max_dest_rd_atomic = conn_param-responder_resources; + if (resp_res != RDMA_MAX_RESP_RES) + qp_attr.max_dest_rd_atomic = resp_res; return ibv_modify_qp(id-qp, qp_attr, qp_attr_mask); } -static int ucma_modify_qp_rts(struct rdma_cm_id *id, - struct rdma_conn_param *conn_param) +static int ucma_modify_qp_rts(struct rdma_cm_id *id, uint8_t init_depth) { struct ibv_qp_attr qp_attr; int qp_attr_mask, ret; @@ -890,8 +890,8 @@ static int ucma_modify_qp_rts(struct rdma_cm_id *id, if (ret) return ret; - if (conn_param) - qp_attr.max_rd_atomic = conn_param-initiator_depth; + if (init_depth != RDMA_MAX_INIT_DEPTH) + qp_attr.max_rd_atomic = init_depth; return ibv_modify_qp(id-qp, qp_attr, qp_attr_mask); } @@ -1128,28 +1128,31 @@ void rdma_destroy_qp(struct rdma_cm_id *id) } static int ucma_valid_param(struct cma_id_private *id_priv, - struct rdma_conn_param *conn_param) + struct rdma_conn_param *param) { if (id_priv-id.ps != RDMA_PS_TCP) return 0; - if ((conn_param-responder_resources -id_priv-cma_dev-max_responder_resources) || - (conn_param-initiator_depth -id_priv-cma_dev-max_initiator_depth)) + if ((param-responder_resources != RDMA_MAX_RESP_RES) + (param-responder_resources id_priv-cma_dev-max_responder_resources)) + return ERR(EINVAL); + + if ((param-initiator_depth != RDMA_MAX_INIT_DEPTH) + (param-initiator_depth id_priv-cma_dev-max_initiator_depth)) return ERR(EINVAL); return 0; } -static void ucma_copy_conn_param_to_kern(struct ucma_abi_conn_param *dst, +static void ucma_copy_conn_param_to_kern(struct cma_id_private *id_priv, +struct ucma_abi_conn_param *dst, struct rdma_conn_param *src, uint32_t qp_num, uint8_t srq) { dst-qp_num = qp_num; dst-srq = srq; - dst-responder_resources = src-responder_resources; - dst-initiator_depth = src-initiator_depth; + dst-responder_resources = id_priv-responder_resources; + dst-initiator_depth = id_priv-initiator_depth; dst-flow_control = src-flow_control; dst-retry_count = src-retry_count; dst-rnr_retry_count = src-rnr_retry_count; @@ -1174,15 +1177,24 @@ int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) if (ret) return ret; + if (conn_param-initiator_depth != RDMA_MAX_INIT_DEPTH) + id_priv-initiator_depth = conn_param-initiator_depth; + else + id_priv-initiator_depth = id_priv-cma_dev-max_initiator_depth; + if (conn_param-responder_resources != RDMA_MAX_RESP_RES) + id_priv-responder_resources = conn_param-responder_resources; + else + id_priv-responder_resources =
[PATCH 14/37] librdmacm: make CQs optional for rdma_create_qp
Allow the user to specify NULL for the send and receive CQs when creating a QP through rdma_create_qp. The librdmacm will automatically create CQs for the user, along with completion channel. Signed-off-by: Sean Hefty sean.he...@intel.com --- include/rdma/rdma_cma.h |4 +++ src/cma.c | 74 --- 2 files changed, 74 insertions(+), 4 deletions(-) diff --git a/include/rdma/rdma_cma.h b/include/rdma/rdma_cma.h index ccf6cd4..d8cbb91 100644 --- a/include/rdma/rdma_cma.h +++ b/include/rdma/rdma_cma.h @@ -115,6 +115,10 @@ struct rdma_cm_id { enum rdma_port_space ps; uint8_t port_num; struct rdma_cm_event*event; + struct ibv_comp_channel *send_cq_channel; + struct ibv_cq *send_cq; + struct ibv_comp_channel *recv_cq_channel; + struct ibv_cq *recv_cq; }; struct rdma_conn_param { diff --git a/src/cma.c b/src/cma.c index 0587ab3..805aca3 100644 --- a/src/cma.c +++ b/src/cma.c @@ -1025,6 +1025,63 @@ static int ucma_init_ud_qp(struct cma_id_private *id_priv, struct ibv_qp *qp) return ibv_modify_qp(qp, qp_attr, IBV_QP_STATE | IBV_QP_SQ_PSN); } +static void ucma_destroy_cqs(struct rdma_cm_id *id) +{ + if (id-recv_cq) + ibv_destroy_cq(id-recv_cq); + + if (id-recv_cq_channel) + ibv_destroy_comp_channel(id-recv_cq_channel); + + if (id-send_cq) + ibv_destroy_cq(id-send_cq); + + if (id-send_cq_channel) + ibv_destroy_comp_channel(id-send_cq_channel); +} + +static int ucma_create_cqs(struct rdma_cm_id *id, struct ibv_qp_init_attr *attr) +{ + int ret; + + if (!attr-recv_cq) { + id-recv_cq_channel = ibv_create_comp_channel(id-verbs); + if (!id-recv_cq_channel) { + ret = ERR(ENOMEM); + goto err; + } + + id-recv_cq = ibv_create_cq(id-verbs, attr-cap.max_recv_wr, + id, id-recv_cq_channel, 0); + if (!id-recv_cq) { + ret = ERR(ENOMEM); + goto err; + } + attr-recv_cq = id-recv_cq; + } + + if (!attr-send_cq) { + id-send_cq_channel = ibv_create_comp_channel(id-verbs); + if (!id-send_cq_channel) { + ret = ERR(ENOMEM); + goto err; + } + + id-send_cq = ibv_create_cq(id-verbs, attr-cap.max_send_wr, + id, id-send_cq_channel, 0); + if (!id-send_cq) { + ret = ERR(ENOMEM); + goto err; + } + attr-send_cq = id-send_cq; + } + + return 0; +err: + ucma_destroy_cqs(id); + return ret; +} + int rdma_create_qp(struct rdma_cm_id *id, struct ibv_pd *pd, struct ibv_qp_init_attr *qp_init_attr) { @@ -1038,27 +1095,36 @@ int rdma_create_qp(struct rdma_cm_id *id, struct ibv_pd *pd, else if (id-verbs != pd-context) return ERR(EINVAL); + ret = ucma_create_cqs(id, qp_init_attr); + if (ret) + return ret; + qp = ibv_create_qp(pd, qp_init_attr); - if (!qp) - return ERR(ENOMEM); + if (!qp) { + ret = ERR(ENOMEM); + goto err1; + } if (ucma_is_ud_ps(id-ps)) ret = ucma_init_ud_qp(id_priv, qp); else ret = ucma_init_conn_qp(id_priv, qp); if (ret) - goto err; + goto err2; id-qp = qp; return 0; -err: +err2: ibv_destroy_qp(qp); +err1: + ucma_destroy_cqs(id); return ret; } void rdma_destroy_qp(struct rdma_cm_id *id) { ibv_destroy_qp(id-qp); + ucma_destroy_cqs(id); } static int ucma_valid_param(struct cma_id_private *id_priv, -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/37] librdmacm: support synchronous rdma_cm_id's
Allow the user to specify NULL as the rdma_event_channel in order to indicate that the rdma_cm_id should process all requests synchronously. Signed-off-by: Sean Hefty sean.he...@intel.com --- include/rdma/rdma_cma.h |1 + src/cma.c | 93 +++ 2 files changed, 78 insertions(+), 16 deletions(-) diff --git a/include/rdma/rdma_cma.h b/include/rdma/rdma_cma.h index a071a9b..83418c3 100644 --- a/include/rdma/rdma_cma.h +++ b/include/rdma/rdma_cma.h @@ -114,6 +114,7 @@ struct rdma_cm_id { struct rdma_routeroute; enum rdma_port_space ps; uint8_t port_num; + struct rdma_cm_event*event; }; struct rdma_conn_param { diff --git a/src/cma.c b/src/cma.c index 4025aeb..c7a3a7b 100644 --- a/src/cma.c +++ b/src/cma.c @@ -106,6 +106,7 @@ struct cma_id_private { struct cma_device *cma_dev; int events_completed; int connect_error; + int sync; pthread_cond_tcond; pthread_mutex_t mut; uint32_t handle; @@ -333,6 +334,9 @@ static void ucma_free_id(struct cma_id_private *id_priv) pthread_mutex_destroy(id_priv-mut); if (id_priv-id.route.path_rec) free(id_priv-id.route.path_rec); + + if (id_priv-sync) + rdma_destroy_event_channel(id_priv-id.channel); free(id_priv); } @@ -348,7 +352,16 @@ static struct cma_id_private *ucma_alloc_id(struct rdma_event_channel *channel, id_priv-id.context = context; id_priv-id.ps = ps; - id_priv-id.channel = channel; + + if (!channel) { + id_priv-id.channel = rdma_create_event_channel(); + if (!id_priv-id.channel) + goto err; + id_priv-sync = 1; + } else { + id_priv-id.channel = channel; + } + pthread_mutex_init(id_priv-mut, NULL); if (pthread_cond_init(id_priv-cond, NULL)) goto err; @@ -381,7 +394,7 @@ int rdma_create_id(struct rdma_event_channel *channel, cmd-uid = (uintptr_t) id_priv; cmd-ps = ps; - ret = write(channel-fd, msg, size); + ret = write(id_priv-id.channel-fd, msg, size); if (ret != size) goto err; @@ -424,6 +437,9 @@ int rdma_destroy_id(struct rdma_cm_id *id) if (ret 0) return ret; + if (id_priv-id.event) + rdma_ack_cm_event(id_priv-id.event); + pthread_mutex_lock(id_priv-mut); while (id_priv-events_completed ret) pthread_cond_wait(id_priv-cond, id_priv-mut); @@ -694,6 +710,25 @@ int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) return ucma_query_route(id); } +static int ucma_complete(struct cma_id_private *id_priv) +{ + int ret; + + if (!id_priv-sync) + return 0; + + if (id_priv-id.event) { + rdma_ack_cm_event(id_priv-id.event); + id_priv-id.event = NULL; + } + + ret = rdma_get_cm_event(id_priv-id.channel, id_priv-id.event); + if (ret) + return ret; + + return id_priv-id.event-status; +} + static int rdma_resolve_addr2(struct rdma_cm_id *id, struct sockaddr *src_addr, socklen_t src_len, struct sockaddr *dst_addr, socklen_t dst_len, int timeout_ms) @@ -718,7 +753,7 @@ static int rdma_resolve_addr2(struct rdma_cm_id *id, struct sockaddr *src_addr, return (ret = 0) ? ERR(ENODATA) : -1; memcpy(id-route.addr.dst_addr, dst_addr, dst_len); - return 0; + return ucma_complete(id_priv); } int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, @@ -751,7 +786,7 @@ int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, return (ret = 0) ? ERR(ENODATA) : -1; memcpy(id-route.addr.dst_addr, dst_addr, dst_len); - return 0; + return ucma_complete(id_priv); } int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) @@ -770,7 +805,7 @@ int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms) if (ret != size) return (ret = 0) ? ERR(ENODATA) : -1; - return 0; + return ucma_complete(id_priv); } static int ucma_is_ud_ps(enum rdma_port_space ps) @@ -1074,7 +1109,7 @@ int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) if (ret != size) return (ret = 0) ? ERR(ENODATA) : -1; - return 0; + return ucma_complete(id_priv); } int rdma_listen(struct rdma_cm_id *id, int backlog) @@ -1139,7 +1174,7 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) return (ret = 0) ? ERR(ENODATA) : -1; } - return 0; + return ucma_complete(id_priv); } int
[PATCH 11/37] librdmacm: add zalloc call
Signed-off-by: Sean Hefty sean.he...@intel.com --- src/cma.c |6 ++ src/cma.h | 10 ++ 2 files changed, 12 insertions(+), 4 deletions(-) diff --git a/src/cma.c b/src/cma.c index a85448b..4025aeb 100644 --- a/src/cma.c +++ b/src/cma.c @@ -342,11 +342,10 @@ static struct cma_id_private *ucma_alloc_id(struct rdma_event_channel *channel, { struct cma_id_private *id_priv; - id_priv = malloc(sizeof *id_priv); + id_priv = zalloc(sizeof *id_priv); if (!id_priv) return NULL; - memset(id_priv, 0, sizeof *id_priv); id_priv-id.context = context; id_priv-id.ps = ps; id_priv-id.channel = channel; @@ -1228,11 +1227,10 @@ static int rdma_join_multicast2(struct rdma_cm_id *id, struct sockaddr *addr, int ret, size; id_priv = container_of(id, struct cma_id_private, id); - mc = malloc(sizeof *mc); + mc = zalloc(sizeof *mc); if (!mc) return ERR(ENOMEM); - memset(mc, 0, sizeof *mc); mc-context = context; mc-id_priv = id_priv; memcpy(mc-addr, addr, addrlen); diff --git a/src/cma.h b/src/cma.h index 1c0ab8b..fcfb1f7 100644 --- a/src/cma.h +++ b/src/cma.h @@ -42,6 +42,7 @@ #include errno.h #include endian.h #include byteswap.h +#include string.h #ifdef INCLUDE_VALGRIND # include valgrind/memcheck.h @@ -70,5 +71,14 @@ static inline int ERR(int err) return -1; } +static inline void *zalloc(size_t size) +{ + void *buf; + + if ((buf = malloc(size))) + memset(buf, 0, size); + return buf; +} + #endif /* CMA_H */ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/37] librdmacm: move common definitions to internal header file
Signed-off-by: Sean Hefty sean.he...@intel.com --- Makefile.am |2 +- src/cma.c | 28 +- src/cma.h | 74 +++ 3 files changed, 76 insertions(+), 28 deletions(-) diff --git a/Makefile.am b/Makefile.am index 2898ad9..c9be437 100644 --- a/Makefile.am +++ b/Makefile.am @@ -70,7 +70,7 @@ man_MANS = \ EXTRA_DIST = include/rdma/rdma_cma_abi.h include/rdma/rdma_cma.h \ include/infiniband/ib.h \ -src/librdmacm.map librdmacm.spec.in $(man_MANS) +src/cma.h src/librdmacm.map librdmacm.spec.in $(man_MANS) dist-hook: librdmacm.spec cp librdmacm.spec $(distdir) diff --git a/src/cma.c b/src/cma.c index c83d9d2..a85448b 100644 --- a/src/cma.c +++ b/src/cma.c @@ -50,39 +50,13 @@ #include byteswap.h #include stddef.h +#include cma.h #include infiniband/driver.h #include infiniband/marshall.h #include rdma/rdma_cma.h #include rdma/rdma_cma_abi.h #include infiniband/ib.h -#ifdef INCLUDE_VALGRIND -# include valgrind/memcheck.h -# ifndef VALGRIND_MAKE_MEM_DEFINED -# warning Valgrind requested, but VALGRIND_MAKE_MEM_DEFINED undefined -# endif -#endif - -#ifndef VALGRIND_MAKE_MEM_DEFINED -# define VALGRIND_MAKE_MEM_DEFINED(addr,len) -#endif - -#define PFX librdmacm: - -#if __BYTE_ORDER == __LITTLE_ENDIAN -static inline uint64_t htonll(uint64_t x) { return bswap_64(x); } -static inline uint64_t ntohll(uint64_t x) { return bswap_64(x); } -#else -static inline uint64_t htonll(uint64_t x) { return x; } -static inline uint64_t ntohll(uint64_t x) { return x; } -#endif - -static inline int ERR(int err) -{ - errno = err; - return -1; -} - #define CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, type, size) \ do {\ struct ucma_abi_cmd_hdr *hdr; \ diff --git a/src/cma.h b/src/cma.h new file mode 100644 index 000..1c0ab8b --- /dev/null +++ b/src/cma.h @@ -0,0 +1,74 @@ +/* + * Copyright (c) 2005-2010 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *disclaimer in the documentation and/or other materials + *provided with the distribution. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#if !defined(CMA_H) +#define CMA_H + +#if HAVE_CONFIG_H +# include config.h +#endif /* HAVE_CONFIG_H */ + +#include stdlib.h +#include errno.h +#include endian.h +#include byteswap.h + +#ifdef INCLUDE_VALGRIND +# include valgrind/memcheck.h +# ifndef VALGRIND_MAKE_MEM_DEFINED +# warning Valgrind requested, but VALGRIND_MAKE_MEM_DEFINED undefined +# endif +#endif + +#ifndef VALGRIND_MAKE_MEM_DEFINED +# define VALGRIND_MAKE_MEM_DEFINED(addr,len) +#endif + +#define PFX librdmacm: + +#if __BYTE_ORDER == __LITTLE_ENDIAN +static inline uint64_t htonll(uint64_t x) { return bswap_64(x); } +static inline uint64_t ntohll(uint64_t x) { return bswap_64(x); } +#else +static inline uint64_t htonll(uint64_t x) { return x; } +static inline uint64_t ntohll(uint64_t x) { return x; } +#endif + +static inline int ERR(int err) +{ + errno = err; + return -1; +} + +#endif /* CMA_H */ + -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/37] librdmacm: replace query_route call with separate queries
To support other address families and multiple path records, replace the query_route call with specific query calls to obtain only the desired information. Signed-off-by: Sean Hefty sean.he...@intel.com --- src/cma.c | 69 + 1 files changed, 60 insertions(+), 9 deletions(-) diff --git a/src/cma.c b/src/cma.c index 2a70d20..c57d166 100644 --- a/src/cma.c +++ b/src/cma.c @@ -161,6 +161,7 @@ static struct cma_device *cma_dev_array; static int cma_dev_cnt; static pthread_mutex_t mut = PTHREAD_MUTEX_INITIALIZER; static int abi_ver = RDMA_USER_CM_MAX_ABI_VERSION; +int af_ib_support; #define container_of(ptr, type, field) \ ((type *) ((void *)ptr - offsetof(type, field))) @@ -627,7 +628,7 @@ static int ucma_query_route(struct rdma_cm_id *id) struct cma_id_private *id_priv; void *msg; int ret, size, i; - + CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_QUERY_ROUTE, size); id_priv = container_of(id, struct cma_id_private, id); cmd-id = id_priv-handle; @@ -1060,7 +1061,10 @@ int rdma_listen(struct rdma_cm_id *id, int backlog) if (ret != size) return (ret = 0) ? ERR(ENODATA) : -1; - return ucma_query_route(id); + if (af_ib_support) + return ucma_query_addr(id); + else + return ucma_query_route(id); } int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param) @@ -1326,6 +1330,57 @@ int rdma_ack_cm_event(struct rdma_cm_event *event) return 0; } +static void ucma_process_addr_resolved(struct cma_event *evt) +{ + if (af_ib_support) { + evt-event.status = ucma_query_addr(evt-id_priv-id); + if (!evt-event.status + evt-id_priv-id.verbs-device-transport_type == IBV_TRANSPORT_IB) + evt-event.status = ucma_query_gid(evt-id_priv-id); + } else { + evt-event.status = ucma_query_route(evt-id_priv-id); + } + + if (evt-event.status) + evt-event.event = RDMA_CM_EVENT_ADDR_ERROR; +} + +static void ucma_process_route_resolved(struct cma_event *evt) +{ + if (evt-id_priv-id.verbs-device-transport_type != IBV_TRANSPORT_IB) + return; + + if (af_ib_support) + evt-event.status = ucma_query_path(evt-id_priv-id); + else + evt-event.status = ucma_query_route(evt-id_priv-id); + + if (evt-event.status) + evt-event.event = RDMA_CM_EVENT_ROUTE_ERROR; +} + +static int ucma_query_req_info(struct rdma_cm_id *id) +{ + int ret; + + if (!af_ib_support) + return ucma_query_route(id); + + ret = ucma_query_addr(id); + if (ret) + return ret; + + ret = ucma_query_gid(id); + if (ret) + return ret; + + ret = ucma_query_path(id); + if (ret) + return ret; + + return 0; +} + static int ucma_process_conn_req(struct cma_event *evt, uint32_t handle) { @@ -1344,7 +1399,7 @@ static int ucma_process_conn_req(struct cma_event *evt, evt-event.id = id_priv-id; id_priv-handle = handle; - ret = ucma_query_route(id_priv-id); + ret = ucma_query_req_info(id_priv-id); if (ret) { rdma_destroy_id(id_priv-id); goto err; @@ -1473,14 +1528,10 @@ retry: switch (resp-event) { case RDMA_CM_EVENT_ADDR_RESOLVED: - evt-event.status = ucma_query_route(evt-id_priv-id); - if (evt-event.status) - evt-event.event = RDMA_CM_EVENT_ADDR_ERROR; + ucma_process_addr_resolved(evt); break; case RDMA_CM_EVENT_ROUTE_RESOLVED: - evt-event.status = ucma_query_route(evt-id_priv-id); - if (evt-event.status) - evt-event.event = RDMA_CM_EVENT_ROUTE_ERROR; + ucma_process_route_resolved(evt); break; case RDMA_CM_EVENT_CONNECT_REQUEST: evt-id_priv = (void *) (uintptr_t) resp-uid; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/37] librdmacm: add ability to query IB path records
The current query_route command only supports 2 path records. Add support for query_path, which is capable of supporting multiple paths. Signed-off-by: Sean Hefty sean.he...@intel.com --- include/rdma/rdma_cma_abi.h | 10 + src/cma.c | 82 +++ 2 files changed, 91 insertions(+), 1 deletions(-) diff --git a/include/rdma/rdma_cma_abi.h b/include/rdma/rdma_cma_abi.h index 5c736fb..6c83fe8 100644 --- a/include/rdma/rdma_cma_abi.h +++ b/include/rdma/rdma_cma_abi.h @@ -35,6 +35,7 @@ #include infiniband/kern-abi.h #include infiniband/sa-kern-abi.h +#include infiniband/sa.h /* * This file must be kept in sync with the kernel's version of rdma_user_cm.h @@ -114,7 +115,8 @@ struct ucma_abi_resolve_route { }; enum { - UCMA_QUERY_ADDR + UCMA_QUERY_ADDR, + UCMA_QUERY_PATH }; struct ucma_abi_query { @@ -144,6 +146,12 @@ struct ucma_abi_query_addr_resp { struct sockaddr_storage dst_addr; }; +struct ucma_abi_query_path_resp { + __u32 num_paths; + __u32 reserved; + struct ib_path_data path_data[0]; +}; + struct ucma_abi_conn_param { __u32 qp_num; __u32 reserved; diff --git a/src/cma.c b/src/cma.c index 2aef594..c3c6b73 100644 --- a/src/cma.c +++ b/src/cma.c @@ -506,6 +506,88 @@ static int ucma_query_addr(struct rdma_cm_id *id) return 0; } +static void ucma_convert_path(struct ib_path_data *path_data, + struct ibv_sa_path_rec *sa_path) +{ + uint32_t fl_hop; + + sa_path-dgid = path_data-path.dgid; + sa_path-sgid = path_data-path.sgid; + sa_path-dlid = path_data-path.dlid; + sa_path-slid = path_data-path.slid; + sa_path-raw_traffic = 0; + + fl_hop = ntohl(path_data-path.flowlabel_hoplimit); + sa_path-flow_label = htonl(fl_hop 8); + sa_path-hop_limit = (uint8_t) fl_hop; + + sa_path-traffic_class = path_data-path.tclass; + sa_path-reversible = path_data-path.reversible_numpath 7; + sa_path-numb_path = 1; + sa_path-pkey = path_data-path.pkey; + sa_path-sl = ntohs(path_data-path.qosclass_sl) 0xF; + sa_path-mtu_selector = 1; + sa_path-mtu = path_data-path.mtu 0x1F; + sa_path-rate_selector = 1; + sa_path-rate = path_data-path.rate 0x1F; + sa_path-packet_life_time_selector = 1; + sa_path-packet_life_time = path_data-path.packetlifetime 0x1F; + + sa_path-preference = (uint8_t) path_data-flags; +} + +static int ucma_query_path(struct rdma_cm_id *id) +{ + struct ucma_abi_query_path_resp *resp; + struct ucma_abi_query *cmd; + struct ucma_abi_cmd_hdr *hdr; + struct cma_id_private *id_priv; + void *msg; + int ret, size, i; + + size = sizeof(*hdr) + sizeof(*cmd); + msg = alloca(size); + if (!msg) + return ERR(ENOMEM); + + hdr = msg; + cmd = msg + sizeof(*hdr); + + hdr-cmd = UCMA_CMD_QUERY; + hdr-in = sizeof(*cmd); + hdr-out = sizeof(*resp) + sizeof(struct ib_path_data) * 6; + + memset(cmd, 0, sizeof(*cmd)); + + resp = alloca(hdr-out); + if (!resp) + return ERR(ENOMEM); + + id_priv = container_of(id, struct cma_id_private, id); + cmd-response = (uintptr_t) resp; + cmd-id = id_priv-handle; + cmd-option = UCMA_QUERY_PATH; + + ret = write(id-channel-fd, msg, size); + if (ret != size) + return (ret = 0) ? ERR(ENODATA) : -1; + + VALGRIND_MAKE_MEM_DEFINED(resp, hdr-out); + + if (resp-num_paths) { + id-route.path_rec = malloc(sizeof(*id-route.path_rec) * + resp-num_paths); + if (!id-route.path_rec) + return ERR(ENOMEM); + + id-route.num_paths = resp-num_paths; + for (i = 0; i resp-num_paths; i++) + ucma_convert_path(resp-path_data[i], id-route.path_rec[i]); + } + + return 0; +} + static int ucma_query_route(struct rdma_cm_id *id) { struct ucma_abi_query_route_resp *resp; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/37] librdmacm: name changes to indicate only IP addresses supported
Several commands to the kernel RDMA CM only support IP addresses because of limitations in the structure definition. Update the library to match the name changes in the kernel and indicate that only IP addresses can be used with the current commands. Signed-off-by: Sean Hefty sean.he...@intel.com --- include/rdma/rdma_cma_abi.h | 12 ++-- src/cma.c | 12 ++-- 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/include/rdma/rdma_cma_abi.h b/include/rdma/rdma_cma_abi.h index 1a3a9c2..e51a372 100644 --- a/include/rdma/rdma_cma_abi.h +++ b/include/rdma/rdma_cma_abi.h @@ -48,8 +48,8 @@ enum { UCMA_CMD_CREATE_ID, UCMA_CMD_DESTROY_ID, - UCMA_CMD_BIND_ADDR, - UCMA_CMD_RESOLVE_ADDR, + UCMA_CMD_BIND_IP, + UCMA_CMD_RESOLVE_IP, UCMA_CMD_RESOLVE_ROUTE, UCMA_CMD_QUERY_ROUTE, UCMA_CMD_CONNECT, @@ -62,7 +62,7 @@ enum { UCMA_CMD_GET_OPTION, UCMA_CMD_SET_OPTION, UCMA_CMD_NOTIFY, - UCMA_CMD_JOIN_MCAST, + UCMA_CMD_JOIN_IP_MCAST, UCMA_CMD_LEAVE_MCAST, UCMA_CMD_MIGRATE_ID }; @@ -94,13 +94,13 @@ struct ucma_abi_destroy_id_resp { __u32 events_reported; }; -struct ucma_abi_bind_addr { +struct ucma_abi_bind_ip { __u64 response; struct sockaddr_in6 addr; __u32 id; }; -struct ucma_abi_resolve_addr { +struct ucma_abi_resolve_ip { struct sockaddr_in6 src_addr; struct sockaddr_in6 dst_addr; __u32 id; @@ -192,7 +192,7 @@ struct ucma_abi_notify { __u32 event; }; -struct ucma_abi_join_mcast { +struct ucma_abi_join_ip_mcast { __u64 response; /* ucma_abi_create_id_resp */ __u64 uid; struct sockaddr_in6 addr; diff --git a/src/cma.c b/src/cma.c index 59e89dd..b5f71d0 100644 --- a/src/cma.c +++ b/src/cma.c @@ -525,7 +525,7 @@ static int ucma_query_route(struct rdma_cm_id *id) int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) { - struct ucma_abi_bind_addr *cmd; + struct ucma_abi_bind_ip *cmd; struct cma_id_private *id_priv; void *msg; int ret, size, addrlen; @@ -534,7 +534,7 @@ int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) if (!addrlen) return ERR(EINVAL); - CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_BIND_ADDR, size); + CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_BIND_IP, size); id_priv = container_of(id, struct cma_id_private, id); cmd-id = id_priv-handle; memcpy(cmd-addr, addr, addrlen); @@ -549,7 +549,7 @@ int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, struct sockaddr *dst_addr, int timeout_ms) { - struct ucma_abi_resolve_addr *cmd; + struct ucma_abi_resolve_ip *cmd; struct cma_id_private *id_priv; void *msg; int ret, size, daddrlen; @@ -558,7 +558,7 @@ int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, if (!daddrlen) return ERR(EINVAL); - CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_RESOLVE_ADDR, size); + CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_RESOLVE_IP, size); id_priv = container_of(id, struct cma_id_private, id); cmd-id = id_priv-handle; if (src_addr) @@ -1037,7 +1037,7 @@ int rdma_disconnect(struct rdma_cm_id *id) int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, void *context) { - struct ucma_abi_join_mcast *cmd; + struct ucma_abi_join_ip_mcast *cmd; struct ucma_abi_create_id_resp *resp; struct cma_id_private *id_priv; struct cma_multicast *mc, **pos; @@ -1067,7 +1067,7 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr, id_priv-mc_list = mc; pthread_mutex_unlock(id_priv-mut); - CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_JOIN_MCAST, size); + CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_JOIN_IP_MCAST, size); cmd-id = id_priv-handle; memcpy(cmd-addr, addr, addrlen); cmd-uid = (uintptr_t) mc; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/37] librdmacm: support querying AF_IB addresses
The current query route command returns path record data and address information. The latter is restricted to sizeof(sockaddr_in6). In order to support AF_IB, modify the library to use the new query addr command, which supports larger address sizes and avoids querying for path records data when none are available. Signed-off-by: Sean Hefty sean.he...@intel.com --- include/rdma/rdma_cma_abi.h | 22 +++--- src/cma.c | 35 ++- 2 files changed, 53 insertions(+), 4 deletions(-) diff --git a/include/rdma/rdma_cma_abi.h b/include/rdma/rdma_cma_abi.h index e51a372..5c736fb 100644 --- a/include/rdma/rdma_cma_abi.h +++ b/include/rdma/rdma_cma_abi.h @@ -64,7 +64,8 @@ enum { UCMA_CMD_NOTIFY, UCMA_CMD_JOIN_IP_MCAST, UCMA_CMD_LEAVE_MCAST, - UCMA_CMD_MIGRATE_ID + UCMA_CMD_MIGRATE_ID, + UCMA_CMD_QUERY }; struct ucma_abi_cmd_hdr { @@ -112,10 +113,14 @@ struct ucma_abi_resolve_route { __u32 timeout_ms; }; -struct ucma_abi_query_route { +enum { + UCMA_QUERY_ADDR +}; + +struct ucma_abi_query { __u64 response; __u32 id; - __u32 reserved; + __u32 option; }; struct ucma_abi_query_route_resp { @@ -128,6 +133,17 @@ struct ucma_abi_query_route_resp { __u8 reserved[3]; }; +struct ucma_abi_query_addr_resp { + __u64 node_guid; + __u8 port_num; + __u8 reserved; + __u16 pkey; + __u16 src_size; + __u16 dst_size; + struct sockaddr_storage src_addr; + struct sockaddr_storage dst_addr; +}; + struct ucma_abi_conn_param { __u32 qp_num; __u32 reserved; diff --git a/src/cma.c b/src/cma.c index b5f71d0..2aef594 100644 --- a/src/cma.c +++ b/src/cma.c @@ -473,10 +473,43 @@ static int ucma_addrlen(struct sockaddr *addr) } } +static int ucma_query_addr(struct rdma_cm_id *id) +{ + struct ucma_abi_query_addr_resp *resp; + struct ucma_abi_query *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size; + + CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_QUERY, size); + id_priv = container_of(id, struct cma_id_private, id); + cmd-id = id_priv-handle; + cmd-option = UCMA_QUERY_ADDR; + + ret = write(id-channel-fd, msg, size); + if (ret != size) + return (ret = 0) ? ERR(ENODATA) : -1; + + VALGRIND_MAKE_MEM_DEFINED(resp, sizeof *resp); + + memcpy(id-route.addr.src_addr, resp-src_addr, resp-src_size); + memcpy(id-route.addr.dst_addr, resp-dst_addr, resp-dst_size); + + if (!id_priv-cma_dev resp-node_guid) { + ret = ucma_get_device(id_priv, resp-node_guid); + if (ret) + return ret; + id-port_num = resp-port_num; + id-route.addr.addr.ibaddr.pkey = resp-pkey; + } + + return 0; +} + static int ucma_query_route(struct rdma_cm_id *id) { struct ucma_abi_query_route_resp *resp; - struct ucma_abi_query_route *cmd; + struct ucma_abi_query *cmd; struct cma_id_private *id_priv; void *msg; int ret, size, i; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/37] librdmacm: add support to query GIDs
Support query GID ABI to obtain GID information separately from path record data and sa_family addressing. This patch also adds the definition for sockaddr_ib for userspace. Signed-off-by: Sean Hefty sean.he...@intel.com --- Makefile.am |6 ++- include/infiniband/ib.h | 97 +++ include/rdma/rdma_cma.h |4 +- include/rdma/rdma_cma_abi.h |3 + src/cma.c | 32 ++ 5 files changed, 137 insertions(+), 5 deletions(-) diff --git a/Makefile.am b/Makefile.am index 290cbc3..2898ad9 100644 --- a/Makefile.am +++ b/Makefile.am @@ -27,10 +27,11 @@ examples_udaddy_LDADD = $(top_builddir)/src/librdmacm.la examples_mckey_SOURCES = examples/mckey.c examples_mckey_LDADD = $(top_builddir)/src/librdmacm.la -librdmacmincludedir = $(includedir)/rdma +librdmacmincludedir = $(includedir)/rdma $(includedir)/infiniband librdmacminclude_HEADERS = include/rdma/rdma_cma_abi.h \ - include/rdma/rdma_cma.h + include/rdma/rdma_cma.h \ + include/infiniband/ib.h man_MANS = \ man/rdma_accept.3 \ @@ -68,6 +69,7 @@ man_MANS = \ man/rdma_cm.7 EXTRA_DIST = include/rdma/rdma_cma_abi.h include/rdma/rdma_cma.h \ +include/infiniband/ib.h \ src/librdmacm.map librdmacm.spec.in $(man_MANS) dist-hook: librdmacm.spec diff --git a/include/infiniband/ib.h b/include/infiniband/ib.h new file mode 100644 index 000..3a97322 --- /dev/null +++ b/include/infiniband/ib.h @@ -0,0 +1,97 @@ +/* + * Copyright (c) 2010 Intel Corporation. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *disclaimer in the documentation and/or other materials + *provided with the distribution. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#if !defined(_RDMA_IB_H) +#define _RDMA_IB_H + +#include linux/types.h +#include string.h + +#ifndef AF_IB +#define AF_IB 27 +#endif +#ifndef PF_IB +#define PF_IB AF_IB +#endif + +struct ib_addr { + union { + __u8uib_addr8[16]; + __be16 uib_addr16[8]; + __be32 uib_addr32[4]; + __be64 uib_addr64[2]; + } ib_u; +#define sib_addr8 ib_u.uib_addr8 +#define sib_addr16 ib_u.uib_addr16 +#define sib_addr32 ib_u.uib_addr32 +#define sib_addr64 ib_u.uib_addr64 +#define sib_rawib_u.uib_addr8 +#define sib_subnet_prefix ib_u.uib_addr64[0] +#define sib_interface_id ib_u.uib_addr64[1] +}; + +static inline int ib_addr_any(const struct ib_addr *a) +{ + return ((a-sib_addr64[0] | a-sib_addr64[1]) == 0); +} + +static inline int ib_addr_loopback(const struct ib_addr *a) +{ + return ((a-sib_addr32[0] | a-sib_addr32[1] | +a-sib_addr32[2] | (a-sib_addr32[3] ^ htonl(1))) == 0); +} + +static inline void ib_addr_set(struct ib_addr *addr, + __be32 w1, __be32 w2, __be32 w3, __be32 w4) +{ + addr-sib_addr32[0] = w1; + addr-sib_addr32[1] = w2; + addr-sib_addr32[2] = w3; + addr-sib_addr32[3] = w4; +} + +static inline int ib_addr_cmp(const struct ib_addr *a1, const struct ib_addr *a2) +{ + return memcmp(a1, a2, sizeof(struct ib_addr)); +} + +struct sockaddr_ib { + unsigned short int sib_family; /* AF_IB */ + __be16 sib_pkey; + __be32 sib_flowinfo; + struct ib_addr sib_addr; + __be64 sib_sid; + __be64 sib_sid_mask; + __u64 sib_scope_id; +}; + +#endif /* _RDMA_IB_H */ diff --git a/include/rdma/rdma_cma.h
[PATCH 8/37] librdmacm: add support for PF_IB to resolve_addr
Allow user to specify PF_IB addresses to rdma_resolve_addr. Signed-off-by: Sean Hefty sean.he...@intel.com --- include/rdma/rdma_cma_abi.h | 13 - src/cma.c | 44 +-- 2 files changed, 50 insertions(+), 7 deletions(-) diff --git a/include/rdma/rdma_cma_abi.h b/include/rdma/rdma_cma_abi.h index 8add397..4a7a55d 100644 --- a/include/rdma/rdma_cma_abi.h +++ b/include/rdma/rdma_cma_abi.h @@ -67,7 +67,8 @@ enum { UCMA_CMD_LEAVE_MCAST, UCMA_CMD_MIGRATE_ID, UCMA_CMD_QUERY, - UCMA_CMD_BIND + UCMA_CMD_BIND, + UCMA_CMD_RESOLVE_ADDR }; struct ucma_abi_cmd_hdr { @@ -117,6 +118,16 @@ struct ucma_abi_resolve_ip { __u32 timeout_ms; }; +struct ucma_abi_resolve_addr { + __u32 id; + __u32 timeout_ms; + __u16 src_size; + __u16 dst_size; + __u32 reserved; + struct sockaddr_storage src_addr; + struct sockaddr_storage dst_addr; +}; + struct ucma_abi_resolve_route { __u32 id; __u32 timeout_ms; diff --git a/src/cma.c b/src/cma.c index be61333..e22e1b4 100644 --- a/src/cma.c +++ b/src/cma.c @@ -721,31 +721,63 @@ int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr) return ucma_query_route(id); } +static int rdma_resolve_addr2(struct rdma_cm_id *id, struct sockaddr *src_addr, + socklen_t src_len, struct sockaddr *dst_addr, + socklen_t dst_len, int timeout_ms) +{ + struct ucma_abi_resolve_addr *cmd; + struct cma_id_private *id_priv; + void *msg; + int ret, size; + + CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_RESOLVE_ADDR, size); + id_priv = container_of(id, struct cma_id_private, id); + cmd-id = id_priv-handle; + if ((cmd-src_size = src_len)) + memcpy(cmd-src_addr, src_addr, src_len); + memcpy(cmd-dst_addr, dst_addr, dst_len); + cmd-dst_size = dst_len; + cmd-timeout_ms = timeout_ms; + cmd-reserved = 0; + + ret = write(id-channel-fd, msg, size); + if (ret != size) + return (ret = 0) ? ERR(ENODATA) : -1; + + memcpy(id-route.addr.dst_addr, dst_addr, dst_len); + return 0; +} + int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, struct sockaddr *dst_addr, int timeout_ms) { struct ucma_abi_resolve_ip *cmd; struct cma_id_private *id_priv; void *msg; - int ret, size, daddrlen; + int ret, size, dst_len, src_len; - daddrlen = ucma_addrlen(dst_addr); - if (!daddrlen) + dst_len = ucma_addrlen(dst_addr); + if (!dst_len) return ERR(EINVAL); + src_len = ucma_addrlen(src_addr); + if (af_ib_support) + return rdma_resolve_addr2(id, src_addr, src_len, dst_addr, + dst_len, timeout_ms); + CMA_CREATE_MSG_CMD(msg, cmd, UCMA_CMD_RESOLVE_IP, size); id_priv = container_of(id, struct cma_id_private, id); cmd-id = id_priv-handle; if (src_addr) - memcpy(cmd-src_addr, src_addr, ucma_addrlen(src_addr)); - memcpy(cmd-dst_addr, dst_addr, daddrlen); + memcpy(cmd-src_addr, src_addr, src_len); + memcpy(cmd-dst_addr, dst_addr, dst_len); cmd-timeout_ms = timeout_ms; ret = write(id-channel-fd, msg, size); if (ret != size) return (ret = 0) ? ERR(ENODATA) : -1; - memcpy(id-route.addr.dst_addr, dst_addr, daddrlen); + memcpy(id-route.addr.dst_addr, dst_addr, dst_len); return 0; } -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/37] librdmacm: allow pd parameter to be optional
Allow the user to create a QP using rdma_create_qp without specifying a PD. If a PD is not given, a default PD will be used instead. This simplifies the user interface. Signed-off-by: Sean Hefty sean.he...@intel.com --- include/rdma/rdma_cma.h |4 +++- src/cma.c | 24 +++- 2 files changed, 22 insertions(+), 6 deletions(-) diff --git a/include/rdma/rdma_cma.h b/include/rdma/rdma_cma.h index 83418c3..ccf6cd4 100644 --- a/include/rdma/rdma_cma.h +++ b/include/rdma/rdma_cma.h @@ -279,7 +279,7 @@ int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms); /** * rdma_create_qp - Allocate a QP. * @id: RDMA identifier. - * @pd: protection domain for the QP. + * @pd: Optional protection domain for the QP. * @qp_init_attr: initial QP attributes. * Description: * Allocate a QP associated with the specified rdma_cm_id and transition it @@ -291,6 +291,8 @@ int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms); * librdmacm through their states. After being allocated, the QP will be * ready to handle posting of receives. If the QP is unconnected, it will * be ready to post sends. + * If pd is NULL, then the QP will be allocated using a default protection + * domain associated with the underlying RDMA device. * See also: * rdma_bind_addr, rdma_resolve_addr, rdma_destroy_qp, ibv_create_qp, * ibv_modify_qp diff --git a/src/cma.c b/src/cma.c index c7a3a7b..0587ab3 100644 --- a/src/cma.c +++ b/src/cma.c @@ -95,6 +95,7 @@ do {\ struct cma_device { struct ibv_context *verbs; + struct ibv_pd *pd; uint64_tguid; int port_cnt; uint8_t max_initiator_depth; @@ -144,9 +145,11 @@ int af_ib_support; static void ucma_cleanup(void) { if (cma_dev_cnt) { - while (cma_dev_cnt) - ibv_close_device(cma_dev_array[--cma_dev_cnt].verbs); - + while (cma_dev_cnt--) { + ibv_dealloc_pd(cma_dev_array[cma_dev_cnt].pd); + ibv_close_device(cma_dev_array[cma_dev_cnt].verbs); + } + free(cma_dev_array); cma_dev_cnt = 0; } @@ -224,6 +227,13 @@ static int ucma_init(void) goto err3; } + cma_dev-pd = ibv_alloc_pd(cma_dev-verbs); + if (!cma_dev-pd) { + ibv_close_device(cma_dev-verbs); + ret = ERR(ENOMEM); + goto err3; + } + i++; ret = ibv_query_device(cma_dev-verbs, attr); if (ret) { @@ -242,8 +252,10 @@ static int ucma_init(void) return 0; err3: - while (i--) + while (i--) { + ibv_dealloc_pd(cma_dev_array[i].pd); ibv_close_device(cma_dev_array[i].verbs); + } free(cma_dev_array); err2: ibv_free_device_list(dev_list); @@ -1021,7 +1033,9 @@ int rdma_create_qp(struct rdma_cm_id *id, struct ibv_pd *pd, int ret; id_priv = container_of(id, struct cma_id_private, id); - if (id-verbs != pd-context) + if (!pd) + pd = id_priv-cma_dev-pd; + else if (id-verbs != pd-context) return ERR(EINVAL); qp = ibv_create_qp(pd, qp_init_attr); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 21/37] librdmacm: specify qp_type when creating id
To support AF_IB / PS_IB, we need to specify the qp type when creating the rdma_cm_id. The kernel requires this in order to select the correct type of operation to perform (e.g. SIDR versus REQ). Signed-off-by: Sean Hefty sean.he...@intel.com --- include/rdma/rdma_cma_abi.h |3 ++- src/cma.c | 18 +++--- 2 files changed, 17 insertions(+), 4 deletions(-) diff --git a/include/rdma/rdma_cma_abi.h b/include/rdma/rdma_cma_abi.h index c3981e6..bd4ca0f 100644 --- a/include/rdma/rdma_cma_abi.h +++ b/include/rdma/rdma_cma_abi.h @@ -82,7 +82,8 @@ struct ucma_abi_create_id { __u64 uid; __u64 response; __u16 ps; - __u8 reserved[6]; + __u8 qp_type; + __u8 reserved[5]; }; struct ucma_abi_create_id_resp { diff --git a/src/cma.c b/src/cma.c index 9de33d4..e31fb8a 100644 --- a/src/cma.c +++ b/src/cma.c @@ -390,9 +390,9 @@ err:ucma_free_id(id_priv); return NULL; } -int rdma_create_id(struct rdma_event_channel *channel, - struct rdma_cm_id **id, void *context, - enum rdma_port_space ps) +static int rdma_create_id2(struct rdma_event_channel *channel, + struct rdma_cm_id **id, void *context, + enum rdma_port_space ps, enum ibv_qp_type qp_type) { struct ucma_abi_create_id_resp *resp; struct ucma_abi_create_id *cmd; @@ -411,6 +411,7 @@ int rdma_create_id(struct rdma_event_channel *channel, CMA_CREATE_MSG_CMD_RESP(msg, cmd, resp, UCMA_CMD_CREATE_ID, size); cmd-uid = (uintptr_t) id_priv; cmd-ps = ps; + cmd-qp_type = qp_type; ret = write(id_priv-id.channel-fd, msg, size); if (ret != size) @@ -426,6 +427,17 @@ err: ucma_free_id(id_priv); return ret; } +int rdma_create_id(struct rdma_event_channel *channel, + struct rdma_cm_id **id, void *context, + enum rdma_port_space ps) +{ + enum ibv_qp_type qp_type; + + qp_type = (ps == RDMA_PS_IPOIB || ps == RDMA_PS_UDP) ? + IBV_QPT_UD : IBV_QPT_RC; + return rdma_create_id2(channel, id, context, ps, qp_type); +} + static int ucma_destroy_kern_id(int fd, uint32_t handle) { struct ucma_abi_destroy_id_resp *resp; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[infiniband-diags] [1/3] support --diff in ibnetdiscover
Hi Sasha, This patch adds the default --diff support in ibnetdiscover. Al -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory ---BeginMessage--- Signed-off-by: Albert Chu ch...@llnl.gov --- infiniband-diags/man/ibnetdiscover.8 |7 + infiniband-diags/src/ibnetdiscover.c | 246 + 2 files changed, 223 insertions(+), 30 deletions(-) diff --git a/infiniband-diags/man/ibnetdiscover.8 b/infiniband-diags/man/ibnetdiscover.8 index 082a8e4..975b999 100644 --- a/infiniband-diags/man/ibnetdiscover.8 +++ b/infiniband-diags/man/ibnetdiscover.8 @@ -57,6 +57,13 @@ Load and use the cached ibnetdiscover data stored in the specified filename. May be useful for outputting and learning about other fabrics or a previous state of a fabric. .TP +\fB\-\-diff\fR filename +Load cached ibnetdiscover data and do a diff comparison to the current +network or another cache. A special diff output for ibnetdiscover +output will be displayed showing differences between the old and current +fabric. By default, the following are compared for differences: switches, +channel adapters, routers, and port connections. +.TP \fB\-p\fR, \fB\-\-ports\fR Obtain a ports report which is a list of connected ports with relevant information (like LID, portnum, diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 651bafd..4da09ce 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -57,6 +57,16 @@ #define LIST_SWITCH_NODE (1 IB_NODE_SWITCH) #define LIST_ROUTER_NODE (1 IB_NODE_ROUTER) +#define DIFF_FLAG_SWITCH 0x0001 +#define DIFF_FLAG_CA 0x0002 +#define DIFF_FLAG_ROUTER 0x0004 +#define DIFF_FLAG_PORT_CONNECTION 0x0008 + +#define DIFF_FLAG_DEFAULT (DIFF_FLAG_SWITCH \ + | DIFF_FLAG_CA \ + | DIFF_FLAG_ROUTER \ + | DIFF_FLAG_PORT_CONNECTION) + struct ibmad_port *srcport; static FILE *f; @@ -65,6 +75,7 @@ static char *node_name_map_file = NULL; static nn_map_t *node_name_map = NULL; static char *cache_file = NULL; static char *load_cache_file = NULL; +static char *diff_cache_file = NULL; static int report_max_hops = 0; @@ -183,16 +194,20 @@ void list_nodes(ibnd_fabric_t * fabric, int list) ibnd_iter_nodes_type(fabric, list_node, IB_NODE_ROUTER, NULL); } -void out_ids(ibnd_node_t * node, int group, char *chname) +void out_ids(ibnd_node_t * node, int group, char *chname, char *out_prefix) { uint64_t sysimgguid = mad_get_field64(node-info, 0, IB_NODE_SYSTEM_GUID_F); - fprintf(f, \nvendid=0x%x\ndevid=0x%x\n, - mad_get_field(node-info, 0, IB_NODE_VENDORID_F), + fprintf(f, \n%svendid=0x%x\n, + out_prefix ? out_prefix : , + mad_get_field(node-info, 0, IB_NODE_VENDORID_F)); + fprintf(f, %sdevid=0x%x\n, + out_prefix ? out_prefix : , mad_get_field(node-info, 0, IB_NODE_DEVID_F)); if (sysimgguid) - fprintf(f, sysimgguid=0x% PRIx64, sysimgguid); + fprintf(f, %ssysimgguid=0x% PRIx64, + out_prefix ? out_prefix : , sysimgguid); if (group node-chassis node-chassis-chassisnum) { fprintf(f, \t\t# Chassis %d, node-chassis-chassisnum); if (chname) @@ -217,14 +232,15 @@ uint64_t out_chassis(ibnd_fabric_t * fabric, unsigned char chassisnum) return guid; } -void out_switch(ibnd_node_t * node, int group, char *chname) +void out_switch(ibnd_node_t * node, int group, char *chname, char *out_prefix) { char *str; char str2[256]; char *nodename = NULL; - out_ids(node, group, chname); - fprintf(f, switchguid=0x% PRIx64, node-guid); + out_ids(node, group, chname, out_prefix); + fprintf(f, %sswitchguid=0x% PRIx64, + out_prefix ? out_prefix : , node-guid); fprintf(f, (% PRIx64 ), mad_get_field64(node-info, 0, IB_NODE_PORT_GUID_F)); if (group) { @@ -239,7 +255,8 @@ void out_switch(ibnd_node_t * node, int group, char *chname) nodename = remap_node_name(node_name_map, node-guid, node-nodedesc); - fprintf(f, \nSwitch\t%d %s\t\t# \%s\ %s port 0 lid %d lmc %d\n, + fprintf(f, \n%sSwitch\t%d %s\t\t# \%s\ %s port 0 lid %d lmc %d\n, + out_prefix ? out_prefix : , node-numports, node_name(node), nodename, node-smaenhsp0 ? enhanced : base, node-smalid, node-smalmc); @@ -247,12 +264,12 @@ void out_switch(ibnd_node_t * node, int group, char *chname) free(nodename); } -void out_ca(ibnd_node_t * node, int group, char *chname) +void out_ca(ibnd_node_t * node, int group, char *chname, char *out_prefix) { char
Re: Fork safe clarification
Are PDs, QPs and CQs created before a fork shared by the parent and child after fork() has returned (ie. both can submit WRs, poll CQ, etc.)? no, QPs and CQs are accessible only in the parent. The child can still use the uverbs file descriptor to do things, but libibverbs will probably get very confused in this case. More userspace development would probably be required to make this really work. Since the PD is attached to the FD, it could be shared. What about MRs registered before the fork? Even though the child doesn't have access to the parent's memory, can he sill submit WRs on a QP with an MR created before the fork? yes. What if the MR pages in the above scenario are accessible in both parent and child (shared memory)? Are there complications with registering shared memory? shouldn't make a difference. In general, are pointers returned by libibverbs pointer to user/process address space (as ibv_mr pointers must be) or kernel space (eg. if an unrelated process had another process's QP pointer, lkey, and a virtual address could it post (almost certainly unsafely) a WR to the other process's QP? Not sure I understand this. All the pointers from libibverbs are of course userspace pointers. What could a userspace process do with a kernel pointer? Processes own all their resources and can't access other resources. - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[infiniband-diags] [3/3] support lid and nodedesc diffchecks in ibnetdiscover
Hi Sasha, This patch adds lid and node description diff options for --diffcheck in ibnetdiscover. Al -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory ---BeginMessage--- Signed-off-by: Albert Chu ch...@llnl.gov --- infiniband-diags/man/ibnetdiscover.8 |3 +- infiniband-diags/src/ibnetdiscover.c | 211 -- 2 files changed, 154 insertions(+), 60 deletions(-) diff --git a/infiniband-diags/man/ibnetdiscover.8 b/infiniband-diags/man/ibnetdiscover.8 index e122736..76cfbc8 100644 --- a/infiniband-diags/man/ibnetdiscover.8 +++ b/infiniband-diags/man/ibnetdiscover.8 @@ -68,7 +68,8 @@ channel adapters, routers, and port connections. Specify what diff checks should be done in the \fB\-\-diff\fR option above. Comma separate multiple diff check key(s). The available diff checks are: \fIsw\fR = switches, \fIca\fR = channel adapters, \fIrouter\fR = routers, -\fIport\fR = port connections descriptions. Note that \fIport\fR is +\fIport\fR = port connections, \fIlid\fR = lids, \fInodedesc\fR = node +descriptions. Note that \fIport\fR, \fIlid\fR, and \fInodedesc\fR are checked only for the node types that are specified (e.g. \fIsw\fR, \fIca\fR, \fIrouter\fR). .TP diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 4435ade..770c589 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -61,6 +61,8 @@ #define DIFF_FLAG_CA 0x0002 #define DIFF_FLAG_ROUTER 0x0004 #define DIFF_FLAG_PORT_CONNECTION 0x0008 +#define DIFF_FLAG_LID 0x0010 +#define DIFF_FLAG_NODE_DESCRIPTION 0x0020 #define DIFF_FLAG_DEFAULT (DIFF_FLAG_SWITCH \ | DIFF_FLAG_CA \ @@ -233,15 +235,29 @@ uint64_t out_chassis(ibnd_fabric_t * fabric, unsigned char chassisnum) return guid; } -void out_switch(ibnd_node_t * node, int group, char *chname, char *out_prefix) +void out_switch_detail(ibnd_node_t * node, char *sw_prefix) +{ + char *nodename = NULL; + + nodename = remap_node_name(node_name_map, node-guid, node-nodedesc); + + fprintf(f, %sSwitch\t%d %s\t\t# \%s\ %s port 0 lid %d lmc %d, + sw_prefix ? sw_prefix : , + node-numports, node_name(node), nodename, + node-smaenhsp0 ? enhanced : base, + node-smalid, node-smalmc); + + free(nodename); +} + +void out_switch(ibnd_node_t * node, int group, char *chname, char *id_prefix, char *sw_prefix) { char *str; char str2[256]; - char *nodename = NULL; - out_ids(node, group, chname, out_prefix); + out_ids(node, group, chname, id_prefix); fprintf(f, %sswitchguid=0x% PRIx64, - out_prefix ? out_prefix : , node-guid); + id_prefix ? id_prefix : , node-guid); fprintf(f, (% PRIx64 ), mad_get_field64(node-info, 0, IB_NODE_PORT_GUID_F)); if (group) { @@ -253,45 +269,54 @@ void out_switch(ibnd_node_t * node, int group, char *chname, char *out_prefix) if (str) fprintf(f, %s, str); } + fprintf(f, \n); - nodename = remap_node_name(node_name_map, node-guid, node-nodedesc); + out_switch_detail(node, sw_prefix); + fprintf(f, \n); +} - fprintf(f, \n%sSwitch\t%d %s\t\t# \%s\ %s port 0 lid %d lmc %d\n, - out_prefix ? out_prefix : , - node-numports, node_name(node), nodename, - node-smaenhsp0 ? enhanced : base, - node-smalid, node-smalmc); +void out_ca_detail(ibnd_node_t * node, char *ca_prefix) +{ + char *node_type; - free(nodename); + switch (node-type) { + case IB_NODE_CA: + node_type = Ca; + break; + case IB_NODE_ROUTER: + node_type = Rt; + break; + default: + node_type = ???; + break; + } + + fprintf(f, %s%s\t%d %s\t\t# \%s\, + ca_prefix ? ca_prefix : , + node_type, node-numports, node_name(node), + clean_nodedesc(node-nodedesc)); } -void out_ca(ibnd_node_t * node, int group, char *chname, char *out_prefix) +void out_ca(ibnd_node_t * node, int group, char *chname, char *id_prefix, char *ca_prefix) { char *node_type; - char *node_type2; - out_ids(node, group, chname, out_prefix); + out_ids(node, group, chname, id_prefix); switch (node-type) { case IB_NODE_CA: node_type = ca; - node_type2 = Ca; break; case IB_NODE_ROUTER: node_type = rt; - node_type2 = Rt; break; default: node_type = ???; - node_type2 = ???; break; }
[infiniband-diags] [2/3] support --diffcheck in ibnetdiscover
Hi Sasha, This patch adds basic --diffcheck support in ibnetdiscover, allowing configuration of the diff checks done in the default --diff option. Al -- Albert Chu ch...@llnl.gov Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory ---BeginMessage--- Signed-off-by: Albert Chu ch...@llnl.gov --- infiniband-diags/man/ibnetdiscover.8 |8 + infiniband-diags/src/ibnetdiscover.c | 50 +- 2 files changed, 45 insertions(+), 13 deletions(-) diff --git a/infiniband-diags/man/ibnetdiscover.8 b/infiniband-diags/man/ibnetdiscover.8 index 975b999..e122736 100644 --- a/infiniband-diags/man/ibnetdiscover.8 +++ b/infiniband-diags/man/ibnetdiscover.8 @@ -64,6 +64,14 @@ output will be displayed showing differences between the old and current fabric. By default, the following are compared for differences: switches, channel adapters, routers, and port connections. .TP +\fB\-\-diffcheck\fR key(s) +Specify what diff checks should be done in the \fB\-\-diff\fR option above. +Comma separate multiple diff check key(s). The available diff checks +are: \fIsw\fR = switches, \fIca\fR = channel adapters, \fIrouter\fR = routers, +\fIport\fR = port connections descriptions. Note that \fIport\fR is +checked only for the node types that are specified (e.g. \fIsw\fR, +\fIca\fR, \fIrouter\fR). +.TP \fB\-p\fR, \fB\-\-ports\fR Obtain a ports report which is a list of connected ports with relevant information (like LID, portnum, diff --git a/infiniband-diags/src/ibnetdiscover.c b/infiniband-diags/src/ibnetdiscover.c index 4da09ce..4435ade 100644 --- a/infiniband-diags/src/ibnetdiscover.c +++ b/infiniband-diags/src/ibnetdiscover.c @@ -57,10 +57,10 @@ #define LIST_SWITCH_NODE (1 IB_NODE_SWITCH) #define LIST_ROUTER_NODE (1 IB_NODE_ROUTER) -#define DIFF_FLAG_SWITCH 0x0001 -#define DIFF_FLAG_CA 0x0002 -#define DIFF_FLAG_ROUTER 0x0004 -#define DIFF_FLAG_PORT_CONNECTION 0x0008 +#define DIFF_FLAG_SWITCH 0x0001 +#define DIFF_FLAG_CA 0x0002 +#define DIFF_FLAG_ROUTER 0x0004 +#define DIFF_FLAG_PORT_CONNECTION 0x0008 #define DIFF_FLAG_DEFAULT (DIFF_FLAG_SWITCH \ | DIFF_FLAG_CA \ @@ -76,6 +76,7 @@ static nn_map_t *node_name_map = NULL; static char *cache_file = NULL; static char *load_cache_file = NULL; static char *diff_cache_file = NULL; +static uint32_t diffcheck_flags = DIFF_FLAG_DEFAULT; static int report_max_hops = 0; @@ -735,7 +736,9 @@ static int diff_common(ibnd_fabric_t * orig_fabric, * in new_fabric but not in orig_fabric. * * In this diff, we don't need to check port connections, -* since it has already been done before. +* lids, or node descriptions since it has already been + * done (i.e. checks are only done when guid exists on both +* orig and new). */ iter_diff_data.diff_flags = diff_flags ~DIFF_FLAG_PORT_CONNECTION; iter_diff_data.fabric1 = new_fabric; @@ -752,29 +755,27 @@ static int diff_common(ibnd_fabric_t * orig_fabric, int diff(ibnd_fabric_t * orig_fabric, ibnd_fabric_t * new_fabric) { - uint32_t diff_flags = DIFF_FLAG_DEFAULT; - - if (diff_flags DIFF_FLAG_SWITCH) + if (diffcheck_flags DIFF_FLAG_SWITCH) diff_common(orig_fabric, new_fabric, IB_NODE_SWITCH, - diff_flags, + diffcheck_flags, out_switch, out_switch_port); - if (diff_flags DIFF_FLAG_CA) + if (diffcheck_flags DIFF_FLAG_CA) diff_common(orig_fabric, new_fabric, IB_NODE_CA, - diff_flags, + diffcheck_flags, out_ca, out_ca_port); - if (diff_flags DIFF_FLAG_ROUTER) + if (diffcheck_flags DIFF_FLAG_ROUTER) diff_common(orig_fabric, new_fabric, IB_NODE_ROUTER, - diff_flags, + diffcheck_flags, out_ca, out_ca_port); @@ -786,6 +787,8 @@ static int list, group, ports_report; static int process_opt(void *context, int ch, char *optarg) { + char *p; + switch (ch) { case 1: node_name_map_file = strdup(optarg); @@ -799,6 +802,25 @@ static int process_opt(void *context, int ch, char *optarg) case 4: diff_cache_file = strdup(optarg); break; + case 5: + diffcheck_flags = 0; + p = strtok(optarg, ,); + while (p) { + if
Re: [PATCH] [RFC] ummunotify: Userspace support for MMU notifications
On Wed, 7 Apr 2010 13:30:29 +0100 Eric B Munson wrote: Signed-off-by: Roland Dreier rolandd at cisco.com Use unobfuscated @. Signed-off-by: Eric B Munson ebmun...@us.ibm.com --- Changes since v3: - Fixed replaced [get|put] user with copy_[from|to]_user to fix x86 builds --- Documentation/Makefile|3 +- drivers/char/Kconfig | 12 + drivers/char/Makefile |1 + drivers/char/ummunotify.c | 567 + 4 files changed, 582 insertions(+), 1 deletions(-) create mode 100644 drivers/char/ummunotify.c diff --git a/Documentation/Makefile b/Documentation/Makefile index 6fc7ea1..27ba76a 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -1,3 +1,4 @@ obj-m := DocBook/ accounting/ auxdisplay/ connector/ \ filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \ - pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/ + pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \ + watchdog/src/ What is this change to Documentation/Makefile for? Is there some file that should be added in Documentation/ummunotify/ ? diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig index 3141dd3..cf26019 100644 --- a/drivers/char/Kconfig +++ b/drivers/char/Kconfig @@ -,6 +,18 @@ config DEVPORT depends on ISA || PCI default y +config UMMUNOTIFY + tristate Userspace MMU notifications + select MMU_NOTIFIER + help + The ummunotify (userspace MMU notification) driver creates a + character device that can be used by userspace libraries to + get notifications when an application's memory mapping + changed. This is used, for example, by RDMA libraries to + improve the reliability of memory registration caching, since + the kernel's MMU notifications can be used to know precisely + when to shoot down a cached registration. + source drivers/s390/char/Kconfig endmenu diff --git a/drivers/char/Makefile b/drivers/char/Makefile index f957edf..521e5de 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -97,6 +97,7 @@ obj-$(CONFIG_NSC_GPIO) += nsc_gpio.o obj-$(CONFIG_CS5535_GPIO)+= cs5535_gpio.o obj-$(CONFIG_GPIO_TB0219)+= tb0219.o obj-$(CONFIG_TELCLOCK) += tlclk.o +obj-$(CONFIG_UMMUNOTIFY) += ummunotify.o obj-$(CONFIG_MWAVE) += mwave/ obj-$(CONFIG_AGP)+= agp/ --- ~Randy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/37] librdmacm: add support for AF_IB
The following patch series adds several enhancements to the librdmacm intended to simplify using RDMA devices and address scalability issues. Major changes include: * Adding support for AF_IB. * The addition of a new API: rdma_getaddrinfo. This call provides functionality similar to getaddrinfo for RDMA devices. In addition to resolving names to addresses, it can also resolve route and connection data. rdma_getaddrinfo can return addresses using AF_INET, AF_INET6, and AF_IB. * Add support for IB ACM. IB ACM defines a socket based protocol to an IB address and route resolution service. One implementation of that service is provided separately, but anyone can implement the service provided that they adhere to the IB ACM communication protocol. Use of IB ACM is not required. * Support synchronous operation for library calls. Users can control whether an rdma_cm_id operates asynchronously or synchronously based on the rdma_event_channel parameter. Use of synchronous operations reduces the amount of application code required to use the librdmacm. * Allow the library to abstract RDMA resource creation for simpler RDMA applications. The library can now allocate PDs, CQs, and QPs for the user, if not provided. * Provide a set of helper verbs calls for posting work requests and checking for completions. These are simple wrappers around libibverbs calls. This patch series is also available through my git tree at: git://git.openfabrics.org/~shefty/librdmacm.git af_ib Signed-off-by: Sean Hefty sean.he...@intel.com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 13/51] IB/qib: Add qib_driver.c
Roland Dreier wrote: +unsigned qib_debug; +module_param_named(debug, qib_debug, uint, S_IWUSR | S_IRUGO); +MODULE_PARM_DESC(debug, mask for debug prints); Did you look at using trace events for this stuff? That gives you extremely low overhead when tracing is turned off (dynamic patching to NOP out the tracing when it's disabled) and also very fine-grained (per trace site) control over what gets printed; plus you get dumping of the trace buffer on crash, etc. - R. Where can I find information on trace events? Something in Documentation/*? Thanks, Steve. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Dimension port order file support
On Wed, 24 Mar 2010, Sasha Khapyorsky wrote: Hi Dale, On 18:06 Wed 03 Mar , Dale Purdy wrote: Provide a means to specify on a per switch basis the mapping (order) between switch ports and dimensions for Dimension Order Routing. This allows the DOR routing engine to be used when the cabling is not properly aligned for DOR, either initially, or for an upgrade. Nice stuff. Is this something useful with ! '-R dor'? I'm not using the dimn_ports array in anything but DOR, but I do think it could be useful for some of the other routing engines. Signed-off-by: Dale Purdy pu...@sgi.com The patch itself is broken somehow - it has double space at start of non-changed line (it is fixable with sed -e 's/^ / /', so don't resend patch only for this). Yes I see - odd. The original patch file didn't have this - must have happened when loading it into mail. Hopefully my updated patch will be ok. Some more minor comments are below. ... +static int set_dimn_ports(void *ctx, uint64_t guid, char *p) +{ + osm_ucast_mgr_t *m = ctx; + osm_node_t *node = osm_get_node_by_guid(m-p_subn, cl_hton64(guid)); + osm_switch_t *sw; + uint8_t *dimn_ports = NULL; + uint8_t port; + uint *ports = NULL; 'uint' is not something standard (we had some build compatibility issues with 'uint' in infiniband-diags in the past), so what about 'unsigned int'? ok, fixed. + const int bpw = sizeof(*ports)*8; + int words; + int i = 1; /* port 0 maps to port 0 */ + + if (!node || !(sw = node-sw)) { + OSM_LOG(m-p_log, OSM_LOG_DEBUG, + switch with guid 0x%016 PRIx64 is not found\n, + guid); + return 0; + } + + if (sw-dimn_ports) { + OSM_LOG(m-p_log, OSM_LOG_DEBUG, + switch with guid 0x%016 PRIx64 already listed\n, + guid); It is GIUD double listed case, right? Wouldn't OSM_LOG_VERBOSE be more appropriate? fixed. + while ((*p != '\0') (*p != '#')) { + char *e; + + port = strtoul(p, e, 0); + if ((p == e) || (port == 0) || (port = sw-num_ports) || + !osm_node_get_physp_ptr(node, port)) { + OSM_LOG(m-p_log, OSM_LOG_DEBUG, + bad port %d specified for guid 0x%016 PRIx64 \n, + port, guid); + free(dimn_ports); + free(ports); Ditto. fixed. + return 0; + } + + if (ports[port/bpw] (1u (port%bpw))) { + OSM_LOG(m-p_log, OSM_LOG_DEBUG, + port %d already specified for guid 0x%016 PRIx64 \n, + port, guid); Ditto. fixed. + cl_qmap_apply_func(p_sw_guid_tbl, free_dimn_ports, NULL); + if (p_mgr-p_subn-opt.dimn_ports_file) { + OSM_LOG(p_mgr-p_log, OSM_LOG_DEBUG, + Fetching dimension ports file \'%s\'\n, + p_mgr-p_subn-opt.dimn_ports_file); + if (parse_node_map(p_mgr-p_subn-opt.dimn_ports_file, + set_dimn_ports, p_mgr)) { + OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR, ERR 3A05: + cannot parse dimn_ports_file \'%s\'\n, + p_mgr-p_subn-opt.dimn_ports_file); + } + } + Hmm, if it is DOR only it can be done under 'if (is_dor)' (to save cycles of other REs). Otherwise (generic usability) ucast_mgr_setup_all_switches() seems as more appropriate place to have such setup, no? moved to ucast_mgr_setup_all_switches() as you suggested. And what about adding: if (sw-dimn_ports) free(dimn_ports); in osm_switch_delete()? fixed. New patch attached. DaleDimension port order file support (V2) Provide a means to specify on a per switch basis the mapping (order) between switch ports and dimensions for Dimension Order Routing. This allows the DOR routing engine to be used when the cabling is not properly aligned for DOR, either initially, or for an upgrade. Signed-off-by: Dale Purdy pu...@sgi.com --- opensm/include/opensm/osm_subnet.h |1 + opensm/include/opensm/osm_switch.h | 30 + opensm/man/opensm.8.in | 31 -- opensm/opensm/main.c | 13 - opensm/opensm/osm_subnet.c |7 ++ opensm/opensm/osm_switch.c |4 +- opensm/opensm/osm_ucast_mgr.c | 116 +++- 7 files changed, 192 insertions(+), 10 deletions(-) diff --git a/opensm/include/opensm/osm_subnet.h b/opensm/include/opensm/osm_subnet.h index 3970e98..e4e298e 100644 --- a/opensm/include/opensm/osm_subnet.h +++ b/opensm/include/opensm/osm_subnet.h @@ -186,6 +186,7 @@ typedef struct osm_subn_opt { uint16_t
Re: [PATCH 26/37] librdmacm: set src_addr in rdma_getaddrinfo
On Wed, Apr 07, 2010 at 10:12:43AM -0700, Sean Hefty wrote: RDMA requires the user to allocate hardware resources before establishing a connection. To support this, the user must know the source address that the connection will use to reach the remote endpoint. Modify rdma_getaddrinfo to determine an appropriate source address based on the specified destination, when a source address is not given. I haven't looked through everything you posted to make a suggestion here, but this bothers me.. The resources should be allocated after the rdma_bind syscall, prior to listen/accept or connect, IMHO. How does tha rai-ai_src_addr get used to allocate resources anyhow? Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 22/37] librdmacm: add new call to create id
On Wed, Apr 07, 2010 at 10:12:44AM -0700, Sean Hefty wrote: + * The rdma_cm_id will be set to use synchronous operations (connect, + * listen, and get_request). To convert to synchronous operation, the ^ asynchronous? Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 22/37] librdmacm: add new call to create id
+ * The rdma_cm_id will be set to use synchronous operations (connect, + * listen, and get_request). To convert to synchronous operation, the ^ asynchronous? yes - thanks -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 26/37] librdmacm: set src_addr in rdma_getaddrinfo
I haven't looked through everything you posted to make a suggestion here, but this bothers me.. The resources should be allocated after the rdma_bind syscall, prior to listen/accept or connect, IMHO. How does tha rai-ai_src_addr get used to allocate resources anyhow? Maybe the patch description is off. All this does (in a very non-sexy way) is set ai_src_addr. It does not allocate any hardware resources. A user can provide ai_src_addr as input into rdma_bind or rdma_resolve_addr. The motivation is twofold. First, the user can select the rdma_addrinfo for a connection by examining the src/dst address pair. This may be desired for failover or performance reasons. Second, route resolution may require knowing both the source and destination addresses. For example, IB ACM requires both addresses as input. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ummunotify: progress at last!
On Wed, Apr 07, 2010 at 12:37:03PM -0700, Roland Dreier wrote: No, there is no mmap. Like this: u64 my_counter = 0; ibv_set_mmu_counter(verbs, my_counter); [..] while (my_counter != last_my_counter) { last_my_counter = my_counter; ibv_get_mmu_notifications(verbs, ...); // - I am a memory barrier as well } The kernel 'syscall' ibv_set_mmu_counter would bind the given verbs to the 8 byte counter you specified without having to the mmap thing. As I understand it this is what perfevents does. I was trying to look at how perf events handles this, and AFAICT it looks like kernel/perf_event.c just supports mmap(). Can you expand on what you meant here? (I was trying to figure out how one would handle the case where userspace gives us a counter in highmem -- doing kmap_atomic() seems to be to only option but then I'm not sure if I want to deal with that...) I think I was mistaken here, disregard.. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 26/37] librdmacm: set src_addr in rdma_getaddrinfo
On Wed, Apr 07, 2010 at 12:54:56PM -0700, Sean Hefty wrote: I haven't looked through everything you posted to make a suggestion here, but this bothers me.. The resources should be allocated after the rdma_bind syscall, prior to listen/accept or connect, IMHO. How does tha rai-ai_src_addr get used to allocate resources anyhow? Maybe the patch description is off. All this does (in a very non-sexy way) is set ai_src_addr. It does not allocate any hardware resources. A user can provide ai_src_addr as input into rdma_bind or rdma_resolve_addr. The motivation is twofold. First, the user can select the rdma_addrinfo for a connection by examining the src/dst address pair. This may be desired for failover or performance reasons. Second, route resolution may require knowing both the source and destination addresses. For example, IB ACM requires both addresses as input. Huumm I don't have a problem with ai_src_addr being set, when necessary, but setting it unconditionally seems wrong to me. In most cases the kernel should select the source during route resolution, not be forced to something in userspace. Certainly for AF_INET/6 I don't think this should be done.. Apps doing complex things for failover should supply a source address in the hints and call rdma_getaddrinfo for each adaptor. AF_IB has the scope ID in the destination to specify the adaptor for link-local GIDs, so the source should not often be needed. Not sure what you mean that ACM requires it? Doesn't ACM plug in at the rdma_getaddrinfo stage? If so it can get the source on its own like you did in this patch. I agree that ACM should always return results with the source set, because it is providing path records relative to a specific adaptor. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 31/37] librdmacm: provide abstracted verb calls
+static inline int +rdma_get_send_comp(struct rdma_cm_id *id, struct ibv_wc *wc) +{ +struct ibv_cq *cq; +void *context; +int ret; + +ret = ibv_poll_cq(id-send_cq, 1, wc); +if (ret) +return ret; + +ret = ibv_req_notify_cq(id-send_cq, 0); +if (ret) +return ret; + +ret = ibv_poll_cq(id-send_cq, 1, wc); +if (ret) +return ret; + +ret = ibv_get_cq_event(id-send_cq_channel, cq, context); +if (ret) +return ret; This doesn't look correct. If the send isn't complete by the time the 2nd ibv_poll_cq() completes, then this function will return without having filled in the wc. Or am I missing something? Shouldn't the ibv_get_cq_event() be the first thing this function does? The same issue/question exists for rdma_get_recv_comp(). I think it's possible for the function to return without having filled in a wc. If the 2nd poll removes a completion, it can leave a cq event on the channel, which a subsequent call could retrieve, but then find the cq empty. The idea for this call is to abstract poll, notify_cq, and get_cq_event, but still provide decent performance. (Scalability is a separate matter. I couldn't find a decent way to abstract a CQ shared across QPs or between the receive and send queues.) To avoid returning from the call without a completion, I think the following structure works: poll() notify_cq() poll() while (no completion) { get_cq_event() poll() } The only drawback I see is that it's theoretically possible to build up a queue of cq events in the kernel. Not sure how to fix that. Any ideas? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 31/37] librdmacm: provide abstracted verb calls
Sean Hefty wrote: +static inline int +rdma_get_send_comp(struct rdma_cm_id *id, struct ibv_wc *wc) +{ + struct ibv_cq *cq; + void *context; + int ret; + + ret = ibv_poll_cq(id-send_cq, 1, wc); + if (ret) + return ret; + + ret = ibv_req_notify_cq(id-send_cq, 0); + if (ret) + return ret; + + ret = ibv_poll_cq(id-send_cq, 1, wc); + if (ret) + return ret; + + ret = ibv_get_cq_event(id-send_cq_channel, cq, context); + if (ret) + return ret; This doesn't look correct. If the send isn't complete by the time the 2nd ibv_poll_cq() completes, then this function will return without having filled in the wc. Or am I missing something? Shouldn't the ibv_get_cq_event() be the first thing this function does? The same issue/question exists for rdma_get_recv_comp(). I think it's possible for the function to return without having filled in a wc. So its busted? Or is this intended behavior? If the 2nd poll removes a completion, it can leave a cq event on the channel, which a subsequent call could retrieve, but then find the cq empty. The idea for this call is to abstract poll, notify_cq, and get_cq_event, but still provide decent performance. (Scalability is a separate matter. I couldn't find a decent way to abstract a CQ shared across QPs or between the receive and send queues.) To avoid returning from the call without a completion, I think the following structure works: poll() notify_cq() poll() while (no completion) { get_cq_event() poll() } Is rdma_get_send_completion() supposed to return exactly one wc? If so then the 2 polls can cause a wc to get silently discarded. I must still not be understanding the intended use? I would think this should just be: get_cq_event() notify_cq() poll() The only drawback I see is that it's theoretically possible to build up a queue of cq events in the kernel. Not sure how to fix that. Any ideas? That can always happen, yes? Steve. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] RDMA/nes: correct cap.max_inline_data assignment in nes_query_qp
thanks, applied -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] infiniband: checking the wrong variable
thanks, applied. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/10] iw_cxgb4: Add connection management functions.
+void _c4iw_free_ep(struct kref *kref) ... +ep = container_of(container_of(kref, struct c4iw_ep_common, kref), + struct c4iw_ep, com); sparse warns of some internal container_of variable shadowing itself here. You can avoid that and write this more simply as: ep = container_of(kref, struct c4iw_ep, com.kref); -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/10] iw_cxgb4: Add connection management functions.
+wr_waitp = (struct c4iw_wr_wait *)rpl-data[1]; Sparse complains about this case from __be64 to a pointer. I assume this is OK but you probably want to stick a __force in there to annotate it. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/10] iw_cxgb4: Add memory management functions.
+req-wr.wr_lo = (u64)wr_wait; wr_lo is __be64. The cast should be to __force __be64 here I think. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/10] iw_cxgb4: Add connection management functions.
Roland Dreier wrote: +int peer2peer = 0; +module_param(peer2peer, int, 0644); +MODULE_PARM_DESC(peer2peer, Support peer2peer ULPs (default=0)); If you build iw_cxgb3 and iw_cxgb4 into the kernel, the peer2peer symbol names clash. (Same problem occurs if you try to load cxgb3 and cxgb4 modules at the same time, I think). Both iw_cxgb3 and iw_cxgb4 load ok concurrently when compiled as modules. The option was originally intended to be used in more than just cm.c. So there's a piece of code missing in qp.c. I'll clean this up. I might make an attribute in c4iw_endpoint that indicates this mode. Then the qp code won't need the global option and can key off the endpoint attribute. So I can make this a static as you suggest. We can fix it here in cxgb4 by just making peer2peer static (and deleting the extern declaration). However peer2peer is not that great of a name for a global symbol; might be good to add a patch to cxgb3 to rename peer2peer to something like iwch_peer2peer and using module_param_named()... I'll do this for cxgb3. Steve. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] iw_cxgb4: Add driver, fw, and hw headers.
Roland Dreier wrote: You have: +struct fw_ri_send_wr { ... + __be16 wrid; +struct fw_ri_recv_wr { ... + __be16 wrid; But also: +static inline void init_wr_hdr(union t4_wr *wqe, u16 wrid, +enum fw_wr_opcodes opcode, u8 flags, u8 len16) ... + wqe-send.wrid = wrid; and similar for recv.wrid in qp.c. sparse correctly warns about this endianness clash. The intention is that the device just treats wrid as opaque I assume so I think the correct fix is to go from __be16 to u16 in the structure declarations. - R. Yes, it should be a u16 in the wr structs. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/10] iw_cxgb4: Add connection management functions.
Both iw_cxgb3 and iw_cxgb4 load ok concurrently when compiled as modules. Oh, right. The peer2peer symbol isn't exported so the clash is only if you try to build them both into the kernel (as I often do as part of my quick build tests). -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 31/37] librdmacm: provide abstracted verb calls
Sean Hefty wrote: I think it's possible for the function to return without having filled in a wc. So its busted? Or is this intended behavior? Depends on the point of view, I guess. :) It would be nice to avoid that situation. Is rdma_get_send_completion() supposed to return exactly one wc? If so then the 2 polls can cause a wc to get silently discarded. I must still not be understanding the intended use? How can a wc get discarded? Maybe the return code from ibv_poll_cq is confusing you? If the first poll finds a wc, ibv_poll_cq returns 1, and we exit the function. Otherwise, we rearm the cq, then poll again to make sure that nothing got missed. Right. I missed that. poll will return 1 if there's a completion returned. Nevermind :) I would think this should just be: get_cq_event() notify_cq() poll() This requires arming the CQ up front. I was also trying to avoid the overhead of always calling get_cq_event and notify_cq to just pull a completed request off of the work queue. I was confused on the poll_cq return code (and I've been working in this code for umpteen years :) ). The only drawback I see is that it's theoretically possible to build up a queue of cq events in the kernel. Not sure how to fix that. Any ideas? That can always happen, yes? It seems like it should be avoidable. Maybe 1 event can queue up, but I think we can prevent more by not rearming until that event gets pulled. If nothing else, I think this discussion shows why we need this sort of wrapper. :) Indeed! I like the wrappers. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 26/37] librdmacm: set src_addr in rdma_getaddrinfo
I don't have a problem with ai_src_addr being set, when necessary, but setting it unconditionally seems wrong to me. In most cases the kernel should select the source during route resolution, not be forced to something in userspace. Just to be precise, the source is selected during address resolution, and the existing APIs allow the user to indicate that a specific source should be used. This is a requirement of some applications. Not sure what you mean that ACM requires it? Doesn't ACM plug in at the rdma_getaddrinfo stage? If so it can get the source on its own like you did in this patch. I agree that ACM should always return results with the source set, because it is providing path records relative to a specific adaptor. Yes - the code to set the source could move from librdmacm into ACM. I can change rdma_getaddrinfo to only set the source address if either the user provides one through a hint, or if resolved through ACM. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 26/37] librdmacm: set src_addr in rdma_getaddrinfo
On Wed, Apr 07, 2010 at 03:10:36PM -0700, Sean Hefty wrote: Not sure what you mean that ACM requires it? Doesn't ACM plug in at the rdma_getaddrinfo stage? If so it can get the source on its own like you did in this patch. I agree that ACM should always return results with the source set, because it is providing path records relative to a specific adaptor. Yes - the code to set the source could move from librdmacm into ACM. I can change rdma_getaddrinfo to only set the source address if either the user provides one through a hint, or if resolved through ACM. That would be my preference. I think the kernel calls should use a null source address in the common case and a set source should be an exceptional case. This matches sockets very well. I'd see two cases for setting a source address, an app that wants to control the bind port - this is similar to socket cases, and is generally an exceptional case. The other is that an app wants the connection to be usable with a certain PD. This is more like the DAPL case, as far as I understand it (ie resources have been allocated against a PD prior to the addresses being known). This would be best served by having the hints include a PD and have rdma_getaddrinfo generate a source address that works with that PD. A PD is more general than a source address - in single HCA cases a PD will be usable with all ports. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: librdmacm meets libiwarp
it's nice to see, that you seem to have liked most of my ideas and intentions of libiwarp :) thanks for making that work a bit more sustainable! Thanks for the input. I think there could still be a little more work done to handle completions across shared CQs, plus add in SRQ support. I was wondering where I could get a copy of your latest code to look at it as a whole and (maybe) comment on it. The code is available from my git tree, in the af_ib branch: git://git.openfabrics.org/~shefty/librdmacm.git af_ib - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] iw_cxgb4: Add driver, fw, and hw headers.
Shouldn't it be __u16? These structs are part of the firmware to host driver/lib API. Yes, if this header is used by userspace too then you want __u16. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 13/51] IB/qib: Add qib_driver.c
Roland Dreier rdre...@cisco.com wrote: Where can I find information on trace events? Something in Documentation/*? Yep, Documentation/trace/events.txt. LWN just did a really good writeup on using the TRACE_EVENT macro: http://lwn.net/Articles/379903/ Part 2 is still behind the paywall. -John Gregor -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html