Re: P
Sorry for late reply. 2010/6/12 Dotan Barak dota...@gmail.com: On 12/06/2010 03:22, Ding Dinghua wrote: 2010/6/11 Dotan Barakdota...@gmail.com: Hi. On 11/06/2010 10:51, Ding Dinghua wrote: Hi all: I'm using RDMA to do fs-metadata mirror between nodes. I encountered a strange problem when the program was running: Complete queue handler reported that the RDMA-Write operation failed, the status of corresponding struct ib_wc is IB_WC_RETRY_EXC_ERR. The problem is encountered randomly. I don't know the meaning of this error code as well as what to do next. Would anyone give me some tips? thanks a lot. Do you sync between the sides before closing the QPs? Can you say it more detail? thanks. If you try to send a message from local QP to a remote QP before the remote QP is in RTR state (or after it was closed/transferred to the ERROR state), you may get RETRY EXCEEDED, because there isn't any QP in the remote side that can accept your message (and send a response). How do you connect the QPs? (And how do you close the connection between them) I call rdma_create_id to create an ib id, then do resolve remote addr, resolve route work, then setup qp and call rdma_connect to setup connection, before ack or error replies, the thread will wait on a wait queue. The listening ib id of remote node will catch the connect request, setup qp, allocate and map pages to construct the RDMA-WRITE space, and call rdma_accept to reply the request. Some other information which may be useful: 1.All the RETRY EXCEEDED problems happened when there were two connections which use RDMA-WRITE to transfer things. And the latter connection had a high possibility to get into this problem. 2. All the RETRY EXCEEDED problems happened when the RMDA-WRITE space is 256MB each(that is, for two connections, consumes 512MB mem), when the RDMA-WRITE space is 64MB, this problem never happened in our test. Remote node's total memory is 2GB. Thanks a lot. Dotan -- Ding Dinghua -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: P
2010/6/12 Dotan Barak dota...@gmail.com: On 12/06/2010 03:22, Ding Dinghua wrote: 2010/6/11 Dotan Barakdota...@gmail.com: Hi. On 11/06/2010 10:51, Ding Dinghua wrote: Hi all: I'm using RDMA to do fs-metadata mirror between nodes. I encountered a strange problem when the program was running: Complete queue handler reported that the RDMA-Write operation failed, the status of corresponding struct ib_wc is IB_WC_RETRY_EXC_ERR. The problem is encountered randomly. I don't know the meaning of this error code as well as what to do next. Would anyone give me some tips? thanks a lot. Do you sync between the sides before closing the QPs? Can you say it more detail? thanks. If you try to send a message from local QP to a remote QP before the remote QP is in RTR state (or after it was closed/transferred to the ERROR state), you may get RETRY EXCEEDED, because there isn't any QP in the remote side that can accept your message (and send a response). How do you connect the QPs? (And how do you close the connection between them) Sorry i forget the close issue. 1. Local node call ib_poll_cq to process the remaining complete queue entry, 2. Local node call rdma_disconnect to destroy connection, before remote side ack, the thread will wait on a wait queue. 3. After catching this request, the remote node will also call ib_poll_cq to process the remainning complete queue entry, then do some resource-release work, then send a reply. 4. Local node was waken up and do resource-release work. Dotan -- Ding Dinghua -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] dapl-2.0 - scm, ucm: add pkey, pkey_index, sl override for QP's
Hefty, Sean wrote: The index isn't guaranteed to be the same across all nodes. If a consumer is going to manually control this, they should really be forced to use the actual pkey. yes, I saw this confusion in action, for most users pkey index doesn't mean anything, it may also change across time, which can break scripts/setting to run specific jobs using specific partitions. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Handling busy responses from the SA
Mike, On Wed, Jun 16, 2010 at 3:57 PM, Mike Heinz michael.he...@qlogic.com wrote: Hal, But if the original trap had retries 0, wouldn't resending the trap be what the issuer intended? I suppose as there's nothing in the IBA spec that precludes using busy on TrapRepresses although I'd be hard pressed to rationalize using that particularly for SMP traps. -- Hal I guess I'm confused why treating BUSY as similar to simply never getting a response at all is a bad thing. In my mind, receiving a BUSY response is like getting a busy signal when you call someone on the phone - a sign you need to wait a bit then try again. Similarly, if I call someone and never get an answer my strategy is going to be to wait, then try again. -Original Message- From: Hal Rosenstock [mailto:hal.rosenst...@gmail.com] Sent: Tuesday, June 08, 2010 8:16 PM To: Mike Heinz Cc: Hefty, Sean; linux-rdma@vger.kernel.org Subject: Re: Handling busy responses from the SA Mike, I'm referring to the receipt of the TrapRepress with busy status. Wouldn't your patch cause the original Trap to be resent when retries 0 ? TrapRepress is essentially a response to Trap and classified as such by ib_response_mad. Your proposed patch treats a busy as a timeout and can cause retry of the original sent Trap. -- Hal -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][TRIVIAL] infiniband-diags/perfquery.8: Add some missing counters to description
On 07:05 Wed 16 Jun , Hal Rosenstock wrote: Also, updated email address Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com Applied. Thanks. Sasha -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Handling busy responses from the SA
To be honest, we haven't been able to think of a case where a sender would use retries on a trap or a busy on a repress either, but I don't think it would hurt to omit represses from the busy handling either. Would that be acceptable to everyone? To alter the patch to allow BUSY trap repress MADs to pass through? -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Hal Rosenstock Sent: Thursday, June 17, 2010 9:30 AM To: Mike Heinz Cc: Hefty, Sean; linux-rdma@vger.kernel.org; Todd Rimmer Subject: Re: Handling busy responses from the SA Mike, On Wed, Jun 16, 2010 at 3:57 PM, Mike Heinz michael.he...@qlogic.com wrote: Hal, But if the original trap had retries 0, wouldn't resending the trap be what the issuer intended? I suppose as there's nothing in the IBA spec that precludes using busy on TrapRepresses although I'd be hard pressed to rationalize using that particularly for SMP traps. -- Hal I guess I'm confused why treating BUSY as similar to simply never getting a response at all is a bad thing. In my mind, receiving a BUSY response is like getting a busy signal when you call someone on the phone - a sign you need to wait a bit then try again. Similarly, if I call someone and never get an answer my strategy is going to be to wait, then try again. -Original Message- From: Hal Rosenstock [mailto:hal.rosenst...@gmail.com] Sent: Tuesday, June 08, 2010 8:16 PM To: Mike Heinz Cc: Hefty, Sean; linux-rdma@vger.kernel.org Subject: Re: Handling busy responses from the SA Mike, I'm referring to the receipt of the TrapRepress with busy status. Wouldn't your patch cause the original Trap to be resent when retries 0 ? TrapRepress is essentially a response to Trap and classified as such by ib_response_mad. Your proposed patch treats a busy as a timeout and can cause retry of the original sent Trap. -- Hal -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] opensm/osmtest.c: fix bug in getting attr offset
On 11:33 Tue 15 Jun , Yevgeny Kliteynik wrote: Fix bug that was introduced by commit 4fd4ca306f93376963725285f3bf7c87a76055b0 Signed-off-by: Yevgeny Kliteynik klit...@dev.mellanox.co.il Applied. Thanks. Sasha -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [RESEND] opensm/osm_mcast_mgr.c: Only route MLIDs with more than 1 member
On 08:46 Mon 14 Jun , Hal Rosenstock wrote: rather than just more than 0 members. There is no need to route MLIDs with only 1 member either. MLIDs only need routing when 2 or more members. Single member case is handled locally. Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com Applied. Thanks. Sasha -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] RDMA/CMA: fix iWARP adapter TCP port space usage
I agree that the issue must get solved and its good that it has been brought up again. I agree with Chien that the solution should respect and interface to a single in kernel instance maintaining host global TCP port space. iWARP is just another protocol on top of TCP - like iSCSI. There is no good reason to invent another TCP port maintainer per TCP user type trying to synchonize with the kernel if the resource is host global and already maintained by the kernel. Since we are developing and already open sourced a full software implementation (SoftiWARP) of RDMA, our view on the optimal solution must be different. Like kernel iSCSI, we are running on top of regular kernel sockets. With that, there is no point having a connection manager blocking just the port we wanted to use for communication - SoftiWARP uses kernel sockets for data communication. Therefore, I propose pushing back responsibility to the RDMA device driver, where the actual connection setup is initiated (RNIC) or takes place (software RMDA stack). I think, it is not the job of the RDMA connection manager to maintain TCP port space at all. It should be up to the driver to do the appropriate steps. Due to the lack of another interface, an RNIC driver would create and bind a kernel socket to get hold of the TCP port it is intending to use for offloaded communication, while a software RDMA stack just goes forward doing communication on that socket. For the future it might be a good idea to approach the netdev folks kindly asking for a neat interface for just TCP port maintainance without the need to create and bind an otherwise useless socket. Of course, the RNIC driver must restrict its activities to local IP adresses on its cards (or, for SoftiWARP, to IP adresses of interfaces it is bound to). For example, a wildcard listen must get translated into a listen restricted to the interface(s) under local control. With that, the RDMA connection manager should simply be aware of the possibility that a listen or connect call may fail for one more reason. From using SoftiWARP in that environment I know, that's already the case (-EADDRINUSE is always an acceptable return value). Thanks, Bernard. linux-rdma-ow...@vger.kernel.org wrote on 06/12/2010 05:17:58 PM: Roland Dreier wrote: Other protocols are also running over networking today, such as iSCSI and FCoE. These happily co-exist with other L2-L4 protocols in the stack. This iWARP patch allows iWARP to happily co-exist on a TCP connection, and does *not* negatively affect the networking stack at all. How do iSCSI offload HBAs coexist? As I understand it, they typically just choose a separate IP address. In any case I'm not going to slip in a patch that another maintainer has explicitly NAKed. Maybe one way to force things forward would be to write up an exhaustive explanation of the underlying problem and the impact on end users, include this patch, explain that it touches only RDMA code, and point out that most end users are already using this patch since it's shipped in OFED. Then send the whole thing to Linus and Andrew Morton, making sure to cc Dave Miller, netdev, and linux-rdma. - R. My 2007 thread does this basically, but posted it to lkml and David Miller. But the rationale for why we need it as well as other possible solutions is included in that thread. We could re-package it and send it on as you suggest. It might carry more weight coming from the linux rdma maintainer though. :) -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] opensm/osmeventplugin: added new events to monitor SM
On 14:41 Thu 10 Jun , Yevgeny Kliteynik wrote: Hi Sasha, Adding new events that allow event plug-in to see when SM finishes heavy sweep and routing configuration, when it updates dump files, when it is no longer master, and when SM port is down: OSM_EVENT_ID_HEAVY_SWEEP_DONE OSM_EVENT_ID_UCAST_ROUTING_DONE What is wrong with using Subnet Up event for those purposes? OSM_EVENT_ID_ENTERING_STANDBY OSM_EVENT_ID_SM_PORT_DOWN Instead I would suggest to make state change event. OSM_EVENT_ID_SA_DB_DUMPED Again, Subnet Up indicates that all sweep stuff is done (including dump files). The last event is reported when SA DB is actually dumped. Signed-off-by: Yevgeny Kliteynik klit...@dev.mellanox.co.il --- Changes from V2: - reduced number of events that are reported - rebased to latest master --- opensm/include/opensm/osm_event_plugin.h |7 ++- opensm/opensm/osm_state_mgr.c | 16 +++- opensm/osmeventplugin/src/osmeventplugin.c | 15 +++ 3 files changed, 36 insertions(+), 2 deletions(-) diff --git a/opensm/include/opensm/osm_event_plugin.h b/opensm/include/opensm/osm_event_plugin.h index 33d1920..a565123 100644 --- a/opensm/include/opensm/osm_event_plugin.h +++ b/opensm/include/opensm/osm_event_plugin.h @@ -72,7 +72,12 @@ typedef enum { OSM_EVENT_ID_PORT_SELECT, OSM_EVENT_ID_TRAP, OSM_EVENT_ID_SUBNET_UP, - OSM_EVENT_ID_MAX + OSM_EVENT_ID_MAX, Likely you wanted to move OSM_EVENT_ID_MAX to be last in the list. Sasha + OSM_EVENT_ID_HEAVY_SWEEP_DONE, + OSM_EVENT_ID_UCAST_ROUTING_DONE, + OSM_EVENT_ID_ENTERING_STANDBY, + OSM_EVENT_ID_SM_PORT_DOWN, + OSM_EVENT_ID_SA_DB_DUMPED } osm_epi_event_id_t; typedef struct osm_epi_port_id { diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index 81c8f54..3231ae9 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -1151,6 +1151,8 @@ static void do_sweep(osm_sm_t * sm) if (!sm-p_subn-subnet_initialization_error) { OSM_LOG_MSG_BOX(sm-p_log, OSM_LOG_VERBOSE, REROUTE COMPLETE); + osm_opensm_report_event(sm-p_subn-p_osm, + OSM_EVENT_ID_UCAST_ROUTING_DONE, NULL); return; } } @@ -1185,6 +1187,8 @@ repeat_discovery: /* Move to DISCOVERING state */ osm_sm_state_mgr_process(sm, OSM_SM_SIGNAL_DISCOVER); + osm_opensm_report_event(sm-p_subn-p_osm, + OSM_EVENT_ID_SM_PORT_DOWN, NULL); return; } @@ -1205,6 +1209,8 @@ repeat_discovery: ENTERING STANDBY STATE); /* notify master SM about us */ osm_send_trap144(sm, 0); + osm_opensm_report_event(sm-p_subn-p_osm, + OSM_EVENT_ID_ENTERING_STANDBY, NULL); return; } @@ -1212,6 +1218,9 @@ repeat_discovery: if (sm-p_subn-force_heavy_sweep) goto repeat_discovery; + osm_opensm_report_event(sm-p_subn-p_osm, + OSM_EVENT_ID_HEAVY_SWEEP_DONE, NULL); + OSM_LOG_MSG_BOX(sm-p_log, OSM_LOG_VERBOSE, HEAVY SWEEP COMPLETE); /* If we are MASTER - get the highest remote_sm, and @@ -1314,6 +1323,8 @@ repeat_discovery: OSM_LOG_MSG_BOX(sm-p_log, OSM_LOG_VERBOSE, SWITCHES CONFIGURED FOR UNICAST); + osm_opensm_report_event(sm-p_subn-p_osm, + OSM_EVENT_ID_UCAST_ROUTING_DONE, NULL); if (!sm-p_subn-opt.disable_multicast) { osm_mcast_mgr_process(sm); @@ -1375,7 +1386,10 @@ repeat_discovery: if (osm_log_is_active(sm-p_log, OSM_LOG_VERBOSE) || sm-p_subn-opt.sa_db_dump) - osm_sa_db_file_dump(sm-p_subn-p_osm); + if (!osm_sa_db_file_dump(sm-p_subn-p_osm)) + osm_opensm_report_event(sm-p_subn-p_osm, + OSM_EVENT_ID_SA_DB_DUMPED, NULL); + } /* diff --git a/opensm/osmeventplugin/src/osmeventplugin.c b/opensm/osmeventplugin/src/osmeventplugin.c index b4d9ce9..af68a5c 100644 --- a/opensm/osmeventplugin/src/osmeventplugin.c +++ b/opensm/osmeventplugin/src/osmeventplugin.c @@ -176,6 +176,21 @@ static void report(void *_log, osm_epi_event_id_t event_id, void *event_data) case OSM_EVENT_ID_SUBNET_UP: fprintf(log-log_file, Subnet up reported\n); break; + case OSM_EVENT_ID_HEAVY_SWEEP_DONE: + fprintf(log-log_file, Heavy sweep completed\n); + break; + case OSM_EVENT_ID_UCAST_ROUTING_DONE: + fprintf(log-log_file, Unicast routing completed\n); + break; +
Re: [Patch v2] opensm/main.c: force stdout to be line-buffered
On 15:00 Thu 10 Jun , Yevgeny Kliteynik wrote: When stdout is assigned to a terminal, it is line-buffered. But when opensm's stdout is redirected to a file, stdout becomes block-buffered, which means that '\n' won't cause the buffer to be flushed. Forcing stdout to always be line-buffered and to have a more predictable behavior when used as opensm some_file. Signed-off-by: Yevgeny Kliteynik klit...@dev.mellanox.co.il Applied. Thanks. Sasha -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH resend] opensm/osm_sa_path_record.c: adding wrapper for pr_rcv_get_path_parms()
On 16:49 Thu 10 Jun , Yevgeny Kliteynik wrote: Adding non-static wrapper function for pr_rcv_get_path_parms() function to enable calling path record calculation function from outside this file. Signed-off-by: Yevgeny Kliteynik klit...@dev.mellanox.co.il Applied. Thanks. Sasha -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] opensm/osm_qos.c: Eliminate unneeded endport SL to VL setup
On 09:09 Mon 14 Jun , Hal Rosenstock wrote: This is intended. It's not needed since it's only doing this in the wildcarded case and the wildcarding includes port 0. Any reason not to move ahead on this ? Thanks. No reason. Applied. Thanks. Sasha -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] opensm/osmeventplugin: added new events to monitor SM
Hi Sasha, On 17-Jun-10 5:18 PM, Sasha Khapyorsky wrote: On 14:41 Thu 10 Jun , Yevgeny Kliteynik wrote: Hi Sasha, Adding new events that allow event plug-in to see when SM finishes heavy sweep and routing configuration, when it updates dump files, when it is no longer master, and when SM port is down: OSM_EVENT_ID_HEAVY_SWEEP_DONE OSM_EVENT_ID_UCAST_ROUTING_DONE What is wrong with using Subnet Up event for those purposes? There is a big difference between SWEEP_DONE and SUBNET_UP events. The former happens before all the managers (drop manager, QoS, unicast and multicast routing, etc), so there is a long period between two events. Moreover, after SWEEP_DONE there is a lot of information that is later cleared. As for ROUTING_DONE, if OSM is doing re-route only, then routing might change, and we don't get SUBNET_UP event. Furthermore, when torus2QoS routing will be included in the SM, the re-route will also cause QoS configuration to change. OSM_EVENT_ID_ENTERING_STANDBY OSM_EVENT_ID_SM_PORT_DOWN Instead I would suggest to make state change event. OK OSM_EVENT_ID_SA_DB_DUMPED Again, Subnet Up indicates that all sweep stuff is done (including dump files). This is true. In fact, the way I posed it, there is no point adding this event. However, this event should also be sent when SA DB is dumped at the end of light sweep, and then SUBNET_UP cannot replace it. The last event is reported when SA DB is actually dumped. Signed-off-by: Yevgeny Kliteynikklit...@dev.mellanox.co.il --- Changes from V2: - reduced number of events that are reported - rebased to latest master --- opensm/include/opensm/osm_event_plugin.h |7 ++- opensm/opensm/osm_state_mgr.c | 16 +++- opensm/osmeventplugin/src/osmeventplugin.c | 15 +++ 3 files changed, 36 insertions(+), 2 deletions(-) diff --git a/opensm/include/opensm/osm_event_plugin.h b/opensm/include/opensm/osm_event_plugin.h index 33d1920..a565123 100644 --- a/opensm/include/opensm/osm_event_plugin.h +++ b/opensm/include/opensm/osm_event_plugin.h @@ -72,7 +72,12 @@ typedef enum { OSM_EVENT_ID_PORT_SELECT, OSM_EVENT_ID_TRAP, OSM_EVENT_ID_SUBNET_UP, - OSM_EVENT_ID_MAX + OSM_EVENT_ID_MAX, Likely you wanted to move OSM_EVENT_ID_MAX to be last in the list. Oops... -- Yevgeny Sasha + OSM_EVENT_ID_HEAVY_SWEEP_DONE, + OSM_EVENT_ID_UCAST_ROUTING_DONE, + OSM_EVENT_ID_ENTERING_STANDBY, + OSM_EVENT_ID_SM_PORT_DOWN, + OSM_EVENT_ID_SA_DB_DUMPED } osm_epi_event_id_t; typedef struct osm_epi_port_id { diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c index 81c8f54..3231ae9 100644 --- a/opensm/opensm/osm_state_mgr.c +++ b/opensm/opensm/osm_state_mgr.c @@ -1151,6 +1151,8 @@ static void do_sweep(osm_sm_t * sm) if (!sm-p_subn-subnet_initialization_error) { OSM_LOG_MSG_BOX(sm-p_log, OSM_LOG_VERBOSE, REROUTE COMPLETE); + osm_opensm_report_event(sm-p_subn-p_osm, + OSM_EVENT_ID_UCAST_ROUTING_DONE, NULL); return; } } @@ -1185,6 +1187,8 @@ repeat_discovery: /* Move to DISCOVERING state */ osm_sm_state_mgr_process(sm, OSM_SM_SIGNAL_DISCOVER); + osm_opensm_report_event(sm-p_subn-p_osm, + OSM_EVENT_ID_SM_PORT_DOWN, NULL); return; } @@ -1205,6 +1209,8 @@ repeat_discovery: ENTERING STANDBY STATE); /* notify master SM about us */ osm_send_trap144(sm, 0); + osm_opensm_report_event(sm-p_subn-p_osm, + OSM_EVENT_ID_ENTERING_STANDBY, NULL); return; } @@ -1212,6 +1218,9 @@ repeat_discovery: if (sm-p_subn-force_heavy_sweep) goto repeat_discovery; + osm_opensm_report_event(sm-p_subn-p_osm, + OSM_EVENT_ID_HEAVY_SWEEP_DONE, NULL); + OSM_LOG_MSG_BOX(sm-p_log, OSM_LOG_VERBOSE, HEAVY SWEEP COMPLETE); /* If we are MASTER - get the highest remote_sm, and @@ -1314,6 +1323,8 @@ repeat_discovery: OSM_LOG_MSG_BOX(sm-p_log, OSM_LOG_VERBOSE, SWITCHES CONFIGURED FOR UNICAST); + osm_opensm_report_event(sm-p_subn-p_osm, + OSM_EVENT_ID_UCAST_ROUTING_DONE, NULL); if (!sm-p_subn-opt.disable_multicast) { osm_mcast_mgr_process(sm); @@ -1375,7 +1386,10 @@ repeat_discovery: if (osm_log_is_active(sm-p_log, OSM_LOG_VERBOSE) || sm-p_subn-opt.sa_db_dump) - osm_sa_db_file_dump(sm-p_subn-p_osm); + if (!osm_sa_db_file_dump(sm-p_subn-p_osm)) +
Re: [PATCH v2] RDMA/CMA: fix iWARP adapter TCP port space usage
Bernard Metzler wrote: I agree that the issue must get solved and its good that it has been brought up again. I agree with Chien that the solution should respect and interface to a single in kernel instance maintaining host global TCP port space. iWARP is just another protocol on top of TCP - like iSCSI. There is no good reason to invent another TCP port maintainer per TCP user type trying to synchonize with the kernel if the resource is host global and already maintained by the kernel. Since we are developing and already open sourced a full software implementation (SoftiWARP) of RDMA, our view on the optimal solution must be different. Like kernel iSCSI, we are running on top of regular kernel sockets. With that, there is no point having a connection manager blocking just the port we wanted to use for communication - SoftiWARP uses kernel sockets for data communication. Hey Bernard, Has SoftiWARP been submitted upstream yet? Therefore, I propose pushing back responsibility to the RDMA device driver, where the actual connection setup is initiated (RNIC) or takes place (software RMDA stack). I think, it is not the job of the RDMA connection manager to maintain TCP port space at all. It should be up to the driver to do the appropriate steps. Due to the lack of another interface, an RNIC driver would create and bind a kernel socket to get hold of the TCP port it is intending to use for offloaded communication, while a software RDMA stack just goes forward doing communication on that socket. For the future it might be a good idea to approach the netdev folks kindly asking for a neat interface for just TCP port maintainance without the need to create and bind an otherwise useless socket. I proposed this design in 2007. It was NAK'd. Read the tail end of this email where I describe such a solution and indicate that Miller already NAK'd it. Now we could try again with this solution, but unless we have end users backing us and showing how much demand there is for this, it won't fly IMO. http://lkml.org/lkml/2007/8/15/174 Of course, the RNIC driver must restrict its activities to local IP adresses on its cards (or, for SoftiWARP, to IP adresses of interfaces it is bound to). For example, a wildcard listen must get translated into a listen restricted to the interface(s) under local control. I implemented and submitted this type of solution for cxgb3 in 2007 as well. http://lkml.org/lkml/2007/9/13/268 Roland didn't like it, I think, because it used well known tokens in the interface name to designate iwarp ip addresses via ifconfig. Like eth0:iw1. So the solution really required the admin to setup these iwarp-only subnets/interfaces. There was nothing that prevented non iwarp traffic to arrive on these ip addresses other than admin policy. I think that was another reason Roland didn't like this solution. Anyway, you can peruse that thread and maybe its a starting point for some separate iwarp ipaddresses solution Steve. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH/RFC] mlx4_core: module param to limit msix vec allocation
The mlx4_core driver allocates 'nreq' msix vectors (and irqs), where: nreq = min_t(int, dev-caps.num_eqs - dev-caps.reserved_eqs, num_possible_cpus() + 1); ConnectX HCAs support 512 event queues (4 reserved). On a system with enough processors, we get: mlx4_core 0006:01:00.0: Requested 508 vectors, but only 256 MSI-X vectors available, trying again Further attempts (by other drivers) to allocate interrupts fail, because mlx4_core got 'em all. How about this? Hi, I think that this patch would do the job, Anyway we are thinking of ways to change our interrupt allocation scheme. --Yevgeny-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] opensm/osm_trap_rcv.c: No need to check for sweep for trap 145
Trap 145 merely carries the SystemImageGUID (and indication that it changed) so there is no need (to even check) for sweep Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com --- diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c index 500632c..71429c4 100644 --- a/opensm/opensm/osm_trap_rcv.c +++ b/opensm/opensm/osm_trap_rcv.c @@ -2,7 +2,7 @@ * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved. * Copyright (c) 2002-2007 Mellanox Technologies LTD. All rights reserved. * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. - * Copyright (c) 2009 HNR Consulting. All rights reserved. + * Copyright (c) 2009,2010 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -510,10 +510,12 @@ static void trap_rcv_process_request(IN osm_sm_t * sm, ERR 3812: No physical port found for trap 144: \node description update\\n); goto check_sweep; - } else if (cl_ntoh16(p_ntci-g_or_v.generic.trap_num) == 145) + } else if (cl_ntoh16(p_ntci-g_or_v.generic.trap_num) == 145) { /* this assumes that trap 145 content is not broken? */ p_physp-p_node-node_info.sys_guid = p_ntci-data_details.ntc_145.new_sys_guid; + goto check_report; + } check_sweep: /* do a sweep if we received a trap */ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv3][RESEND] opensm/osm_console.c: Add dump and clear redir perfmgr command support
Follows previous patch that adds better redirection support into PerfMgr Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com --- Changes since v2: Rebased Changes since v1: Changes based on changes to PerfMgr redir support in v3 patch diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index 49b0ae0..1779d9d 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2005-2009 Voltaire, Inc. All rights reserved. - * Copyright (c) 2009 HNR Consulting. All rights reserved. + * Copyright (c) 2009,2010 HNR Consulting. All rights reserved. * Copyright (c) 2010 Mellanox Technologies LTD. All rights reserved. * * This software is available to you under a choice of one of two @@ -232,7 +232,7 @@ static void help_update_desc(FILE *out, int detail) static void help_perfmgr(FILE * out, int detail) { fprintf(out, - perfmgr [enable|disable|clear_counters|dump_counters|print_counters|sweep_time[seconds]]\n); + perfmgr [enable|disable|clear_counters|dump_counters|print_counters|dump_redir|clear_redir|sweep_time[seconds]]\n); if (detail) { fprintf(out, perfmgr -- print the performance manager state\n); @@ -246,6 +246,10 @@ static void help_perfmgr(FILE * out, int detail) [dump_counters [mach]] -- dump the counters (optionally in [mach]ine readable format)\n); fprintf(out, [print_counters nodename|nodeguid] -- print the counters for the specified node\n); + fprintf(out, + [dump_redir [nodename|nodeguid]] -- dump the redirection table\n); + fprintf(out, + [clear_redir [nodename|nodeguid]] -- clear the redirection table\n); } } #endif /* ENABLE_OSM_PERF_MGR */ @@ -1180,6 +1184,152 @@ static void update_desc_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) } #ifdef ENABLE_OSM_PERF_MGR +static monitored_node_t *find_node_by_name(osm_opensm_t * p_osm, + char *nodename) +{ + cl_map_item_t *item; + monitored_node_t *node; + + item = cl_qmap_head(p_osm-perfmgr.monitored_map); +while (item != cl_qmap_end(p_osm-perfmgr.monitored_map)) { +node = (monitored_node_t *)item; +if (strcmp(node-name, nodename) == 0) + return node; +item = cl_qmap_next(item); +} + + return NULL; +} + +static monitored_node_t *find_node_by_guid(osm_opensm_t * p_osm, + uint64_t guid) +{ + cl_map_item_t *node; + + node = cl_qmap_get(p_osm-perfmgr.monitored_map, guid); + if (node != cl_qmap_end(p_osm-perfmgr.monitored_map)) + return (monitored_node_t *)node; + + return NULL; +} + +static void dump_redir_entry(monitored_node_t *p_mon_node, FILE * out) +{ + int port, redir; + + /* only display monitored nodes with redirection info */ + redir = 0; + for (port = (p_mon_node-esp0) ? 0 : 1; +port p_mon_node-num_ports; port++) { + if (p_mon_node-port[port].redirection) { + if (!redir) { + fprintf(out,Node GUID ESP0 Name\n); + fprintf(out,- \n); + fprintf(out,0x% PRIx64 %d %s\n, + p_mon_node-guid, p_mon_node-esp0, + p_mon_node-name); + fprintf(out, \n Port Valid LIDs PKey QPPKey Index\n); + fprintf(out, - -- --\n); + redir = 1; + } + fprintf(out,%d%d %u-%u 0x%x 0x%x %d\n, + port, p_mon_node-port[port].valid, + cl_ntoh16(p_mon_node-port[port].orig_lid), + cl_ntoh16(p_mon_node-port[port].lid), + cl_ntoh16(p_mon_node-port[port].pkey), + cl_ntoh32(p_mon_node-port[port].qp), + p_mon_node-port[port].pkey_ix); + } + } + if (redir) + fprintf(out, \n); +} + +static void dump_redir(osm_opensm_t * p_osm, char *nodename, FILE * out) +{ + monitored_node_t *p_mon_node; + uint64_t guid; + + if (!p_osm-subn.opt.perfmgr_redir) + fprintf(out, Perfmgr redirection not enabled\n); + + fprintf(out, \nRedirection Table\n); + fprintf(out, -\n); + cl_plock_acquire(p_osm-lock); + if (nodename) { + guid =
Re: [PATCH] pkey fix for ipoib - resubmission
On Wed, Jun 16, 2010 at 4:59 PM, Mike Heinz michael.he...@qlogic.com wrote: IPoIB is coded to use the 1st PKey in the PKey table as its ib0 interface. Additional ib0.pkey interfaces may be created using the /sys/class/... add_child interface. However, there is a race. During normal boot, IPoIB will be started before the port is Active. Hence the pkey table has not yet been programmed and has a default pkey table (with 0x as only pkey). So what's wrong with using the default pkey? It is a valid and I don't see why we should ignore it. Later when the SM moves the port to Active, the SM may program the pkey table differently. However at this point IPoIB has already started using the incorrect pkey. It appears that the initially formatted 'broadcast' mgid is never updated to supply actual pkey value if ipoib comes up before hca port. Proposed patch targets two issues: 1. Suppress activation of interface and join multicast group queries (it will fail anyway) until hca port is initialized. When port becomes active - update pkey value and move on. I don't think this is required. 2. Update broadcast mgid based on actual pkey, then issue join broadcast group request. I agree that the broadcast MGID is not updated. But it seems to me that all that's needed is to update priv-dev-broadcast with the updated pkey at ipoib_open(). The rest is already taken care of since pkey change events are handled by IPoIB. Signed-Off-By: Michael Heinz michael.he...@qlogic.com --- diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index ec6b4fb..496d96c 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -51,6 +51,7 @@ MODULE_PARM_DESC(data_debug_level, #endif static DEFINE_MUTEX(pkey_mutex); +static void ipoib_pkey_dev_check_presence(struct net_device *dev); struct ipoib_ah *ipoib_create_ah(struct net_device *dev, struct ib_pd *pd, struct ib_ah_attr *attr) @@ -654,12 +655,13 @@ int ipoib_ib_dev_open(struct net_device *dev) struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; - if (ib_find_pkey(priv-ca, priv-port, priv-pkey, priv-pkey_index)) { + ipoib_pkey_dev_check_presence(dev); + + if (!test_bit(IPOIB_PKEY_ASSIGNED, priv-flags)) { ipoib_warn(priv, P_Key 0x%04x not found\n, priv-pkey); clear_bit(IPOIB_PKEY_ASSIGNED, priv-flags); return -1; } - set_bit(IPOIB_PKEY_ASSIGNED, priv-flags); ret = ipoib_init_qp(dev); if (ret) { @@ -694,9 +696,26 @@ int ipoib_ib_dev_open(struct net_device *dev) static void ipoib_pkey_dev_check_presence(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - u16 pkey_index = 0; + struct ib_port_attr port_attr; + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, priv-flags)) { + clear_bit(IPOIB_PKEY_ASSIGNED, priv-flags); + if (ib_query_port(priv-ca, priv-port, port_attr)) { + ipoib_warn(priv, Query port attrs failed\n); + return; + } + + if (port_attr.state != IB_PORT_ACTIVE) + return; + + if (ib_query_pkey(priv-ca, priv-port, 0, priv-pkey)) { + ipoib_warn(priv, Query P_Key table entry 0 failed\n); + return; + } + set_bit(IPOIB_PKEY_ASSIGNED, priv-flags); + } - if (ib_find_pkey(priv-ca, priv-port, priv-pkey, pkey_index)) + if (ib_find_pkey(priv-ca, priv-port, priv-pkey, priv-pkey_index)) clear_bit(IPOIB_PKEY_ASSIGNED, priv-flags); else set_bit(IPOIB_PKEY_ASSIGNED, priv-flags); @@ -955,7 +974,8 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, } /* restart QP only if P_Key index is changed */ - if (test_and_set_bit(IPOIB_PKEY_ASSIGNED, priv-flags) + if (test_bit(IPOIB_FLAG_SUBINTERFACE, priv-flags) + test_and_set_bit(IPOIB_PKEY_ASSIGNED, priv-flags) new_index == priv-pkey_index) { ipoib_dbg(priv, Not flushing - P_Key index not changed.\n); return; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 3871ac6..6fe6527 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -552,6 +552,13 @@ void ipoib_mcast_join_task(struct work_struct *work) } spin_lock_irq(priv-lock); + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, priv-flags)) { + /* fix broadcast gid in case if pkey was changed */ +
RE: [PATCH] pkey fix for ipoib - resubmission
So what's wrong with using the default pkey? It is a valid and I don't see why we should ignore it. In fabrics using quality of service and virtual fabrics, the default pkey is probably the wrong one to use for network traffic - and may not work at all. Remember, the real default pkey is 0x7fff, not 0x - and 0x7fff only guarantees communications with the SM not with other nodes. I don't think this is required. It is certainly required for any fabric that does not permit the use of 0x as a pkey for ipoib traffic. -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Eli Cohen Sent: Thursday, June 17, 2010 12:37 PM To: Mike Heinz Cc: linux-rdma@vger.kernel.org; Roland Dreier Subject: Re: [PATCH] pkey fix for ipoib - resubmission On Wed, Jun 16, 2010 at 4:59 PM, Mike Heinz michael.he...@qlogic.com wrote: IPoIB is coded to use the 1st PKey in the PKey table as its ib0 interface. Additional ib0.pkey interfaces may be created using the /sys/class/... add_child interface. However, there is a race. During normal boot, IPoIB will be started before the port is Active. Hence the pkey table has not yet been programmed and has a default pkey table (with 0x as only pkey). So what's wrong with using the default pkey? It is a valid and I don't see why we should ignore it. Later when the SM moves the port to Active, the SM may program the pkey table differently. However at this point IPoIB has already started using the incorrect pkey. It appears that the initially formatted 'broadcast' mgid is never updated to supply actual pkey value if ipoib comes up before hca port. Proposed patch targets two issues: 1. Suppress activation of interface and join multicast group queries (it will fail anyway) until hca port is initialized. When port becomes active - update pkey value and move on. I don't think this is required. 2. Update broadcast mgid based on actual pkey, then issue join broadcast group request. I agree that the broadcast MGID is not updated. But it seems to me that all that's needed is to update priv-dev-broadcast with the updated pkey at ipoib_open(). The rest is already taken care of since pkey change events are handled by IPoIB. Signed-Off-By: Michael Heinz michael.he...@qlogic.com --- diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index ec6b4fb..496d96c 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -51,6 +51,7 @@ MODULE_PARM_DESC(data_debug_level, #endif static DEFINE_MUTEX(pkey_mutex); +static void ipoib_pkey_dev_check_presence(struct net_device *dev); struct ipoib_ah *ipoib_create_ah(struct net_device *dev, struct ib_pd *pd, struct ib_ah_attr *attr) @@ -654,12 +655,13 @@ int ipoib_ib_dev_open(struct net_device *dev) struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; - if (ib_find_pkey(priv-ca, priv-port, priv-pkey, priv-pkey_index)) { + ipoib_pkey_dev_check_presence(dev); + + if (!test_bit(IPOIB_PKEY_ASSIGNED, priv-flags)) { ipoib_warn(priv, P_Key 0x%04x not found\n, priv-pkey); clear_bit(IPOIB_PKEY_ASSIGNED, priv-flags); return -1; } - set_bit(IPOIB_PKEY_ASSIGNED, priv-flags); ret = ipoib_init_qp(dev); if (ret) { @@ -694,9 +696,26 @@ int ipoib_ib_dev_open(struct net_device *dev) static void ipoib_pkey_dev_check_presence(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); - u16 pkey_index = 0; + struct ib_port_attr port_attr; + + if (!test_bit(IPOIB_FLAG_SUBINTERFACE, priv-flags)) { + clear_bit(IPOIB_PKEY_ASSIGNED, priv-flags); + if (ib_query_port(priv-ca, priv-port, port_attr)) { + ipoib_warn(priv, Query port attrs failed\n); + return; + } + + if (port_attr.state != IB_PORT_ACTIVE) + return; + + if (ib_query_pkey(priv-ca, priv-port, 0, priv-pkey)) { + ipoib_warn(priv, Query P_Key table entry 0 failed\n); + return; + } + set_bit(IPOIB_PKEY_ASSIGNED, priv-flags); + } - if (ib_find_pkey(priv-ca, priv-port, priv-pkey, pkey_index)) + if (ib_find_pkey(priv-ca, priv-port, priv-pkey, priv-pkey_index)) clear_bit(IPOIB_PKEY_ASSIGNED, priv-flags); else set_bit(IPOIB_PKEY_ASSIGNED, priv-flags); @@ -955,7 +974,8 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, } /* restart QP only if P_Key index is changed */ - if (test_and_set_bit(IPOIB_PKEY_ASSIGNED, priv-flags) + if
Re: [PATCHv4][RESEND] opensm/PerfMgr: Better redirection support
Sasha, I was thinking of doing something similar to this. When can you get this applied? Thanks, Ira On Thu, 17 Jun 2010 09:03:35 -0700 Hal Rosenstock hnr...@comcast.net wrote: Handle PKey and QPN redirection information GID redirection handling remains Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com --- Changes since v3: Rebased Changes since v2: Use OpenSM DB rather than vendor layer for local port number and PKeys Change most log levels from ERROR to VERBOSE Redirection info validity now determined by single flag validate_redir_pkey returns pkey index or -1 rather than boolean Removed redir_ prefixes Changes since v1: Added include of osm_helper.h to osm_perfmgr.c diff --git a/opensm/include/opensm/osm_perfmgr.h b/opensm/include/opensm/osm_perfmgr.h index c26c141..34925e8 100644 --- a/opensm/include/opensm/osm_perfmgr.h +++ b/opensm/include/opensm/osm_perfmgr.h @@ -1,7 +1,7 @@ /* * Copyright (c) 2007 The Regents of the University of California. * Copyright (c) 2007-2009 Voltaire, Inc. All rights reserved. - * Copyright (c) 2009 HNR Consulting. All rights reserved. + * Copyright (c) 2009,2010 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -90,11 +90,17 @@ typedef enum { PERFMGR_SWEEP_SUSPENDED } osm_perfmgr_sweep_state_t; -/* Redirection information */ -typedef struct redir { - ib_net16_t redir_lid; - ib_net32_t redir_qp; -} redir_t; +typedef struct monitored_port { + uint16_t pkey_ix; + ib_net16_t orig_lid; + boolean_t redirection; + boolean_t valid; + /* Redirection fields from ClassPortInfo */ + ib_gid_t gid; + ib_net16_t lid; + ib_net16_t pkey; + ib_net32_t qp; +} monitored_port_t; /* Node to store information about nodes being monitored */ typedef struct monitored_node { @@ -104,7 +110,7 @@ typedef struct monitored_node { boolean_t esp0; char *name; uint32_t num_ports; - redir_t redir_port[1]; /* redirection on a per port basis */ + monitored_port_t port[1]; } monitored_node_t; struct osm_opensm; @@ -134,6 +140,8 @@ typedef struct osm_perfmgr { uint32_t max_outstanding_queries; cl_qmap_t monitored_map;/* map the nodes being tracked */ monitored_node_t *remove_list; + ib_net64_t port_guid; + int16_t local_port; } osm_perfmgr_t; /* * FIELDS diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c index 398b463..d86e1c6 100644 --- a/opensm/opensm/osm_perfmgr.c +++ b/opensm/opensm/osm_perfmgr.c @@ -1,7 +1,7 @@ /* * Copyright (c) 2007 The Regents of the University of California. * Copyright (c) 2007-2009 Voltaire, Inc. All rights reserved. - * Copyright (c) 2009 HNR Consulting. All rights reserved. + * Copyright (c) 2009,2010 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -64,6 +64,7 @@ #include opensm/osm_log.h #include opensm/osm_node.h #include opensm/osm_opensm.h +#include opensm/osm_helper.h #define PERFMGR_INITIAL_TID_VALUE 0xcafe @@ -194,6 +195,7 @@ static void perfmgr_mad_send_err_callback(void *bind_context, uint8_t port = context-perfmgr_context.port; cl_map_item_t *p_node; monitored_node_t *p_mon_node; + ib_net16_t orig_lid; OSM_LOG_ENTER(pm-log); @@ -225,9 +227,11 @@ static void perfmgr_mad_send_err_callback(void *bind_context, p_mon_node-num_ports); goto Exit; } - /* Clear redirection info */ - p_mon_node-redir_port[port].redir_lid = 0; - p_mon_node-redir_port[port].redir_qp = 0; + /* Clear redirection info for this port except orig_lid */ + orig_lid = p_mon_node-port[port].orig_lid; + memset(p_mon_node-port[port], 0, sizeof(monitored_port_t)); + p_mon_node-port[port].orig_lid = orig_lid; + p_mon_node-port[port].valid = TRUE; cl_plock_release(pm-osm-lock); } @@ -256,7 +260,7 @@ ib_api_status_t osm_perfmgr_bind(osm_perfmgr_t * pm, ib_net64_t port_guid) goto Exit; } - bind_info.port_guid = port_guid; + bind_info.port_guid = pm-port_guid = port_guid; bind_info.mad_class = IB_MCLASS_PERF; bind_info.class_version = 1; bind_info.is_responder = FALSE; @@ -309,24 +313,14 @@ static ib_net32_t get_qp(monitored_node_t * mon_node, uint8_t port) ib_net32_t qp = IB_QP1; if (mon_node mon_node-num_ports port mon_node-num_ports -
[PATCHv4 2/2][RESEND] opensm/osm_console.c: Add dump and clear redir perfmgr command support
Follows previous patch that adds better redirection support into PerfMgr Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com --- Changes since v3: Fixed some formatting problems (spaces instead of tabs) Changes since v2: Rebased Changes since v1: Changes based on changes to PerfMgr redir support in v3 patch diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index 49b0ae0..764235a 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2005-2009 Voltaire, Inc. All rights reserved. - * Copyright (c) 2009 HNR Consulting. All rights reserved. + * Copyright (c) 2009,2010 HNR Consulting. All rights reserved. * Copyright (c) 2010 Mellanox Technologies LTD. All rights reserved. * * This software is available to you under a choice of one of two @@ -232,7 +232,7 @@ static void help_update_desc(FILE *out, int detail) static void help_perfmgr(FILE * out, int detail) { fprintf(out, - perfmgr [enable|disable|clear_counters|dump_counters|print_counters|sweep_time[seconds]]\n); + perfmgr [enable|disable|clear_counters|dump_counters|print_counters|dump_redir|clear_redir|sweep_time[seconds]]\n); if (detail) { fprintf(out, perfmgr -- print the performance manager state\n); @@ -246,6 +246,10 @@ static void help_perfmgr(FILE * out, int detail) [dump_counters [mach]] -- dump the counters (optionally in [mach]ine readable format)\n); fprintf(out, [print_counters nodename|nodeguid] -- print the counters for the specified node\n); + fprintf(out, + [dump_redir [nodename|nodeguid]] -- dump the redirection table\n); + fprintf(out, + [clear_redir [nodename|nodeguid]] -- clear the redirection table\n); } } #endif /* ENABLE_OSM_PERF_MGR */ @@ -1180,6 +1184,152 @@ static void update_desc_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) } #ifdef ENABLE_OSM_PERF_MGR +static monitored_node_t *find_node_by_name(osm_opensm_t * p_osm, + char *nodename) +{ + cl_map_item_t *item; + monitored_node_t *node; + + item = cl_qmap_head(p_osm-perfmgr.monitored_map); + while (item != cl_qmap_end(p_osm-perfmgr.monitored_map)) { + node = (monitored_node_t *)item; + if (strcmp(node-name, nodename) == 0) + return node; + item = cl_qmap_next(item); + } + + return NULL; +} + +static monitored_node_t *find_node_by_guid(osm_opensm_t * p_osm, + uint64_t guid) +{ + cl_map_item_t *node; + + node = cl_qmap_get(p_osm-perfmgr.monitored_map, guid); + if (node != cl_qmap_end(p_osm-perfmgr.monitored_map)) + return (monitored_node_t *)node; + + return NULL; +} + +static void dump_redir_entry(monitored_node_t *p_mon_node, FILE * out) +{ + int port, redir; + + /* only display monitored nodes with redirection info */ + redir = 0; + for (port = (p_mon_node-esp0) ? 0 : 1; +port p_mon_node-num_ports; port++) { + if (p_mon_node-port[port].redirection) { + if (!redir) { + fprintf(out,Node GUID ESP0 Name\n); + fprintf(out,- \n); + fprintf(out,0x% PRIx64 %d %s\n, + p_mon_node-guid, p_mon_node-esp0, + p_mon_node-name); + fprintf(out, \n Port Valid LIDs PKey QPPKey Index\n); + fprintf(out, - -- --\n); + redir = 1; + } + fprintf(out,%d%d %u-%u 0x%x 0x%x %d\n, + port, p_mon_node-port[port].valid, + cl_ntoh16(p_mon_node-port[port].orig_lid), + cl_ntoh16(p_mon_node-port[port].lid), + cl_ntoh16(p_mon_node-port[port].pkey), + cl_ntoh32(p_mon_node-port[port].qp), + p_mon_node-port[port].pkey_ix); + } + } + if (redir) + fprintf(out, \n); +} + +static void dump_redir(osm_opensm_t * p_osm, char *nodename, FILE * out) +{ + monitored_node_t *p_mon_node; + uint64_t guid; + + if (!p_osm-subn.opt.perfmgr_redir) + fprintf(out, Perfmgr redirection not enabled\n); + + fprintf(out, \nRedirection Table\n); + fprintf(out, -\n); +
[PATCHv5 1/2][RESEND] opensm/PerfMgr: Better redirection support
Handle PKey and QPN redirection information GID redirection handling remains Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com --- Changes since v4: Fixed some trailing whitespace problems Changes since v3: Rebased Changes since v2: Use OpenSM DB rather than vendor layer for local port number and PKeys Change most log levels from ERROR to VERBOSE Redirection info validity now determined by single flag validate_redir_pkey returns pkey index or -1 rather than boolean Removed redir_ prefixes Changes since v1: Added include of osm_helper.h to osm_perfmgr.c diff --git a/opensm/include/opensm/osm_perfmgr.h b/opensm/include/opensm/osm_perfmgr.h index c26c141..34925e8 100644 --- a/opensm/include/opensm/osm_perfmgr.h +++ b/opensm/include/opensm/osm_perfmgr.h @@ -1,7 +1,7 @@ /* * Copyright (c) 2007 The Regents of the University of California. * Copyright (c) 2007-2009 Voltaire, Inc. All rights reserved. - * Copyright (c) 2009 HNR Consulting. All rights reserved. + * Copyright (c) 2009,2010 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -90,11 +90,17 @@ typedef enum { PERFMGR_SWEEP_SUSPENDED } osm_perfmgr_sweep_state_t; -/* Redirection information */ -typedef struct redir { - ib_net16_t redir_lid; - ib_net32_t redir_qp; -} redir_t; +typedef struct monitored_port { + uint16_t pkey_ix; + ib_net16_t orig_lid; + boolean_t redirection; + boolean_t valid; + /* Redirection fields from ClassPortInfo */ + ib_gid_t gid; + ib_net16_t lid; + ib_net16_t pkey; + ib_net32_t qp; +} monitored_port_t; /* Node to store information about nodes being monitored */ typedef struct monitored_node { @@ -104,7 +110,7 @@ typedef struct monitored_node { boolean_t esp0; char *name; uint32_t num_ports; - redir_t redir_port[1]; /* redirection on a per port basis */ + monitored_port_t port[1]; } monitored_node_t; struct osm_opensm; @@ -134,6 +140,8 @@ typedef struct osm_perfmgr { uint32_t max_outstanding_queries; cl_qmap_t monitored_map;/* map the nodes being tracked */ monitored_node_t *remove_list; + ib_net64_t port_guid; + int16_t local_port; } osm_perfmgr_t; /* * FIELDS diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c index 398b463..fccf9d6 100644 --- a/opensm/opensm/osm_perfmgr.c +++ b/opensm/opensm/osm_perfmgr.c @@ -1,7 +1,7 @@ /* * Copyright (c) 2007 The Regents of the University of California. * Copyright (c) 2007-2009 Voltaire, Inc. All rights reserved. - * Copyright (c) 2009 HNR Consulting. All rights reserved. + * Copyright (c) 2009,2010 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -64,6 +64,7 @@ #include opensm/osm_log.h #include opensm/osm_node.h #include opensm/osm_opensm.h +#include opensm/osm_helper.h #define PERFMGR_INITIAL_TID_VALUE 0xcafe @@ -194,6 +195,7 @@ static void perfmgr_mad_send_err_callback(void *bind_context, uint8_t port = context-perfmgr_context.port; cl_map_item_t *p_node; monitored_node_t *p_mon_node; + ib_net16_t orig_lid; OSM_LOG_ENTER(pm-log); @@ -225,9 +227,11 @@ static void perfmgr_mad_send_err_callback(void *bind_context, p_mon_node-num_ports); goto Exit; } - /* Clear redirection info */ - p_mon_node-redir_port[port].redir_lid = 0; - p_mon_node-redir_port[port].redir_qp = 0; + /* Clear redirection info for this port except orig_lid */ + orig_lid = p_mon_node-port[port].orig_lid; + memset(p_mon_node-port[port], 0, sizeof(monitored_port_t)); + p_mon_node-port[port].orig_lid = orig_lid; + p_mon_node-port[port].valid = TRUE; cl_plock_release(pm-osm-lock); } @@ -256,7 +260,7 @@ ib_api_status_t osm_perfmgr_bind(osm_perfmgr_t * pm, ib_net64_t port_guid) goto Exit; } - bind_info.port_guid = port_guid; + bind_info.port_guid = pm-port_guid = port_guid; bind_info.mad_class = IB_MCLASS_PERF; bind_info.class_version = 1; bind_info.is_responder = FALSE; @@ -309,24 +313,14 @@ static ib_net32_t get_qp(monitored_node_t * mon_node, uint8_t port) ib_net32_t qp = IB_QP1; if (mon_node mon_node-num_ports port mon_node-num_ports - mon_node-redir_port[port].redir_lid - mon_node-redir_port[port].redir_qp) - qp = mon_node-redir_port[port].redir_qp; + mon_node-port[port].redirection mon_node-port[port].qp) + qp = mon_node-port[port].qp; return qp;
[PATCHv5 2/2][RESEND] opensm/osm_console.c: Add dump and clear redir perfmgr command support
Follows previous patch that adds better redirection support for PerfMgr Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com --- Changes since v4: Fixed rejection of Copyright hunk Changes since v3: Fixed some formatting problems (spaces instead of tabs) Changes since v2: Rebased Changes since v1: Changes based on changes to PerfMgr redir support in v3 patch diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c index bc7bea3..27f1e1e 100644 --- a/opensm/opensm/osm_console.c +++ b/opensm/opensm/osm_console.c @@ -1,6 +1,6 @@ /* * Copyright (c) 2005-2009 Voltaire, Inc. All rights reserved. - * Copyright (c) 2009 HNR Consulting. All rights reserved. + * Copyright (c) 2009,2010 HNR Consulting. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU @@ -231,7 +231,7 @@ static void help_update_desc(FILE *out, int detail) static void help_perfmgr(FILE * out, int detail) { fprintf(out, - perfmgr [enable|disable|clear_counters|dump_counters|print_counters|sweep_time[seconds]]\n); + perfmgr [enable|disable|clear_counters|dump_counters|print_counters|dump_redir|clear_redir|sweep_time[seconds]]\n); if (detail) { fprintf(out, perfmgr -- print the performance manager state\n); @@ -245,6 +245,10 @@ static void help_perfmgr(FILE * out, int detail) [dump_counters [mach]] -- dump the counters (optionally in [mach]ine readable format)\n); fprintf(out, [print_counters nodename|nodeguid] -- print the counters for the specified node\n); + fprintf(out, + [dump_redir [nodename|nodeguid]] -- dump the redirection table\n); + fprintf(out, + [clear_redir [nodename|nodeguid]] -- clear the redirection table\n); } } #endif /* ENABLE_OSM_PERF_MGR */ @@ -1179,6 +1183,152 @@ static void update_desc_parse(char **p_last, osm_opensm_t * p_osm, FILE * out) } #ifdef ENABLE_OSM_PERF_MGR +static monitored_node_t *find_node_by_name(osm_opensm_t * p_osm, + char *nodename) +{ + cl_map_item_t *item; + monitored_node_t *node; + + item = cl_qmap_head(p_osm-perfmgr.monitored_map); + while (item != cl_qmap_end(p_osm-perfmgr.monitored_map)) { + node = (monitored_node_t *)item; + if (strcmp(node-name, nodename) == 0) + return node; + item = cl_qmap_next(item); + } + + return NULL; +} + +static monitored_node_t *find_node_by_guid(osm_opensm_t * p_osm, + uint64_t guid) +{ + cl_map_item_t *node; + + node = cl_qmap_get(p_osm-perfmgr.monitored_map, guid); + if (node != cl_qmap_end(p_osm-perfmgr.monitored_map)) + return (monitored_node_t *)node; + + return NULL; +} + +static void dump_redir_entry(monitored_node_t *p_mon_node, FILE * out) +{ + int port, redir; + + /* only display monitored nodes with redirection info */ + redir = 0; + for (port = (p_mon_node-esp0) ? 0 : 1; +port p_mon_node-num_ports; port++) { + if (p_mon_node-port[port].redirection) { + if (!redir) { + fprintf(out,Node GUID ESP0 Name\n); + fprintf(out,- \n); + fprintf(out,0x% PRIx64 %d %s\n, + p_mon_node-guid, p_mon_node-esp0, + p_mon_node-name); + fprintf(out, \n Port Valid LIDs PKey QPPKey Index\n); + fprintf(out, - -- --\n); + redir = 1; + } + fprintf(out,%d%d %u-%u 0x%x 0x%x %d\n, + port, p_mon_node-port[port].valid, + cl_ntoh16(p_mon_node-port[port].orig_lid), + cl_ntoh16(p_mon_node-port[port].lid), + cl_ntoh16(p_mon_node-port[port].pkey), + cl_ntoh32(p_mon_node-port[port].qp), + p_mon_node-port[port].pkey_ix); + } + } + if (redir) + fprintf(out, \n); +} + +static void dump_redir(osm_opensm_t * p_osm, char *nodename, FILE * out) +{ + monitored_node_t *p_mon_node; + uint64_t guid; + + if (!p_osm-subn.opt.perfmgr_redir) + fprintf(out, Perfmgr redirection not enabled\n); + + fprintf(out, \nRedirection Table\n); + fprintf(out,
[PATCH 5/5 v2] dapl-2.0 - scm, ucm: add pkey, pkey_index, sl override for QP's
Or/Sean, Good points. Here is v2 without index capabilities. Hefty, Sean wrote: The index isn't guaranteed to be the same across all nodes. If a consumer is going to manually control this, they should really be forced to use the actual pkey. yes, I saw this confusion in action, for most users pkey index doesn't mean anything, it may also change across time, which can break scripts/setting to run specific jobs using specific partitions. Or. On a per open basis, add environment variables DAPL_IB_SL and DAPL_IB_PKEY and use on connection setup (QP modify) to override default values of 0 for SL and PKEY index. If pkey is provided then find the pkey index with ibv_query_pkey for dev_attr.max_pkeys. Will be used for RC and UD type QP's. Signed-off-by: Arlin Davis arlin.r.da...@intel.com --- dapl/openib_cma/dapl_ib_util.h |4 +++- dapl/openib_common/qp.c|8 dapl/openib_common/util.c | 39 +-- dapl/openib_scm/dapl_ib_util.h |4 dapl/openib_ucm/dapl_ib_util.h |3 +++ 5 files changed, 51 insertions(+), 7 deletions(-) diff --git a/dapl/openib_cma/dapl_ib_util.h b/dapl/openib_cma/dapl_ib_util.h index a710195..471bd7f 100755 --- a/dapl/openib_cma/dapl_ib_util.h +++ b/dapl/openib_cma/dapl_ib_util.h @@ -121,7 +121,9 @@ typedef struct _ib_hca_transport uint8_t tclass; uint8_t mtu; DAT_NAMED_ATTR named_attr; - + uint8_t sl; + uint16_tpkey; + int pkey_idx; } ib_hca_transport_t; /* prototypes */ diff --git a/dapl/openib_common/qp.c b/dapl/openib_common/qp.c index 473604b..179eef0 100644 --- a/dapl/openib_common/qp.c +++ b/dapl/openib_common/qp.c @@ -422,7 +422,7 @@ dapls_modify_qp_state(IN ib_qp_handle_t qp_handle, qp_attr.ah_attr.grh.traffic_class = ia_ptr-hca_ptr-ib_trans.tclass; } - qp_attr.ah_attr.sl = 0; + qp_attr.ah_attr.sl = ia_ptr-hca_ptr-ib_trans.sl; qp_attr.ah_attr.src_path_bits = 0; qp_attr.ah_attr.port_num = ia_ptr-hca_ptr-port_num; @@ -489,7 +489,7 @@ dapls_modify_qp_state(IN ib_qp_handle_t qp_handle, qp_attr.qkey = DAT_UD_QKEY; } - qp_attr.pkey_index = 0; + qp_attr.pkey_index = ia_ptr-hca_ptr-ib_trans.pkey_idx; qp_attr.port_num = ia_ptr-hca_ptr-port_num; dapl_dbg_log(DAPL_DBG_TYPE_EP, @@ -519,7 +519,7 @@ dapls_modify_qp_ud(IN DAPL_HCA *hca, IN ib_qp_handle_t qp) /* modify QP, setup and prepost buffers */ dapl_os_memzero((void *)qp_attr, sizeof(qp_attr)); qp_attr.qp_state = IBV_QPS_INIT; -qp_attr.pkey_index = 0; +qp_attr.pkey_index = hca-ib_trans.pkey_idx; qp_attr.port_num = hca-port_num; qp_attr.qkey = DAT_UD_QKEY; if (ibv_modify_qp(qp, qp_attr, @@ -582,7 +582,7 @@ dapls_create_ah(IN DAPL_HCA *hca, qp_attr.ah_attr.grh.hop_limit = hca-ib_trans.hop_limit; qp_attr.ah_attr.grh.traffic_class = hca-ib_trans.tclass; } - qp_attr.ah_attr.sl = 0; + qp_attr.ah_attr.sl = hca-ib_trans.sl; qp_attr.ah_attr.src_path_bits = 0; qp_attr.ah_attr.port_num = hca-port_num; diff --git a/dapl/openib_common/util.c b/dapl/openib_common/util.c index b83f609..a69261f 100644 --- a/dapl/openib_common/util.c +++ b/dapl/openib_common/util.c @@ -321,6 +321,38 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HCA * hca_ptr, hca_ptr-ib_trans.named_attr.value = dapl_ib_mtu_str(hca_ptr-ib_trans.mtu); + if (hca_ptr-ib_hca_handle-device-transport_type != IBV_TRANSPORT_IB) + goto skip_ib; + + /* set SL, PKEY values, defaults = 0 */ + hca_ptr-ib_trans.pkey_idx = 0; + hca_ptr-ib_trans.pkey = dapl_os_get_env_val(DAPL_IB_PKEY, 0); + hca_ptr-ib_trans.sl = dapl_os_get_env_val(DAPL_IB_SL, 0); + + /* index provided, get pkey; pkey provided, get index */ + if (hca_ptr-ib_trans.pkey) { + int i; uint16_t pkey = 0; + for (i=0; i dev_attr.max_pkeys; i++) { + if (ibv_query_pkey(hca_ptr-ib_hca_handle, + hca_ptr-port_num, + i, pkey)) { + i = dev_attr.max_pkeys; + break; + } + if (pkey == hca_ptr-ib_trans.pkey) { + hca_ptr-ib_trans.pkey_idx = i; + break; + } +
[ANNOUNCE] dapl-2.0.29 release
New release for uDAPL v2.0 (2.0.29) available at: http://www.openfabrics.org/downloads/dapl Latest Packages (see ChangeLog for details): md5sum: 76f18eedf0758ca81aaa3923a65808a5 dapl-2.0.29.tar.gz For 1.2 and 2.0 support on same system, including development, install RPM packages as follow: dapl-2.0.29-1 dapl-utils-2.0.29-1 dapl-devel-2.0.29-1 dapl-debuginfo-2.0.29-1 compat-dapl-1.2.17-1 compat-dapl-devel-1.2.17-1 Summary of changes: Release 2.0.29 fixes (OFED 1.5.2): scm, ucm: add pkey and sl override for QP's, DAPL_IB_SL, DAPL_IB_KEY. cma: remove dependency on rdma_cma_abi.h configure: need a false conditional for verbs attr.link_layer member check ucm: incorrectly freeing port on passive side after reject ucm: modify debug CM output for consistency, all ports, qpn in hex Vlad, please pull into OFED 1.5.2 RC2 (should be final uDAPL package for OFED 1.5.2): Thanks, -arlin -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] mlx4_core: module param to limit msix vec allocation
On Thu, Jun 17, 2010 at 05:53:58PM +0300, Yevgeny Petrilin wrote: I think that this patch would do the job, (Is that an ack?) Anyway we are thinking of ways to change our interrupt allocation scheme. Would be interested to know what you've got in mind. -- Arthur -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/7] various fixes for QIB driver
The following patches are for various bug fixes. I'm not sure what counts as a regression for code that is newly introduced. I'm hoping that all except #2 can be made for 2.6.35 whereas #2 can wait for 2.6.36 since it is actually a feature. IB/qib: avoid a rare 7322 chip problem by not marking VL15 bufs as WC IB/qib: allow PSM to select from multiple port assignment algorithms IB/qib: mask hardware error during link reset IB/qib: clear eager buffer memory for each new process IB/qib: clear 6120 hardware error register IB/qib: update 7322 serdes tables IB/qib: completion queue callback needs to be single threaded -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/7] IB/qib: avoid a rare 7322 chip problem by not marking VL15 bufs as WC
From: Dave Olson dave.ol...@qlogic.com Don't set write combining via PAT on the VL15 buffers to avoid a rare problem with unaligned writes from interrupt-flushed store buffers. Signed-off-by: Dave Olson dave.ol...@qlogic.com --- drivers/infiniband/hw/qib/qib.h |1 + drivers/infiniband/hw/qib/qib_diag.c| 19 +++ drivers/infiniband/hw/qib/qib_iba7322.c | 18 +- drivers/infiniband/hw/qib/qib_init.c|6 ++ drivers/infiniband/hw/qib/qib_pcie.c|2 ++ drivers/infiniband/hw/qib/qib_tx.c |6 +- 6 files changed, 46 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib.h b/drivers/infiniband/hw/qib/qib.h index 32d9208..3593983 100644 --- a/drivers/infiniband/hw/qib/qib.h +++ b/drivers/infiniband/hw/qib/qib.h @@ -686,6 +686,7 @@ struct qib_devdata { void __iomem *piobase; /* mem-mapped pointer to base of user chip regs (if using WC PAT) */ u64 __iomem *userbase; + void __iomem *piovl15base; /* base of VL15 buffers, if not WC */ /* * points to area where PIOavail registers will be DMA'ed. * Has to be on a page of it's own, because the page will be diff --git a/drivers/infiniband/hw/qib/qib_diag.c b/drivers/infiniband/hw/qib/qib_diag.c index ca98dd5..05dcf0d 100644 --- a/drivers/infiniband/hw/qib/qib_diag.c +++ b/drivers/infiniband/hw/qib/qib_diag.c @@ -233,6 +233,7 @@ static u32 __iomem *qib_remap_ioaddr32(struct qib_devdata *dd, u32 offset, u32 __iomem *krb32 = (u32 __iomem *)dd-kregbase; u32 __iomem *map = NULL; u32 cnt = 0; + u32 tot4k, offs4k; /* First, simplest case, offset is within the first map. */ kreglen = (dd-kregend - dd-kregbase) * sizeof(u64); @@ -250,7 +251,8 @@ static u32 __iomem *qib_remap_ioaddr32(struct qib_devdata *dd, u32 offset, if (dd-userbase) { /* If user regs mapped, they are after send, so set limit. */ u32 ulim = (dd-cfgctxts * dd-ureg_align) + dd-uregbase; - snd_lim = dd-uregbase; + if (!dd-piovl15base) + snd_lim = dd-uregbase; krb32 = (u32 __iomem *)dd-userbase; if (offset = dd-uregbase offset ulim) { map = krb32 + (offset - dd-uregbase) / sizeof(u32); @@ -277,14 +279,14 @@ static u32 __iomem *qib_remap_ioaddr32(struct qib_devdata *dd, u32 offset, /* If 4k buffers exist, account for them by bumping * appropriate limit. */ + tot4k = dd-piobcnt4k * dd-align4k; + offs4k = dd-piobufbase 32; if (dd-piobcnt4k) { - u32 tot4k = dd-piobcnt4k * dd-align4k; - u32 offs4k = dd-piobufbase 32; if (snd_bottom offs4k) snd_bottom = offs4k; else { /* 4k above 2k. Bump snd_lim, if needed*/ - if (!dd-userbase) + if (!dd-userbase || dd-piovl15base) snd_lim = offs4k + tot4k; } } @@ -298,6 +300,15 @@ static u32 __iomem *qib_remap_ioaddr32(struct qib_devdata *dd, u32 offset, cnt = snd_lim - offset; } + if (!map offs4k dd-piovl15base) { + snd_lim = offs4k + tot4k + 2 * dd-align4k; + if (offset = (offs4k + tot4k) offset snd_lim) { + map = (u32 __iomem *)dd-piovl15base + + ((offset - (offs4k + tot4k)) / sizeof(u32)); + cnt = snd_lim - offset; + } + } + mapped: if (cntp) *cntp = cnt; diff --git a/drivers/infiniband/hw/qib/qib_iba7322.c b/drivers/infiniband/hw/qib/qib_iba7322.c index 503992d..3e9828b 100644 --- a/drivers/infiniband/hw/qib/qib_iba7322.c +++ b/drivers/infiniband/hw/qib/qib_iba7322.c @@ -6119,9 +6119,25 @@ static int qib_init_7322_variables(struct qib_devdata *dd) qib_set_ctxtcnt(dd); if (qib_wc_pat) { - ret = init_chip_wc_pat(dd, NUM_VL15_BUFS * dd-align4k); + resource_size_t vl15off; + /* +* We do not set WC on the VL15 buffers to avoid +* a rare problem with unaligned writes from +* interrupt-flushed store buffers, so we need +* to map those separately here. We can't solve +* this for the rarely used mtrr case. +*/ + ret = init_chip_wc_pat(dd, 0); if (ret) goto bail; + + /* vl15 buffers start just after the 4k buffers */ + vl15off = dd-physaddr + (dd-piobufbase 32) + + dd-piobcnt4k * dd-align4k; + dd-piovl15base = ioremap_nocache(vl15off, + NUM_VL15_BUFS * dd-align4k); + if
[PATCH 2/7] IB/qib: allow PSM to select from multiple port assignment algorithms
From: Dave Olson dave.ol...@qlogic.com We formerly allowed only full specification, or using all contexts within an HCA before moving to the next HCA. We now allow an additional method, of round-robining through HCAs, and make that the default. Signed-off-by: Dave Olson dave.ol...@qlogic.com --- drivers/infiniband/hw/qib/qib_common.h | 16 ++ drivers/infiniband/hw/qib/qib_file_ops.c | 203 +++--- 2 files changed, 118 insertions(+), 101 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib_common.h b/drivers/infiniband/hw/qib/qib_common.h index b3955ed..145da40 100644 --- a/drivers/infiniband/hw/qib/qib_common.h +++ b/drivers/infiniband/hw/qib/qib_common.h @@ -279,7 +279,7 @@ struct qib_base_info { * may not be implemented; the user code must deal with this if it * cares, or it must abort after initialization reports the difference. */ -#define QIB_USER_SWMINOR 10 +#define QIB_USER_SWMINOR 11 #define QIB_USER_SWVERSION ((QIB_USER_SWMAJOR 16) | QIB_USER_SWMINOR) @@ -302,6 +302,18 @@ struct qib_base_info { #define QIB_KERN_SWVERSION ((QIB_KERN_TYPE 31) | QIB_USER_SWVERSION) /* + * If the unit is specified via open, HCA choice is fixed. If port is + * specified, it's also fixed. Otherwise we try to spread contexts + * across ports and HCAs, using different algorithims. WITHIN is + * the old default, prior to this mechanism. + */ +#define QIB_PORT_ALG_ACROSS 0 /* round robin contexts across HCAs, then + * ports; this is the default */ +#define QIB_PORT_ALG_WITHIN 1 /* use all contexts on an HCA (round robin + * active ports within), then next HCA */ +#define QIB_PORT_ALG_COUNT 2 /* number of algorithm choices */ + +/* * This structure is passed to qib_userinit() to tell the driver where * user code buffers are, sizes, etc. The offsets and sizes of the * fields must remain unchanged, for binary compatibility. It can @@ -319,7 +331,7 @@ struct qib_user_info { /* size of struct base_info to write to */ __u32 spu_base_info_size; - __u32 _spu_unused3; + __u32 spu_port_alg; /* which QIB_PORT_ALG_*; unused user minor 11 */ /* * If two or more processes wish to share a context, each process diff --git a/drivers/infiniband/hw/qib/qib_file_ops.c b/drivers/infiniband/hw/qib/qib_file_ops.c index a142a9e..6b11645 100644 --- a/drivers/infiniband/hw/qib/qib_file_ops.c +++ b/drivers/infiniband/hw/qib/qib_file_ops.c @@ -1294,128 +1294,130 @@ bail: return ret; } -static inline int usable(struct qib_pportdata *ppd, int active_only) +static inline int usable(struct qib_pportdata *ppd) { struct qib_devdata *dd = ppd-dd; - u32 linkok = active_only ? QIBL_LINKACTIVE : -(QIBL_LINKINIT | QIBL_LINKARMED | QIBL_LINKACTIVE); return dd (dd-flags QIB_PRESENT) dd-kregbase ppd-lid - (ppd-lflags linkok); + (ppd-lflags QIBL_LINKACTIVE); } -static int find_free_ctxt(int unit, struct file *fp, - const struct qib_user_info *uinfo) +/* + * Select a context on the given device, either using a requested port + * or the port based on the context number. + */ +static int choose_port_ctxt(struct file *fp, struct qib_devdata *dd, u32 port, + const struct qib_user_info *uinfo) { - struct qib_devdata *dd = qib_lookup(unit); struct qib_pportdata *ppd = NULL; - int ret; - u32 ctxt; + int ret, ctxt; - if (!dd || (uinfo-spu_port uinfo-spu_port dd-num_pports)) { - ret = -ENODEV; - goto bail; - } - - /* -* If users requests specific port, only try that one port, else -* select best port below, based on context. -*/ - if (uinfo-spu_port) { - ppd = dd-pport + uinfo-spu_port - 1; - if (!usable(ppd, 0)) { + if (port) { + if (!usable(dd-pport + port - 1)) { ret = -ENETDOWN; - goto bail; - } + goto done; + } else + ppd = dd-pport + port - 1; } - - for (ctxt = dd-first_user_ctxt; ctxt dd-cfgctxts; ctxt++) { - if (dd-rcd[ctxt]) - continue; - /* -* The setting and clearing of user context rcd[x] protected -* by the qib_mutex -*/ - if (!ppd) { - /* choose port based on ctxt, if up, else 1st up */ - ppd = dd-pport + (ctxt % dd-num_pports); - if (!usable(ppd, 0)) { - int i; - for (i = 0; i dd-num_pports; i++) { - ppd = dd-pport + i; - if (usable(ppd, 0)) -
[PATCH 3/7] IB/qib: mask hardware error during link reset
The HCA checks for certain hardware errors which can be falsely triggered when the IB link is reset. The fix is to mask them rather than report them. Signed-off-by: Ralph Campbell ralph.campb...@qlogic.com --- drivers/infiniband/hw/qib/qib_7322_regs.h | 48 +++-- drivers/infiniband/hw/qib/qib_iba7322.c |9 - 2 files changed, 31 insertions(+), 26 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib_7322_regs.h b/drivers/infiniband/hw/qib/qib_7322_regs.h index a97440b..32dc81f 100644 --- a/drivers/infiniband/hw/qib/qib_7322_regs.h +++ b/drivers/infiniband/hw/qib/qib_7322_regs.h @@ -742,15 +742,15 @@ #define QIB_7322_HwErrMask_IBCBusFromSPCParityErrMask_1_LSB 0xF #define QIB_7322_HwErrMask_IBCBusFromSPCParityErrMask_1_MSB 0xF #define QIB_7322_HwErrMask_IBCBusFromSPCParityErrMask_1_RMASK 0x1 -#define QIB_7322_HwErrMask_statusValidNoEopMask_1_LSB 0xE -#define QIB_7322_HwErrMask_statusValidNoEopMask_1_MSB 0xE -#define QIB_7322_HwErrMask_statusValidNoEopMask_1_RMASK 0x1 +#define QIB_7322_HwErrMask_IBCBusToSPCParityErrMask_1_LSB 0xE +#define QIB_7322_HwErrMask_IBCBusToSPCParityErrMask_1_MSB 0xE +#define QIB_7322_HwErrMask_IBCBusToSPCParityErrMask_1_RMASK 0x1 #define QIB_7322_HwErrMask_IBCBusFromSPCParityErrMask_0_LSB 0xD #define QIB_7322_HwErrMask_IBCBusFromSPCParityErrMask_0_MSB 0xD #define QIB_7322_HwErrMask_IBCBusFromSPCParityErrMask_0_RMASK 0x1 -#define QIB_7322_HwErrMask_statusValidNoEopMask_0_LSB 0xC -#define QIB_7322_HwErrMask_statusValidNoEopMask_0_MSB 0xC -#define QIB_7322_HwErrMask_statusValidNoEopMask_0_RMASK 0x1 +#define QIB_7322_HwErrMask_statusValidNoEopMask_LSB 0xC +#define QIB_7322_HwErrMask_statusValidNoEopMask_MSB 0xC +#define QIB_7322_HwErrMask_statusValidNoEopMask_RMASK 0x1 #define QIB_7322_HwErrMask_LATriggeredMask_LSB 0xB #define QIB_7322_HwErrMask_LATriggeredMask_MSB 0xB #define QIB_7322_HwErrMask_LATriggeredMask_RMASK 0x1 @@ -796,15 +796,15 @@ #define QIB_7322_HwErrStatus_IBCBusFromSPCParityErr_1_LSB 0xF #define QIB_7322_HwErrStatus_IBCBusFromSPCParityErr_1_MSB 0xF #define QIB_7322_HwErrStatus_IBCBusFromSPCParityErr_1_RMASK 0x1 -#define QIB_7322_HwErrStatus_statusValidNoEop_1_LSB 0xE -#define QIB_7322_HwErrStatus_statusValidNoEop_1_MSB 0xE -#define QIB_7322_HwErrStatus_statusValidNoEop_1_RMASK 0x1 +#define QIB_7322_HwErrStatus_IBCBusToSPCParityErr_1_LSB 0xE +#define QIB_7322_HwErrStatus_IBCBusToSPCParityErr_1_MSB 0xE +#define QIB_7322_HwErrStatus_IBCBusToSPCParityErr_1_RMASK 0x1 #define QIB_7322_HwErrStatus_IBCBusFromSPCParityErr_0_LSB 0xD #define QIB_7322_HwErrStatus_IBCBusFromSPCParityErr_0_MSB 0xD #define QIB_7322_HwErrStatus_IBCBusFromSPCParityErr_0_RMASK 0x1 -#define QIB_7322_HwErrStatus_statusValidNoEop_0_LSB 0xC -#define QIB_7322_HwErrStatus_statusValidNoEop_0_MSB 0xC -#define QIB_7322_HwErrStatus_statusValidNoEop_0_RMASK 0x1 +#define QIB_7322_HwErrStatus_statusValidNoEop_LSB 0xC +#define QIB_7322_HwErrStatus_statusValidNoEop_MSB 0xC +#define QIB_7322_HwErrStatus_statusValidNoEop_RMASK 0x1 #define QIB_7322_HwErrStatus_LATriggered_LSB 0xB #define QIB_7322_HwErrStatus_LATriggered_MSB 0xB #define QIB_7322_HwErrStatus_LATriggered_RMASK 0x1 @@ -850,15 +850,15 @@ #define QIB_7322_HwErrClear_IBCBusFromSPCParityErrClear_1_LSB 0xF #define QIB_7322_HwErrClear_IBCBusFromSPCParityErrClear_1_MSB 0xF #define QIB_7322_HwErrClear_IBCBusFromSPCParityErrClear_1_RMASK 0x1 -#define QIB_7322_HwErrClear_IBCBusToSPCparityErrClear_1_LSB 0xE -#define QIB_7322_HwErrClear_IBCBusToSPCparityErrClear_1_MSB 0xE -#define QIB_7322_HwErrClear_IBCBusToSPCparityErrClear_1_RMASK 0x1 +#define QIB_7322_HwErrClear_IBCBusToSPCParityErrClear_1_LSB 0xE +#define QIB_7322_HwErrClear_IBCBusToSPCParityErrClear_1_MSB 0xE +#define QIB_7322_HwErrClear_IBCBusToSPCParityErrClear_1_RMASK 0x1 #define QIB_7322_HwErrClear_IBCBusFromSPCParityErrClear_0_LSB 0xD #define QIB_7322_HwErrClear_IBCBusFromSPCParityErrClear_0_MSB 0xD #define QIB_7322_HwErrClear_IBCBusFromSPCParityErrClear_0_RMASK 0x1 -#define QIB_7322_HwErrClear_IBCBusToSPCparityErrClear_0_LSB 0xC -#define QIB_7322_HwErrClear_IBCBusToSPCparityErrClear_0_MSB 0xC -#define QIB_7322_HwErrClear_IBCBusToSPCparityErrClear_0_RMASK 0x1 +#define QIB_7322_HwErrClear_statusValidNoEopClear_LSB 0xC +#define QIB_7322_HwErrClear_statusValidNoEopClear_MSB 0xC +#define QIB_7322_HwErrClear_statusValidNoEopClear_RMASK 0x1 #define QIB_7322_HwErrClear_LATriggeredClear_LSB 0xB #define QIB_7322_HwErrClear_LATriggeredClear_MSB 0xB #define QIB_7322_HwErrClear_LATriggeredClear_RMASK 0x1 @@ -880,15 +880,15 @@ #define QIB_7322_HwDiagCtrl_ForceIBCBusFromSPCParityErr_1_LSB 0xF #define QIB_7322_HwDiagCtrl_ForceIBCBusFromSPCParityErr_1_MSB 0xF #define QIB_7322_HwDiagCtrl_ForceIBCBusFromSPCParityErr_1_RMASK 0x1 -#define QIB_7322_HwDiagCtrl_ForcestatusValidNoEop_1_LSB 0xE -#define QIB_7322_HwDiagCtrl_ForcestatusValidNoEop_1_MSB 0xE -#define QIB_7322_HwDiagCtrl_ForcestatusValidNoEop_1_RMASK 0x1 +#define
[PATCH 4/7] IB/qib: clear eager buffer memory for each new process
The eager buffers are not being cleared before being mmapped into a new user address space. This is a potential security risk and should be fixed. Note that the eager header queue is already being cleared OK. Signed-off-by: Ralph Campbell ralph.campb...@qlogic.com --- drivers/infiniband/hw/qib/qib_init.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib_init.c b/drivers/infiniband/hw/qib/qib_init.c index 2589599..1d4db4b 100644 --- a/drivers/infiniband/hw/qib/qib_init.c +++ b/drivers/infiniband/hw/qib/qib_init.c @@ -1472,6 +1472,9 @@ int qib_setup_eagerbufs(struct qib_ctxtdata *rcd) dma_addr_t pa = rcd-rcvegrbuf_phys[chunk]; unsigned i; + /* clear for security and sanity on each use */ + memset(rcd-rcvegrbuf[chunk], 0, size); + for (i = 0; e egrcnt i egrperchunk; e++, i++) { dd-f_put_tid(dd, e + egroff + (u64 __iomem *) -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/7] IB/qib: clear 6120 hardware error register
The hardware error register needs to be cleared or another interrupt will be generated, thus causing an infinite loop. This is a regression introduced when removing debug output. Signed-off-by: Ralph Campbell ralph.campb...@qlogic.com --- drivers/infiniband/hw/qib/qib_iba6120.c |3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib_iba6120.c b/drivers/infiniband/hw/qib/qib_iba6120.c index 1eadadc..a5e29db 100644 --- a/drivers/infiniband/hw/qib/qib_iba6120.c +++ b/drivers/infiniband/hw/qib/qib_iba6120.c @@ -1355,8 +1355,7 @@ static int qib_6120_bringup_serdes(struct qib_pportdata *ppd) hwstat = qib_read_kreg64(dd, kr_hwerrstatus); if (hwstat) { /* should just have PLL, clear all set, in an case */ - if (hwstat ~QLOGIC_IB_HWE_SERDESPLLFAILED) - qib_write_kreg(dd, kr_hwerrclear, hwstat); + qib_write_kreg(dd, kr_hwerrclear, hwstat); qib_write_kreg(dd, kr_errclear, ERR_MASK(HardwareErr)); } -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/7] IB/qib: update 7322 serdes tables
Signed-off-by: Ralph Campbell ralph.campb...@qlogic.com --- drivers/infiniband/hw/qib/qib_iba7322.c | 16 1 files changed, 12 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib_iba7322.c b/drivers/infiniband/hw/qib/qib_iba7322.c index 8ee0ac6..5eedf83 100644 --- a/drivers/infiniband/hw/qib/qib_iba7322.c +++ b/drivers/infiniband/hw/qib/qib_iba7322.c @@ -543,7 +543,7 @@ struct vendor_txdds_ent { static void write_tx_serdes_param(struct qib_pportdata *, struct txdds_ent *); #define TXDDS_TABLE_SZ 16 /* number of entries per speed in onchip table */ -#define TXDDS_EXTRA_SZ 11 /* number of extra tx settings entries */ +#define TXDDS_EXTRA_SZ 13 /* number of extra tx settings entries */ #define SERDES_CHANS 4 /* yes, it's obvious, but one less magic number */ #define H1_FORCE_VAL 8 @@ -5629,6 +5629,8 @@ static void set_no_qsfp_atten(struct qib_devdata *dd, int change) if (ppd-port != port || !ppd-link_speed_supported) continue; ppd-cpspec-no_eep = val; + if (seth1) + ppd-cpspec-h1_val = h1; /* now change the IBC and serdes, overriding generic */ init_txdds_table(ppd, 1); any++; @@ -6069,9 +6071,9 @@ static int qib_init_7322_variables(struct qib_devdata *dd) * the cable info setup here. Can be overridden * in adapter-specific routines. */ - if (!(ppd-dd-flags QIB_HAS_QSFP)) { - if (!IS_QMH(ppd-dd) !IS_QME(ppd-dd)) - qib_devinfo(ppd-dd-pcidev, IB%u:%u: + if (!(dd-flags QIB_HAS_QSFP)) { + if (!IS_QMH(dd) !IS_QME(dd)) + qib_devinfo(dd-pcidev, IB%u:%u: Unknown mezzanine card type\n, dd-unit, ppd-port); cp-h1_val = IS_QMH(dd) ? H1_FORCE_QMH : H1_FORCE_QME; @@ -6953,6 +6955,8 @@ static const struct txdds_ent txdds_extra_sdr[TXDDS_EXTRA_SZ] = { { 0, 0, 0, 11 }, /* QME7342 backplane settings */ { 0, 0, 0, 11 }, /* QME7342 backplane settings */ { 0, 0, 0, 11 }, /* QME7342 backplane settings */ + { 0, 0, 0, 3 }, /* QMH7342 backplane settings */ + { 0, 0, 0, 4 }, /* QMH7342 backplane settings */ }; static const struct txdds_ent txdds_extra_ddr[TXDDS_EXTRA_SZ] = { @@ -6968,6 +6972,8 @@ static const struct txdds_ent txdds_extra_ddr[TXDDS_EXTRA_SZ] = { { 0, 0, 0, 13 }, /* QME7342 backplane settings */ { 0, 0, 0, 13 }, /* QME7342 backplane settings */ { 0, 0, 0, 13 }, /* QME7342 backplane settings */ + { 0, 0, 0, 9 }, /* QMH7342 backplane settings */ + { 0, 0, 0, 10 }, /* QMH7342 backplane settings */ }; static const struct txdds_ent txdds_extra_qdr[TXDDS_EXTRA_SZ] = { @@ -6983,6 +6989,8 @@ static const struct txdds_ent txdds_extra_qdr[TXDDS_EXTRA_SZ] = { { 0, 1, 12, 6 }, /* QME7342 backplane setting */ { 0, 1, 12, 7 }, /* QME7342 backplane setting */ { 0, 1, 12, 8 }, /* QME7342 backplane setting */ + { 0, 1, 0, 10 }, /* QMH7342 backplane settings */ + { 0, 1, 0, 12 }, /* QMH7342 backplane settings */ }; static const struct txdds_ent *get_atten_table(const struct txdds_ent *txdds, -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/7] IB/qib: completion queue callback needs to be single threaded
Workqueues aren't exactly equivalent to tasklets since the callback function may be called from multiple CPUs before the callback returns. This causes completion notification callbacks to have MT bugs since they weren't expecting this behavior. The fix is to use a single threaded work queue. Signed-off-by: Ralph Campbell ralph.campb...@qlogic.com --- drivers/infiniband/hw/qib/qib_init.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/qib/qib_init.c b/drivers/infiniband/hw/qib/qib_init.c index 1d4db4b..7831ff8 100644 --- a/drivers/infiniband/hw/qib/qib_init.c +++ b/drivers/infiniband/hw/qib/qib_init.c @@ -1059,7 +1059,7 @@ static int __init qlogic_ib_init(void) goto bail_dev; } - qib_cq_wq = create_workqueue(qib_cq); + qib_cq_wq = create_singlethread_workqueue(qib_cq); if (!qib_cq_wq) { ret = -ENOMEM; goto bail_wq; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] librdmacm/mcraw: Add a new test application for user-space IBV_QPT_RAW_ETH QP type
FYI - I don't track the ewg mail list, so I missed any discussion there. See my detailed comments inline. I track the librdmacm against Roland's libibverbs releases and upstream kernel features, rather than against OFED features. As a result, I do think there's some functionality missing in both the upstream libibverbs and kernel that need to be resolved. I tried to identify these below. The patch adds a new test application describing a usage of the IBV_QPT_RAW_ETH Where is IBV_QPT_RAW_ETH defined? Roland's version of verbs.h only defines IBV_QPT_RC/UC/UD. I think we need to get this defined there first, then figure out if anything new is needed for the librdmacm. Also, if I understand this correctly, a RAW_ETH QP exposes the contents of the Ethernet frame and header to the user. This should be restricted to privileged applications, and I'm guessing uverbs should verify that before allocating a RAW_ETH QP for the user. +struct cmatest { + struct rdma_event_channel *channel; + struct cmatest_node *nodes; + int conn_index; + int connects_left; + + struct sockaddr_in6 dst_in; + struct sockaddr *dst_addr; + struct sockaddr_in6 src_in; + struct sockaddr *src_addr; + int fd[1024]; See comments below regarding the fd array usage. +}; + +static struct cmatest test; +static int connections = 1; +static int message_size = 100; +static int message_count = 10; +static int is_sender; +static int unmapped_addr; +static char *dst_addr; +static char *src_addr; +static enum rdma_port_space port_space = RDMA_PS_UDP; + +int vlan_flag; +int vlan_ident; + +static int cq_len = 512; +static int qp_len = 256; + +uint16_t IP_CRC(void *buf, int hdr_len) +{ + unsigned long sum = 0; + const uint16_t *ip1; + + ip1 = (uint16_t *)buf; + while (hdr_len 1) { + sum += *ip1++; + if (sum 0x8000) + sum = (sum 0x) + (sum 16); + hdr_len -= 2; + } + + while (sum 16) + sum = (sum 0x) + (sum 16); + + return ~sum; +} + +uint16_t udp_checksum(struct udphdr *udp_head, + int header_size, + int pay_load_size, + uint32_t src_addr, + uint32_t dest_addr, + unsigned char *payload) +{ + uint16_t *buf = (void *)udp_head; + uint16_t *ip_src = (void *)src_addr; + uint16_t *ip_dst = (void *)dest_addr; + uint32_t sum; + size_t len = header_size; + + sum = 0; + while (len 1) { + sum += *buf++; + if (sum 0x8000) + sum = (sum 0x) + (sum 16); + len -= 2; + } + + buf = (void *)payload; + len = pay_load_size; + while (len 1) { + sum += *buf++; + if (sum 0x8000) + sum = (sum 0x) + (sum 16); + len -= 2; + } + + if (len 1) + sum += *((uint8_t *)buf); + sum += *(ip_src++); + sum += *ip_src; + + sum += *(ip_dst++); + sum += *ip_dst; + + sum += htons(IPPROTO_UDP); + len = (header_size + pay_load_size); + sum += htons(len); + + while (sum 16) + sum = (sum 0x) + (sum 16); + + return (uint16_t)(~sum); +} The above two calls look like candidates for common code - not part of librdmacm, but common to some other library that provides functionality similar to: ip_crc(), udp_checksum(), format_eth_hdr(), format_ip_hdr(), format_udp_hdr(), etc. Even separating that functionality out into another source file would make it easier for another application to pick up and reuse. +static int create_message(struct cmatest_node *node) +{ + if (!message_size) + message_count = 0; + + if (!message_count) + return 0; + + node-mem = NULL; + posix_memalign((void *)node-mem, 4096, + (message_size + HEADER_LEN ) * sizeof(char)); + if (node-mem == NULL) { + printf(failed message allocation\n); + return -1; + } + + node-mr = ibv_reg_mr(node-pd, node-mem, + message_size + HEADER_LEN, + IBV_ACCESS_LOCAL_WRITE); + if (!node-mr) { + printf(failed to reg MR\n); + goto err; + } + return 0; +err: + free(node-mem); + return -1; +} + +static int verify_test_params(struct cmatest_node *node) +{ + struct ibv_port_attr port_attr; + int ret; + + ret = ibv_query_port(node-cma_id-verbs, node-cma_id-port_num, + port_attr); + if (ret) + return ret; + + printf(\nibv_query_port %x\n, node-cma_id-port_num); + if (message_count message_size (1