Re: P

2010-06-17 Thread Ding Dinghua
Sorry for late reply.

2010/6/12 Dotan Barak dota...@gmail.com:
 On 12/06/2010 03:22, Ding Dinghua wrote:

 2010/6/11 Dotan Barakdota...@gmail.com:


 Hi.

 On 11/06/2010 10:51, Ding Dinghua wrote:


 Hi all:
          I'm using RDMA to do fs-metadata mirror between nodes. I
 encountered a strange problem when the program was running:
 Complete queue handler reported that the  RDMA-Write operation failed,
  the status of  corresponding struct ib_wc is IB_WC_RETRY_EXC_ERR.
 The problem is encountered randomly. I don't know the meaning of this
 error code as well as what to do next. Would anyone give me some tips?
 thanks a lot.



 Do you sync between the sides before closing the QPs?


 Can you say it more detail? thanks.


 If you try to send a message from local QP to a remote QP before the remote
 QP is in RTR state (or after it was closed/transferred to the ERROR state),
 you may get RETRY EXCEEDED, because there isn't any QP in the remote side
 that can accept your message (and send a response).

 How do you connect the QPs? (And how do you close the connection between
 them)

I call rdma_create_id to create an ib id, then do resolve remote addr,
resolve route work, then
setup qp and call rdma_connect to setup connection, before ack or
error replies, the thread will
wait on a wait queue. The listening ib id of remote node will catch
the connect request,
setup qp, allocate and map pages to construct the RDMA-WRITE space,
and call rdma_accept to reply
the request.

Some other information which may be useful:
1.All the RETRY EXCEEDED problems happened when there were two
connections which use RDMA-WRITE to transfer things.
And the latter connection had a high possibility to get into this problem.
2. All the RETRY EXCEEDED problems happened when the RMDA-WRITE
space is 256MB each(that is, for two connections, consumes 512MB mem),
when the RDMA-WRITE  space is 64MB, this problem never happened in our
test. Remote node's total memory is 2GB.

Thanks a lot.


 Dotan




-- 
Ding Dinghua
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: P

2010-06-17 Thread Ding Dinghua
2010/6/12 Dotan Barak dota...@gmail.com:
 On 12/06/2010 03:22, Ding Dinghua wrote:

 2010/6/11 Dotan Barakdota...@gmail.com:


 Hi.

 On 11/06/2010 10:51, Ding Dinghua wrote:


 Hi all:
          I'm using RDMA to do fs-metadata mirror between nodes. I
 encountered a strange problem when the program was running:
 Complete queue handler reported that the  RDMA-Write operation failed,
  the status of  corresponding struct ib_wc is IB_WC_RETRY_EXC_ERR.
 The problem is encountered randomly. I don't know the meaning of this
 error code as well as what to do next. Would anyone give me some tips?
 thanks a lot.



 Do you sync between the sides before closing the QPs?


 Can you say it more detail? thanks.


 If you try to send a message from local QP to a remote QP before the remote
 QP is in RTR state (or after it was closed/transferred to the ERROR state),
 you may get RETRY EXCEEDED, because there isn't any QP in the remote side
 that can accept your message (and send a response).

 How do you connect the QPs? (And how do you close the connection between
 them)

Sorry i forget the close issue.

1. Local node call ib_poll_cq to process the remaining complete queue entry,
2. Local node call rdma_disconnect to destroy connection, before
remote side ack, the thread will wait on a wait queue.
3. After catching this request, the remote node will also call
ib_poll_cq to process the remainning complete queue entry,
then do some resource-release work, then send a reply.
4. Local node was waken up and do resource-release work.


 Dotan




-- 
Ding Dinghua
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] dapl-2.0 - scm, ucm: add pkey, pkey_index, sl override for QP's

2010-06-17 Thread Or Gerlitz

Hefty, Sean wrote:

The index isn't guaranteed to be the same across all nodes.  If a consumer is 
going to manually control this, they should really be forced to use the actual 
pkey.
yes, I saw this confusion in action, for most users pkey index doesn't 
mean anything, it may also change across time, which can break 
scripts/setting to run specific jobs using specific partitions.


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Handling busy responses from the SA

2010-06-17 Thread Hal Rosenstock
Mike,

On Wed, Jun 16, 2010 at 3:57 PM, Mike Heinz michael.he...@qlogic.com wrote:
 Hal,

 But if the original trap had retries  0, wouldn't resending the trap be what 
 the issuer intended?

I suppose as there's nothing in the IBA spec that precludes using busy
on TrapRepresses although I'd be hard pressed to rationalize using
that particularly for SMP traps.

-- Hal

 I guess I'm confused why treating BUSY as similar to simply never getting a 
 response at all is a bad thing. In my mind, receiving a BUSY response is like 
 getting a busy signal when you call someone on the phone - a sign you need to 
 wait a bit then try again. Similarly, if I call someone and never get an 
 answer my strategy is going to be to wait, then try again.

 -Original Message-
 From: Hal Rosenstock [mailto:hal.rosenst...@gmail.com]
 Sent: Tuesday, June 08, 2010 8:16 PM
 To: Mike Heinz
 Cc: Hefty, Sean; linux-rdma@vger.kernel.org
 Subject: Re: Handling busy responses from the SA

 Mike,

 I'm referring to the receipt of the TrapRepress with busy status.
 Wouldn't your patch cause the original Trap to be resent when retries
 0 ? TrapRepress is essentially a response to Trap and classified as
 such by ib_response_mad. Your proposed patch treats a busy as a
 timeout and can cause retry of the original sent Trap.

 -- Hal

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][TRIVIAL] infiniband-diags/perfquery.8: Add some missing counters to description

2010-06-17 Thread Sasha Khapyorsky
On 07:05 Wed 16 Jun , Hal Rosenstock wrote:
 
 Also, updated email address
 
 Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com

Applied. Thanks.

Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Handling busy responses from the SA

2010-06-17 Thread Mike Heinz
To be honest, we haven't been able to think of a case where a sender would use 
retries on a trap or a busy on a repress either, but I don't think it would 
hurt to omit represses from the busy handling either.

Would that be acceptable to everyone? To alter the patch to allow BUSY trap 
repress MADs to pass through?

-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Hal Rosenstock
Sent: Thursday, June 17, 2010 9:30 AM
To: Mike Heinz
Cc: Hefty, Sean; linux-rdma@vger.kernel.org; Todd Rimmer
Subject: Re: Handling busy responses from the SA

Mike,

On Wed, Jun 16, 2010 at 3:57 PM, Mike Heinz michael.he...@qlogic.com wrote:
 Hal,

 But if the original trap had retries  0, wouldn't resending the trap be what 
 the issuer intended?

I suppose as there's nothing in the IBA spec that precludes using busy
on TrapRepresses although I'd be hard pressed to rationalize using
that particularly for SMP traps.

-- Hal

 I guess I'm confused why treating BUSY as similar to simply never getting a 
 response at all is a bad thing. In my mind, receiving a BUSY response is like 
 getting a busy signal when you call someone on the phone - a sign you need to 
 wait a bit then try again. Similarly, if I call someone and never get an 
 answer my strategy is going to be to wait, then try again.

 -Original Message-
 From: Hal Rosenstock [mailto:hal.rosenst...@gmail.com]
 Sent: Tuesday, June 08, 2010 8:16 PM
 To: Mike Heinz
 Cc: Hefty, Sean; linux-rdma@vger.kernel.org
 Subject: Re: Handling busy responses from the SA

 Mike,

 I'm referring to the receipt of the TrapRepress with busy status.
 Wouldn't your patch cause the original Trap to be resent when retries
 0 ? TrapRepress is essentially a response to Trap and classified as
 such by ib_response_mad. Your proposed patch treats a busy as a
 timeout and can cause retry of the original sent Trap.

 -- Hal

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] opensm/osmtest.c: fix bug in getting attr offset

2010-06-17 Thread Sasha Khapyorsky
On 11:33 Tue 15 Jun , Yevgeny Kliteynik wrote:
 Fix bug that was introduced by commit 4fd4ca306f93376963725285f3bf7c87a76055b0
 
 Signed-off-by: Yevgeny Kliteynik klit...@dev.mellanox.co.il

Applied. Thanks.

Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [RESEND] opensm/osm_mcast_mgr.c: Only route MLIDs with more than 1 member

2010-06-17 Thread Sasha Khapyorsky
On 08:46 Mon 14 Jun , Hal Rosenstock wrote:
 
 rather than just more than 0 members. There is no need to route MLIDs with
 only 1 member either. MLIDs only need routing when 2 or more members. Single
 member case is handled locally.
 
 Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com

Applied. Thanks.

Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] RDMA/CMA: fix iWARP adapter TCP port space usage

2010-06-17 Thread Bernard Metzler
I agree that the issue must get solved and its good that it has
been brought up again. I agree with Chien that the
solution should respect and interface to a single in kernel instance
maintaining host global TCP port space. iWARP is just another
protocol on top of TCP - like iSCSI. There is no good reason to
invent another TCP port maintainer per TCP user type trying to
synchonize with the kernel if the resource is host global and
already maintained by the kernel.

Since we are developing and already open sourced a full software
implementation (SoftiWARP) of RDMA, our view on the optimal solution
must be different. Like kernel iSCSI, we are running on top of regular
kernel sockets. With that, there is no point having a connection manager
blocking just the port we wanted to use for communication - SoftiWARP
uses kernel sockets for data communication.

Therefore, I propose pushing back responsibility to the RDMA device driver,
where the actual connection setup is initiated (RNIC) or takes place
(software RMDA stack). I think, it is not the job of the RDMA connection
manager to maintain TCP port space at all. It should be up to the driver
to do the appropriate steps. Due to the lack of another interface, an
RNIC driver would create and bind a kernel socket to get hold of
the TCP port it is intending to use for offloaded communication,
while a software RDMA stack just goes forward doing communication on
that socket. For the future it might be a good idea to approach the
netdev folks kindly asking for a neat interface for just TCP port
maintainance without the need to create and bind an otherwise
useless socket.

Of course, the RNIC driver must restrict its activities to local
IP adresses on its cards (or, for SoftiWARP, to IP adresses of interfaces
it is bound to). For example, a wildcard listen must get translated
into a listen restricted to the interface(s) under local control.

With that, the RDMA connection manager should simply be aware of
the possibility that a listen or connect call may fail for one
more reason. From using SoftiWARP in that environment I know,
that's already the case (-EADDRINUSE is always an acceptable
return value).

Thanks,
Bernard.

linux-rdma-ow...@vger.kernel.org wrote on 06/12/2010 05:17:58 PM:

 Roland Dreier wrote:
Other protocols are also running over networking today, such as
iSCSI
and FCoE.  These happily co-exist with other L2-L4 protocols in the
stack. This iWARP patch allows iWARP to happily co-exist on a TCP
connection, and does *not* negatively affect the networking stack at
all.
 
  How do iSCSI offload HBAs coexist?  As I understand it, they typically
  just choose a separate IP address.
 
  In any case I'm not going to slip in a patch that another maintainer
has
  explicitly NAKed.  Maybe one way to force things forward would be to
  write up an exhaustive explanation of the underlying problem and the
  impact on end users, include this patch, explain that it touches only
  RDMA code, and point out that most end users are already using this
  patch since it's shipped in OFED.  Then send the whole thing to Linus
  and Andrew Morton, making sure to cc Dave Miller, netdev, and
  linux-rdma.
 
   - R.
 

 My 2007 thread does this basically, but posted it to lkml and David
 Miller.  But the rationale for why we need it as well as other possible
 solutions is included in that thread.  We could re-package it and send
 it on as you suggest.  It might carry more weight coming from the linux
 rdma maintainer though. :)





 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] opensm/osmeventplugin: added new events to monitor SM

2010-06-17 Thread Sasha Khapyorsky
On 14:41 Thu 10 Jun , Yevgeny Kliteynik wrote:
 Hi Sasha,
 
 Adding new events that allow event plug-in to see
 when SM finishes heavy sweep and routing configuration,
 when it updates dump files, when it is no longer master,
 and when SM port is down:
 
   OSM_EVENT_ID_HEAVY_SWEEP_DONE
   OSM_EVENT_ID_UCAST_ROUTING_DONE

What is wrong with using Subnet Up event for those purposes?

   OSM_EVENT_ID_ENTERING_STANDBY
   OSM_EVENT_ID_SM_PORT_DOWN

Instead I would suggest to make state change event.

   OSM_EVENT_ID_SA_DB_DUMPED

Again, Subnet Up indicates that all sweep stuff is done (including
dump files).

 
 The last event is reported when SA DB is actually dumped.
 
 Signed-off-by: Yevgeny Kliteynik klit...@dev.mellanox.co.il
 ---
 
 Changes from V2:
   - reduced number of events that are reported
   - rebased to latest master
 
 ---
  opensm/include/opensm/osm_event_plugin.h   |7 ++-
  opensm/opensm/osm_state_mgr.c  |   16 +++-
  opensm/osmeventplugin/src/osmeventplugin.c |   15 +++
  3 files changed, 36 insertions(+), 2 deletions(-)
 
 diff --git a/opensm/include/opensm/osm_event_plugin.h 
 b/opensm/include/opensm/osm_event_plugin.h
 index 33d1920..a565123 100644
 --- a/opensm/include/opensm/osm_event_plugin.h
 +++ b/opensm/include/opensm/osm_event_plugin.h
 @@ -72,7 +72,12 @@ typedef enum {
   OSM_EVENT_ID_PORT_SELECT,
   OSM_EVENT_ID_TRAP,
   OSM_EVENT_ID_SUBNET_UP,
 - OSM_EVENT_ID_MAX
 + OSM_EVENT_ID_MAX,

Likely you wanted to move OSM_EVENT_ID_MAX to be last in the list.

Sasha

 + OSM_EVENT_ID_HEAVY_SWEEP_DONE,
 + OSM_EVENT_ID_UCAST_ROUTING_DONE,
 + OSM_EVENT_ID_ENTERING_STANDBY,
 + OSM_EVENT_ID_SM_PORT_DOWN,
 + OSM_EVENT_ID_SA_DB_DUMPED
  } osm_epi_event_id_t;
 
  typedef struct osm_epi_port_id {
 diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
 index 81c8f54..3231ae9 100644
 --- a/opensm/opensm/osm_state_mgr.c
 +++ b/opensm/opensm/osm_state_mgr.c
 @@ -1151,6 +1151,8 @@ static void do_sweep(osm_sm_t * sm)
   if (!sm-p_subn-subnet_initialization_error) {
   OSM_LOG_MSG_BOX(sm-p_log, OSM_LOG_VERBOSE,
   REROUTE COMPLETE);
 + osm_opensm_report_event(sm-p_subn-p_osm,
 + OSM_EVENT_ID_UCAST_ROUTING_DONE, NULL);
   return;
   }
   }
 @@ -1185,6 +1187,8 @@ repeat_discovery:
 
   /* Move to DISCOVERING state */
   osm_sm_state_mgr_process(sm, OSM_SM_SIGNAL_DISCOVER);
 + osm_opensm_report_event(sm-p_subn-p_osm,
 + OSM_EVENT_ID_SM_PORT_DOWN, NULL);
   return;
   }
 
 @@ -1205,6 +1209,8 @@ repeat_discovery:
   ENTERING STANDBY STATE);
   /* notify master SM about us */
   osm_send_trap144(sm, 0);
 + osm_opensm_report_event(sm-p_subn-p_osm,
 + OSM_EVENT_ID_ENTERING_STANDBY, NULL);
   return;
   }
 
 @@ -1212,6 +1218,9 @@ repeat_discovery:
   if (sm-p_subn-force_heavy_sweep)
   goto repeat_discovery;
 
 + osm_opensm_report_event(sm-p_subn-p_osm,
 + OSM_EVENT_ID_HEAVY_SWEEP_DONE, NULL);
 +
   OSM_LOG_MSG_BOX(sm-p_log, OSM_LOG_VERBOSE, HEAVY SWEEP COMPLETE);
 
   /* If we are MASTER - get the highest remote_sm, and
 @@ -1314,6 +1323,8 @@ repeat_discovery:
 
   OSM_LOG_MSG_BOX(sm-p_log, OSM_LOG_VERBOSE,
   SWITCHES CONFIGURED FOR UNICAST);
 + osm_opensm_report_event(sm-p_subn-p_osm,
 + OSM_EVENT_ID_UCAST_ROUTING_DONE, NULL);
 
   if (!sm-p_subn-opt.disable_multicast) {
   osm_mcast_mgr_process(sm);
 @@ -1375,7 +1386,10 @@ repeat_discovery:
 
   if (osm_log_is_active(sm-p_log, OSM_LOG_VERBOSE) ||
   sm-p_subn-opt.sa_db_dump)
 - osm_sa_db_file_dump(sm-p_subn-p_osm);
 + if (!osm_sa_db_file_dump(sm-p_subn-p_osm))
 + osm_opensm_report_event(sm-p_subn-p_osm,
 + OSM_EVENT_ID_SA_DB_DUMPED, NULL);
 +
   }
 
   /*
 diff --git a/opensm/osmeventplugin/src/osmeventplugin.c 
 b/opensm/osmeventplugin/src/osmeventplugin.c
 index b4d9ce9..af68a5c 100644
 --- a/opensm/osmeventplugin/src/osmeventplugin.c
 +++ b/opensm/osmeventplugin/src/osmeventplugin.c
 @@ -176,6 +176,21 @@ static void report(void *_log, osm_epi_event_id_t 
 event_id, void *event_data)
   case OSM_EVENT_ID_SUBNET_UP:
   fprintf(log-log_file, Subnet up reported\n);
   break;
 + case OSM_EVENT_ID_HEAVY_SWEEP_DONE:
 + fprintf(log-log_file, Heavy sweep completed\n);
 + break;
 + case OSM_EVENT_ID_UCAST_ROUTING_DONE:
 + fprintf(log-log_file, Unicast routing completed\n);
 + break;
 +   

Re: [Patch v2] opensm/main.c: force stdout to be line-buffered

2010-06-17 Thread Sasha Khapyorsky
On 15:00 Thu 10 Jun , Yevgeny Kliteynik wrote:
 When stdout is assigned to a terminal, it is line-buffered.
 But when opensm's stdout is redirected to a file, stdout
 becomes block-buffered, which means that '\n' won't cause
 the buffer to be flushed.
 
 Forcing stdout to always be line-buffered and to have a
 more predictable behavior when used as opensm  some_file.
 
 Signed-off-by: Yevgeny Kliteynik klit...@dev.mellanox.co.il

Applied. Thanks.

Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend] opensm/osm_sa_path_record.c: adding wrapper for pr_rcv_get_path_parms()

2010-06-17 Thread Sasha Khapyorsky
On 16:49 Thu 10 Jun , Yevgeny Kliteynik wrote:
 Adding non-static wrapper function for pr_rcv_get_path_parms()
 function to enable calling path record calculation function from
 outside this file.
 
 Signed-off-by: Yevgeny Kliteynik klit...@dev.mellanox.co.il

Applied. Thanks.

Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] opensm/osm_qos.c: Eliminate unneeded endport SL to VL setup

2010-06-17 Thread Sasha Khapyorsky
On 09:09 Mon 14 Jun , Hal Rosenstock wrote:
 
  This is intended. It's not needed since it's only doing this in the
  wildcarded case and the wildcarding includes port 0.
 
 Any reason not to move ahead on this ? Thanks.

No reason. Applied. Thanks.

Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] opensm/osmeventplugin: added new events to monitor SM

2010-06-17 Thread Yevgeny Kliteynik

Hi Sasha,

On 17-Jun-10 5:18 PM, Sasha Khapyorsky wrote:

On 14:41 Thu 10 Jun , Yevgeny Kliteynik wrote:

Hi Sasha,

Adding new events that allow event plug-in to see
when SM finishes heavy sweep and routing configuration,
when it updates dump files, when it is no longer master,
and when SM port is down:

   OSM_EVENT_ID_HEAVY_SWEEP_DONE
   OSM_EVENT_ID_UCAST_ROUTING_DONE


What is wrong with using Subnet Up event for those purposes?


There is a big difference between SWEEP_DONE  and SUBNET_UP
events. The former happens before all the managers (drop
manager, QoS, unicast and multicast routing, etc), so there
is a long period between two events.
Moreover, after SWEEP_DONE there is a lot of information
that is later cleared.

As for ROUTING_DONE, if OSM is doing re-route only, then
routing might change, and we don't get SUBNET_UP event.
Furthermore, when torus2QoS routing will be included in
the SM, the re-route will also cause QoS configuration
to change.
 

   OSM_EVENT_ID_ENTERING_STANDBY
   OSM_EVENT_ID_SM_PORT_DOWN


Instead I would suggest to make state change event.


OK
 

   OSM_EVENT_ID_SA_DB_DUMPED


Again, Subnet Up indicates that all sweep stuff is done (including
dump files).


This is true. In fact, the way I posed it, there is no
point adding this event. However, this event should also
be sent when SA DB is dumped at the end of light sweep,
and then SUBNET_UP cannot replace it.



The last event is reported when SA DB is actually dumped.

Signed-off-by: Yevgeny Kliteynikklit...@dev.mellanox.co.il
---

Changes from V2:
   - reduced number of events that are reported
   - rebased to latest master

---
  opensm/include/opensm/osm_event_plugin.h   |7 ++-
  opensm/opensm/osm_state_mgr.c  |   16 +++-
  opensm/osmeventplugin/src/osmeventplugin.c |   15 +++
  3 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/opensm/include/opensm/osm_event_plugin.h 
b/opensm/include/opensm/osm_event_plugin.h
index 33d1920..a565123 100644
--- a/opensm/include/opensm/osm_event_plugin.h
+++ b/opensm/include/opensm/osm_event_plugin.h
@@ -72,7 +72,12 @@ typedef enum {
OSM_EVENT_ID_PORT_SELECT,
OSM_EVENT_ID_TRAP,
OSM_EVENT_ID_SUBNET_UP,
-   OSM_EVENT_ID_MAX
+   OSM_EVENT_ID_MAX,


Likely you wanted to move OSM_EVENT_ID_MAX to be last in the list.


Oops...

-- Yevgeny


Sasha


+   OSM_EVENT_ID_HEAVY_SWEEP_DONE,
+   OSM_EVENT_ID_UCAST_ROUTING_DONE,
+   OSM_EVENT_ID_ENTERING_STANDBY,
+   OSM_EVENT_ID_SM_PORT_DOWN,
+   OSM_EVENT_ID_SA_DB_DUMPED
  } osm_epi_event_id_t;

  typedef struct osm_epi_port_id {
diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index 81c8f54..3231ae9 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -1151,6 +1151,8 @@ static void do_sweep(osm_sm_t * sm)
if (!sm-p_subn-subnet_initialization_error) {
OSM_LOG_MSG_BOX(sm-p_log, OSM_LOG_VERBOSE,
REROUTE COMPLETE);
+   osm_opensm_report_event(sm-p_subn-p_osm,
+   OSM_EVENT_ID_UCAST_ROUTING_DONE, NULL);
return;
}
}
@@ -1185,6 +1187,8 @@ repeat_discovery:

/* Move to DISCOVERING state */
osm_sm_state_mgr_process(sm, OSM_SM_SIGNAL_DISCOVER);
+   osm_opensm_report_event(sm-p_subn-p_osm,
+   OSM_EVENT_ID_SM_PORT_DOWN, NULL);
return;
}

@@ -1205,6 +1209,8 @@ repeat_discovery:
ENTERING STANDBY STATE);
/* notify master SM about us */
osm_send_trap144(sm, 0);
+   osm_opensm_report_event(sm-p_subn-p_osm,
+   OSM_EVENT_ID_ENTERING_STANDBY, NULL);
return;
}

@@ -1212,6 +1218,9 @@ repeat_discovery:
if (sm-p_subn-force_heavy_sweep)
goto repeat_discovery;

+   osm_opensm_report_event(sm-p_subn-p_osm,
+   OSM_EVENT_ID_HEAVY_SWEEP_DONE, NULL);
+
OSM_LOG_MSG_BOX(sm-p_log, OSM_LOG_VERBOSE, HEAVY SWEEP COMPLETE);

/* If we are MASTER - get the highest remote_sm, and
@@ -1314,6 +1323,8 @@ repeat_discovery:

OSM_LOG_MSG_BOX(sm-p_log, OSM_LOG_VERBOSE,
SWITCHES CONFIGURED FOR UNICAST);
+   osm_opensm_report_event(sm-p_subn-p_osm,
+   OSM_EVENT_ID_UCAST_ROUTING_DONE, NULL);

if (!sm-p_subn-opt.disable_multicast) {
osm_mcast_mgr_process(sm);
@@ -1375,7 +1386,10 @@ repeat_discovery:

if (osm_log_is_active(sm-p_log, OSM_LOG_VERBOSE) ||
sm-p_subn-opt.sa_db_dump)
-   osm_sa_db_file_dump(sm-p_subn-p_osm);
+   if (!osm_sa_db_file_dump(sm-p_subn-p_osm))
+   

Re: [PATCH v2] RDMA/CMA: fix iWARP adapter TCP port space usage

2010-06-17 Thread Steve Wise

Bernard Metzler wrote:

I agree that the issue must get solved and its good that it has
been brought up again. I agree with Chien that the
solution should respect and interface to a single in kernel instance
maintaining host global TCP port space. iWARP is just another
protocol on top of TCP - like iSCSI. There is no good reason to
invent another TCP port maintainer per TCP user type trying to
synchonize with the kernel if the resource is host global and
already maintained by the kernel.

Since we are developing and already open sourced a full software
implementation (SoftiWARP) of RDMA, our view on the optimal solution
must be different. Like kernel iSCSI, we are running on top of regular
kernel sockets. With that, there is no point having a connection manager
blocking just the port we wanted to use for communication - SoftiWARP
uses kernel sockets for data communication.

  



Hey Bernard,

Has SoftiWARP been submitted upstream yet?



Therefore, I propose pushing back responsibility to the RDMA device driver,
where the actual connection setup is initiated (RNIC) or takes place
(software RMDA stack). I think, it is not the job of the RDMA connection
manager to maintain TCP port space at all. It should be up to the driver
to do the appropriate steps. Due to the lack of another interface, an
RNIC driver would create and bind a kernel socket to get hold of
the TCP port it is intending to use for offloaded communication,
while a software RDMA stack just goes forward doing communication on
that socket. For the future it might be a good idea to approach the
netdev folks kindly asking for a neat interface for just TCP port
maintainance without the need to create and bind an otherwise
useless socket.
  



I proposed this design in 2007. It was NAK'd. Read the tail end of this 
email where I describe such a solution and indicate that Miller already 
NAK'd it. Now we could try again with this solution, but unless we have 
end users backing us and showing how much demand there is for this, it 
won't fly IMO.



http://lkml.org/lkml/2007/8/15/174




Of course, the RNIC driver must restrict its activities to local
IP adresses on its cards (or, for SoftiWARP, to IP adresses of interfaces
it is bound to). For example, a wildcard listen must get translated
into a listen restricted to the interface(s) under local control.

  



I implemented and submitted this type of solution for cxgb3 in 2007 as 
well.



http://lkml.org/lkml/2007/9/13/268


Roland didn't like it, I think, because it used well known tokens in the 
interface name to designate iwarp ip addresses via ifconfig. Like 
eth0:iw1. So the solution really required the admin to setup these 
iwarp-only subnets/interfaces. There was nothing that prevented non 
iwarp traffic to arrive on these ip addresses other than admin policy. I 
think that was another reason Roland didn't like this solution. Anyway, 
you can peruse that thread and maybe its a starting point for some 
separate iwarp ipaddresses solution




Steve.


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH/RFC] mlx4_core: module param to limit msix vec allocation

2010-06-17 Thread Yevgeny Petrilin
 

 The mlx4_core driver allocates 'nreq' msix vectors (and irqs),
 where:
 
 nreq = min_t(int, dev-caps.num_eqs - dev-caps.reserved_eqs,
   num_possible_cpus() + 1);

 ConnectX HCAs support 512 event queues (4 reserved). On a system with enough 
 processors, we get:

  mlx4_core 0006:01:00.0: Requested 508 vectors, but only 256 MSI-X vectors 
 available, trying again

 Further attempts (by other drivers) to allocate interrupts fail, because 
 mlx4_core got 'em all.

 How about this?

Hi,
I think that this patch would do the job,
Anyway we are thinking of ways to change our interrupt allocation scheme.

--Yevgeny--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] opensm/osm_trap_rcv.c: No need to check for sweep for trap 145

2010-06-17 Thread Hal Rosenstock

Trap 145 merely carries the SystemImageGUID (and indication that it changed)
so there is no need (to even check) for sweep

Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com
---
diff --git a/opensm/opensm/osm_trap_rcv.c b/opensm/opensm/osm_trap_rcv.c
index 500632c..71429c4 100644
--- a/opensm/opensm/osm_trap_rcv.c
+++ b/opensm/opensm/osm_trap_rcv.c
@@ -2,7 +2,7 @@
  * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2007 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
- * Copyright (c) 2009 HNR Consulting. All rights reserved.
+ * Copyright (c) 2009,2010 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -510,10 +510,12 @@ static void trap_rcv_process_request(IN osm_sm_t * sm,
ERR 3812: No physical port found for 
trap 144: \node description update\\n);
goto check_sweep;
-   } else if (cl_ntoh16(p_ntci-g_or_v.generic.trap_num) == 145)
+   } else if (cl_ntoh16(p_ntci-g_or_v.generic.trap_num) == 145) {
/* this assumes that trap 145 content is not broken? */
p_physp-p_node-node_info.sys_guid =
p_ntci-data_details.ntc_145.new_sys_guid;
+   goto check_report;
+   }
 
 check_sweep:
/* do a sweep if we received a trap */
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv3][RESEND] opensm/osm_console.c: Add dump and clear redir perfmgr command support

2010-06-17 Thread Hal Rosenstock

Follows previous patch that adds better redirection support into PerfMgr

Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com

---
Changes since v2:
Rebased

Changes since v1:
Changes based on changes to PerfMgr redir support in v3 patch

diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
index 49b0ae0..1779d9d 100644
--- a/opensm/opensm/osm_console.c
+++ b/opensm/opensm/osm_console.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2005-2009 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2009 HNR Consulting. All rights reserved.
+ * Copyright (c) 2009,2010 HNR Consulting. All rights reserved.
  * Copyright (c) 2010 Mellanox Technologies LTD. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -232,7 +232,7 @@ static void help_update_desc(FILE *out, int detail)
 static void help_perfmgr(FILE * out, int detail)
 {
fprintf(out,
-   perfmgr 
[enable|disable|clear_counters|dump_counters|print_counters|sweep_time[seconds]]\n);
+   perfmgr 
[enable|disable|clear_counters|dump_counters|print_counters|dump_redir|clear_redir|sweep_time[seconds]]\n);
if (detail) {
fprintf(out,
perfmgr -- print the performance manager state\n);
@@ -246,6 +246,10 @@ static void help_perfmgr(FILE * out, int detail)
   [dump_counters [mach]] -- dump the counters 
(optionally in [mach]ine readable format)\n);
fprintf(out,
   [print_counters nodename|nodeguid] -- print the 
counters for the specified node\n);
+   fprintf(out,
+  [dump_redir [nodename|nodeguid]] -- dump the 
redirection table\n);
+   fprintf(out,
+  [clear_redir [nodename|nodeguid]] -- clear the 
redirection table\n);
}
 }
 #endif /* ENABLE_OSM_PERF_MGR */
@@ -1180,6 +1184,152 @@ static void update_desc_parse(char **p_last, 
osm_opensm_t * p_osm, FILE * out)
 }
 
 #ifdef ENABLE_OSM_PERF_MGR
+static monitored_node_t *find_node_by_name(osm_opensm_t * p_osm,
+  char *nodename)
+{
+   cl_map_item_t *item;
+   monitored_node_t *node;
+
+   item = cl_qmap_head(p_osm-perfmgr.monitored_map);
+while (item != cl_qmap_end(p_osm-perfmgr.monitored_map)) {
+node = (monitored_node_t *)item;
+if (strcmp(node-name, nodename) == 0)
+   return node;
+item = cl_qmap_next(item);
+}
+
+   return NULL;
+}
+
+static monitored_node_t *find_node_by_guid(osm_opensm_t * p_osm,
+  uint64_t guid)
+{
+   cl_map_item_t *node;
+
+   node = cl_qmap_get(p_osm-perfmgr.monitored_map, guid);
+   if (node != cl_qmap_end(p_osm-perfmgr.monitored_map))
+   return (monitored_node_t *)node;
+
+   return NULL;
+}
+
+static void dump_redir_entry(monitored_node_t *p_mon_node, FILE * out)
+{
+   int port, redir;
+
+   /* only display monitored nodes with redirection info */
+   redir = 0;
+   for (port = (p_mon_node-esp0) ? 0 : 1;
+port  p_mon_node-num_ports; port++) {
+   if (p_mon_node-port[port].redirection) {
+   if (!redir) {
+   fprintf(out,Node GUID   ESP0   
Name\n);
+   fprintf(out,-      
\n);
+   fprintf(out,0x% PRIx64  %d  %s\n,
+   p_mon_node-guid, p_mon_node-esp0,
+   p_mon_node-name);
+   fprintf(out, \n   Port Valid  LIDs PKey  
QPPKey Index\n);
+   fprintf(out, -     --  
  --\n);
+   redir = 1;
+   }
+   fprintf(out,%d%d  %u-%u  0x%x 0x%x   
%d\n,
+   port, p_mon_node-port[port].valid,
+   cl_ntoh16(p_mon_node-port[port].orig_lid),
+   cl_ntoh16(p_mon_node-port[port].lid),
+   cl_ntoh16(p_mon_node-port[port].pkey),
+   cl_ntoh32(p_mon_node-port[port].qp),
+   p_mon_node-port[port].pkey_ix);
+   }
+   }
+   if (redir)
+   fprintf(out, \n);
+}
+
+static void dump_redir(osm_opensm_t * p_osm, char *nodename, FILE * out)
+{
+   monitored_node_t *p_mon_node;
+   uint64_t guid;
+
+   if (!p_osm-subn.opt.perfmgr_redir)
+   fprintf(out, Perfmgr redirection not enabled\n);
+
+   fprintf(out, \nRedirection Table\n);
+   fprintf(out, -\n);
+   cl_plock_acquire(p_osm-lock);
+   if (nodename) {
+   guid = 

Re: [PATCH] pkey fix for ipoib - resubmission

2010-06-17 Thread Eli Cohen
On Wed, Jun 16, 2010 at 4:59 PM, Mike Heinz michael.he...@qlogic.com wrote:

 IPoIB is coded to use the 1st PKey in the PKey table as its ib0 interface. 
 Additional ib0.pkey interfaces may be created using the /sys/class/... 
 add_child interface.

 However, there is a race.  During normal boot, IPoIB will be started before 
 the port is Active.  Hence the pkey table has not yet been programmed and has 
 a default pkey table (with 0x as only pkey).
So what's wrong with using the default pkey? It is a valid and I don't
see why we should ignore it.


 Later when the SM moves the port to Active, the SM may program the pkey table 
 differently.  However at this point IPoIB has already started using the 
 incorrect pkey.

 It appears that the initially formatted 'broadcast' mgid is never updated to 
 supply actual pkey value if ipoib comes up before hca port. Proposed patch 
 targets two issues:

 1. Suppress activation of interface and join multicast group queries (it will 
 fail anyway) until hca port is initialized. When port becomes active - update 
 pkey value and move on.
I don't think this is required.


 2. Update broadcast mgid based on actual pkey, then issue join broadcast 
 group request.
I agree that the broadcast MGID is not updated. But it seems to me
that all that's needed is to update priv-dev-broadcast with the
updated pkey at ipoib_open(). The rest is already taken care of since
pkey change events are handled by IPoIB.




 Signed-Off-By: Michael Heinz michael.he...@qlogic.com

 ---
 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
 b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
 index ec6b4fb..496d96c 100644
 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
 @@ -51,6 +51,7 @@ MODULE_PARM_DESC(data_debug_level,
  #endif

  static DEFINE_MUTEX(pkey_mutex);
 +static void ipoib_pkey_dev_check_presence(struct net_device *dev);

  struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
                                 struct ib_pd *pd, struct ib_ah_attr *attr) @@ 
 -654,12 +655,13 @@ int ipoib_ib_dev_open(struct net_device *dev)
        struct ipoib_dev_priv *priv = netdev_priv(dev);
        int ret;

 -       if (ib_find_pkey(priv-ca, priv-port, priv-pkey, 
 priv-pkey_index)) {
 +       ipoib_pkey_dev_check_presence(dev);
 +
 +       if (!test_bit(IPOIB_PKEY_ASSIGNED, priv-flags)) {
                ipoib_warn(priv, P_Key 0x%04x not found\n, priv-pkey);
                clear_bit(IPOIB_PKEY_ASSIGNED, priv-flags);
                return -1;
        }
 -       set_bit(IPOIB_PKEY_ASSIGNED, priv-flags);

        ret = ipoib_init_qp(dev);
        if (ret) {
 @@ -694,9 +696,26 @@ int ipoib_ib_dev_open(struct net_device *dev)  static 
 void ipoib_pkey_dev_check_presence(struct net_device *dev)  {
        struct ipoib_dev_priv *priv = netdev_priv(dev);
 -       u16 pkey_index = 0;
 +       struct ib_port_attr    port_attr;
 +
 +       if (!test_bit(IPOIB_FLAG_SUBINTERFACE, priv-flags)) {
 +               clear_bit(IPOIB_PKEY_ASSIGNED, priv-flags);
 +               if (ib_query_port(priv-ca, priv-port, port_attr)) {
 +                       ipoib_warn(priv, Query port attrs failed\n);
 +                       return;
 +               }
 +
 +               if (port_attr.state != IB_PORT_ACTIVE)
 +                       return;
 +
 +               if (ib_query_pkey(priv-ca, priv-port, 0, priv-pkey)) {
 +                       ipoib_warn(priv, Query P_Key table entry 0 
 failed\n);
 +                       return;
 +               }
 +               set_bit(IPOIB_PKEY_ASSIGNED, priv-flags);
 +       }

 -       if (ib_find_pkey(priv-ca, priv-port, priv-pkey, pkey_index))
 +       if (ib_find_pkey(priv-ca, priv-port, priv-pkey, priv-pkey_index))
                clear_bit(IPOIB_PKEY_ASSIGNED, priv-flags);
        else
                set_bit(IPOIB_PKEY_ASSIGNED, priv-flags); @@ -955,7 +974,8 
 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,
                }

                /* restart QP only if P_Key index is changed */
 -               if (test_and_set_bit(IPOIB_PKEY_ASSIGNED, priv-flags) 
 +               if (test_bit(IPOIB_FLAG_SUBINTERFACE, priv-flags) 
 +                   test_and_set_bit(IPOIB_PKEY_ASSIGNED, priv-flags) 
                    new_index == priv-pkey_index) {
                        ipoib_dbg(priv, Not flushing - P_Key index not 
 changed.\n);
                        return;
 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
 b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
 index 3871ac6..6fe6527 100644
 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
 +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
 @@ -552,6 +552,13 @@ void ipoib_mcast_join_task(struct work_struct *work)
                }

                spin_lock_irq(priv-lock);
 +
 +               if (!test_bit(IPOIB_FLAG_SUBINTERFACE, priv-flags)) {
 +                       /* fix broadcast gid in case if pkey was changed */
 +                

RE: [PATCH] pkey fix for ipoib - resubmission

2010-06-17 Thread Mike Heinz
 So what's wrong with using the default pkey? It is a valid and I don't
see why we should ignore it.

In fabrics using quality of service and virtual fabrics, the default pkey is 
probably the wrong one to use for network traffic - and may not work at all. 
Remember, the real default pkey is 0x7fff, not 0x - and 0x7fff only 
guarantees communications with the SM not with other nodes.

 I don't think this is required.

It is certainly required for any fabric that does not permit the use of 0x 
as a pkey for ipoib traffic.


-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Eli Cohen
Sent: Thursday, June 17, 2010 12:37 PM
To: Mike Heinz
Cc: linux-rdma@vger.kernel.org; Roland Dreier
Subject: Re: [PATCH] pkey fix for ipoib - resubmission

On Wed, Jun 16, 2010 at 4:59 PM, Mike Heinz michael.he...@qlogic.com wrote:

 IPoIB is coded to use the 1st PKey in the PKey table as its ib0 interface. 
 Additional ib0.pkey interfaces may be created using the /sys/class/... 
 add_child interface.

 However, there is a race.  During normal boot, IPoIB will be started before 
 the port is Active.  Hence the pkey table has not yet been programmed and has 
 a default pkey table (with 0x as only pkey).
So what's wrong with using the default pkey? It is a valid and I don't
see why we should ignore it.


 Later when the SM moves the port to Active, the SM may program the pkey table 
 differently.  However at this point IPoIB has already started using the 
 incorrect pkey.

 It appears that the initially formatted 'broadcast' mgid is never updated to 
 supply actual pkey value if ipoib comes up before hca port. Proposed patch 
 targets two issues:

 1. Suppress activation of interface and join multicast group queries (it will 
 fail anyway) until hca port is initialized. When port becomes active - update 
 pkey value and move on.
I don't think this is required.


 2. Update broadcast mgid based on actual pkey, then issue join broadcast 
 group request.
I agree that the broadcast MGID is not updated. But it seems to me
that all that's needed is to update priv-dev-broadcast with the
updated pkey at ipoib_open(). The rest is already taken care of since
pkey change events are handled by IPoIB.




 Signed-Off-By: Michael Heinz michael.he...@qlogic.com

 ---
 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
 b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
 index ec6b4fb..496d96c 100644
 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
 +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
 @@ -51,6 +51,7 @@ MODULE_PARM_DESC(data_debug_level,
  #endif

  static DEFINE_MUTEX(pkey_mutex);
 +static void ipoib_pkey_dev_check_presence(struct net_device *dev);

  struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
                                 struct ib_pd *pd, struct ib_ah_attr *attr) @@ 
 -654,12 +655,13 @@ int ipoib_ib_dev_open(struct net_device *dev)
        struct ipoib_dev_priv *priv = netdev_priv(dev);
        int ret;

 -       if (ib_find_pkey(priv-ca, priv-port, priv-pkey, 
 priv-pkey_index)) {
 +       ipoib_pkey_dev_check_presence(dev);
 +
 +       if (!test_bit(IPOIB_PKEY_ASSIGNED, priv-flags)) {
                ipoib_warn(priv, P_Key 0x%04x not found\n, priv-pkey);
                clear_bit(IPOIB_PKEY_ASSIGNED, priv-flags);
                return -1;
        }
 -       set_bit(IPOIB_PKEY_ASSIGNED, priv-flags);

        ret = ipoib_init_qp(dev);
        if (ret) {
 @@ -694,9 +696,26 @@ int ipoib_ib_dev_open(struct net_device *dev)  static 
 void ipoib_pkey_dev_check_presence(struct net_device *dev)  {
        struct ipoib_dev_priv *priv = netdev_priv(dev);
 -       u16 pkey_index = 0;
 +       struct ib_port_attr    port_attr;
 +
 +       if (!test_bit(IPOIB_FLAG_SUBINTERFACE, priv-flags)) {
 +               clear_bit(IPOIB_PKEY_ASSIGNED, priv-flags);
 +               if (ib_query_port(priv-ca, priv-port, port_attr)) {
 +                       ipoib_warn(priv, Query port attrs failed\n);
 +                       return;
 +               }
 +
 +               if (port_attr.state != IB_PORT_ACTIVE)
 +                       return;
 +
 +               if (ib_query_pkey(priv-ca, priv-port, 0, priv-pkey)) {
 +                       ipoib_warn(priv, Query P_Key table entry 0 
 failed\n);
 +                       return;
 +               }
 +               set_bit(IPOIB_PKEY_ASSIGNED, priv-flags);
 +       }

 -       if (ib_find_pkey(priv-ca, priv-port, priv-pkey, pkey_index))
 +       if (ib_find_pkey(priv-ca, priv-port, priv-pkey, priv-pkey_index))
                clear_bit(IPOIB_PKEY_ASSIGNED, priv-flags);
        else
                set_bit(IPOIB_PKEY_ASSIGNED, priv-flags); @@ -955,7 +974,8 
 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,
                }

                /* restart QP only if P_Key index is changed */
 -               if (test_and_set_bit(IPOIB_PKEY_ASSIGNED, priv-flags) 
 +               if 

Re: [PATCHv4][RESEND] opensm/PerfMgr: Better redirection support

2010-06-17 Thread Ira Weiny
Sasha,

I was thinking of doing something similar to this.  When can you get this 
applied?

Thanks,
Ira

On Thu, 17 Jun 2010 09:03:35 -0700
Hal Rosenstock hnr...@comcast.net wrote:

 
 Handle PKey and QPN redirection information
 GID redirection handling remains
 
 Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com
 
 ---
 Changes since v3:
 Rebased
 
 Changes since v2:
 Use OpenSM DB rather than vendor layer for local port number and PKeys
 Change most log levels from ERROR to VERBOSE
 Redirection info validity now determined by single flag
 validate_redir_pkey returns pkey index or -1 rather than boolean
 Removed redir_ prefixes
 
 Changes since v1:
 Added include of osm_helper.h to osm_perfmgr.c
 
 diff --git a/opensm/include/opensm/osm_perfmgr.h 
 b/opensm/include/opensm/osm_perfmgr.h
 index c26c141..34925e8 100644
 --- a/opensm/include/opensm/osm_perfmgr.h
 +++ b/opensm/include/opensm/osm_perfmgr.h
 @@ -1,7 +1,7 @@
  /*
   * Copyright (c) 2007 The Regents of the University of California.
   * Copyright (c) 2007-2009 Voltaire, Inc. All rights reserved.
 - * Copyright (c) 2009 HNR Consulting. All rights reserved.
 + * Copyright (c) 2009,2010 HNR Consulting. All rights reserved.
   *
   * This software is available to you under a choice of one of two
   * licenses.  You may choose to be licensed under the terms of the GNU
 @@ -90,11 +90,17 @@ typedef enum {
 PERFMGR_SWEEP_SUSPENDED
  } osm_perfmgr_sweep_state_t;
 
 -/* Redirection information */
 -typedef struct redir {
 -   ib_net16_t redir_lid;
 -   ib_net32_t redir_qp;
 -} redir_t;
 +typedef struct monitored_port {
 +   uint16_t pkey_ix;
 +   ib_net16_t orig_lid;
 +   boolean_t redirection;
 +   boolean_t valid;
 +   /* Redirection fields from ClassPortInfo */
 +   ib_gid_t gid;
 +   ib_net16_t lid;
 +   ib_net16_t pkey;
 +   ib_net32_t qp;
 +} monitored_port_t;
 
  /* Node to store information about nodes being monitored */
  typedef struct monitored_node {
 @@ -104,7 +110,7 @@ typedef struct monitored_node {
 boolean_t esp0;
 char *name;
 uint32_t num_ports;
 -   redir_t redir_port[1];  /* redirection on a per port basis */
 +   monitored_port_t port[1];
  } monitored_node_t;
 
  struct osm_opensm;
 @@ -134,6 +140,8 @@ typedef struct osm_perfmgr {
 uint32_t max_outstanding_queries;
 cl_qmap_t monitored_map;/* map the nodes being tracked */
 monitored_node_t *remove_list;
 +   ib_net64_t port_guid;
 +   int16_t local_port;
  } osm_perfmgr_t;
  /*
  * FIELDS
 diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c
 index 398b463..d86e1c6 100644
 --- a/opensm/opensm/osm_perfmgr.c
 +++ b/opensm/opensm/osm_perfmgr.c
 @@ -1,7 +1,7 @@
  /*
   * Copyright (c) 2007 The Regents of the University of California.
   * Copyright (c) 2007-2009 Voltaire, Inc. All rights reserved.
 - * Copyright (c) 2009 HNR Consulting. All rights reserved.
 + * Copyright (c) 2009,2010 HNR Consulting. All rights reserved.
   *
   * This software is available to you under a choice of one of two
   * licenses.  You may choose to be licensed under the terms of the GNU
 @@ -64,6 +64,7 @@
  #include opensm/osm_log.h
  #include opensm/osm_node.h
  #include opensm/osm_opensm.h
 +#include opensm/osm_helper.h
 
  #define PERFMGR_INITIAL_TID_VALUE 0xcafe
 
 @@ -194,6 +195,7 @@ static void perfmgr_mad_send_err_callback(void 
 *bind_context,
 uint8_t port = context-perfmgr_context.port;
 cl_map_item_t *p_node;
 monitored_node_t *p_mon_node;
 +   ib_net16_t orig_lid;
 
 OSM_LOG_ENTER(pm-log);
 
 @@ -225,9 +227,11 @@ static void perfmgr_mad_send_err_callback(void 
 *bind_context,
 p_mon_node-num_ports);
 goto Exit;
 }
 -   /* Clear redirection info */
 -   p_mon_node-redir_port[port].redir_lid = 0;
 -   p_mon_node-redir_port[port].redir_qp = 0;
 +   /* Clear redirection info for this port except orig_lid */
 +   orig_lid = p_mon_node-port[port].orig_lid;
 +   memset(p_mon_node-port[port], 0, sizeof(monitored_port_t));
 +   p_mon_node-port[port].orig_lid = orig_lid;
 +   p_mon_node-port[port].valid = TRUE;
 cl_plock_release(pm-osm-lock);
 }
 
 @@ -256,7 +260,7 @@ ib_api_status_t osm_perfmgr_bind(osm_perfmgr_t * pm, 
 ib_net64_t port_guid)
 goto Exit;
 }
 
 -   bind_info.port_guid = port_guid;
 +   bind_info.port_guid = pm-port_guid = port_guid;
 bind_info.mad_class = IB_MCLASS_PERF;
 bind_info.class_version = 1;
 bind_info.is_responder = FALSE;
 @@ -309,24 +313,14 @@ static ib_net32_t get_qp(monitored_node_t * mon_node, 
 uint8_t port)
 ib_net32_t qp = IB_QP1;
 
 if (mon_node  mon_node-num_ports  port  mon_node-num_ports 
 -   

[PATCHv4 2/2][RESEND] opensm/osm_console.c: Add dump and clear redir perfmgr command support

2010-06-17 Thread Hal Rosenstock

Follows previous patch that adds better redirection support into PerfMgr

Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com
---
Changes since v3:
Fixed some formatting problems (spaces instead of tabs)

Changes since v2:
Rebased

Changes since v1:
Changes based on changes to PerfMgr redir support in v3 patch

diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
index 49b0ae0..764235a 100644
--- a/opensm/opensm/osm_console.c
+++ b/opensm/opensm/osm_console.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2005-2009 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2009 HNR Consulting. All rights reserved.
+ * Copyright (c) 2009,2010 HNR Consulting. All rights reserved.
  * Copyright (c) 2010 Mellanox Technologies LTD. All rights reserved.
  *
  * This software is available to you under a choice of one of two
@@ -232,7 +232,7 @@ static void help_update_desc(FILE *out, int detail)
 static void help_perfmgr(FILE * out, int detail)
 {
fprintf(out,
-   perfmgr 
[enable|disable|clear_counters|dump_counters|print_counters|sweep_time[seconds]]\n);
+   perfmgr 
[enable|disable|clear_counters|dump_counters|print_counters|dump_redir|clear_redir|sweep_time[seconds]]\n);
if (detail) {
fprintf(out,
perfmgr -- print the performance manager state\n);
@@ -246,6 +246,10 @@ static void help_perfmgr(FILE * out, int detail)
   [dump_counters [mach]] -- dump the counters 
(optionally in [mach]ine readable format)\n);
fprintf(out,
   [print_counters nodename|nodeguid] -- print the 
counters for the specified node\n);
+   fprintf(out,
+  [dump_redir [nodename|nodeguid]] -- dump the 
redirection table\n);
+   fprintf(out,
+  [clear_redir [nodename|nodeguid]] -- clear the 
redirection table\n);
}
 }
 #endif /* ENABLE_OSM_PERF_MGR */
@@ -1180,6 +1184,152 @@ static void update_desc_parse(char **p_last, 
osm_opensm_t * p_osm, FILE * out)
 }
 
 #ifdef ENABLE_OSM_PERF_MGR
+static monitored_node_t *find_node_by_name(osm_opensm_t * p_osm,
+  char *nodename)
+{
+   cl_map_item_t *item;
+   monitored_node_t *node;
+
+   item = cl_qmap_head(p_osm-perfmgr.monitored_map);
+   while (item != cl_qmap_end(p_osm-perfmgr.monitored_map)) {
+   node = (monitored_node_t *)item;
+   if (strcmp(node-name, nodename) == 0)
+   return node;
+   item = cl_qmap_next(item);
+   }
+
+   return NULL;
+}
+
+static monitored_node_t *find_node_by_guid(osm_opensm_t * p_osm,
+  uint64_t guid)
+{
+   cl_map_item_t *node;
+
+   node = cl_qmap_get(p_osm-perfmgr.monitored_map, guid);
+   if (node != cl_qmap_end(p_osm-perfmgr.monitored_map))
+   return (monitored_node_t *)node;
+
+   return NULL;
+}
+
+static void dump_redir_entry(monitored_node_t *p_mon_node, FILE * out)
+{
+   int port, redir;
+
+   /* only display monitored nodes with redirection info */
+   redir = 0;
+   for (port = (p_mon_node-esp0) ? 0 : 1;
+port  p_mon_node-num_ports; port++) {
+   if (p_mon_node-port[port].redirection) {
+   if (!redir) {
+   fprintf(out,Node GUID   ESP0   
Name\n);
+   fprintf(out,-      
\n);
+   fprintf(out,0x% PRIx64  %d  %s\n,
+   p_mon_node-guid, p_mon_node-esp0,
+   p_mon_node-name);
+   fprintf(out, \n   Port Valid  LIDs PKey  
QPPKey Index\n);
+   fprintf(out, -     --  
  --\n);
+   redir = 1;
+   }
+   fprintf(out,%d%d  %u-%u  0x%x 0x%x   
%d\n,
+   port, p_mon_node-port[port].valid,
+   cl_ntoh16(p_mon_node-port[port].orig_lid),
+   cl_ntoh16(p_mon_node-port[port].lid),
+   cl_ntoh16(p_mon_node-port[port].pkey),
+   cl_ntoh32(p_mon_node-port[port].qp),
+   p_mon_node-port[port].pkey_ix);
+   }
+   }
+   if (redir)
+   fprintf(out, \n);
+}
+
+static void dump_redir(osm_opensm_t * p_osm, char *nodename, FILE * out)
+{
+   monitored_node_t *p_mon_node;
+   uint64_t guid;
+
+   if (!p_osm-subn.opt.perfmgr_redir)
+   fprintf(out, Perfmgr redirection not enabled\n);
+
+   fprintf(out, \nRedirection Table\n);
+   fprintf(out, -\n);
+   

[PATCHv5 1/2][RESEND] opensm/PerfMgr: Better redirection support

2010-06-17 Thread Hal Rosenstock

Handle PKey and QPN redirection information
GID redirection handling remains

Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com
---
Changes since v4:
Fixed some trailing whitespace problems

Changes since v3:
Rebased

Changes since v2:
Use OpenSM DB rather than vendor layer for local port number and PKeys
Change most log levels from ERROR to VERBOSE
Redirection info validity now determined by single flag
validate_redir_pkey returns pkey index or -1 rather than boolean
Removed redir_ prefixes

Changes since v1:
Added include of osm_helper.h to osm_perfmgr.c

diff --git a/opensm/include/opensm/osm_perfmgr.h 
b/opensm/include/opensm/osm_perfmgr.h
index c26c141..34925e8 100644
--- a/opensm/include/opensm/osm_perfmgr.h
+++ b/opensm/include/opensm/osm_perfmgr.h
@@ -1,7 +1,7 @@
 /*
  * Copyright (c) 2007 The Regents of the University of California.
  * Copyright (c) 2007-2009 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2009 HNR Consulting. All rights reserved.
+ * Copyright (c) 2009,2010 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -90,11 +90,17 @@ typedef enum {
PERFMGR_SWEEP_SUSPENDED
 } osm_perfmgr_sweep_state_t;
 
-/* Redirection information */
-typedef struct redir {
-   ib_net16_t redir_lid;
-   ib_net32_t redir_qp;
-} redir_t;
+typedef struct monitored_port {
+   uint16_t pkey_ix;
+   ib_net16_t orig_lid;
+   boolean_t redirection;
+   boolean_t valid;
+   /* Redirection fields from ClassPortInfo */
+   ib_gid_t gid;
+   ib_net16_t lid;
+   ib_net16_t pkey;
+   ib_net32_t qp;
+} monitored_port_t;
 
 /* Node to store information about nodes being monitored */
 typedef struct monitored_node {
@@ -104,7 +110,7 @@ typedef struct monitored_node {
boolean_t esp0;
char *name;
uint32_t num_ports;
-   redir_t redir_port[1];  /* redirection on a per port basis */
+   monitored_port_t port[1];
 } monitored_node_t;
 
 struct osm_opensm;
@@ -134,6 +140,8 @@ typedef struct osm_perfmgr {
uint32_t max_outstanding_queries;
cl_qmap_t monitored_map;/* map the nodes being tracked */
monitored_node_t *remove_list;
+   ib_net64_t port_guid;
+   int16_t local_port;
 } osm_perfmgr_t;
 /*
 * FIELDS
diff --git a/opensm/opensm/osm_perfmgr.c b/opensm/opensm/osm_perfmgr.c
index 398b463..fccf9d6 100644
--- a/opensm/opensm/osm_perfmgr.c
+++ b/opensm/opensm/osm_perfmgr.c
@@ -1,7 +1,7 @@
 /*
  * Copyright (c) 2007 The Regents of the University of California.
  * Copyright (c) 2007-2009 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2009 HNR Consulting. All rights reserved.
+ * Copyright (c) 2009,2010 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -64,6 +64,7 @@
 #include opensm/osm_log.h
 #include opensm/osm_node.h
 #include opensm/osm_opensm.h
+#include opensm/osm_helper.h
 
 #define PERFMGR_INITIAL_TID_VALUE 0xcafe
 
@@ -194,6 +195,7 @@ static void perfmgr_mad_send_err_callback(void 
*bind_context,
uint8_t port = context-perfmgr_context.port;
cl_map_item_t *p_node;
monitored_node_t *p_mon_node;
+   ib_net16_t orig_lid;
 
OSM_LOG_ENTER(pm-log);
 
@@ -225,9 +227,11 @@ static void perfmgr_mad_send_err_callback(void 
*bind_context,
p_mon_node-num_ports);
goto Exit;
}
-   /* Clear redirection info */
-   p_mon_node-redir_port[port].redir_lid = 0;
-   p_mon_node-redir_port[port].redir_qp = 0;
+   /* Clear redirection info for this port except orig_lid */
+   orig_lid = p_mon_node-port[port].orig_lid;
+   memset(p_mon_node-port[port], 0, sizeof(monitored_port_t));
+   p_mon_node-port[port].orig_lid = orig_lid;
+   p_mon_node-port[port].valid = TRUE;
cl_plock_release(pm-osm-lock);
}
 
@@ -256,7 +260,7 @@ ib_api_status_t osm_perfmgr_bind(osm_perfmgr_t * pm, 
ib_net64_t port_guid)
goto Exit;
}
 
-   bind_info.port_guid = port_guid;
+   bind_info.port_guid = pm-port_guid = port_guid;
bind_info.mad_class = IB_MCLASS_PERF;
bind_info.class_version = 1;
bind_info.is_responder = FALSE;
@@ -309,24 +313,14 @@ static ib_net32_t get_qp(monitored_node_t * mon_node, 
uint8_t port)
ib_net32_t qp = IB_QP1;
 
if (mon_node  mon_node-num_ports  port  mon_node-num_ports 
-   mon_node-redir_port[port].redir_lid 
-   mon_node-redir_port[port].redir_qp)
-   qp = mon_node-redir_port[port].redir_qp;
+   mon_node-port[port].redirection  mon_node-port[port].qp)
+   qp = mon_node-port[port].qp;
 
return qp;
 

[PATCHv5 2/2][RESEND] opensm/osm_console.c: Add dump and clear redir perfmgr command support

2010-06-17 Thread Hal Rosenstock
Follows previous patch that adds better redirection support for PerfMgr

Signed-off-by: Hal Rosenstock hal.rosenst...@gmail.com
---
Changes since v4:
Fixed rejection of Copyright hunk

Changes since v3:
Fixed some formatting problems (spaces instead of tabs)

Changes since v2:
Rebased

Changes since v1:
Changes based on changes to PerfMgr redir support in v3 patch

diff --git a/opensm/opensm/osm_console.c b/opensm/opensm/osm_console.c
index bc7bea3..27f1e1e 100644
--- a/opensm/opensm/osm_console.c
+++ b/opensm/opensm/osm_console.c
@@ -1,6 +1,6 @@
 /*
  * Copyright (c) 2005-2009 Voltaire, Inc. All rights reserved.
- * Copyright (c) 2009 HNR Consulting. All rights reserved.
+ * Copyright (c) 2009,2010 HNR Consulting. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -231,7 +231,7 @@ static void help_update_desc(FILE *out, int detail)
 static void help_perfmgr(FILE * out, int detail)
 {
fprintf(out,
-   perfmgr 
[enable|disable|clear_counters|dump_counters|print_counters|sweep_time[seconds]]\n);
+   perfmgr 
[enable|disable|clear_counters|dump_counters|print_counters|dump_redir|clear_redir|sweep_time[seconds]]\n);
if (detail) {
fprintf(out,
perfmgr -- print the performance manager state\n);
@@ -245,6 +245,10 @@ static void help_perfmgr(FILE * out, int detail)
   [dump_counters [mach]] -- dump the counters 
(optionally in [mach]ine readable format)\n);
fprintf(out,
   [print_counters nodename|nodeguid] -- print the 
counters for the specified node\n);
+   fprintf(out,
+  [dump_redir [nodename|nodeguid]] -- dump the 
redirection table\n);
+   fprintf(out,
+  [clear_redir [nodename|nodeguid]] -- clear the 
redirection table\n);
}
 }
 #endif /* ENABLE_OSM_PERF_MGR */
@@ -1179,6 +1183,152 @@ static void update_desc_parse(char **p_last, 
osm_opensm_t * p_osm, FILE * out)
 }
 
 #ifdef ENABLE_OSM_PERF_MGR
+static monitored_node_t *find_node_by_name(osm_opensm_t * p_osm,
+  char *nodename)
+{
+   cl_map_item_t *item;
+   monitored_node_t *node;
+
+   item = cl_qmap_head(p_osm-perfmgr.monitored_map);
+   while (item != cl_qmap_end(p_osm-perfmgr.monitored_map)) {
+   node = (monitored_node_t *)item;
+   if (strcmp(node-name, nodename) == 0)
+   return node;
+   item = cl_qmap_next(item);
+   }
+
+   return NULL;
+}
+
+static monitored_node_t *find_node_by_guid(osm_opensm_t * p_osm,
+  uint64_t guid)
+{
+   cl_map_item_t *node;
+
+   node = cl_qmap_get(p_osm-perfmgr.monitored_map, guid);
+   if (node != cl_qmap_end(p_osm-perfmgr.monitored_map))
+   return (monitored_node_t *)node;
+
+   return NULL;
+}
+
+static void dump_redir_entry(monitored_node_t *p_mon_node, FILE * out)
+{
+   int port, redir;
+
+   /* only display monitored nodes with redirection info */
+   redir = 0;
+   for (port = (p_mon_node-esp0) ? 0 : 1;
+port  p_mon_node-num_ports; port++) {
+   if (p_mon_node-port[port].redirection) {
+   if (!redir) {
+   fprintf(out,Node GUID   ESP0   
Name\n);
+   fprintf(out,-      
\n);
+   fprintf(out,0x% PRIx64  %d  %s\n,
+   p_mon_node-guid, p_mon_node-esp0,
+   p_mon_node-name);
+   fprintf(out, \n   Port Valid  LIDs PKey  
QPPKey Index\n);
+   fprintf(out, -     --  
  --\n);
+   redir = 1;
+   }
+   fprintf(out,%d%d  %u-%u  0x%x 0x%x   
%d\n,
+   port, p_mon_node-port[port].valid,
+   cl_ntoh16(p_mon_node-port[port].orig_lid),
+   cl_ntoh16(p_mon_node-port[port].lid),
+   cl_ntoh16(p_mon_node-port[port].pkey),
+   cl_ntoh32(p_mon_node-port[port].qp),
+   p_mon_node-port[port].pkey_ix);
+   }
+   }
+   if (redir)
+   fprintf(out, \n);
+}
+
+static void dump_redir(osm_opensm_t * p_osm, char *nodename, FILE * out)
+{
+   monitored_node_t *p_mon_node;
+   uint64_t guid;
+
+   if (!p_osm-subn.opt.perfmgr_redir)
+   fprintf(out, Perfmgr redirection not enabled\n);
+
+   fprintf(out, \nRedirection Table\n);
+   fprintf(out, 

[PATCH 5/5 v2] dapl-2.0 - scm, ucm: add pkey, pkey_index, sl override for QP's

2010-06-17 Thread Davis, Arlin R
Or/Sean, 

Good points. Here is v2 without index capabilities. 

Hefty, Sean wrote:
 The index isn't guaranteed to be the same across all nodes.  
If a consumer is going to manually control this, they should 
really be forced to use the actual pkey.
yes, I saw this confusion in action, for most users pkey index doesn't 
mean anything, it may also change across time, which can break 
scripts/setting to run specific jobs using specific partitions.

Or.


On a per open basis, add environment variables
DAPL_IB_SL and DAPL_IB_PKEY and use on
connection setup (QP modify) to override default
values of 0 for SL and PKEY index. If pkey is
provided then find the pkey index with
ibv_query_pkey for dev_attr.max_pkeys.
Will be used for RC and UD type QP's.

Signed-off-by: Arlin Davis arlin.r.da...@intel.com
---
 dapl/openib_cma/dapl_ib_util.h |4 +++-
 dapl/openib_common/qp.c|8 
 dapl/openib_common/util.c  |   39 +--
 dapl/openib_scm/dapl_ib_util.h |4 
 dapl/openib_ucm/dapl_ib_util.h |3 +++
 5 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/dapl/openib_cma/dapl_ib_util.h b/dapl/openib_cma/dapl_ib_util.h
index a710195..471bd7f 100755
--- a/dapl/openib_cma/dapl_ib_util.h
+++ b/dapl/openib_cma/dapl_ib_util.h
@@ -121,7 +121,9 @@ typedef struct _ib_hca_transport
uint8_t tclass;
uint8_t mtu;
DAT_NAMED_ATTR  named_attr;
-
+   uint8_t sl;
+   uint16_tpkey;
+   int pkey_idx;
 } ib_hca_transport_t;
 
 /* prototypes */
diff --git a/dapl/openib_common/qp.c b/dapl/openib_common/qp.c
index 473604b..179eef0 100644
--- a/dapl/openib_common/qp.c
+++ b/dapl/openib_common/qp.c
@@ -422,7 +422,7 @@ dapls_modify_qp_state(IN ib_qp_handle_t 
qp_handle,
qp_attr.ah_attr.grh.traffic_class =
ia_ptr-hca_ptr-ib_trans.tclass;
}
-   qp_attr.ah_attr.sl = 0;
+   qp_attr.ah_attr.sl = ia_ptr-hca_ptr-ib_trans.sl;
qp_attr.ah_attr.src_path_bits = 0;
qp_attr.ah_attr.port_num = ia_ptr-hca_ptr-port_num;
 
@@ -489,7 +489,7 @@ dapls_modify_qp_state(IN ib_qp_handle_t 
qp_handle,
qp_attr.qkey = DAT_UD_QKEY;
}
 
-   qp_attr.pkey_index = 0;
+   qp_attr.pkey_index = ia_ptr-hca_ptr-ib_trans.pkey_idx;
qp_attr.port_num = ia_ptr-hca_ptr-port_num;
 
dapl_dbg_log(DAPL_DBG_TYPE_EP,
@@ -519,7 +519,7 @@ dapls_modify_qp_ud(IN DAPL_HCA *hca, IN ib_qp_handle_t qp)
/* modify QP, setup and prepost buffers */
dapl_os_memzero((void *)qp_attr, sizeof(qp_attr));
qp_attr.qp_state = IBV_QPS_INIT;
-qp_attr.pkey_index = 0;
+qp_attr.pkey_index = hca-ib_trans.pkey_idx;
 qp_attr.port_num = hca-port_num;
 qp_attr.qkey = DAT_UD_QKEY;
if (ibv_modify_qp(qp, qp_attr, 
@@ -582,7 +582,7 @@ dapls_create_ah(IN DAPL_HCA *hca,
qp_attr.ah_attr.grh.hop_limit = hca-ib_trans.hop_limit;
qp_attr.ah_attr.grh.traffic_class = hca-ib_trans.tclass;
}
-   qp_attr.ah_attr.sl = 0;
+   qp_attr.ah_attr.sl = hca-ib_trans.sl;
qp_attr.ah_attr.src_path_bits = 0;
qp_attr.ah_attr.port_num = hca-port_num;
 
diff --git a/dapl/openib_common/util.c b/dapl/openib_common/util.c
index b83f609..a69261f 100644
--- a/dapl/openib_common/util.c
+++ b/dapl/openib_common/util.c
@@ -321,6 +321,38 @@ DAT_RETURN dapls_ib_query_hca(IN DAPL_HCA * hca_ptr,
hca_ptr-ib_trans.named_attr.value =
dapl_ib_mtu_str(hca_ptr-ib_trans.mtu);
 
+   if (hca_ptr-ib_hca_handle-device-transport_type != 
IBV_TRANSPORT_IB)
+   goto skip_ib;
+
+   /* set SL, PKEY values, defaults = 0 */
+   hca_ptr-ib_trans.pkey_idx = 0;
+   hca_ptr-ib_trans.pkey = dapl_os_get_env_val(DAPL_IB_PKEY, 0);
+   hca_ptr-ib_trans.sl = dapl_os_get_env_val(DAPL_IB_SL, 0);
+
+   /* index provided, get pkey; pkey provided, get index */
+   if (hca_ptr-ib_trans.pkey) {
+   int i; uint16_t pkey = 0;
+   for (i=0; i  dev_attr.max_pkeys; i++) {
+   if (ibv_query_pkey(hca_ptr-ib_hca_handle,
+  hca_ptr-port_num,
+  i, pkey)) {
+   i = dev_attr.max_pkeys;
+   break;
+   }
+   if (pkey == hca_ptr-ib_trans.pkey) {
+   hca_ptr-ib_trans.pkey_idx = i;
+   break;
+   }
+  

[ANNOUNCE] dapl-2.0.29 release

2010-06-17 Thread Davis, Arlin R
 
New release for uDAPL v2.0 (2.0.29) available at:

http://www.openfabrics.org/downloads/dapl

Latest Packages (see ChangeLog for details):

md5sum: 76f18eedf0758ca81aaa3923a65808a5 dapl-2.0.29.tar.gz 

For 1.2 and 2.0 support on same system, including development, install RPM 
packages as follow: 

dapl-2.0.29-1 
dapl-utils-2.0.29-1 
dapl-devel-2.0.29-1 
dapl-debuginfo-2.0.29-1 
compat-dapl-1.2.17-1 
compat-dapl-devel-1.2.17-1 

Summary of changes: 

Release 2.0.29 fixes (OFED 1.5.2): 

scm, ucm: add pkey and sl override for QP's, DAPL_IB_SL, DAPL_IB_KEY.
cma: remove dependency on rdma_cma_abi.h 
configure: need a false conditional for verbs attr.link_layer member check 
ucm: incorrectly freeing port on passive side after reject 
ucm: modify debug CM output for consistency, all ports, qpn in hex 

Vlad, please pull into OFED 1.5.2 RC2 (should be final uDAPL package for OFED 
1.5.2):

Thanks,

-arlin

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC] mlx4_core: module param to limit msix vec allocation

2010-06-17 Thread Arthur Kepner
On Thu, Jun 17, 2010 at 05:53:58PM +0300, Yevgeny Petrilin wrote:
 I think that this patch would do the job,

(Is that an ack?)

 Anyway we are thinking of ways to change our interrupt allocation scheme.
 

Would be interested to know what you've got in mind.

-- 
Arthur
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/7] various fixes for QIB driver

2010-06-17 Thread Ralph Campbell
The following patches are for various bug fixes.
I'm not sure what counts as a regression for code that is newly introduced.
I'm hoping that all except #2 can be made for 2.6.35 whereas
#2 can wait for 2.6.36 since it is actually a feature.

IB/qib: avoid a rare 7322 chip problem by not marking VL15 bufs as WC
IB/qib: allow PSM to select from multiple port assignment algorithms
IB/qib: mask hardware error during link reset
IB/qib: clear eager buffer memory for each new process
IB/qib: clear 6120 hardware error register
IB/qib: update 7322 serdes tables
IB/qib: completion queue callback needs to be single threaded
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] IB/qib: avoid a rare 7322 chip problem by not marking VL15 bufs as WC

2010-06-17 Thread Ralph Campbell
From: Dave Olson dave.ol...@qlogic.com

Don't set write combining via PAT on the VL15 buffers to avoid a
rare problem with unaligned writes from interrupt-flushed store buffers.

Signed-off-by: Dave Olson dave.ol...@qlogic.com
---

 drivers/infiniband/hw/qib/qib.h |1 +
 drivers/infiniband/hw/qib/qib_diag.c|   19 +++
 drivers/infiniband/hw/qib/qib_iba7322.c |   18 +-
 drivers/infiniband/hw/qib/qib_init.c|6 ++
 drivers/infiniband/hw/qib/qib_pcie.c|2 ++
 drivers/infiniband/hw/qib/qib_tx.c  |6 +-
 6 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib.h b/drivers/infiniband/hw/qib/qib.h
index 32d9208..3593983 100644
--- a/drivers/infiniband/hw/qib/qib.h
+++ b/drivers/infiniband/hw/qib/qib.h
@@ -686,6 +686,7 @@ struct qib_devdata {
void __iomem *piobase;
/* mem-mapped pointer to base of user chip regs (if using WC PAT) */
u64 __iomem *userbase;
+   void __iomem *piovl15base; /* base of VL15 buffers, if not WC */
/*
 * points to area where PIOavail registers will be DMA'ed.
 * Has to be on a page of it's own, because the page will be
diff --git a/drivers/infiniband/hw/qib/qib_diag.c 
b/drivers/infiniband/hw/qib/qib_diag.c
index ca98dd5..05dcf0d 100644
--- a/drivers/infiniband/hw/qib/qib_diag.c
+++ b/drivers/infiniband/hw/qib/qib_diag.c
@@ -233,6 +233,7 @@ static u32 __iomem *qib_remap_ioaddr32(struct qib_devdata 
*dd, u32 offset,
u32 __iomem *krb32 = (u32 __iomem *)dd-kregbase;
u32 __iomem *map = NULL;
u32 cnt = 0;
+   u32 tot4k, offs4k;
 
/* First, simplest case, offset is within the first map. */
kreglen = (dd-kregend - dd-kregbase) * sizeof(u64);
@@ -250,7 +251,8 @@ static u32 __iomem *qib_remap_ioaddr32(struct qib_devdata 
*dd, u32 offset,
if (dd-userbase) {
/* If user regs mapped, they are after send, so set limit. */
u32 ulim = (dd-cfgctxts * dd-ureg_align) + dd-uregbase;
-   snd_lim = dd-uregbase;
+   if (!dd-piovl15base)
+   snd_lim = dd-uregbase;
krb32 = (u32 __iomem *)dd-userbase;
if (offset = dd-uregbase  offset  ulim) {
map = krb32 + (offset - dd-uregbase) / sizeof(u32);
@@ -277,14 +279,14 @@ static u32 __iomem *qib_remap_ioaddr32(struct qib_devdata 
*dd, u32 offset,
/* If 4k buffers exist, account for them by bumping
 * appropriate limit.
 */
+   tot4k = dd-piobcnt4k * dd-align4k;
+   offs4k = dd-piobufbase  32;
if (dd-piobcnt4k) {
-   u32 tot4k = dd-piobcnt4k * dd-align4k;
-   u32 offs4k = dd-piobufbase  32;
if (snd_bottom  offs4k)
snd_bottom = offs4k;
else {
/* 4k above 2k. Bump snd_lim, if needed*/
-   if (!dd-userbase)
+   if (!dd-userbase || dd-piovl15base)
snd_lim = offs4k + tot4k;
}
}
@@ -298,6 +300,15 @@ static u32 __iomem *qib_remap_ioaddr32(struct qib_devdata 
*dd, u32 offset,
cnt = snd_lim - offset;
}
 
+   if (!map  offs4k  dd-piovl15base) {
+   snd_lim = offs4k + tot4k + 2 * dd-align4k;
+   if (offset = (offs4k + tot4k)  offset  snd_lim) {
+   map = (u32 __iomem *)dd-piovl15base +
+   ((offset - (offs4k + tot4k)) / sizeof(u32));
+   cnt = snd_lim - offset;
+   }
+   }
+
 mapped:
if (cntp)
*cntp = cnt;
diff --git a/drivers/infiniband/hw/qib/qib_iba7322.c 
b/drivers/infiniband/hw/qib/qib_iba7322.c
index 503992d..3e9828b 100644
--- a/drivers/infiniband/hw/qib/qib_iba7322.c
+++ b/drivers/infiniband/hw/qib/qib_iba7322.c
@@ -6119,9 +6119,25 @@ static int qib_init_7322_variables(struct qib_devdata 
*dd)
qib_set_ctxtcnt(dd);
 
if (qib_wc_pat) {
-   ret = init_chip_wc_pat(dd, NUM_VL15_BUFS * dd-align4k);
+   resource_size_t vl15off;
+   /*
+* We do not set WC on the VL15 buffers to avoid
+* a rare problem with unaligned writes from
+* interrupt-flushed store buffers, so we need
+* to map those separately here.  We can't solve
+* this for the rarely used mtrr case.
+*/
+   ret = init_chip_wc_pat(dd, 0);
if (ret)
goto bail;
+
+   /* vl15 buffers start just after the 4k buffers */
+   vl15off = dd-physaddr + (dd-piobufbase  32) +
+   dd-piobcnt4k * dd-align4k;
+   dd-piovl15base = ioremap_nocache(vl15off,
+ NUM_VL15_BUFS * dd-align4k);
+   if 

[PATCH 2/7] IB/qib: allow PSM to select from multiple port assignment algorithms

2010-06-17 Thread Ralph Campbell
From: Dave Olson dave.ol...@qlogic.com

We formerly allowed only full specification, or using all contexts
within an HCA before moving to the next HCA.  We now allow an additional
method, of round-robining through HCAs, and make that the default.

Signed-off-by: Dave Olson dave.ol...@qlogic.com
---

 drivers/infiniband/hw/qib/qib_common.h   |   16 ++
 drivers/infiniband/hw/qib/qib_file_ops.c |  203 +++---
 2 files changed, 118 insertions(+), 101 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_common.h 
b/drivers/infiniband/hw/qib/qib_common.h
index b3955ed..145da40 100644
--- a/drivers/infiniband/hw/qib/qib_common.h
+++ b/drivers/infiniband/hw/qib/qib_common.h
@@ -279,7 +279,7 @@ struct qib_base_info {
  * may not be implemented; the user code must deal with this if it
  * cares, or it must abort after initialization reports the difference.
  */
-#define QIB_USER_SWMINOR 10
+#define QIB_USER_SWMINOR 11
 
 #define QIB_USER_SWVERSION ((QIB_USER_SWMAJOR  16) | QIB_USER_SWMINOR)
 
@@ -302,6 +302,18 @@ struct qib_base_info {
 #define QIB_KERN_SWVERSION ((QIB_KERN_TYPE  31) | QIB_USER_SWVERSION)
 
 /*
+ * If the unit is specified via open, HCA choice is fixed.  If port is
+ * specified, it's also fixed.  Otherwise we try to spread contexts
+ * across ports and HCAs, using different algorithims.  WITHIN is
+ * the old default, prior to this mechanism.
+ */
+#define QIB_PORT_ALG_ACROSS 0 /* round robin contexts across HCAs, then
+  * ports; this is the default */
+#define QIB_PORT_ALG_WITHIN 1 /* use all contexts on an HCA (round robin
+  * active ports within), then next HCA */
+#define QIB_PORT_ALG_COUNT 2 /* number of algorithm choices */
+
+/*
  * This structure is passed to qib_userinit() to tell the driver where
  * user code buffers are, sizes, etc.   The offsets and sizes of the
  * fields must remain unchanged, for binary compatibility.  It can
@@ -319,7 +331,7 @@ struct qib_user_info {
/* size of struct base_info to write to */
__u32 spu_base_info_size;
 
-   __u32 _spu_unused3;
+   __u32 spu_port_alg; /* which QIB_PORT_ALG_*; unused user minor  11 */
 
/*
 * If two or more processes wish to share a context, each process
diff --git a/drivers/infiniband/hw/qib/qib_file_ops.c 
b/drivers/infiniband/hw/qib/qib_file_ops.c
index a142a9e..6b11645 100644
--- a/drivers/infiniband/hw/qib/qib_file_ops.c
+++ b/drivers/infiniband/hw/qib/qib_file_ops.c
@@ -1294,128 +1294,130 @@ bail:
return ret;
 }
 
-static inline int usable(struct qib_pportdata *ppd, int active_only)
+static inline int usable(struct qib_pportdata *ppd)
 {
struct qib_devdata *dd = ppd-dd;
-   u32 linkok = active_only ? QIBL_LINKACTIVE :
-(QIBL_LINKINIT | QIBL_LINKARMED | QIBL_LINKACTIVE);
 
return dd  (dd-flags  QIB_PRESENT)  dd-kregbase  ppd-lid 
-   (ppd-lflags  linkok);
+   (ppd-lflags  QIBL_LINKACTIVE);
 }
 
-static int find_free_ctxt(int unit, struct file *fp,
- const struct qib_user_info *uinfo)
+/*
+ * Select a context on the given device, either using a requested port
+ * or the port based on the context number.
+ */
+static int choose_port_ctxt(struct file *fp, struct qib_devdata *dd, u32 port,
+   const struct qib_user_info *uinfo)
 {
-   struct qib_devdata *dd = qib_lookup(unit);
struct qib_pportdata *ppd = NULL;
-   int ret;
-   u32 ctxt;
+   int ret, ctxt;
 
-   if (!dd || (uinfo-spu_port  uinfo-spu_port  dd-num_pports)) {
-   ret = -ENODEV;
-   goto bail;
-   }
-
-   /*
-* If users requests specific port, only try that one port, else
-* select best port below, based on context.
-*/
-   if (uinfo-spu_port) {
-   ppd = dd-pport + uinfo-spu_port - 1;
-   if (!usable(ppd, 0)) {
+   if (port) {
+   if (!usable(dd-pport + port - 1)) {
ret = -ENETDOWN;
-   goto bail;
-   }
+   goto done;
+   } else
+   ppd = dd-pport + port - 1;
}
-
-   for (ctxt = dd-first_user_ctxt; ctxt  dd-cfgctxts; ctxt++) {
-   if (dd-rcd[ctxt])
-   continue;
-   /*
-* The setting and clearing of user context rcd[x] protected
-* by the qib_mutex
-*/
-   if (!ppd) {
-   /* choose port based on ctxt, if up, else 1st up */
-   ppd = dd-pport + (ctxt % dd-num_pports);
-   if (!usable(ppd, 0)) {
-   int i;
-   for (i = 0; i  dd-num_pports; i++) {
-   ppd = dd-pport + i;
-   if (usable(ppd, 0))
-   

[PATCH 3/7] IB/qib: mask hardware error during link reset

2010-06-17 Thread Ralph Campbell
The HCA checks for certain hardware errors which can be falsely
triggered when the IB link is reset. The fix is to mask them rather
than report them.

Signed-off-by: Ralph Campbell ralph.campb...@qlogic.com
---

 drivers/infiniband/hw/qib/qib_7322_regs.h |   48 +++--
 drivers/infiniband/hw/qib/qib_iba7322.c   |9 -
 2 files changed, 31 insertions(+), 26 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_7322_regs.h 
b/drivers/infiniband/hw/qib/qib_7322_regs.h
index a97440b..32dc81f 100644
--- a/drivers/infiniband/hw/qib/qib_7322_regs.h
+++ b/drivers/infiniband/hw/qib/qib_7322_regs.h
@@ -742,15 +742,15 @@
 #define QIB_7322_HwErrMask_IBCBusFromSPCParityErrMask_1_LSB 0xF
 #define QIB_7322_HwErrMask_IBCBusFromSPCParityErrMask_1_MSB 0xF
 #define QIB_7322_HwErrMask_IBCBusFromSPCParityErrMask_1_RMASK 0x1
-#define QIB_7322_HwErrMask_statusValidNoEopMask_1_LSB 0xE
-#define QIB_7322_HwErrMask_statusValidNoEopMask_1_MSB 0xE
-#define QIB_7322_HwErrMask_statusValidNoEopMask_1_RMASK 0x1
+#define QIB_7322_HwErrMask_IBCBusToSPCParityErrMask_1_LSB 0xE
+#define QIB_7322_HwErrMask_IBCBusToSPCParityErrMask_1_MSB 0xE
+#define QIB_7322_HwErrMask_IBCBusToSPCParityErrMask_1_RMASK 0x1
 #define QIB_7322_HwErrMask_IBCBusFromSPCParityErrMask_0_LSB 0xD
 #define QIB_7322_HwErrMask_IBCBusFromSPCParityErrMask_0_MSB 0xD
 #define QIB_7322_HwErrMask_IBCBusFromSPCParityErrMask_0_RMASK 0x1
-#define QIB_7322_HwErrMask_statusValidNoEopMask_0_LSB 0xC
-#define QIB_7322_HwErrMask_statusValidNoEopMask_0_MSB 0xC
-#define QIB_7322_HwErrMask_statusValidNoEopMask_0_RMASK 0x1
+#define QIB_7322_HwErrMask_statusValidNoEopMask_LSB 0xC
+#define QIB_7322_HwErrMask_statusValidNoEopMask_MSB 0xC
+#define QIB_7322_HwErrMask_statusValidNoEopMask_RMASK 0x1
 #define QIB_7322_HwErrMask_LATriggeredMask_LSB 0xB
 #define QIB_7322_HwErrMask_LATriggeredMask_MSB 0xB
 #define QIB_7322_HwErrMask_LATriggeredMask_RMASK 0x1
@@ -796,15 +796,15 @@
 #define QIB_7322_HwErrStatus_IBCBusFromSPCParityErr_1_LSB 0xF
 #define QIB_7322_HwErrStatus_IBCBusFromSPCParityErr_1_MSB 0xF
 #define QIB_7322_HwErrStatus_IBCBusFromSPCParityErr_1_RMASK 0x1
-#define QIB_7322_HwErrStatus_statusValidNoEop_1_LSB 0xE
-#define QIB_7322_HwErrStatus_statusValidNoEop_1_MSB 0xE
-#define QIB_7322_HwErrStatus_statusValidNoEop_1_RMASK 0x1
+#define QIB_7322_HwErrStatus_IBCBusToSPCParityErr_1_LSB 0xE
+#define QIB_7322_HwErrStatus_IBCBusToSPCParityErr_1_MSB 0xE
+#define QIB_7322_HwErrStatus_IBCBusToSPCParityErr_1_RMASK 0x1
 #define QIB_7322_HwErrStatus_IBCBusFromSPCParityErr_0_LSB 0xD
 #define QIB_7322_HwErrStatus_IBCBusFromSPCParityErr_0_MSB 0xD
 #define QIB_7322_HwErrStatus_IBCBusFromSPCParityErr_0_RMASK 0x1
-#define QIB_7322_HwErrStatus_statusValidNoEop_0_LSB 0xC
-#define QIB_7322_HwErrStatus_statusValidNoEop_0_MSB 0xC
-#define QIB_7322_HwErrStatus_statusValidNoEop_0_RMASK 0x1
+#define QIB_7322_HwErrStatus_statusValidNoEop_LSB 0xC
+#define QIB_7322_HwErrStatus_statusValidNoEop_MSB 0xC
+#define QIB_7322_HwErrStatus_statusValidNoEop_RMASK 0x1
 #define QIB_7322_HwErrStatus_LATriggered_LSB 0xB
 #define QIB_7322_HwErrStatus_LATriggered_MSB 0xB
 #define QIB_7322_HwErrStatus_LATriggered_RMASK 0x1
@@ -850,15 +850,15 @@
 #define QIB_7322_HwErrClear_IBCBusFromSPCParityErrClear_1_LSB 0xF
 #define QIB_7322_HwErrClear_IBCBusFromSPCParityErrClear_1_MSB 0xF
 #define QIB_7322_HwErrClear_IBCBusFromSPCParityErrClear_1_RMASK 0x1
-#define QIB_7322_HwErrClear_IBCBusToSPCparityErrClear_1_LSB 0xE
-#define QIB_7322_HwErrClear_IBCBusToSPCparityErrClear_1_MSB 0xE
-#define QIB_7322_HwErrClear_IBCBusToSPCparityErrClear_1_RMASK 0x1
+#define QIB_7322_HwErrClear_IBCBusToSPCParityErrClear_1_LSB 0xE
+#define QIB_7322_HwErrClear_IBCBusToSPCParityErrClear_1_MSB 0xE
+#define QIB_7322_HwErrClear_IBCBusToSPCParityErrClear_1_RMASK 0x1
 #define QIB_7322_HwErrClear_IBCBusFromSPCParityErrClear_0_LSB 0xD
 #define QIB_7322_HwErrClear_IBCBusFromSPCParityErrClear_0_MSB 0xD
 #define QIB_7322_HwErrClear_IBCBusFromSPCParityErrClear_0_RMASK 0x1
-#define QIB_7322_HwErrClear_IBCBusToSPCparityErrClear_0_LSB 0xC
-#define QIB_7322_HwErrClear_IBCBusToSPCparityErrClear_0_MSB 0xC
-#define QIB_7322_HwErrClear_IBCBusToSPCparityErrClear_0_RMASK 0x1
+#define QIB_7322_HwErrClear_statusValidNoEopClear_LSB 0xC
+#define QIB_7322_HwErrClear_statusValidNoEopClear_MSB 0xC
+#define QIB_7322_HwErrClear_statusValidNoEopClear_RMASK 0x1
 #define QIB_7322_HwErrClear_LATriggeredClear_LSB 0xB
 #define QIB_7322_HwErrClear_LATriggeredClear_MSB 0xB
 #define QIB_7322_HwErrClear_LATriggeredClear_RMASK 0x1
@@ -880,15 +880,15 @@
 #define QIB_7322_HwDiagCtrl_ForceIBCBusFromSPCParityErr_1_LSB 0xF
 #define QIB_7322_HwDiagCtrl_ForceIBCBusFromSPCParityErr_1_MSB 0xF
 #define QIB_7322_HwDiagCtrl_ForceIBCBusFromSPCParityErr_1_RMASK 0x1
-#define QIB_7322_HwDiagCtrl_ForcestatusValidNoEop_1_LSB 0xE
-#define QIB_7322_HwDiagCtrl_ForcestatusValidNoEop_1_MSB 0xE
-#define QIB_7322_HwDiagCtrl_ForcestatusValidNoEop_1_RMASK 0x1
+#define 

[PATCH 4/7] IB/qib: clear eager buffer memory for each new process

2010-06-17 Thread Ralph Campbell
The eager buffers are not being cleared before being mmapped into a new
user address space. This is a potential security risk and should be fixed.
Note that the eager header queue is already being cleared OK.

Signed-off-by: Ralph Campbell ralph.campb...@qlogic.com
---

 drivers/infiniband/hw/qib/qib_init.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_init.c 
b/drivers/infiniband/hw/qib/qib_init.c
index 2589599..1d4db4b 100644
--- a/drivers/infiniband/hw/qib/qib_init.c
+++ b/drivers/infiniband/hw/qib/qib_init.c
@@ -1472,6 +1472,9 @@ int qib_setup_eagerbufs(struct qib_ctxtdata *rcd)
dma_addr_t pa = rcd-rcvegrbuf_phys[chunk];
unsigned i;
 
+   /* clear for security and sanity on each use */
+   memset(rcd-rcvegrbuf[chunk], 0, size);
+
for (i = 0; e  egrcnt  i  egrperchunk; e++, i++) {
dd-f_put_tid(dd, e + egroff +
  (u64 __iomem *)

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/7] IB/qib: clear 6120 hardware error register

2010-06-17 Thread Ralph Campbell
The hardware error register needs to be cleared or another interrupt
will be generated, thus causing an infinite loop.
This is a regression introduced when removing debug output.

Signed-off-by: Ralph Campbell ralph.campb...@qlogic.com
---

 drivers/infiniband/hw/qib/qib_iba6120.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_iba6120.c 
b/drivers/infiniband/hw/qib/qib_iba6120.c
index 1eadadc..a5e29db 100644
--- a/drivers/infiniband/hw/qib/qib_iba6120.c
+++ b/drivers/infiniband/hw/qib/qib_iba6120.c
@@ -1355,8 +1355,7 @@ static int qib_6120_bringup_serdes(struct qib_pportdata 
*ppd)
hwstat = qib_read_kreg64(dd, kr_hwerrstatus);
if (hwstat) {
/* should just have PLL, clear all set, in an case */
-   if (hwstat  ~QLOGIC_IB_HWE_SERDESPLLFAILED)
-   qib_write_kreg(dd, kr_hwerrclear, hwstat);
+   qib_write_kreg(dd, kr_hwerrclear, hwstat);
qib_write_kreg(dd, kr_errclear, ERR_MASK(HardwareErr));
}
 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] IB/qib: update 7322 serdes tables

2010-06-17 Thread Ralph Campbell
Signed-off-by: Ralph Campbell ralph.campb...@qlogic.com
---

 drivers/infiniband/hw/qib/qib_iba7322.c |   16 
 1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_iba7322.c 
b/drivers/infiniband/hw/qib/qib_iba7322.c
index 8ee0ac6..5eedf83 100644
--- a/drivers/infiniband/hw/qib/qib_iba7322.c
+++ b/drivers/infiniband/hw/qib/qib_iba7322.c
@@ -543,7 +543,7 @@ struct vendor_txdds_ent {
 static void write_tx_serdes_param(struct qib_pportdata *, struct txdds_ent *);
 
 #define TXDDS_TABLE_SZ 16 /* number of entries per speed in onchip table */
-#define TXDDS_EXTRA_SZ 11 /* number of extra tx settings entries */
+#define TXDDS_EXTRA_SZ 13 /* number of extra tx settings entries */
 #define SERDES_CHANS 4 /* yes, it's obvious, but one less magic number */
 
 #define H1_FORCE_VAL 8
@@ -5629,6 +5629,8 @@ static void set_no_qsfp_atten(struct qib_devdata *dd, int 
change)
if (ppd-port != port || !ppd-link_speed_supported)
continue;
ppd-cpspec-no_eep = val;
+   if (seth1)
+   ppd-cpspec-h1_val = h1;
/* now change the IBC and serdes, overriding generic */
init_txdds_table(ppd, 1);
any++;
@@ -6069,9 +6071,9 @@ static int qib_init_7322_variables(struct qib_devdata *dd)
 * the cable info setup here.  Can be overridden
 * in adapter-specific routines.
 */
-   if (!(ppd-dd-flags  QIB_HAS_QSFP)) {
-   if (!IS_QMH(ppd-dd)  !IS_QME(ppd-dd))
-   qib_devinfo(ppd-dd-pcidev, IB%u:%u: 
+   if (!(dd-flags  QIB_HAS_QSFP)) {
+   if (!IS_QMH(dd)  !IS_QME(dd))
+   qib_devinfo(dd-pcidev, IB%u:%u: 
Unknown mezzanine card type\n,
dd-unit, ppd-port);
cp-h1_val = IS_QMH(dd) ? H1_FORCE_QMH : H1_FORCE_QME;
@@ -6953,6 +6955,8 @@ static const struct txdds_ent 
txdds_extra_sdr[TXDDS_EXTRA_SZ] = {
{  0, 0, 0, 11 },   /* QME7342 backplane settings */
{  0, 0, 0, 11 },   /* QME7342 backplane settings */
{  0, 0, 0, 11 },   /* QME7342 backplane settings */
+   {  0, 0, 0,  3 },   /* QMH7342 backplane settings */
+   {  0, 0, 0,  4 },   /* QMH7342 backplane settings */
 };
 
 static const struct txdds_ent txdds_extra_ddr[TXDDS_EXTRA_SZ] = {
@@ -6968,6 +6972,8 @@ static const struct txdds_ent 
txdds_extra_ddr[TXDDS_EXTRA_SZ] = {
{  0, 0, 0, 13 },   /* QME7342 backplane settings */
{  0, 0, 0, 13 },   /* QME7342 backplane settings */
{  0, 0, 0, 13 },   /* QME7342 backplane settings */
+   {  0, 0, 0,  9 },   /* QMH7342 backplane settings */
+   {  0, 0, 0, 10 },   /* QMH7342 backplane settings */
 };
 
 static const struct txdds_ent txdds_extra_qdr[TXDDS_EXTRA_SZ] = {
@@ -6983,6 +6989,8 @@ static const struct txdds_ent 
txdds_extra_qdr[TXDDS_EXTRA_SZ] = {
{  0, 1, 12,  6 },  /* QME7342 backplane setting */
{  0, 1, 12,  7 },  /* QME7342 backplane setting */
{  0, 1, 12,  8 },  /* QME7342 backplane setting */
+   {  0, 1,  0, 10 },  /* QMH7342 backplane settings */
+   {  0, 1,  0, 12 },  /* QMH7342 backplane settings */
 };
 
 static const struct txdds_ent *get_atten_table(const struct txdds_ent *txdds,

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/7] IB/qib: completion queue callback needs to be single threaded

2010-06-17 Thread Ralph Campbell
Workqueues aren't exactly equivalent to tasklets since the callback
function may be called from multiple CPUs before the callback returns.
This causes completion notification callbacks to have MT bugs since
they weren't expecting this behavior. The fix is to use a single
threaded work queue.

Signed-off-by: Ralph Campbell ralph.campb...@qlogic.com
---

 drivers/infiniband/hw/qib/qib_init.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_init.c 
b/drivers/infiniband/hw/qib/qib_init.c
index 1d4db4b..7831ff8 100644
--- a/drivers/infiniband/hw/qib/qib_init.c
+++ b/drivers/infiniband/hw/qib/qib_init.c
@@ -1059,7 +1059,7 @@ static int __init qlogic_ib_init(void)
goto bail_dev;
}
 
-   qib_cq_wq = create_workqueue(qib_cq);
+   qib_cq_wq = create_singlethread_workqueue(qib_cq);
if (!qib_cq_wq) {
ret = -ENOMEM;
goto bail_wq;

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] librdmacm/mcraw: Add a new test application for user-space IBV_QPT_RAW_ETH QP type

2010-06-17 Thread Hefty, Sean
FYI - I don't track the ewg mail list, so I missed any discussion there.

See my detailed comments inline.  I track the librdmacm against Roland's 
libibverbs releases and upstream kernel features, rather than against OFED 
features.  As a result, I do think there's some functionality missing in both 
the upstream libibverbs and kernel that need to be resolved.  I tried to 
identify these below.

 The patch adds a new test application describing a usage of the
 IBV_QPT_RAW_ETH

Where is IBV_QPT_RAW_ETH defined?

Roland's version of verbs.h only defines IBV_QPT_RC/UC/UD.  I think we need to 
get this defined there first, then figure out if anything new is needed for the 
librdmacm.

Also, if I understand this correctly, a RAW_ETH QP exposes the contents of the 
Ethernet frame and header to the user.  This should be restricted to privileged 
applications, and I'm guessing uverbs should verify that before allocating a 
RAW_ETH QP for the user.

 +struct cmatest {
 + struct rdma_event_channel *channel;
 + struct cmatest_node *nodes;
 + int conn_index;
 + int connects_left;
 +
 + struct sockaddr_in6 dst_in;
 + struct sockaddr *dst_addr;
 + struct sockaddr_in6 src_in;
 + struct sockaddr *src_addr;
 + int fd[1024];

See comments below regarding the fd array usage.

 +};
 +
 +static struct cmatest test;
 +static int connections = 1;
 +static int message_size = 100;
 +static int message_count = 10;
 +static int is_sender;
 +static int unmapped_addr;
 +static char *dst_addr;
 +static char *src_addr;
 +static enum rdma_port_space port_space = RDMA_PS_UDP;
 +
 +int vlan_flag;
 +int vlan_ident;
 +
 +static int cq_len = 512;
 +static int qp_len = 256;
 +
 +uint16_t IP_CRC(void *buf, int hdr_len)
 +{
 + unsigned long sum = 0;
 + const uint16_t *ip1;
 +
 + ip1 = (uint16_t *)buf;
 + while (hdr_len  1) {
 + sum += *ip1++;
 + if (sum  0x8000)
 + sum = (sum  0x) + (sum  16);
 + hdr_len -= 2;
 + }
 +
 + while (sum  16)
 + sum = (sum  0x) + (sum  16);
 +
 + return ~sum;
 +}
 +
 +uint16_t udp_checksum(struct udphdr *udp_head,
 + int header_size,
 + int pay_load_size,
 + uint32_t src_addr,
 + uint32_t dest_addr,
 + unsigned char *payload)
 +{
 + uint16_t *buf = (void *)udp_head;
 + uint16_t *ip_src = (void *)src_addr;
 + uint16_t *ip_dst = (void *)dest_addr;
 + uint32_t sum;
 + size_t len = header_size;
 +
 + sum = 0;
 + while (len  1) {
 + sum += *buf++;
 + if (sum  0x8000)
 + sum = (sum  0x) + (sum  16);
 + len -= 2;
 + }
 +
 + buf = (void *)payload;
 + len = pay_load_size;
 + while (len  1) {
 + sum += *buf++;
 + if (sum  0x8000)
 + sum = (sum  0x) + (sum  16);
 + len -= 2;
 + }
 +
 + if (len  1)
 + sum += *((uint8_t *)buf);
 + sum += *(ip_src++);
 + sum += *ip_src;
 +
 + sum += *(ip_dst++);
 + sum += *ip_dst;
 +
 + sum += htons(IPPROTO_UDP);
 + len = (header_size + pay_load_size);
 + sum += htons(len);
 +
 + while (sum  16)
 + sum = (sum  0x) + (sum  16);
 +
 + return (uint16_t)(~sum);
 +}

The above two calls look like candidates for common code - not part of 
librdmacm, but common to some other library that provides functionality similar 
to: ip_crc(), udp_checksum(), format_eth_hdr(), format_ip_hdr(), 
format_udp_hdr(), etc.  Even separating that functionality out into another 
source file would make it easier for another application to pick up and reuse.

 +static int create_message(struct cmatest_node *node)
 +{
 + if (!message_size)
 + message_count = 0;
 +
 + if (!message_count)
 + return 0;
 +
 + node-mem = NULL;
 + posix_memalign((void *)node-mem, 4096,
 + (message_size + HEADER_LEN ) * sizeof(char));
 + if (node-mem == NULL) {
 + printf(failed message allocation\n);
 + return -1;
 + }
 +
 + node-mr = ibv_reg_mr(node-pd, node-mem,
 + message_size + HEADER_LEN,
 + IBV_ACCESS_LOCAL_WRITE);
 + if (!node-mr) {
 + printf(failed to reg MR\n);
 + goto err;
 + }
 + return 0;
 +err:
 + free(node-mem);
 + return -1;
 +}
 +
 +static int verify_test_params(struct cmatest_node *node)
 +{
 + struct ibv_port_attr port_attr;
 + int ret;
 +
 + ret = ibv_query_port(node-cma_id-verbs, node-cma_id-port_num,
 +  port_attr);
 + if (ret)
 + return ret;
 +
 + printf(\nibv_query_port %x\n, node-cma_id-port_num);
 + if (message_count  message_size  (1